CN114783408A

CN114783408A - Audio data processing method and device, computer equipment and medium

Info

Publication number: CN114783408A
Application number: CN202210334833.6A
Authority: CN
Inventors: 张心愿; 张晶晶; 刘恺; 李栋梁; 程龙; 郎勇; 许亚东; 刘皓冬; 姜鹏; 王思远
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-22

Abstract

The embodiment of the application provides an audio data processing method, an audio data processing device, computer equipment and a medium, the method can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, audio and the like, and the method comprises the following steps: displaying original text information in an application interface; acquiring target tone information and target audio data corresponding to the target text information; the target text information refers to the text information selected in the original text information; acquiring spliced audio data aiming at original text information; the spliced audio data is obtained by splicing the fused audio data and the standard audio data corresponding to the residual text information; the residual text information is the text information except the target text information in the original text information; the fusion audio data is obtained by fusing target audio data and target tone information. The method and the device can improve the richness of audio creation and improve the quality of audio data.

Description

Audio data processing method and device, computer equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio data processing method and apparatus, a computer device, and a medium.

Background

At present, the dubbing making mode is mainly a mode of calling a speech synthesis algorithm, that is, text information to be made is directly input into the speech synthesis algorithm, and audio data corresponding to the text information is directly generated through the speech synthesis algorithm. For the speech synthesis algorithm to realize dubbing making, the audio data with the pronunciation standard can be quickly generated, however, the generated audio data is single in the audio information such as tone, speech speed and emotion, and the richness of audio creation is reduced. In addition, because different scenes require audio data with different audio information, single audio data generated by a speech synthesis algorithm is difficult to adapt to dubbing requirements in various scenes, and the quality of the audio data is further obviously reduced.

Disclosure of Invention

The embodiment of the application provides an audio data processing method, an audio data processing device, a computer device and a medium, which can improve the richness of audio creation and improve the quality of audio data.

An aspect of the present embodiment provides an audio data processing method, including:

displaying original text information in an application interface;

acquiring target tone information and target audio data corresponding to the target text information; the target text information refers to the text information selected in the original text information; matching the audio content in the target audio data with the target text information;

acquiring spliced audio data aiming at original text information; the spliced audio data is obtained by splicing the fused audio data and the standard audio data corresponding to the residual text information; the residual text information is the text information except the target text information in the original text information; the fusion audio data is obtained by fusing target audio data and target tone information; the audio content in the standard audio data matches the remaining text information.

An embodiment of the present application provides an audio data processing apparatus, including:

the text display module is used for displaying original text information in the application interface;

the audio acquisition module is used for acquiring target audio data corresponding to the target tone information and the target text information; the target text information refers to the text information selected from the original text information; matching the audio content in the target audio data with the target text information;

the audio splicing module is used for acquiring spliced audio data aiming at the original text information; the spliced audio data is obtained by splicing the fused audio data and the standard audio data corresponding to the residual text information; the residual text information is the text information except the target text information in the original text information; the fusion audio data is obtained by fusing target audio data and target tone information; the audio content in the standard audio data matches the remaining text information.

The text display module is specifically used for displaying an application interface; the application interface comprises a text entry area;

and the text display module is specifically used for responding to the input operation aiming at the text entry area and displaying the input original text information in the text entry area.

The text display module is specifically used for displaying an application interface; the application interface comprises a text uploading control;

the text display module is specifically used for responding to the triggering operation aiming at the text uploading control and displaying a text selection interface for selecting a text file;

the text display module is specifically used for responding to a text file selection operation aiming at the text selection interface and taking a text file selected based on the text file selection operation as a target text file;

and the text display module is specifically used for responding to the text file confirmation operation aiming at the text selection interface, taking the text information in the target text file as the original text information, and displaying the original text information in the application interface.

Wherein, the audio acquisition module includes:

a text selection unit configured to take, as target text information, text information selected based on a text selection operation in response to the text selection operation for the original text information;

the voice conversion unit is used for responding to voice conversion operation aiming at the target text information and displaying an audio conversion interface;

the tone acquiring unit is used for acquiring target tone information in the audio conversion interface;

and the audio acquisition unit is used for responding to the audio uploading operation aiming at the audio conversion interface and acquiring target audio data corresponding to the target text information.

The application interface comprises a first voice conversion control;

and the voice conversion unit is specifically used for responding to the triggering operation aiming at the first voice conversion control and displaying an audio conversion interface.

The voice conversion unit is specifically used for responding to trigger operation aiming at the target text information and displaying a text control list; the text control list includes a second voice conversion control;

and the voice conversion unit is specifically used for responding to the triggering operation aiming at the second voice conversion control and displaying an audio conversion interface.

The audio conversion interface comprises a recording starting control;

the audio acquisition unit is specifically used for responding to the triggering operation aiming at the recording starting control and displaying the recording stopping control in the audio conversion interface;

the audio acquisition unit is specifically used for responding to the triggering operation aiming at the recording stopping control, acquiring audio data which are recorded by a target object within a time interval of responding to the triggering operation aiming at the recording starting control and responding to the triggering operation aiming at the recording stopping control, and taking the audio data recorded by the target object as target audio data corresponding to the target text information;

the audio obtaining unit is further specifically configured to display an audio file identifier corresponding to the target audio data in the audio conversion interface.

The audio conversion interface comprises an audio uploading control;

the audio acquisition unit is specifically used for responding to the trigger operation aiming at the audio uploading control and displaying an audio selection interface for selecting an audio file;

the audio acquisition unit is specifically used for responding to audio file selection operation aiming at the audio selection interface and taking an audio file selected based on the audio file selection operation as a target audio file;

the audio acquisition unit is specifically used for responding to the audio file confirmation operation aiming at the audio selection interface and taking the audio data in the target audio file as the target audio data corresponding to the target text information;

Wherein the audio conversion interface comprises one or more candidate timbre information;

a tone acquisition unit specifically configured to, in response to a tone selection operation for one or more candidate tone information, take the candidate tone information selected based on the tone selection operation as target tone information;

and the tone acquiring unit is also specifically used for highlighting the target tone information in the audio conversion interface.

Wherein, audio frequency concatenation module includes:

the fusion unit is used for responding to the confirmation operation aiming at the target audio data and the target tone information and fusing the target audio data and the target tone information to obtain fused audio data;

the splicing unit is used for acquiring standard audio data corresponding to the residual text information, splicing the fused audio data and the standard audio data, and obtaining spliced audio data aiming at the original text information;

the audio splicing module is further specifically configured to display an audio conversion identifier in a target area associated with the target text information when the spliced audio data is generated; the audio conversion identifies target timbre information for characterizing a correspondence of the target text information among the one or more candidate timbre information.

The application interface comprises an audio downloading control and an audio sharing control;

the device is specifically used for responding to the triggering operation aiming at the audio downloading control and downloading the spliced audio data to the terminal disk;

the device is further specifically used for responding to the triggering operation aiming at the audio sharing control and sharing the spliced audio data.

Wherein, the audio acquisition unit includes:

the audio uploading subunit is used for responding to the audio uploading operation aiming at the audio conversion interface and acquiring auxiliary audio data associated with the target text information;

the denoising processing subunit is used for performing denoising processing on the auxiliary audio data to obtain denoised auxiliary audio data;

the content detection subunit is configured to perform content detection on the denoised auxiliary audio data to obtain a content matching degree between audio content in the denoised auxiliary audio data and target text information;

and the audio determining subunit is used for determining the denoised auxiliary audio data as the target audio data corresponding to the target text information if the content matching degree is greater than the matching degree threshold value.

The denoising processing subunit is specifically configured to input an auxiliary frequency signal of the auxiliary audio data to a denoising network model, and perform denoising processing on the auxiliary frequency signal through the denoising network model to obtain a target frequency signal;

and the denoising processing subunit is specifically configured to acquire an audio attribute of the auxiliary audio data, and restore the target frequency signal according to the audio attribute to obtain denoised auxiliary audio data.

The content detection subunit is specifically configured to determine an audio pronunciation sequence of the denoised auxiliary audio data and a text pronunciation sequence of the target text information;

the content detection sub-unit is specifically used for dividing the audio pronunciation sequence into at least two audio pronunciation sub-sequences, determining the sub-sequence matching degree of each audio pronunciation sub-sequence relative to the text pronunciation sequence, and taking the audio pronunciation sub-sequence with the sub-sequence matching degree larger than the sub-sequence threshold value as a matching pronunciation sub-sequence;

and the content detection sub-unit is specifically configured to determine a proportion of the matching pronunciation sub-sequence in the at least two audio pronunciation sub-sequences, and use the proportion of the matching pronunciation sub-sequence in the at least two audio pronunciation sub-sequences as a content matching degree between the audio content in the denoised auxiliary audio data and the target text information.

The content detection subunit is specifically configured to perform speech recognition on the denoised auxiliary audio data to obtain audio text information of the denoised auxiliary audio data;

and the content detection subunit is specifically configured to determine a text similarity between the audio text information and the target text information, and use the text similarity as a content matching degree between the audio content in the denoised auxiliary audio data and the target text information.

The audio acquisition unit is further specifically configured to display error prompt information in the audio conversion interface if the content matching degree is less than or equal to the matching degree threshold; or,

the audio obtaining unit is further specifically configured to display an error prompt interface if the content matching degree is less than or equal to the matching degree threshold, and display error prompt information in the error prompt interface.

The fusion unit is specifically used for inputting the target audio data into the fusion network model and extracting audio text information and audio features in the target audio data through the fusion network model; the audio features comprise at least one of emotional features, mood features, or prosodic features;

and the fusion unit is specifically used for fusing the audio text information, the audio features and the target tone information in the fusion network model to obtain fusion audio data.

Wherein, the concatenation unit includes:

the voice conversion subunit is used for inputting the original text information into the voice conversion network model, and performing voice conversion on the original text information through the voice conversion network model to obtain original audio data corresponding to the original text information;

the position determining subunit is configured to determine, according to position information of the target text information in the original text information, an audio start position and an audio end position for the target text information in the original audio data, and use audio data of the original audio data before the audio start position and audio data after the audio end position as standard audio data corresponding to the remaining text information;

and the audio splicing subunit is used for extracting candidate audio data from the fused audio data, splicing the candidate audio data to the audio starting position and the audio ending position in the standard audio data, and obtaining spliced audio data aiming at the original text information.

The position determining subunit is specifically configured to obtain an original duration of original audio data and a unit duration of each unit text in the original text data;

and the position determining subunit is specifically configured to determine, according to the original duration, the unit duration and the position information of the target text information in the original text information, an audio start position and an audio end position for the target text information in the original audio data.

The audio splicing subunit is specifically configured to acquire fusion waveform data of the fusion audio data, and determine a mute time period in the fusion audio data according to the fusion waveform data;

and the audio splicing subunit is specifically configured to crop the fused audio data according to the mute time period, and use the fused audio data with the audio data corresponding to the mute time period cropped as the candidate audio data.

An aspect of an embodiment of the present application provides a computer device, including: a processor and a memory;

the processor is connected to the memory, wherein the memory is used for storing a computer program, and when the computer program is executed by the processor, the computer device is caused to execute the method provided by the embodiment of the application.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, which is adapted to be loaded and executed by a processor, so as to enable a computer device having the processor to execute the method provided by the embodiments of the present application.

An aspect of an embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method provided by the embodiment of the application.

According to the embodiment of the application, the target audio data and the target tone information can be fused through a sound-varying fusion technology to obtain the fused audio data with the audio information indicated by the target audio data and the standard tone pronunciation indicated by the target tone information, and then the fused audio data and the standard audio data of the standard tone pronunciation generated through a voice synthesis technology are spliced to obtain the spliced audio data aiming at the original text information. Wherein the fusion audio data is generated based on the selected target text information in the original text information, and the standard audio data is generated based on the remaining text information except the target text information in the original text information. Therefore, the embodiment of the application can merge the fusion audio data with the audio information and the standard tone pronunciation and the standard audio data with the standard tone pronunciation, so that the generated splicing audio data simultaneously meets the dubbing appeal combining the standard tone pronunciation without the audio information and the standard tone pronunciation with the audio information, the audio data with different audio information can be generated aiming at different scenes in the original text information, the dubbing efficiency can be guaranteed, the richness of audio creation is improved, and the quality of the audio data is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

fig. 2 is a schematic view of a scenario for performing data interaction according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an audio data processing method according to an embodiment of the present application;

fig. 4a is a schematic view of a scene displaying original text information according to an embodiment of the present application;

fig. 4b is a schematic view of a scene displaying original text information according to an embodiment of the present application;

fig. 4c is a schematic view of a scene displaying original text information according to an embodiment of the present application;

FIG. 5a is a schematic view of a scene displaying an audio conversion interface according to an embodiment of the present application;

FIG. 5b is a schematic diagram of a scene displaying an audio conversion interface according to an embodiment of the present application;

FIG. 5c is a schematic diagram of a scene displaying an audio conversion interface according to an embodiment of the present application;

fig. 6a is a schematic view of a scenario for acquiring target audio data according to an embodiment of the present application;

fig. 6b is a schematic view of a scene for acquiring target audio data according to an embodiment of the present application;

fig. 7a is a schematic view of a scene displaying an audio conversion identifier according to an embodiment of the present application;

FIG. 7b is a schematic diagram of a scene displaying an audio conversion identifier according to an embodiment of the present application;

fig. 8 is a schematic flowchart of an audio data processing method according to an embodiment of the present application;

fig. 9 is a schematic flowchart of an audio data processing method according to an embodiment of the present application;

FIG. 10 is a schematic flow chart of an audio synthesis scheme provided by an embodiment of the present application;

fig. 11 is a schematic flowchart of an audio variant provided in an embodiment of the present application;

fig. 12 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be appreciated that Artificial Intelligence (AI) is a theory, method, technique, and application that utilizes a digital computer or a digital computer controlled machine to simulate, extend, and extend human Intelligence, perceive the environment, acquire knowledge, and use the knowledge to achieve optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Specifically, please refer to fig. 1, where fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a service server 2000 and a cluster of end devices. The terminal device cluster may specifically include one or more terminal devices, and the number of terminal devices in the terminal device cluster is not limited herein. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 3000a, a terminal device 3000b, terminal devices 3000c, …, and a terminal device 3000 n; the terminal device 3000a, the terminal device 3000b, the terminal devices 3000c, …, and the terminal device 3000n may be directly or indirectly connected to the service server 2000 through a wired or wireless communication manner, so that each terminal device may perform data interaction with the service server 2000 through the network connection.

Wherein, every terminal equipment in the terminal equipment cluster can include: the intelligent terminal comprises an intelligent terminal with a data processing function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent home, an intelligent television, a wearable device and a vehicle-mounted terminal. It should be understood that each terminal device in the terminal device cluster shown in fig. 1 may be installed with an application client, and when the application client runs in each terminal device, data interaction may be performed with the service server 2000 shown in fig. 1. The application client may be an independent client or an embedded sub-client integrated in a certain client, which is not limited in this application.

The application client may specifically include a browser, a vehicle-mounted client, an intelligent home client, an entertainment client, a multimedia client (e.g., a video client), a social client, an information client, and other clients with a data processing function. The vehicle-mounted terminal can be an intelligent terminal in an intelligent traffic scene, and the application client on the vehicle-mounted terminal can be the vehicle-mounted client.

The service server 2000 may be a server corresponding to an application client, where the service server 2000 may be an independent physical server, may also be a server cluster or distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform.

For convenience of understanding, in the embodiment of the present application, one terminal device may be selected as a target terminal device from a plurality of terminal devices shown in fig. 1. For example, the terminal device 3000b shown in fig. 1 may be used as a target terminal device in the embodiment of the present application, and an application client having a data processing function may be installed in the target terminal device. At this time, the target terminal device may perform data interaction with the service server 2000 through the application client.

For convenience of understanding, the embodiments of the present application may collectively refer to a user who produces audio data through an application client as a target object, that is, the target object is an object that logs in to the application client. For convenience of understanding, the text information used for making the audio data uploaded by the target object in the application client may be collectively referred to as original text information. The audio data generated by the target object through the application client may be original audio data corresponding to the original text information, or may also be spliced audio data generated for the original text information, where the spliced audio data is generated based on the original audio data and the target audio data provided by the target object.

The original audio data may be audio data generated by a TTS (TextToSpeech, text-to-speech, speech synthesis) algorithm; the spliced audio data may be audio data obtained by splicing standard audio data and fusion audio data in the original audio data, the fusion audio data may be audio data obtained by fusing target audio data and target tone information, the target audio data may be audio data uploaded based on target text information, the target text information may be text information selected from the original text information, and the standard audio data may be audio data of remaining text information in the original text information except the target text information. The standard audio data is audio data synthesized based on the remaining text information, and therefore, the standard audio data may also be referred to as synthesized audio data in the embodiments of the present application.

It should be understood that the network framework may be applied to the fields of audio content production, dubbing, and the like, and the service scenarios to which the network framework is applied may specifically include: the network framework can realize the production of audio data under the business scenes of movie dubbing, game dubbing, novel dubbing and the like, and the business scenes suitable for the network framework are not listed. For example, in a film dubbing scene, the original text information in the movie may be a movie scenario uploaded by the target object, and the audio data generated from the movie scenario may be dubbed for the film. For another example, in a matching scenario of a Game, the original text information may be a Game scenario uploaded by a target object, and the audio data generated by the Game scenario may be a virtual Game (for example, an advisory Game (AVG Game), where a player controls a character to perform a Game with a "virtual Adventure" as a main line, and an interactive Game performed in a form of completing a certain task or solving a puzzle) to achieve matching. For another example, in a novel dubbing scene, the original text information in the scene is a novel text which can be uploaded for the target object, and the production of the voiced novel can be realized through the audio data generated by the novel text.

For ease of understanding, the embodiments of the present application take the original text information as an example of a movie scenario. Therefore, in the embodiment of the application, on the basis of generating original audio data for a movie scenario, dubbing is realized for target text information (i.e., one or more phrases of a movie in the movie scenario) in the movie scenario, and then target audio data generated for one or more phrases of the movie scenario is fused with target timbre information to obtain target audio data compatible with the audio information in the target audio data and fused audio data of the target timbre information, and then audio data (i.e., spliced audio data) of a movie to which the movie scenario belongs is obtained according to the fused audio data and the original audio data.

For easy understanding, please refer to fig. 2, and fig. 2 is a schematic diagram of a scenario for performing data interaction according to an embodiment of the present application. The server 20a shown in fig. 2 may be the service server 2000 in the embodiment corresponding to fig. 1, and the terminal device 20b shown in fig. 2 may be the target terminal device in the embodiment corresponding to fig. 1. The terminal device 20b is installed with an application client, a target object corresponding to the terminal device 20b may be an object 20c, and the application client may be configured to display original text information uploaded by the object 20c through the application client.

The terminal device 20b shown in fig. 2 may display the original text information uploaded by the object 20c in an application interface of the application client, and transmit the original text information to the server 20a, so that the server 20a converts the original text information into original audio data through a speech synthesis algorithm. When the object 20c needs to use the target text information of the real-person audio information, a trigger operation may be performed for the voice conversion function in the application interface. In this way, the terminal device 20b may respond to the trigger operation for the voice conversion function in the application interface, use the text information selected in the original text information based on the trigger operation as the target text information, and further send the target text information to the server 20 a. The audio information may include, but is not limited to, mood information, emotion information, and prosodic information.

As shown in fig. 2, the server 20a may use text information in the original text information except the target text information as the remaining text information, and divide the original audio data corresponding to the original text information into two parts according to the target text information and the remaining text information, where the two parts are the audio data corresponding to the target text information and the audio data corresponding to the remaining text information. For convenience of understanding, the embodiment of the present application may use audio data corresponding to the remaining text information in the original audio data as standard audio data.

Further, as shown in fig. 2, the terminal device 20b may acquire target audio data corresponding to the target tone color information and the target text information, and further transmit the target tone color information and the target audio data to the server 20a in response to a confirmation operation for the target audio data and the target tone color information. The target audio data may be audio data uploaded by the object 20c through the application client, the object 20c may upload the audio data by reading a recording or uploading an audio file, and audio content in the target audio data matches the target text information.

The target tone color information may be tone color information selected by the object 20c from one or more candidate tone color information displayed by the terminal device 20b, and the one or more candidate tone color information may be tone color information configured by the server 20 a. Note that the timbre information may indicate a feature of sound, and refers to a perceptual characteristic of sound, and different sound generators (for example, musical instruments) generate different timbres of sound due to different materials and structures. In other words, the terminal device 20b may display the acquired one or more candidate tone color information after acquiring the one or more candidate tone color information from the server 20a in response to the triggering operation for the voice conversion function in the application interface, so that the object 20c selects the target tone color information among the one or more candidate tone color information.

In this way, after receiving the target tone information and the target audio data, the server 20a may fuse the target tone information and the target audio data to obtain fused audio data, and then splice the fused audio data and the standard audio data to obtain spliced audio data for the original text information. Wherein the audio content in the standard audio data matches the remaining text information. In other words, the server 20a may replace the audio data corresponding to the target text information in the original audio data by the fusion audio data generated based on the target text information, so as to obtain the spliced audio data for the original text information.

The fusion audio data contains real audio information (e.g., tone, emotion, rhythm), the tone of the fusion audio data is selected target tone information, and the fusion audio data can be spliced to the selected target text information in the original text information. The spliced audio data includes both audio data having a pure synthesized tone (i.e., standard audio data) and audio data having true mood, emotion, prosody, and target tone information (i.e., fusion audio data).

It is understood that, when the object 20c needs to obtain the spliced audio data, an audio obtaining request may be sent to the server 20a through the terminal device 20b, so that the server 20a may return the spliced audio data to the terminal device 20b in response to the audio obtaining request, so that the object 20c may listen to or download the spliced audio data on trial in the application client of the terminal device 20 b. The audio obtaining request may be an audio listening trial request or an audio downloading request, which is not limited in the present application.

Therefore, the target text information needing to modify the audio information can be selected from the original text information, the target audio data with the audio information generated aiming at the target text information is further obtained, and the target audio data and the selected target tone information are combined to obtain the fusion audio data with the audio information and the tone standard. Furthermore, the standard audio data corresponding to the residual text information except the target text information in the fused audio data and the original text information are spliced to generate spliced audio data with the audio information and the standard tone, so that the audio information in the audio data can be enriched, the richness of audio creation is improved, and the quality of the audio data is improved.

Further, please refer to fig. 3, wherein fig. 3 is a schematic flowchart of an audio data processing method according to an embodiment of the present application. The method may be executed by a server, or may be executed by an application client, or may be executed by both the server and the application client, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the application client may be the application client in the embodiment corresponding to fig. 2. For the convenience of understanding, the embodiment of the present application is described as an example in which the method is executed by an application client. Wherein, the audio data processing method may include the following steps S101 to S103:

step S101, displaying original text information in an application interface;

it should be appreciated that the application client may display an application interface. Wherein the application interface includes a text entry area. Further, the application client may display the input original text information in the text entry area in response to an input operation directed to the text entry area.

For easy understanding, please refer to fig. 4a, and fig. 4a is a schematic view of a scene displaying original text information according to an embodiment of the present application. The application interface 40a and the application interface 40b as shown in fig. 4a may be application interfaces of application clients at different times, and the text entry area 41a may be included in the application interface 40a and the application interface 40 b. The target object corresponding to the application client may be the object 41 b.

As shown in fig. 4a, the object 41b may perform an input operation with respect to the text entry area 41a, so that the application client may use the text information input based on the input operation as the original text information in response to the input operation performed by the object 41b with respect to the text entry area 41a, where the original text information may be original text information 41c, where the original text information 41c may be "hoeing rice at noon, sweat dripping rice and soil. When people know to eat in the dish, the granules are pungent and bitter. ".

At this time, the object 41b can directly enter the original text information 41c in the text entry area 41 a. As shown in fig. 4a, the application client may switch the application interface from application interface 40a to application interface 40b, displaying the original text information 41c in the text entry area 41a of application interface 40 b.

Alternatively, it should be understood that the application client may display the application interface. Wherein the application interface comprises a text upload control. Further, the application client may display a text selection interface for selecting a text file in response to a triggering operation for the text upload control. Further, the application client may respond to a text file selection operation for the text selection interface, and take the text file selected based on the text file selection operation as a target text file. Further, the application client may respond to a text file confirmation operation for the text selection interface, and display the original text information in the application interface with the text information in the target text file as the original text information.

The Format of the target Text file may be txt (Text Format), doc (Document), docx (office Open Xml Document), pdf (Portable Document Format), etc., and the Format of the target Text file is not listed here.

For easy understanding, please refer to fig. 4b and 4c, and fig. 4b and 4c are schematic views of a scene displaying original text information according to an embodiment of the present application. The application interfaces 42a, 42b, 42c, and 42d shown in fig. 4b and 4c may be application interfaces of application clients at different times, the application interface 42a may be the application interface 40a in the embodiment corresponding to fig. 4a, the application interface 42d may be the application interface 40b in the embodiment corresponding to fig. 4a, and the application interface 42a and the application interface 42d may include text entry areas 43c therein. The target object corresponding to the application client may be an object 43b, and the object 43b may be an object 41b in the embodiment corresponding to fig. 4 a.

As shown in fig. 4b, the application interface 42a may include a text upload control 43a, and the object 43b may perform a trigger operation on the text upload control 43a, so that the application client may display a text selection interface 44a for selecting a text file in response to the trigger operation performed by the object 43b on the text upload control 43 a.

As shown in fig. 4b, the text selection interface 44a may include one or more folders or text files, the number of folders may be one or more, and the number of text files may be one or more, and for convenience of understanding, the number of folders is 1, and the number of text files is 1. The one or more folder bodies may include a folder J, and the one or more text files may include a file 43 d.

It will be appreciated that the object 43b may perform a text file selection operation with respect to the text selection interface 44a (e.g., the object 43b may perform a text file selection operation with respect to the file 43d in the text selection interface 44 a), such that the application client may target the file 43d selected based on the text file selection operation as the target text file in response to the text file selection operation performed by the object 43b with respect to the file 43 d.

Optionally, the object 43b may also perform a folder selection operation with respect to folder J, such that the application client may display one or more text files (e.g., file A) under folder J in the text selection interface 44a in response to the folder selection operation performed by the object 43b with respect to folder J₁And file A₂Not shown in the figure). Further, the object 43b may be directed to a text file (e.g., file A) under folder J₁) Performs a text file selection operation so that the application client can respond to the object 43b for file a₁The executed text file selection operation is to select the file A based on the text file selection operation₁As a target text file.

As shown in fig. 4b and 4c, when responding to the text file selection operation, the application client may switch the text selection interface from the text selection interface 44a shown in fig. 4b to the text selection interface 44b shown in fig. 4c, and highlight the file 43d in the text selection interface 44b, that is, set the display status of the file 43d to the selected status. Where text selection interface 44a and text selection interface 44b may be text selection interfaces of the application client at different times.

As shown in FIG. 4c, a text confirmation control 43e may be included in the text selection interface 44b, and the object 43b may perform a text file confirmation operation with respect to the text selection interface 44b (e.g., the object 43b may perform a text file confirmation operation with respect to the text confirmation control 43e in the text selection interface 44 b), so that the application client may use the text information in the target text file as the original text information in response to the text file confirmation operation performed by the object 43b with respect to the text confirmation control 43e, where the original text information may be the original text information 43f, where the original text information 43f may be "Redpa on the day of hoeing at noon, and Dunpa soil. The granules are bitter and pungent when people eat in the dish. ".

At this time, the object 43b can indirectly enter the original text information 43f in the text entry area 43 c. As shown in fig. 4b and 4 c. The application client may switch the application interface from application interface 42a to application interface 42d, displaying the original text information 43f in the text entry area 43c of application interface 42 d.

Step S102, target tone information and target audio data corresponding to the target text information are obtained;

specifically, the application client may respond to a trigger operation for the voice conversion function to obtain target tone color information and target audio data corresponding to the target text information. The specific process of the application client responding to the trigger operation for the voice conversion function can be described as follows: the application client may take the text information selected based on the text selection operation as the target text information in response to the text selection operation for the original text information. The target text information refers to text information selected in the original text information, that is, the target text information is text information selected in the original text information based on a text selection operation, that is, the target text information is text information selected in the original text information based on a trigger operation for a voice conversion function. Further, the application client may display an audio conversion interface in response to the voice conversion operation for the target text information. Further, the application client may obtain the target tone color information in the audio conversion interface. Further, the application client may respond to an audio uploading operation for the audio conversion interface, and obtain target audio data corresponding to the target text information. Wherein the audio content in the target audio data matches the target text information.

For ease of understanding, please refer to fig. 5a, fig. 5a is a schematic view of a scene displaying an audio conversion interface according to an embodiment of the present application. The application interface 50a and the application interface 50b shown in fig. 5a may be application interfaces of application clients at different time, the application interface 50a and the application interface 50b may include a text entry area 51a, and the application interface 50a may be the application interface 40b in the embodiment corresponding to fig. 4a or the application interface 42d in the embodiment corresponding to fig. 4 c. The target object corresponding to the application client may be the object 51 c.

As shown in fig. 5a, the original text information 51b can be displayed in the text entry area 51a, and the object 51c can perform a text selection operation on the original text information 51b, so that the application client can use the text information 51d selected based on the text selection operation as the target text information in response to the text selection operation performed by the object 51c on the original text information 51b, where the text information 51d can be "whose disc is known to have dinner".

As shown in fig. 5a, when responding to the text selection operation, the application client may switch the application interface from the application interface 50a to the application interface 50b, and highlight the target text information in the application interface 50b, that is, set the display state of the target text information to the selected state.

Wherein the application interface includes a first voice conversion control. It should be appreciated that the application client may display the audio conversion interface in response to a triggering operation for the first speech conversion control. For convenience of understanding, reference may be made to fig. 5b for a specific process of the application client responding to a trigger operation for the first voice conversion control, where fig. 5b is a schematic view of a scene displaying an audio conversion interface according to an embodiment of the present application. The application interface 52a shown in fig. 5b may be the application interface 50b in the embodiment corresponding to fig. 5a, and the object 53c shown in fig. 5b may be the object 51c in the embodiment corresponding to fig. 5 a.

As shown in fig. 5b, the application interface 52a may include a first speech conversion control 53b, and the object 53c may perform a trigger operation on the first speech conversion control 53b, so that the application client may display the audio conversion interface 52b for the target text information 53a in the selected state in response to the trigger operation performed by the object 53c on the first speech conversion control 53 b.

Alternatively, it should be understood that the application client may display the text control list in response to a trigger operation for the target text information. Wherein the text control list includes a second speech conversion control. Further, the application client may display an audio conversion interface in response to a triggering operation for the second speech conversion control. For the convenience of understanding, referring to fig. 5c, a specific process of responding to a trigger operation for a second voice conversion control by an application client, where fig. 5c is a schematic view of a scene displaying an audio conversion interface according to an embodiment of the present application. As shown in fig. 5c, the application interface 54a may be the application interface 50b in the embodiment corresponding to fig. 5a, the application interface 54a and the application interface 54b may be application interfaces of application clients at different times, and the object 55c shown in fig. 5c may be the object 51c in the embodiment corresponding to fig. 5 a.

As shown in fig. 5c, the object 55c may perform a trigger operation on the target text information 55a in the selected state, so that the application client may switch the application interface from the application interface 54a to the application interface 54b in response to the trigger operation performed by the object 55c on the target text information 55a in the selected state, and display the text control list in the application interface 54 b. Wherein a second speech conversion control 55b may be included in the text control list.

As shown in fig. 5c, the object 55c may perform a trigger operation with respect to the second speech conversion control 55b, and thus, the application client may display the audio conversion interface 54c for the target text information 55a in the selected state in response to the trigger operation performed by the object 55c with respect to the second speech conversion control 55 b.

The audio conversion interface comprises a recording starting control. It should be appreciated that the application client may display a recording stop control in the audio conversion interface in response to a triggering operation for the recording start control. Further, the application client may respond to the trigger operation for the recording stop control, acquire audio data that is entered by the target object within a time interval between the response of the trigger operation for the recording start control and the response of the trigger operation for the recording stop control, and use the audio data that is entered by the target object as target audio data corresponding to the target text information. Further, the application client may display the audio file identifier corresponding to the target audio data in the audio conversion interface.

For ease of understanding, please refer to fig. 6a and fig. 6b, and fig. 6a and fig. 6b are schematic views illustrating a scenario for acquiring target audio data according to an embodiment of the present application. The audio conversion interface 60a, the audio conversion interface 60b, and the audio conversion interface 60c shown in fig. 6a and fig. 6b may be audio conversion interfaces of the application client at different time, and the audio conversion interface 60c may be the audio conversion interface 52b in the embodiment corresponding to fig. 5b or the audio conversion interface 54c in the embodiment corresponding to fig. 5 c. The target object corresponding to the application client may be an object 61a, and the object 61a may be an object 41b in the embodiment corresponding to fig. 4 a.

As shown in fig. 6a, the audio conversion interface 60a may include a recording start control 61b, and the object 61a may perform a trigger operation on the recording start control 61b, so that the application client may switch the audio conversion interface from the audio conversion interface 60a to the audio conversion interface 60b in response to the trigger operation performed by the object 61a on the recording start control 61b, and display a recording stop control 61c in the audio conversion interface 60 b.

As shown in fig. 6a, the object 61a may perform a triggering operation with respect to the recording stop control 61c, and thus, in response to the triggering operation performed by the object 61a with respect to the recording stop control 61c, the application client may acquire audio data entered by the object 61a within a time interval (e.g., 5 seconds) in response to the triggering operation performed by the object 61a with respect to the recording stop control 61b and in response to the triggering operation with respect to the recording stop control 61c, and take the audio data entered by the object 61a as target audio data associated with the target text information (i.e., "whose disc is dinner").

As shown in fig. 6b, when responding to the trigger operation for the recording stop control 61c, the application client may switch the audio conversion interface from the audio conversion interface 60b to the audio conversion interface 60c, and display an audio file identifier 61d corresponding to the target audio data in the audio conversion interface 60 c. The audio file identifier 61d may be used to perform audition on the target audio data, and the time bar in the audio file identifier 61d may be used to adjust the playing progress of the target audio data.

Optionally, when the application client responds to the trigger operation for the recording start control, the recording pause control may also be displayed in the audio conversion interface. Further, the application client may respond to the triggering operation for the recording pause control, take the audio data, which is input by the target object within a time interval corresponding to the triggering operation for the recording start control and the triggering operation for the recording pause control, as the first audio data, and display the recording continuation control in the audio conversion interface. Further, the application client may continue to acquire the audio data entered by the target object in response to the trigger operation for the recording continuation control. Further, the application client may respond to the triggering operation for the recording stop control, and take the audio data, which is entered by the target object within a time interval in response to the triggering operation for the recording continuation control and the triggering operation for the recording stop control, as the second audio data. Further, the application client may splice the first audio data and the second audio data to obtain target audio data corresponding to the target text information.

Optionally, the audio conversion interface includes an audio upload control. It should be appreciated that the application client may display an audio selection interface for selecting an audio file in response to a triggering operation for the audio upload control. Further, the application client can respond to the audio file selection operation of the audio selection interface, and the audio file selected based on the audio file selection operation is used as the target audio file. Further, the application client may respond to an audio file confirmation operation for the audio selection interface, and take the audio data in the target audio file as the target audio data corresponding to the target text information. Further, the application client may display the audio file identifier corresponding to the target audio data in the audio conversion interface.

It should be understood that, as shown in fig. 6a, the audio conversion interface 60a may include an audio upload control 62a, and a specific process of the application client obtaining the target audio data based on the audio upload control 62a may refer to the description of obtaining the target text information based on the text upload control 43a in the embodiment corresponding to fig. 4b and fig. 4c, which will not be described again here.

Referring to fig. 6a and fig. 6b again, when the application client obtains the audio conversion interface 60c based on the recording start control 61b or the audio upload control 62a, the audio conversion interface 60c may include an update recording start control 62b and an update audio upload control 62c, the update recording start control 62b may upload the update audio data again in the same manner as the recording start control 61b, and the update audio upload control 62c may upload the update audio data again in the same manner as the audio upload control 62 a. Wherein the update audio data may be used to update the target audio data.

Wherein the audio conversion interface includes one or more candidate timbre information. It should be understood that the application client may take the candidate timbre information selected based on the timbre selection operation as the target timbre information in response to the timbre selection operation for one or more candidate timbre information. Further, the application client may highlight the target timbre information in the audio conversion interface.

For easy understanding, please refer to fig. 7a, and fig. 7a is a schematic view of a scene displaying an audio conversion identifier according to an embodiment of the present application. The audio conversion interface 70a and the audio conversion interface 70b shown in fig. 7a may be audio conversion interfaces of the application client at different time, and the audio conversion interface 70a may be the audio conversion interface 60c in the embodiment corresponding to fig. 6 b. The target object corresponding to the application client may be the object 72 a.

As shown in fig. 7a, the audio conversion interface 70a may include one or more candidate timbre information, and the one or more candidate timbre information may specifically include candidate timbre information 71a, candidate timbre information 71b, candidate timbre information 71c, and candidate timbre information 71 d. The object 72a may perform a tone color selection operation with respect to one or more candidate tone color information (e.g., candidate tone color information 71a), and thus the application client may take the candidate tone color information 71a selected based on the tone color selection operation as target tone color information in response to the tone color selection operation performed by the object 72a with respect to the candidate tone color information 71 a.

As shown in fig. 7a, in response to the tone color selection operation, the application client may switch the audio conversion interface from the audio conversion interface 70a to the audio conversion interface 70b, and highlight the target tone color information 71a in the audio conversion interface 70b, that is, set the display state of the target tone color information 71a to the selected state.

Step S103, acquiring spliced audio data aiming at the original text information.

Specifically, the application client may respond to the confirmation operation for the target audio data and the target tone information to obtain the spliced audio data for the original text information. The spliced audio data is obtained by splicing the fused audio data and the standard audio data corresponding to the residual text information; the residual text information is the text information except the target text information in the original text information; the audio content in the standard audio data matches the remaining text information. The fusion audio data is obtained by fusing target audio data and target tone information.

It should be appreciated that in generating the spliced audio data, the application client may display the audio transition identification in the target area associated with the target text information. And the audio conversion identifier is used for representing the target tone color information corresponding to the target text information in the one or more candidate tone color information.

For easy understanding, please refer to fig. 7b, and fig. 7b is a schematic view of a scene displaying an audio conversion identifier according to an embodiment of the present application. As shown in fig. 7b, the audio conversion interface 70b is the audio conversion interface 70b in the embodiment corresponding to fig. 7a, and the application interface 70c and the audio conversion interface 70b are different interfaces of the application client.

As shown in fig. 7b, a fusion confirmation control 72b may be included in the audio conversion interface 70b, and the object 72a may perform a confirmation operation with respect to the target audio data and the target timbre information 71a (e.g., the object 72a may perform a confirmation operation with respect to the fusion confirmation control 72 b), so that the application client may display an audio conversion identifier 72d (i.e., "mock porket") in the target area 72c of the application interface 70c associated with the target textual information in response to the confirmation operation performed by the object 72a with respect to the fusion confirmation control 72 b. The audio conversion identifier 72d is used to characterize the target tone color information 71a corresponding to the target text information (i.e., "whose disc is known to have dinner").

The target area 72c may be located at any position in the application interface 70c to reduce the influence on the display effect of the original text information, generally, the target area 72c may be located at an edge position of the target text information, and for understanding, the target area 72c is located at a right side position of the target text information in this embodiment of the application as an example.

Optionally, the application client may display a tone color selection interface in response to a trigger operation for the audio conversion identifier. Wherein the tone color selection interface may include one or more candidate tone color information. Further, the application client may respond to the trigger operation for the one or more candidate tone color information, and take the candidate tone color information selected based on the trigger operation for the one or more candidate tone color information as the updated tone color information. Further, the application client may obtain the updated spliced audio data for the original text information in response to a confirmation operation for the updated tone color information. The updating and splicing audio data is obtained by splicing the updating and fusing audio data and the standard audio data corresponding to the residual text information; the updated fusion audio data is obtained by fusing the target audio data and the updated tone information.

Alternatively, it should be appreciated that in generating the updated splice audio data, the application client may display the updated audio conversion identifier in the target area associated with the target textual information. And the updated audio conversion identifier is used for representing the corresponding updated timbre information of the target text information in the one or more candidate timbre information.

Therefore, in the embodiment of the application, the target audio data and the target tone information can be fused through a variable acoustic fusion technology to obtain fused audio data with the audio information indicated by the target audio data and the standard tone pronunciation indicated by the target tone information, and the fused audio data is spliced with the standard audio data of the standard tone pronunciation generated through a voice synthesis technology to obtain spliced audio data for the original text information. The fusion audio data is generated based on the selected target text information in the original text information, and the standard audio data is generated based on the remaining text information except the target text information in the original text information. Therefore, the embodiment of the application can merge the fusion audio data with the audio information and the standard tone pronunciation and the standard audio data with the standard tone pronunciation, so that the generated splicing audio data simultaneously meets the dubbing appeal combining the standard tone pronunciation without the audio information and the standard tone pronunciation with the audio information, the audio data with different audio information can be generated aiming at different scenes in the original text information, the dubbing efficiency can be guaranteed, the richness of audio creation is improved, and the quality of the audio data is improved.

Further, please refer to fig. 8, and fig. 8 is a flowchart illustrating an audio data processing method according to an embodiment of the present application. The method may be executed by a server, or may be executed by an application client, or may be executed by both the server and the application client, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the application client may be the application client in the embodiment corresponding to fig. 2. For the convenience of understanding, the embodiment of the present application is described as an example in which the method is executed by an application client. Wherein, the audio data processing method may include the following steps S201 to S203:

step S201, displaying an audio conversion identifier in a target area associated with target text information;

the application interface comprises an audio downloading control and an audio sharing control.

Referring to fig. 7b again, the standard tone color information 80a shown in fig. 7b may represent tone color information corresponding to standard audio data, the standard tone color information 80a may be "elegant", and the speech rate of the standard audio data is 1.0x (i.e. 1 × speed). The application client may respond to the trigger operation for the standard tone color information 80a to obtain the sample tone color information, and update the standard tone color information 80a with the sample tone color information. At this time, the application client may update the standard tone color information 80a corresponding to the standard audio data to the sample tone color information. The sample tone color information and the standard tone color information 80a are acquired from the tone color library.

Alternatively, the same timbre information in the timbre base may correspond to different dialects, for example, the timbre information "maiden" in the timbre base may correspond to "the four-river speech" and the timbre information "maiden" may correspond to "the Hunan speech".

The audio download control 80b shown in fig. 7b may be used to download the spliced audio data, the audio sharing control 80c shown in fig. 7b may be used to share the spliced audio data, a specific process of the application client responding to the trigger operation for the audio download control 80b may be referred to as step S202, and a specific process of the application client responding to the audio sharing control 80c may be referred to as step S203. In addition, the application interface 70c may further include an audio listening trial control 80d, the audio listening trial control 80d may be used to listen to the spliced audio data on trial, and the application client may play the spliced audio data on the application interface 70c when responding to the trigger operation for the audio listening trial control 80 d.

Step S202, responding to the trigger operation aiming at the audio downloading control, and downloading the spliced audio data to a terminal disk;

specifically, the application client may display the audio format list in response to a trigger operation for the audio download control. Wherein, the audio format list comprises one or more audio format information. Further, the application client may respond to the triggering operation for the one or more pieces of audio format information, take the audio format selected based on the triggering operation for the one or more pieces of audio format information as the target audio format information, and download the spliced audio data with the target audio format information to the terminal disk.

The list of Audio formats may be used to display the Audio format information supported by the spliced Audio data, such as MP3(Moving Picture Experts Group Audio Layer III) format and wav (windows Media Audio) format.

The application client side can automatically download the spliced audio data to the default directory when responding to the triggering operation aiming at the audio format list, can also display the directory selection interface, and further loads the spliced audio data to the target directory selected based on the directory selection operation when responding to the directory selection operation aiming at the target selection interface.

Step S203, responding to the trigger operation for the audio sharing control, and sharing the spliced audio data.

Specifically, the application client may respond to a trigger operation for the audio sharing control, and display the sharing platform list. The sharing platform list comprises one or more sharing platform information. Further, the application client may respond to the trigger operation for the one or more pieces of sharing platform information, use the sharing platform information selected based on the trigger operation for the one or more pieces of sharing platform information as the target sharing platform information, and share the spliced audio data to the social sharing platform to which the target sharing platform information belongs.

The social sharing platform may be a sharing platform inside the application client, or may be a sharing platform outside the application client (i.e., a sharing platform in another application client). In other words, the application client may share the spliced audio data to the application client itself, and may also share the spliced audio data to other application clients.

Optionally, the application client may respond to the trigger operation for the audio sharing control to acquire the audio link of the spliced audio data. The audio link is used for opening spliced audio data, the terminal equipment to which the application client belongs can display the audio link when responding to pasting operation aiming at the audio link, and then display the spliced audio data pointed by the audio link when responding to opening operation aiming at the audio link.

Therefore, the target object in the embodiment of the application can automatically make original audio data through a productized platform tool, and the original audio data is corrected through the uploaded target audio data, namely, the pronunciation, the tone, the emotion and the like are adjusted, so that the made audio data is more standard, rich in emotion and rich in tone. Meanwhile, the target object can also audit the generated spliced audio data, so that the dubbing cost is greatly reduced, and the user experience of making the dubbing audio is improved. In addition, after the spliced audio data is generated, the target object can also download, listen to or share the spliced audio data, and the user experience of the target object using the platform tool is improved.

Further, please refer to fig. 9, and fig. 9 is a flowchart illustrating an audio data processing method according to an embodiment of the present application. The method may be executed by a server, or may be executed by an application client, or may be executed by both the server and the application client, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the application client may be the application client in the embodiment corresponding to fig. 2. For convenience of understanding, the embodiment of the present application is described as an example of the method being executed by an application client. Wherein, the audio data processing method may include the following steps S301 to S307:

step S301, displaying original text information in an application interface;

for a specific process of displaying the original text information in the application interface by the application client, reference may be made to the description of the embodiments corresponding to fig. 4a, fig. 4b, and fig. 4c, which will not be described again here.

Step S302, responding to the text selection operation aiming at the original text information, and taking the text information selected based on the text selection operation as target text information;

for a specific process of determining the target text information by the application client based on the text selection operation, reference may be made to the description of the embodiment corresponding to fig. 5a, which will not be repeated herein.

Step S303, responding to the voice conversion operation aiming at the target text information, and displaying an audio conversion interface;

for a specific process of displaying the audio conversion interface based on the voice conversion operation by the application client, reference may be made to the description of the embodiment corresponding to fig. 5b and fig. 5c, which will not be described again here.

Step S304, acquiring target tone information in an audio conversion interface;

for a specific process of the application client obtaining the target tone information in the audio conversion interface, reference may be made to the description of the embodiment corresponding to fig. 7a, which will not be repeated herein.

Step S305, responding to an audio uploading operation aiming at an audio conversion interface, and acquiring target audio data corresponding to target text information;

specifically, the application client may obtain auxiliary audio data associated with the target text information in response to an audio upload operation for the audio conversion interface. Further, the application client may perform denoising processing on the auxiliary audio data to obtain denoised auxiliary audio data. Further, the application client can perform content detection on the denoised auxiliary audio data to obtain the content matching degree between the audio content in the denoised auxiliary audio data and the target text information. Further, if the content matching degree is greater than the matching degree threshold, the application client may determine the auxiliary audio data after denoising as the target audio data corresponding to the target text information.

The auxiliary audio data may be audio data uploaded through the recording start control and the recording stop control, that is, the auxiliary audio data may be audio data recorded by the target object within a time interval in response to the trigger operation for the recording start control and in response to the trigger operation for the recording stop control. Optionally, the auxiliary audio data may be audio data uploaded through the audio upload control, that is, the auxiliary audio data may be audio data in a target audio file selected by the target object in the audio selection interface. The audio data in the target audio file may be audio data recorded in advance, or audio data with audio information generated by a model.

It should be understood that the specific process of the application client performing denoising processing on the auxiliary audio data can be described as follows: the application client can input the auxiliary frequency signal of the auxiliary audio data into the denoising network model, and denoising the auxiliary frequency signal through the denoising network model to obtain the target frequency signal. Further, the application client can obtain the audio attribute of the auxiliary audio data, and restore the target frequency signal through the audio attribute to obtain the denoised auxiliary audio data.

The audio attribute may include, but is not limited to, sampling rate, bit depth, duration, and channel number information. The target frequency information is restored through the audio attribute, so that the sampling rate, bit depth, duration and channel number information of the generated denoised auxiliary audio data are unchanged after denoising processing.

The process of denoising is performed in a frequency domain, and a frequency distribution and an amplitude are concerned in a frequency domain signal, so that the key of denoising is to extract a frequency spectrum of noise, and then perform a reverse compensation operation on a noise-containing voice (i.e., auxiliary audio data) according to the frequency spectrum of the noise, so as to obtain a noise-reduced voice (i.e., the denoised auxiliary audio data). In practical applications, after the digital sampling signal is subjected to fourier transform, a frequency spectrum (i.e., an auxiliary frequency signal) of the signal can be obtained, and after the processing in the frequency domain is completed, the signal can be converted from the frequency domain to the time domain by using inverse fourier transform.

The denoising network model can be used for denoising auxiliary audio data (namely denoising), and the denoising process can remove noises with too high frequency signals or too low frequency signals in the auxiliary audio data, wherein the noises can be of various types, namely white noises with stable frequency spectrum and unstable impulse noises and fluctuating noises. It should be understood that the embodiment of the present application does not limit the model type of the denoising network model, and meanwhile, the embodiment of the present application does not limit the specific algorithm used in the denoising process.

It should be understood that the specific process of the application client performing content detection on the denoised auxiliary audio data can be described as follows: the application client can determine an audio pronunciation sequence of the denoised auxiliary audio data and a text pronunciation sequence of the target text information. Further, the application client may divide the audio pronunciation sequence into at least two audio pronunciation subsequences, determine a subsequence matching degree of each audio pronunciation subsequence with respect to the text pronunciation sequence, and use the audio pronunciation subsequence with the subsequence matching degree greater than a subsequence threshold value as a matching pronunciation subsequence. Further, the application client may determine a proportion of the matching pronunciation subsequence in the at least two audio pronunciation subsequences, and use the proportion of the matching pronunciation subsequence in the at least two audio pronunciation subsequences as a content matching degree (i.e., pronunciation integrity degree) between the audio content in the denoised auxiliary audio data and the target text information.

The audio pronunciation sequence and the text pronunciation sequence are compared according to pronunciation phoneme levels, the audio pronunciation sequence may be a phoneme sequence corresponding to the denoised auxiliary audio data, and the text pronunciation sequence may be a phoneme sequence corresponding to the target text information. For example, in the case where the target text information is "whose disc is known to have a meal", the text pronunciation sequence may be "sheizhipanzhongcan"; the audio utterance sequence may be "heiizhipanzhongcanya" when the audio utterance in the denoised auxiliary audio data is "whose disc is known to have dinner".

The sub-sequence matching degree can be used for representing the matching degree of each audio pronunciation sub-sequence, and the content matching degree can be used for representing the matching degree of the audio pronunciation sequence. If the matching degree of the subsequence is greater than the subsequence threshold, determining that the audio pronunciation subsequence matches the text pronunciation sequence; if the content matching degree is greater than the matching degree threshold value, it can be determined that the audio pronunciation sequence matches the text pronunciation sequence. It should be understood that the embodiments of the present application do not limit specific values of the subsequence threshold and the matching degree threshold, for example, the subsequence threshold may be 90%, and the matching degree threshold may be 60%.

The application client can compare each audio pronunciation subsequence with the text pronunciation sequence, and determine the subsequence matching degree corresponding to each audio pronunciation subsequence. Optionally, the application client may also divide the text pronunciation sequence into at least two text pronunciation subsequences, and compare each audio pronunciation subsequence with each text pronunciation subsequence, so as to determine a subsequence matching degree corresponding to each audio pronunciation subsequence. Wherein, the number of the at least two text pronunciation subsequences and the number of the at least two audio pronunciation subsequences can be the same or different.

The application client can divide the text pronunciation sequence into at least two text pronunciation subsequences according to the sequence length of the audio pronunciation subsequences; optionally, the application client may also divide the text pronunciation sequence into at least two text pronunciation subsequences according to the duration of the audio pronunciation subsequence in the denoised auxiliary audio data. The sequence length of each audio pronunciation subsequence may be the same or different, and the sequence length of each text pronunciation subsequence may be the same or different.

Wherein, it is understood that the application client can use the N-Gram model to obtain the audio candidate sequence associated with each audio pronunciation subsequence and obtain the text candidate sequence associated with each text pronunciation subsequence. For example, the reserved maximum N-Gram may be set to be 4-Gram in the embodiment of the present application, and the number of phonemes in the audio candidate sequence and the text candidate sequence may be 1, 2, 3, or 4. Thus, the application client can respectively determine the candidate sequence similarity between the audio candidate sequence and the text candidate sequence under the 1-Gram, the 2-Gram, the 3-Gram and the 4-Gram, and further carry out weighted summation on the candidate sequence similarity under the 1-Gram, the 2-Gram, the 3-Gram and the 4-Gram to obtain the subsequence matching degree corresponding to the audio pronunciation subsequence.

Optionally, the application client may also directly determine the sequence similarity between the audio pronunciation sequence and the text pronunciation sequence without dividing the audio pronunciation sequence and the text pronunciation sequence, and use the sequence similarity as content matching between the audio content in the denoised auxiliary audio data and the target text information. It should be understood that the application embodiments are not limited to a specific method for determining the sequence Similarity, and for example, the application embodiments may determine the sequence Similarity between the audio pronunciation sequence and the text pronunciation sequence through Cosine Similarity (Cosine Similarity).

Optionally, it should be understood that the specific process of the application client performing content detection on the denoised auxiliary audio data may be described as follows: the application client can perform voice recognition on the denoised auxiliary audio data to obtain audio text information of the denoised auxiliary audio data. Further, the application client may determine a text similarity between the audio text information and the target text information, and use the text similarity as a content matching degree between the audio content in the denoised auxiliary audio data and the target text information.

The application client may recognize, by using ASR (Automatic Speech Recognition), the audio text information of the denoised auxiliary audio data, in other words, the audio text information of the denoised auxiliary audio data may be referred to as the audio content in the denoised auxiliary audio data. For example, the target text information may be "whose person knows dinner in a disc", and the audio text information of the auxiliary audio data after the denoising may be "whose person knows dinner in a disc". It should be understood that the embodiment of the present application does not limit a specific method for determining the text Similarity, for example, the embodiment of the present application may determine the text Similarity between the audio text information and the target text information through Cosine Similarity (Cosine Similarity).

Optionally, if the content matching degree is less than or equal to the matching degree threshold, the application client may display an error prompt message in the audio conversion interface. Optionally, if the content matching degree is less than or equal to the matching degree threshold, the application client may display an error prompt interface, and display error prompt information in the error prompt interface. For example, the error prompt may be "upload audio does not match text, please re-upload audio! ". If the content matching degree is smaller than or equal to the matching degree threshold value, the target audio data needs to be uploaded again.

Step S306, responding to the confirmation operation aiming at the target audio data and the target tone information, and fusing the target audio data and the target tone information to obtain fused audio data;

specifically, the application client may input the target audio data into the converged network model, and extract audio text information and audio features in the target audio data through the converged network model. Wherein the audio features include at least one of emotional features, mood features, or prosodic features. Further, the application client can fuse the audio text information, the audio features and the target tone information in the fusion network model to obtain fusion audio data.

When the audio features comprise emotion features, mood features and rhythm features, the application client can fuse audio text information, emotion features, mood features, rhythm features and target tone information to obtain fused audio data with mood information, emotion information and rhythm information. Optionally, when the audio features include a mood feature and a prosody feature, the application client may fuse the audio text information, the mood feature, the prosody feature, and the target tone information to obtain fused audio data having the mood information and the prosody information.

It should be understood that the embodiment of the present application does not limit the model type of the fusion network model, and meanwhile, the embodiment of the present application does not limit the specific algorithm used for fusing the target audio data and the target timbre information.

And S307, acquiring standard audio data corresponding to the residual text information, and splicing the fusion audio data and the standard audio data to obtain spliced audio data for the original text information.

Specifically, the application client may input the original text information to the voice conversion network model, and perform voice conversion on the original text information through the voice conversion network model to obtain original audio data corresponding to the original text information. Further, the application client may determine, according to the position information of the target text information in the original text information, an audio start position and an audio end position for the target text information in the original audio data, and use the audio data of the original audio data before the audio start position and the audio data after the audio end position as standard audio data corresponding to the remaining text information. Further, the application client may extract candidate audio data from the fused audio data, and splice the candidate audio data to the audio start position and the audio end position in the standard audio data to obtain spliced audio data for the original text information.

It should be understood that the embodiment of the present application does not limit the model type of the voice conversion network model. The application client may use audio data of the original audio data before the audio start position as first standard audio data, and use audio data of the original audio data after the audio end position as second standard audio data. Wherein the first standard audio data and the second standard audio data may be collectively referred to as standard audio data. In this way, the application client may splice the head of the candidate audio data to the end of the first standard audio data (i.e., audio start position), and splice the end of the candidate audio data to the head of the second standard audio data (i.e., audio end position).

It is understood that the original audio data in the embodiment of the present application may have the same audio information, where the audio information may include, but is not limited to, mood information, emotion information, and prosody information. Optionally, the original audio data in the embodiment of the present application may also have different audio information, where the different audio information is generated when the original text information is converted into the original audio data through the speech conversion network model, that is, the speech conversion network model may convert the original text information into the original audio data, and predict different audio information corresponding to different text information in the original text information. At this time, since the audio information predicted by the voice conversion network model may not be accurate, the embodiment of the present application may be used to modify different audio information in the original audio data.

Optionally, if the position information of the target text information in the original text information is a header position (that is, the target text information is located at a header of the original text information), the application client may determine an audio end position for the target text information in the original audio data, and use the audio data of the original audio data after the audio end position as standard audio data corresponding to the remaining text information. Further, the application client may extract candidate audio data from the fusion audio data, splice the candidate audio data to an audio end position in the standard audio data, that is, before the candidate audio data is spliced to the standard audio data, obtain spliced audio data for the original text information.

Optionally, similarly, if the position information of the target text information in the original text information is the tail position (that is, the target text information is located at the tail of the original text information), the application client may determine an audio start position for the target text information in the original audio data, and use the audio data of the original audio data before the audio start position as the standard audio data corresponding to the remaining text information. Further, the application client may extract candidate audio data from the fusion audio data, splice the candidate audio data to the audio start position in the standard audio data, that is, splice the candidate audio data to the standard audio data, and obtain spliced audio data for the original text information.

It should be understood that the application client may retrieve the original duration of the original audio data, as well as the unit duration of each unit of text in the original text data. Further, the application client may determine an audio start position and an audio end position for the target text information in the original audio data according to the original time length, the unit time length and the position information of the target text information in the original text information.

The candidate audio data may be a voiced segment in the fused audio data (that is, the candidate audio data may be audio data with sound in the fused audio data), a corresponding position of the candidate audio data in the original audio data may be calculated according to an original time length of the original audio data predicted by the speech conversion network model and a unit time length of each unit text, a start time (that is, a time of an audio start position in the original audio data) and an end time (that is, a time of an audio end position in the original audio data) of the corresponding position are given, and finally, the audio data in the time segment is replaced by the candidate audio data, so that a sound conversion process is completed, and the final audio data after the sound conversion (that is, the spliced audio data) is output.

It should be understood that the application client acquires the fused waveform data of the fused audio data, and determines the mute time period in the fused audio data according to the fused waveform data. Further, the application client may crop the fused audio data according to the mute time period, and the fused audio data from which the audio data corresponding to the mute time period is cropped is used as the candidate audio data.

It can be understood that, in order to ensure that the transition of the front and back audio data of the fused audio data in the original audio data is natural, the application client may cut the front and back mute portions of the audio data after changing the sound (i.e., the fused audio data), and only the sound segment portion (i.e., the candidate audio data) remains. Optionally, the application client may also perform cropping on a mute portion in the fusion audio data to obtain candidate audio data.

The fusion waveform data can be used for generating an audio waveform map of the fusion audio data, in the audio waveform map, the amplitudes of the sound waves of the head and tail mute parts are small, and the amplitudes of the effective voice parts are large. Therefore, the mute part in the fused audio data can be removed by fusing the waveform data, and the audio data candidate with sound can be obtained.

Optionally, if the target text information selected by the target object in the original text information is the original text information itself, that is, the text information selected based on the text selection operation is the original text information, the application client may generate the fusion audio data for the original text information, and then directly use the fusion audio data as the spliced audio data for the original text information.

Optionally, the application client may also directly input the remaining text information to the voice conversion network model, and perform voice conversion on the remaining text information through the voice conversion network model to obtain standard audio data corresponding to the remaining text information.

For ease of understanding, please refer to fig. 10, in which fig. 10 is a schematic flowchart of an audio synthesis scheme provided by an embodiment of the present application. The flowchart shown in fig. 10 mainly includes main processing steps of text import (i.e., step S11), selecting a text segment requiring change of voice (i.e., step S12), recording/uploading audio (i.e., step S13 and step S14), selecting a change tone color (i.e., step S15), changing voice processing and replacing an audio segment (i.e., step S16), synthesizing full-text audio and downloading audio (i.e., step S17).

As shown in fig. 10, the application client may perform step S11 to display the original text information in the application interface, and further perform step S12 to respond to the text selection operation for the original text information, and take the text information selected based on the text selection operation as the target text information (i.e., the text segment requiring change of voice), and further display an audio conversion interface for the target text information. Further, the application client may execute step S13 or step S14, and obtain target audio data corresponding to the target text information in response to an audio uploading operation for the audio conversion interface, where the target audio data may be the audio data recorded in step S13 or the audio data in the file uploaded in step S14.

As shown in fig. 10, the application client may perform step S15 to select target timbre information from the one or more candidate timbre information of the audio conversion interface, and further perform step S16 to fuse the target audio data and the target timbre information in response to a confirmation operation for the target audio data and the target timbre information to obtain fused audio data. The fused audio data can be used for replacing the audio data corresponding to the target text information in the original audio data.

As shown in fig. 10, the application client may execute step S17 to obtain standard audio data corresponding to the remaining text information, and splice the fused audio data and the standard audio data to obtain spliced audio data (i.e., full-text audio) for the original text information, so as to download the full-text audio.

For ease of understanding, please refer to fig. 11, fig. 11 is a schematic flowchart of an audio voicing scheme provided in an embodiment of the present application. The flowchart shown in fig. 11 mainly describes the specific execution functions of the sound conversion module and the implementation manner of each function before the sound conversion function is completed after the target object uploads the target audio data. The sound conversion module mainly comprises an audio denoising module (i.e., step S23), a content detection module (i.e., step S24), a sound variation module (i.e., step S26), and an audio splicing module (i.e., step S27).

As shown in fig. 11, the application client may perform step S21, which may represent opening the application client in the terminal device S21. Further, the application client may execute step S22 to receive the auxiliary audio data uploaded by the target object through step S22, where the auxiliary audio data may be the audio data recorded by the target object or the audio data in the uploaded file. Further, the application client may execute step S23, and perform denoising processing on the auxiliary audio data through the denoising module in step S23, where the denoising module may be configured to remove noise inside the auxiliary audio data to obtain denoised auxiliary audio data.

Further, the application client may execute step S24, and perform content detection on the denoised auxiliary audio data through the content detection module in step S24, where the content detection module may take the denoised auxiliary audio data and the selected text of the target object (i.e., the target text information) as input, and is configured to compare whether the internal text content of the denoised auxiliary audio data (i.e., the audio content in the denoised auxiliary audio data, i.e., the audio pronunciation sequence or the audio text information) is consistent with the target text information, so as to obtain a content detection result.

The application client may execute step S25, determine the content detection result through step S25, and if the content detection result indicates that the audio content in the auxiliary audio data after denoising is inconsistent with the target text information, the application client may execute step S29, return an error flag through step S29, prompt the target object that the uploaded audio content is inconsistent, and prompt the target object to upload the target audio data again. Further, the application client may perform step S30, determine whether the target object re-uploads the target audio data in step S30, and perform step S22 if the target object re-uploads the target audio data; if the target object does not upload the target audio data again, step S28 is executed to end the process.

Optionally, if the content detection result indicates that the audio content in the denoised auxiliary audio data is consistent with the target text information, the application client may execute step S26, and fuse the denoised auxiliary audio data (i.e., the target audio data) and the target timbre information through the voicing module in step S26, that is, perform voicing on the target audio data through the target timbre information. The voice modification module can take target audio data and target tone as input, is responsible for converting tone information in uploaded target audio data with emotion, tone and rhythm into target tone information required by a target object, simultaneously retains emotion, tone and rhythm information in the target audio data, completes a voice modification function, and outputs voice-modified audio (namely, fusion audio data). In the internal process of the pitch change module, characteristics related to content (i.e. audio text information), emotion, mood and prosody information need to be extracted from target audio data, and the extraction of the information is interfered by noise to further affect the sound conversion (i.e. pitch change) effect, so a denoising module needs to be added before the pitch change module. Wherein, the emotion related information is characterized by the fundamental frequency characteristics.

Further, the application client may execute step S27, and splice the fused audio data output by the sound change module into the original audio data according to the position of the text content (i.e. the target text information) selected by the target object through the audio splicing module in step S27, i.e. replace the corresponding portion of the original audio data by the fused audio data, and finally form the content audio (i.e. the spliced audio data) required by the target object. It is understood that, when the target object does not need to modify the spliced audio data, the application client executes step S28, and the process ends.

It can be seen that the embodiments of the present application provide a technique for combining machine dubbing with artificial dubbing. Firstly, importing original text information into a text dubbing tool (namely an application client); secondly, synthesizing the original text information into standard tone voice audio (namely original audio data) based on a voice synthesis algorithm; thirdly, selecting a text (namely target text information) which needs to be read with rich emotion in the original text information, and manually reading the target text information with emotion and tone; secondly, extracting emotion, tone and rhythm of the artificial reading fragment by a sound variation technology, and combining the emotion, tone and rhythm with a standard tone model (namely target tone information) to generate a synthetic tone voice audio (namely fusion audio data) with the emotion, tone and rhythm of the artificial reading; and finally, splicing the generated synthetic tone and color voice audio with the artificial reading emotion, the voice mood and the rhythm with other unselected standard tone and color voice audio (namely standard audio data) according to the original text sequence to form an audio file of the synthetic tone and color audio (fusion audio data) and the standard audio data with the artificial reading emotion, the voice mood and the rhythm.

Further, referring to fig. 12, fig. 12 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application, where the audio data processing apparatus 1 may include: the system comprises a text display module 11, an audio acquisition module 12 and an audio splicing module 13;

the text display module 11 is used for displaying original text information in the application interface;

the text display module 11 is specifically configured to display an application interface; the application interface comprises a text entry area;

the text display module 11 is specifically configured to display the input original text information in the text entry area in response to an input operation for the text entry area.

The text display module 11 is specifically configured to display an application interface; the application interface comprises a text uploading control;

the text display module 11 is specifically configured to respond to a trigger operation for the text upload control, and display a text selection interface for selecting a text file;

the text display module 11 is specifically configured to respond to a text file selection operation for the text selection interface, and take a text file selected based on the text file selection operation as a target text file;

the text display module 11 is specifically configured to respond to a text file confirmation operation for the text selection interface, take the text information in the target text file as original text information, and display the original text information in the application interface.

The audio acquiring module 12 is configured to acquire target audio data corresponding to the target tone information and the target text information; the target text information refers to the text information selected from the original text information; matching the audio content in the target audio data with the target text information;

the audio acquisition module 12 includes: a text selection unit 121, a voice conversion unit 122, a tone acquisition unit 123, an audio acquisition unit 124;

a text selecting unit 121 for, in response to a text selecting operation for original text information, taking text information selected based on the text selecting operation as target text information;

a voice conversion unit 122, configured to respond to a voice conversion operation for the target text information, and display an audio conversion interface;

the application interface comprises a first voice conversion control;

the voice conversion unit 122 is specifically configured to display an audio conversion interface in response to a trigger operation for the first voice conversion control.

The voice conversion unit 122 is specifically configured to display a text control list in response to a trigger operation for the target text information; the text control list comprises a second voice conversion control;

the voice conversion unit 122 is specifically configured to display an audio conversion interface in response to the triggering operation for the second voice conversion control.

A tone acquiring unit 123 configured to acquire target tone information in the audio conversion interface;

a tone acquiring unit 123 specifically configured to, in response to a tone selection operation for one or more candidate tone information, take the candidate tone information selected based on the tone selection operation as target tone information;

the tone acquiring unit 123 is further specifically configured to highlight the target tone information in the audio conversion interface.

The audio obtaining unit 124 is configured to respond to an audio uploading operation for the audio conversion interface, and obtain target audio data corresponding to the target text information.

The audio conversion interface comprises a recording starting control;

the audio obtaining unit 124 is specifically configured to respond to a trigger operation for the recording start control, and display a recording stop control in the audio conversion interface;

the audio obtaining unit 124 is specifically configured to, in response to the trigger operation for the recording stop control, obtain audio data that is entered by the target object within a time interval between the response of the trigger operation for the recording start control and the response of the trigger operation for the recording stop control, and use the audio data that is entered by the target object as target audio data corresponding to the target text information;

the audio obtaining unit 124 is further specifically configured to display an audio file identifier corresponding to the target audio data in the audio conversion interface.

The audio conversion interface comprises an audio uploading control;

the audio acquiring unit 124 is specifically configured to respond to a trigger operation for the audio upload control, and display an audio selection interface for selecting an audio file;

the audio obtaining unit 124 is specifically configured to respond to an audio file selection operation for the audio selection interface, and take an audio file selected based on the audio file selection operation as a target audio file;

the audio obtaining unit 124 is specifically configured to respond to an audio file confirmation operation for the audio selection interface, and use audio data in the target audio file as target audio data corresponding to the target text information;

Wherein, the audio acquiring unit 124 includes: an audio uploading subunit 1241, a denoising processing subunit 1242, a content detection subunit 1243, and an audio determination subunit 1244;

an audio uploading subunit 1241, configured to respond to an audio uploading operation for the audio conversion interface, and acquire auxiliary audio data associated with the target text information;

a denoising subunit 1242, configured to perform denoising processing on the auxiliary audio data to obtain denoised auxiliary audio data;

the denoising processing subunit 1242 is specifically configured to input an auxiliary frequency signal of the auxiliary audio data to a denoising network model, and perform denoising processing on the auxiliary frequency signal through the denoising network model to obtain a target frequency signal;

the denoising processing subunit 1242 is specifically configured to obtain the audio attribute of the auxiliary audio data, and restore the target frequency signal according to the audio attribute to obtain the denoised auxiliary audio data.

A content detection subunit 1243, configured to perform content detection on the denoised auxiliary audio data, so as to obtain a content matching degree between audio content in the denoised auxiliary audio data and target text information;

the content detection subunit 1243 is specifically configured to determine an audio pronunciation sequence of the denoised auxiliary audio data and a text pronunciation sequence of the target text information;

a content detection subunit 1243, specifically configured to divide the audio pronunciation sequence into at least two audio pronunciation subsequences, determine a subsequence matching degree of each audio pronunciation subsequence with respect to the text pronunciation sequence, and use the audio pronunciation subsequence with the subsequence matching degree greater than a subsequence threshold as a matching pronunciation subsequence;

the content detecting sub-unit 1243 is specifically configured to determine a proportion of the matching pronunciation sub-sequence in the at least two audio pronunciation sub-sequences, and use the proportion of the matching pronunciation sub-sequence in the at least two audio pronunciation sub-sequences as a content matching degree between the audio content in the denoised auxiliary audio data and the target text information.

The content detection subunit 1243 is specifically configured to perform speech recognition on the denoised auxiliary audio data to obtain audio text information of the denoised auxiliary audio data;

the content detection subunit 1243 is specifically configured to determine a text similarity between the audio text information and the target text information, and use the text similarity as a content matching degree between the audio content in the auxiliary audio data after denoising and the target text information.

The audio determining subunit 1244 is configured to determine, if the content matching degree is greater than the matching degree threshold, the auxiliary audio data after denoising as target audio data corresponding to the target text information.

For specific implementation manners of the audio uploading subunit 1241, the denoising processing subunit 1242, the content detecting subunit 1243 and the audio determining subunit 1244, reference may be made to the description of step S305 in the embodiment corresponding to fig. 9 described above, and details will not be described here.

The audio obtaining unit 124 is further specifically configured to display an error prompt message in the audio conversion interface if the content matching degree is less than or equal to the matching degree threshold; or,

the audio obtaining unit 124 is further specifically configured to display an error prompt interface if the content matching degree is less than or equal to the matching degree threshold, and display error prompt information in the error prompt interface.

For specific implementation manners of the text selecting unit 121, the voice converting unit 122, the tone acquiring unit 123 and the audio acquiring unit 124, reference may be made to the description of step S102 in the embodiment corresponding to fig. 3, and details will not be described here.

The audio splicing module 13 is configured to acquire spliced audio data for the original text information; the spliced audio data is obtained by splicing the fused audio data and the standard audio data corresponding to the residual text information; the residual text information is the text information except the target text information in the original text information; the fusion audio data is obtained by fusing target audio data and target tone information; the audio content in the standard audio data matches the remaining text information.

Wherein, the audio splicing module 13 includes: a fusion unit 131, a splicing unit 132;

the fusion unit 131 is configured to fuse the target audio data and the target tone information in response to a confirmation operation for the target audio data and the target tone information to obtain fused audio data;

the fusion unit 131 is specifically configured to input target audio data into a fusion network model, and extract audio text information and audio features in the target audio data through the fusion network model; the audio features comprise at least one of emotional features, mood features, or prosodic features;

the fusion unit 131 is specifically configured to fuse the audio text information, the audio features, and the target tone information in the fusion network model to obtain fusion audio data.

The splicing unit 132 is configured to acquire standard audio data corresponding to the remaining text information, splice the fusion audio data and the standard audio data, and obtain spliced audio data for the original text information;

wherein, the splicing unit 132 includes: a voice conversion subunit 1321, a position determination subunit 1322, an audio splicing subunit 1323;

the voice conversion subunit 1321 is configured to input the original text information to the voice conversion network model, and perform voice conversion on the original text information through the voice conversion network model to obtain original audio data corresponding to the original text information;

the position determining subunit 1322 is configured to determine, according to the position information of the target text information in the original text information, an audio start position and an audio end position for the target text information in the original audio data, and use audio data of the original audio data before the audio start position and audio data after the audio end position as standard audio data corresponding to the remaining text information;

the position determining subunit 1322 is specifically configured to obtain an original duration of original audio data and a unit duration of each unit text in the original text data;

the position determining subunit 1322 is specifically configured to determine, according to the original time length, the unit time length, and the position information of the target text information in the original text information, an audio start position and an audio end position for the target text information in the original audio data.

And the audio splicing subunit 1323 is configured to extract candidate audio data from the fused audio data, splice the candidate audio data to the audio start position and the audio end position in the standard audio data, and obtain spliced audio data for the original text information.

The audio splicing subunit 1323 is specifically configured to acquire fusion waveform data of the fusion audio data, and determine a mute time period in the fusion audio data according to the fusion waveform data;

the audio splicing subunit 1323 is specifically configured to crop the fused audio data according to the mute time period, and use the fused audio data with the audio data corresponding to the mute time period cropped as the candidate audio data.

For specific implementation of the voice conversion subunit 1321, the position determining subunit 1322, and the audio splicing subunit 1323, reference may be made to the description of step S307 in the embodiment corresponding to fig. 9, which will not be described herein again.

The audio splicing module 13 is further specifically configured to display an audio conversion identifier in a target area associated with the target text information when generating spliced audio data; the audio conversion identifies target timbre information for characterizing a correspondence of the target text information among the one or more candidate timbre information.

For specific implementation of the fusion unit 131 and the splicing unit 132, reference may be made to the description of step S103 in the embodiment corresponding to fig. 3, which will not be described herein again.

the audio data processing device 1 is further specifically configured to respond to a trigger operation for the audio downloading control, and download the spliced audio data to the terminal disk;

the audio data processing apparatus 1 is further specifically configured to share the spliced audio data in response to a trigger operation for the audio sharing control.

For specific implementation manners of the text display module 11, the audio acquisition module 12, and the audio splicing module 13, reference may be made to the description of steps S101 to S103 in the embodiment corresponding to fig. 3, the description of steps S201 to S203 in the embodiment corresponding to fig. 8, and the description of steps S301 to S307 in the embodiment corresponding to fig. 9, which will not be repeated here. In addition, the beneficial effects of the same method are not described in detail.

Further, please refer to fig. 13, wherein fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 13, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, further, the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. In some embodiments, the user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. Optionally, the memory 1005 may also be at least one memory device located remotely from the processor 1001. As shown in fig. 13, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 13, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

displaying original text information in an application interface;

acquiring target tone information and target audio data corresponding to the target text information; the target text information refers to the text information selected from the original text information; matching the audio content in the target audio data with the target text information;

It should be understood that the computer device 1000 described in the embodiment of the present application may perform the description of the audio data processing method in the embodiment corresponding to fig. 3, fig. 8, and fig. 9, and may also perform the description of the audio data processing apparatus 1 in the embodiment corresponding to fig. 12, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the audio data processing apparatus 1 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the audio data processing method in the embodiment corresponding to fig. 3, fig. 8, and fig. 9 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

Further, it should be noted that: embodiments of the present application also provide a computer program product or computer program, which may include computer instructions, which may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor can execute the computer instruction, so that the computer device executes the description of the audio data processing method in the embodiment corresponding to fig. 3, fig. 8, and fig. 9, which is described above, and therefore, the description thereof will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product or the computer program referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and should not be taken as limiting the scope of the present application, so that the present application will be covered by the appended claims.

Claims

1. A method of audio data processing, comprising:

displaying original text information in an application interface;

acquiring target tone information and target audio data corresponding to the target text information; the target text information refers to the text information selected from the original text information; matching audio content in the target audio data with the target text information;

acquiring spliced audio data aiming at the original text information; the spliced audio data is obtained by splicing the fused audio data and the standard audio data corresponding to the residual text information; the residual text information is the text information in the original text information except the target text information; the fusion audio data is obtained by fusing the target audio data and the target tone information; and matching the audio content in the standard audio data with the residual text information.

2. The method of claim 1, wherein displaying the original text information in the application interface comprises:

displaying an application interface; the application interface comprises a text entry area;

displaying the input original text information in the text entry area in response to an input operation directed to the text entry area.

3. The method of claim 1, wherein displaying the original text information in the application interface comprises:

displaying an application interface; the application interface comprises a text upload control;

responding to the triggering operation aiming at the text uploading control, and displaying a text selection interface for selecting a text file;

responding to a text file selection operation aiming at the text selection interface, and taking a text file selected based on the text file selection operation as a target text file;

and responding to a text file confirmation operation aiming at the text selection interface, taking the text information in the target text file as original text information, and displaying the original text information in the application interface.

4. The method according to claim 1, wherein the obtaining target audio data corresponding to the target timbre information and the target text information comprises:

responding to a text selection operation aiming at the original text information, and taking the text information selected based on the text selection operation as target text information;

responding to the voice conversion operation aiming at the target text information, and displaying an audio conversion interface;

acquiring target tone information in the audio conversion interface;

and responding to the audio uploading operation aiming at the audio conversion interface, and acquiring target audio data corresponding to the target text information.

5. The method of claim 4, wherein the application interface comprises a first voice conversion control;

the responding voice conversion operation aiming at the target text information displays an audio conversion interface, and comprises the following steps:

and responding to the triggering operation aiming at the first voice conversion control, and displaying an audio conversion interface.

6. The method of claim 4, wherein displaying an audio conversion interface in response to the voice conversion operation for the target text information comprises:

responding to the trigger operation aiming at the target text information, and displaying a text control list; the text control list includes a second voice conversion control;

and responding to the triggering operation aiming at the second voice conversion control, and displaying an audio conversion interface.

7. The method of claim 4, wherein the audio conversion interface comprises a recording launch control;

the responding audio uploading operation aiming at the audio conversion interface, and acquiring target audio data corresponding to the target text information, wherein the target audio data comprises the following steps:

responding to the triggering operation aiming at the recording starting control, and displaying a recording stopping control in the audio conversion interface;

responding to the triggering operation aiming at the recording stopping control, acquiring audio data which are input by a target object within a time interval of responding to the triggering operation aiming at the recording starting control and responding to the triggering operation aiming at the recording stopping control, and taking the audio data input by the target object as target audio data corresponding to the target text information;

the method further comprises the following steps:

and displaying the audio file identification corresponding to the target audio data in the audio conversion interface.

8. The method of claim 4, wherein the audio conversion interface comprises an audio upload control;

the responding audio uploading operation aiming at the audio conversion interface, and acquiring target audio data corresponding to the target text information, wherein the target audio data comprises:

responding to the triggering operation aiming at the audio uploading control, and displaying an audio selection interface for selecting an audio file;

responding to an audio file selection operation aiming at the audio selection interface, and taking an audio file selected based on the audio file selection operation as a target audio file;

responding to an audio file confirmation operation aiming at the audio selection interface, and taking the audio data in the target audio file as target audio data corresponding to the target text information;

the method further comprises the following steps:

9. The method of claim 4, wherein the audio conversion interface comprises one or more candidate timbre information;

the obtaining of the target tone information in the audio conversion interface includes:

in response to a tone selection operation for the one or more candidate tone color information, selecting candidate tone color information selected based on the tone selection operation as target tone color information;

the method further comprises the following steps:

and highlighting the target tone color information in the audio conversion interface.

10. The method of claim 1, wherein the obtaining spliced audio data for the original textual information comprises:

responding to the confirmation operation aiming at the target audio data and the target tone information, and fusing the target audio data and the target tone information to obtain fused audio data;

acquiring standard audio data corresponding to the residual text information, and splicing the fused audio data and the standard audio data to obtain spliced audio data for the original text information;

the method further comprises the following steps:

displaying an audio conversion identifier in a target area associated with the target text information when generating the spliced audio data; the audio conversion identifier is used for representing target tone color information corresponding to the target text information in one or more candidate tone color information.

11. The method of claim 1, wherein the application interface comprises an audio download control and an audio share control;

the method further comprises the following steps:

responding to the trigger operation aiming at the audio downloading control, and downloading the spliced audio data to a terminal disk;

responding to the triggering operation aiming at the audio sharing control, and sharing the spliced audio data.

12. The method according to claim 4, wherein the obtaining of the target audio data corresponding to the target text information in response to the audio uploading operation on the audio conversion interface comprises:

responding to an audio uploading operation aiming at the audio conversion interface, and acquiring auxiliary audio data associated with the target text information;

denoising the auxiliary audio data to obtain denoised auxiliary audio data;

performing content detection on the denoised auxiliary audio data to obtain a content matching degree between the audio content in the denoised auxiliary audio data and the target text information;

and if the content matching degree is greater than a matching degree threshold value, determining the denoised auxiliary audio data as target audio data corresponding to the target text information.

13. The method as claimed in claim 12, wherein the denoising the auxiliary audio data to obtain denoised auxiliary audio data comprises:

inputting an auxiliary frequency signal of the auxiliary audio data into a denoising network model, and denoising the auxiliary frequency signal through the denoising network model to obtain a target frequency signal;

and acquiring the audio attribute of the auxiliary audio data, and restoring the target frequency signal through the audio attribute to obtain the auxiliary audio data after denoising.

14. The method as claimed in claim 12, wherein the performing content detection on the denoised auxiliary audio data to obtain a content matching degree between the audio content in the denoised auxiliary audio data and the target text information comprises:

determining an audio pronunciation sequence of the denoised auxiliary audio data and a text pronunciation sequence of the target text information;

dividing the audio pronunciation sequence into at least two audio pronunciation subsequences, determining the subsequence matching degree of each audio pronunciation subsequence relative to the text pronunciation sequence, and taking the audio pronunciation subsequences with the subsequence matching degree larger than a subsequence threshold value as matching pronunciation subsequences;

determining the proportion of the matching pronunciation subsequence in the at least two audio pronunciation subsequences, and taking the proportion of the matching pronunciation subsequence in the at least two audio pronunciation subsequences as the content matching degree between the audio content in the auxiliary audio data after denoising and the target text information.

15. The method as claimed in claim 12, wherein the performing content detection on the denoised auxiliary audio data to obtain a content matching degree between the audio content in the denoised auxiliary audio data and the target text information comprises:

performing voice recognition on the denoised auxiliary audio data to obtain audio text information of the denoised auxiliary audio data;

and determining the text similarity between the audio text information and the target text information, and taking the text similarity as the content matching degree between the audio content in the auxiliary audio data after denoising and the target text information.

16. The method of claim 12, further comprising:

if the content matching degree is smaller than or equal to the matching degree threshold value, displaying error prompt information in the audio conversion interface; or,

and if the content matching degree is smaller than or equal to the matching degree threshold value, displaying an error prompt interface, and displaying error prompt information in the error prompt interface.

17. The method according to claim 10, wherein said fusing the target audio data and the target timbre information to obtain fused audio data comprises:

inputting the target audio data into a fusion network model, and extracting audio text information and audio features in the target audio data through the fusion network model; the audio features comprise at least one of emotional features, mood features, or prosodic features;

and in the fusion network model, fusing the audio text information, the audio characteristics and the target tone information to obtain fusion audio data.

18. The method according to claim 10, wherein the obtaining standard audio data corresponding to the remaining text information, and splicing the fused audio data and the standard audio data to obtain spliced audio data for the original text information comprises:

inputting the original text information into a voice conversion network model, and performing voice conversion on the original text information through the voice conversion network model to obtain original audio data corresponding to the original text information;

according to the position information of the target text information in the original text information, determining an audio starting position and an audio ending position aiming at the target text information in the original audio data, and taking the audio data of the original audio data before the audio starting position and the audio data after the audio ending position as standard audio data corresponding to the residual text information;

and extracting candidate audio data from the fusion audio data, splicing the candidate audio data to the audio starting position and the audio ending position in the standard audio data, and obtaining spliced audio data aiming at the original text information.

19. The method of claim 18, wherein determining an audio start position and an audio end position for the target text information in the original audio data according to the position information of the target text information in the original text information comprises:

acquiring original duration of the original audio data and unit duration of each unit text in the original text data;

and determining an audio starting position and an audio ending position aiming at the target text information in the original audio data according to the original time length, the unit time length and the position information of the target text information in the original text information.

20. The method of claim 18, wherein the extracting candidate audio data from the fused audio data comprises:

acquiring fusion waveform data of the fusion audio data, and determining a mute time period in the fusion audio data according to the fusion waveform data;

and cutting the fused audio data according to the mute time period, and taking the fused audio data of the audio data corresponding to the cut mute time period as candidate audio data.

21. An audio data processing apparatus, comprising:

the audio acquisition module is used for acquiring target audio data corresponding to the target tone information and the target text information; the target text information refers to the text information selected in the original text information; the audio content in the target audio data is matched with the target text information;

the audio splicing module is used for acquiring spliced audio data aiming at the original text information; the spliced audio data is obtained by splicing the fused audio data and the standard audio data corresponding to the residual text information; the residual text information is the text information in the original text information except the target text information; the fusion audio data is obtained by fusing the target audio data and the target tone information; the audio content in the standard audio data matches the remaining text information.

22. A computer device, comprising: a processor and a memory;

the processor is connected to the memory, wherein the memory is used for storing a computer program, and the processor is used for calling the computer program to enable the computer device to execute the method of any one of claims 1-20.

23. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded and executed by a processor, so that a computer device having said processor performs the method of any of claims 1-20.

24. A computer program product comprising computer instructions stored in a computer readable storage medium and adapted to be read and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-20.