CN111370024A

CN111370024A - Audio adjusting method, device and computer readable storage medium

Info

Publication number: CN111370024A
Application number: CN202010107251.5A
Authority: CN
Inventors: 何涛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-07-03
Anticipated expiration: 2040-02-21
Also published as: CN111370024B

Abstract

The invention provides an audio adjusting method, audio adjusting equipment and a computer-readable storage medium; the method comprises the following steps: receiving audio data to be adjusted sent by a terminal; acquiring acoustic audio data corresponding to the audio data to be adjusted from an acoustic audio database; carrying out pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result; the pronunciation difference result represents the difference degree of the audio data to be adjusted and the acoustic audio data in pronunciation; and correcting the pronunciation of the audio data to be adjusted by utilizing the pronunciation of the acoustic audio data based on the pronunciation difference result to obtain the adjusted audio. The invention can improve the effect of audio frequency adjustment.

Description

Audio adjusting method, device and computer readable storage medium

Technical Field

The present invention relates to voice processing technologies, and in particular, to an audio adjusting method, device, and computer readable storage medium.

Background

Most terminals have an audio recording function, and users can record words read by themselves or record songs performed by themselves on the terminals through the audio recording function, so that the life of the users is enriched. In practical application, the terminal can send the audio recorded by the user to the server, so that the server corrects or tunes the audio, and the audio of the user is more enjoyable.

At present, the mode of adjusting the audio of a user is mainly to enhance the sound effect of the audio in the recording process of the audio, for example, reverberation effects such as a recording studio and a concert hall are added to the audio music of the user.

Disclosure of Invention

Embodiments of the present invention provide an audio adjusting method, an audio adjusting device, and a computer-readable storage medium, which can improve an audio adjusting effect.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an audio adjusting method, which comprises the following steps:

receiving audio data to be beautified sent by a terminal; the audio data to be beautified is audio data recorded by a user;

acquiring the acoustic audio data corresponding to the audio data to be adjusted from an acoustic audio database;

carrying out pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result; the pronunciation difference result represents the difference degree of the audio data to be adjusted and the acoustic audio data in pronunciation;

and correcting the pronunciation of the audio data to be adjusted by utilizing the pronunciation of the acoustic audio data based on the pronunciation difference result to obtain an adjusted audio.

acquiring audio data to be adjusted of a user on an audio recording function interface;

sending the audio data to be adjusted to a server so that the server performs audio adjustment on the audio data to be adjusted, wherein the adjusted audio is generated by the server based on the audio data to be adjusted;

and receiving and playing the adjusted audio sent by the server, and completing audio adjustment aiming at the audio data to be adjusted.

An embodiment of the present invention provides a server, including:

a first memory to store executable audio adjustment instructions;

the first processor is configured to implement the audio adjustment method provided by the server side in the embodiment of the present invention when the executable audio adjustment instruction stored in the memory is executed.

An embodiment of the present invention provides a terminal, including:

a second memory to store executable audio adjustment instructions;

and the second processor is used for implementing the audio adjusting method provided by the terminal side in the embodiment of the invention when the executable audio adjusting instruction stored in the memory is executed.

An embodiment of the present invention provides a computer-readable storage medium, which stores executable audio adjusting instructions, and is configured to cause a first processor to execute the audio adjusting method provided by the server side according to the embodiment of the present invention, or cause a second processor to execute the audio adjusting method provided by the terminal side according to the embodiment of the present invention.

The embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the server can receive the audio data to be adjusted sent by the terminal, then the acoustic audio data corresponding to the audio data to be adjusted is obtained from the acoustic audio database, then the acoustic comparison is carried out on the audio data to be adjusted and the acoustic audio data to obtain the result of the pronunciation difference, and the pronunciation of the audio data to be adjusted is corrected by the pronunciation of the acoustic audio data based on the result of the pronunciation difference to obtain the adjusted audio. Therefore, the pronunciation of the audio data to be adjusted can be corrected, the types which can be adjusted in the audio are increased, and the adjustment effect for the audio of the user is finally improved.

Drawings

FIG. 1 is a schematic view of an audio adjustment interface in the related art;

fig. 2 is a schematic diagram of an alternative architecture of the audio adaptation system 100 according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a server 200 according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal 400 according to an embodiment of the present invention;

FIG. 5 is a first diagram illustrating an alternative timing sequence of the audio adjustment method according to the embodiment of the invention;

FIG. 6 is a first schematic diagram of adjusting audio according to an embodiment of the present invention;

FIG. 7 is a first flowchart illustrating an alternative audio adjustment method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an alternative flow chart of an audio adjustment method according to an embodiment of the present invention;

FIG. 9 is a timing diagram illustrating an alternative audio adjustment method according to an embodiment of the present invention;

FIG. 10 is a second schematic diagram of adjusting audio according to an embodiment of the present invention;

FIG. 11 is a flow chart of audio beautification performed on the terminal side according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an interaction flow for beautifying audio provided by an embodiment of the present invention;

fig. 13 is a schematic diagram of a process of audio synthesis provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the description that follows, references to the terms "first", "second", and the like, are intended only to distinguish similar objects and not to indicate a particular ordering for the objects, it being understood that "first", "second", and the like may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than the order illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) The audio adjustment means adjusting the audio of the user in the aspects of sound effect, tone quality, balance and the like, so that the audio of the user is more vivid. For example, after a user records a song, a reverberation effect may be added to the song, and the timbre of the user may be optimized.

2) The original audio data represents an original version of audio corresponding to the audio recorded by the user, for example, when the user records a song currently sung by the user, the original audio data refers to an original version of the song.

3) Pronounced syllables, are units of speech in audio that are most audibly distinguishable. For example, in a sentence of Chinese speech, there are several words, each of which has its pronunciation, and when a person listens to Chinese speech, he actually distinguishes the pronunciation of each word, where the pronunciation of a word corresponds to a syllable.

4) A phoneme is the smallest unit of speech and each syllable of pronunciation is made up of one or more phonemes. For example, the syllable of "Pup" in the Chinese word "Putonghua" can be decomposed into two phones of "p" and "u

5) The audio characteristics refer to musical characteristics in audio, such as the pitch, the tempo, the duration of each tone, the intensity of each tone, and the tune of music.

6) Prosody refers to different tones, moods, pause modes, and lengths of pronunciations of a person when speaking or singing, and belongs to prosodic features, in other words, prosody represents habits of different persons when speaking or singing.

At present, most terminals have an audio recording function, and users can record words read aloud by themselves or record songs sung by themselves on the terminals through the audio recording function, so that the life of the users is enriched. In practical application, the terminal can send the audio recorded by the user to the server, so that the server adjusts the audio, and the audio of the user is more listened to.

In the related art, there are two main ways to adjust the audio frequency of the user, one is to enhance the sound effect of the audio frequency of the user during the recording process, i.e. add a reverberation effect to the audio frequency of the user, such as the reverberation effect of scenes like a recording studio, a theater and a concert hall, so that the user can adjust the own audio frequency with different reverberation effects; and the other is to adjust the tone and balance of the user according to the selection of the user after the recording is finished, and eliminate the noise and the like during the audio recording so as to make the audio of the user more beautiful and pleasant. For example, fig. 1 is a schematic diagram of an audio adjustment interface in the related art, in which a progress bar of audio is displayed in a display area 1-1, and the total duration 03:25 of the audio and the time point 00:06 of the currently played audio are respectively displayed. In the display area 1-2 there are two options, single sentence edit 1-21 and add video 1-22. In the display area 1-3, there are three columns of sound effect adjustment 1-31, tone color adjustment 1-32 and balance adjustment 1-33, and the user can enter the corresponding function by clicking on these columns. In the sound effect adjustment 1-31, a reverberation adjustment 1-311 is arranged, wherein the reverberation adjustment 1-311 has 8 modes of recording studio, KTV, magnetism, song spirit, karaoke, leisurely, illusion and old record; in the sound effect adjustment 1-31, there are also inflexion adjustments 1-312, where the adjustment 1-312 has 3 modes of acoustic, electric and harmony. In display area 1-4, there are provided release 1-41, re-record 1-42 and save 1-43 options. After the user finishes recording the own audio, the user can enter the audio beautifying interface shown in the figure 1, select whether to optimize a single sentence of the audio or add the video to the audio in the display area 1-2 by selecting the audio part needing to be adjusted in the display area 1-1, select an adjusting mode in the display area 1-3, adjust the audio of the user after the selection is finished, finally, the user selects to release the audio, store the audio or re-record the audio through the options in the display area 1-4, and finally, the adjusting process of the audio recorded by the user is finished.

However, in the related art, when the audio of the user is adjusted, the audio effect of the audio can only be enhanced during recording, that is, some simple adjustments are performed, and after the recording is finished, the adjustment for the audio of the user is mostly concentrated on the adjustment of the audio effect, the balance, the tone color and the like, and the selectable adjustment type is still single, so that the adjustment effect for the audio of the user is poor.

Embodiments of the present invention provide an audio adjusting method, an audio adjusting device, and a computer-readable storage medium, which can improve an adjusting effect for a user audio. An exemplary application of the audio adjusting apparatus provided in the embodiment of the present invention is described below, and the audio adjusting apparatus provided in the embodiment of the present invention may be implemented as various types of user terminals such as a smart phone, a tablet computer, and a notebook computer, and may also be implemented as a server. Next, an exemplary application when the audio adjusting apparatus implements the terminal and the server, respectively, will be explained.

Referring to fig. 2, fig. 2 is an alternative architecture diagram of the audio adjusting system 100 according to the embodiment of the present invention, in order to support an audio adjusting application, the terminal 400 is connected to the server 200 through the network 300, the network 300 may be a wide area network or a local area network, or a combination of the two, and the server 200 is further configured with an acoustic audio database 500 to store various types of acoustic audio data.

After the user enters the audio recording function interface 410 by operating the terminal 400, the terminal 400 obtains the audio data to be adjusted of the user on the audio recording function interface 410, and then the terminal 400 sends the audio data to be adjusted to the server 200 through the network 300. After receiving the audio data to be adjusted sent by the terminal 400, the server 200 obtains the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database 500. Then, the server 200 performs pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to compare the difference degree of the audio data to be adjusted and the acoustic audio data in pronunciation, so as to obtain a pronunciation difference result. Then, the server 200 corrects the pronunciation of the audio data to be adjusted by using the pronunciation of the acoustic audio data based on the pronunciation difference result to obtain an adjusted audio, and returns the adjusted audio to the terminal 400. The terminal 400 receives and plays the adjusted audio sent by the server 200, so that the audio adjustment process of the audio data to be adjusted can be completed through the cooperative operation of the server 200 and the terminal 400.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a server 200 according to an embodiment of the present invention, where the server 200 shown in fig. 3 includes: at least one first processor 210, a first memory 250, at least one first network interface 220, and a first user interface 230. The various components in server 200 are coupled together by a first bus system 240. It is understood that the first bus system 240 is used to enable communications for connections between these components. The first bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as first bus system 240 in fig. 3.

The first Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.

The first user interface 230 includes one or more first output devices 231, including one or more speakers and/or one or more visual display screens, that enable presentation of media content. The first user interface 230 also includes one or more first input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The first memory 250 includes volatile memory or nonvolatile memory and may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The first memory 250 described in embodiments of the invention is intended to comprise any suitable type of memory. The first memory 250 optionally includes one or more storage devices physically located remotely from the first processor 210.

In some embodiments, the first memory 250 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

A first operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a first network communication module 252 for communicating to other computing devices via one or more (wired or wireless) first network interfaces 220, an exemplary first network interface 220 comprising: bluetooth, wireless-compatibility authentication (Wi-Fi), and Universal Serial Bus (USB), etc.;

a first display module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more first output devices 231 (e.g., a display screen, speakers, etc.) associated with the first user interface 230;

a first input processing module 254 for detecting one or more user inputs or interactions from one of the one or more first input devices 232 and translating the detected inputs or interactions.

In some embodiments, the audio adjusting apparatus provided by the embodiments of the present invention may be implemented in software, and fig. 3 illustrates the audio adjusting apparatus 255 stored in the first memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: a first receiving module 2551, an obtaining module 2552, a difference comparing module 2553, an adjusting module 2554 and a first sending module 2555, the functions of which will be explained below.

In other embodiments, the audio adjusting apparatus provided in the embodiments of the present invention may be implemented in hardware, and as an example, the audio adjusting apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the audio adjusting method provided in the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

Illustratively, an embodiment of the present invention provides a server, including:

a first memory to store executable audio adjustment instructions;

the first processor is configured to implement the audio adjustment method provided by the server side in the embodiment of the present invention when executing the executable audio adjustment instruction stored in the first memory.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a terminal 400 according to an embodiment of the present invention, where the terminal 400 shown in fig. 4 includes: at least one second processor 410, a second memory 450, at least one second network interface 420, and a second user interface 430. The various components in the terminal 400 are coupled together by a second bus system 440. It is understood that the second bus system 440 is used to enable connection communication between these components. The second bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as the second bus system 440 in figure 4.

The second Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The second user interface 430 includes one or more second output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The second user interface 430 also includes one or more second input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The second memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The second memory 450 described in the embodiments of the present invention is intended to comprise any suitable type of memory. The second memory 450 optionally includes one or more storage devices physically located remote from the second processor 410.

In some embodiments, the second memory 450 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

A second operating system 451 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a second network communication module 452 for communicating to other computing devices via one or more (wired or wireless) second network interfaces 420, the example second network interfaces 420 including: bluetooth, wireless-compatibility authentication (Wi-Fi), and Universal Serial Bus (USB), etc.;

a second display module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more second output devices 431 (e.g., display screens, speakers, etc.) associated with the second user interface 430;

a second input processing module 454 for detecting one or more user inputs or interactions from one of the one or more second input devices 432 and translating the detected inputs or interactions.

In some embodiments, the audio playing apparatus provided by the embodiments of the present invention may be implemented in software, and fig. 4 shows the audio playing apparatus 455 stored in the second memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: an obtaining module 4551, a second sending module 4551 and a second receiving module 4552, functions of which will be described below.

In other embodiments, the audio playing apparatus provided in the embodiments of the present invention may be implemented in hardware, and as an example, the audio playing apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the audio adjusting method provided in the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

Illustratively, an embodiment of the present invention provides a terminal, including:

a second memory to store executable audio adjustment instructions;

and the second processor is used for implementing the audio adjusting method provided by the terminal side in the embodiment of the invention when the executable audio adjusting instruction stored in the second memory is executed.

In the following, the audio adjusting method provided by the embodiment of the present invention will be described in conjunction with exemplary applications and implementations of the server and the terminal provided by the embodiment of the present invention.

Referring to fig. 5, fig. 5 is a first timing diagram of an alternative audio adjusting method according to an embodiment of the present invention, which will be described with reference to the steps shown in fig. 5.

S101, the terminal obtains audio data to be adjusted of a user.

The embodiment of the invention is realized in the scene that the audio data of the user needs to be adjusted, for example, the pronunciation correction is carried out on English short texts read by the user, or the pronunciation adjustment is carried out on Guangdong songs sung by the user, and the like. In addition, the embodiment of the invention can also carry out regulation while recording and transmitting the audio data of the user, and can also carry out one-time transmission and processing on the audio data of the user after the recording is finished. When a user wakes up the terminal and enters the audio recording function interface through operation, the audio adjusting function provided by the audio recording function interface can be triggered through clicking and other operations. And then, the terminal acquires the audio data of the user needing audio adjustment, namely acquires the audio data to be adjusted of the user.

In some embodiments of the present invention, the server may perform audio adjustment on the audio data to be adjusted after the user starts the audio adjustment function through operations such as clicking. At this time, the terminal may receive an audio adjustment instruction triggered by the user on the audio recording interface, and send the audio adjustment instruction and the audio data to be adjusted to the server together.

It should be noted that, in the embodiment of the present invention, the audio data to be adjusted may be selected by the user from audio data stored in the terminal, that is, specified from already recorded audio data, or may be audio data recorded by the user at the current time, that is, audio data recorded in real time.

It can be understood that, before sending the audio data to be adjusted, the terminal may extract the identification information of the acoustic audio data corresponding to the audio data to be adjusted, and send the identification information and the audio data to be adjusted to the server together, so that the server determines which acoustic audio data corresponds to the audio data to be adjusted. For example, when the audio data to be adjusted is a song recorded by the user in real time, the user necessarily specifies which song the user wants to record on the terminal, that is, the identification information of the song to be recorded is specified, and at this time, the terminal can extract the identification information and send the identification information and the audio data to be adjusted to the server together, so that the server determines which song the user wants to sing and adjust.

In some embodiments of the present invention, the audio adjustment instruction may carry identification information of the user, so that when the server obtains the audio adjustment instruction, it is clear which user's audio it is for which adjustment operation is performed, so that the server correspondingly stores the adjustment audio of different users.

It can be understood that the terminal may receive the audio adjustment instruction triggered by the user through a touch operation, a click operation, and the like of the user on the audio recording interface, and may also receive the audio adjustment instruction triggered by the user through a voice instruction, and the like of the user, and the embodiment of the present invention is not limited specifically herein.

S102, the terminal sends the audio data to be adjusted to the server so that the server can conduct audio adjustment on the audio data to be adjusted, wherein the adjustment audio is generated by the server based on the audio data to be adjusted.

The terminal sends the audio data to be adjusted to the server through the network after the audio data to be adjusted and the acoustic audio identifier are obtained, the server receives the audio data to be adjusted sent by the terminal, wherein the audio data to be adjusted represents the audio data recorded by the user, and the adjusted audio generated by the server is generated based on the audio data to be adjusted recorded by the user.

It can be understood that, when the terminal sends the audio data to be adjusted, the audio data to be adjusted may be sent at one time, that is, adjustment after recording is implemented, or the audio data to be adjusted may be sent immediately after a small segment of the data to be adjusted is recorded, and then the next small segment is recorded and sent, that is, adjustment while recording is implemented, and embodiments of the present invention are not limited specifically herein.

S103, the server acquires the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database.

After the server obtains the audio adjustment instruction, it is determined that it needs to perform audio adjustment on the audio data to be adjusted, for example, the server may compare the acoustic audio identifier with each audio identifier in the acoustic audio database, and extract the audio data corresponding to the audio identifier that is the same as the acoustic audio identifier, where the extracted audio data is the acoustic audio data corresponding to the audio data to be adjusted.

It should be understood that the original audio data refers to original audio that is not modified by other persons after the author issues the original audio data, and it should be noted that the author of the original audio data refers to the original author or the dubbing author, in other words, the user may record his own audio on the basis of the audio of the original author or the dubbing author. For example, for the original cantonese song "platic sky" the user turned over, the corresponding acoustic audio data is the "platic sky" the original singer sings, and for the turned version cantonese song "platic sky" the user turned over, the acoustic audio data is the "platic sky" the certain turned singer sings.

S104, carrying out pronunciation matching detection on the audio data to be adjusted and the acoustic audio data by the server to obtain a pronunciation difference result; the pronunciation difference result represents the difference degree of the audio data to be adjusted and the acoustic audio data in pronunciation.

After the server obtains the acoustic audio data corresponding to the audio data to be adjusted, the difference of the acoustic audio data and the audio data to be adjusted in pronunciation is compared, so that parts with inaccurate pronunciation in the audio data to be adjusted are found, and the parts with inaccurate pronunciation are conveniently subjected to pronunciation correction in the following process.

In some embodiments of the present invention, the server may decompose the audio data to be adjusted into a plurality of pronunciation syllables, decompose the acoustic audio data into a plurality of syllables, and compare the pronunciation syllables decomposed from the audio data to be adjusted with the pronunciation syllables decomposed from the acoustic audio data, thereby determining the part of the audio data to be adjusted with inaccurate pronunciation.

And S105, the server corrects the pronunciation of the audio data to be adjusted by using the pronunciation of the acoustic audio data based on the pronunciation difference result to obtain the adjusted audio.

The server can judge whether the difference between the pronunciation of the original sound audio data and the pronunciation of the audio data to be adjusted is too large according to the obtained pronunciation difference result, and when the difference between the pronunciations is large, the server can correct the pronunciation of the audio data to be adjusted by using the pronunciation of the original sound audio data, and the corrected audio is used as the adjustment audio. When the difference of pronunciation is small, the audio data to be adjusted does not need to be subjected to pronunciation correction, and at the moment, the server can directly use the audio data to be adjusted as the adjustment audio.

It should be noted that, in the embodiment of the present invention, when the server corrects the pronunciation of the audio data to be adjusted by using the pronunciation of the acoustic audio data, the pronunciation syllables of the acoustic audio data having a large pronunciation difference are extracted, the pronunciation syllables to be corrected in the audio data to be adjusted are replaced with the pronunciation syllables corresponding to the pronunciation syllables, and then the pronunciation syllables to be corrected, which have been replaced in the audio data to be adjusted, and the remaining pronunciation syllables are re-synthesized into the audio.

For example, when the data to be adjusted is an english song that the user has recorded, the server may compare the syllables of the original song of the english song with the syllables of the english song recorded by the user, and when the part of the english song recorded by the user with inaccurate pronunciation is determined to be "heart" in "my heart willgo on and on", the server may extract the syllables of the "heart" of the corresponding part from the original song, replace the inaccurate pronunciation syllables with the extracted syllables, and synthesize the part of the song with the timbre of the user together with other syllables, so as to obtain the final adjusted audio.

And S106, the terminal receives and plays the adjusted audio sent by the server.

After the server synthesizes the adjusted audio, the adjusted audio is returned to the terminal for the terminal to play the adjusted audio, and the audio adjustment for the audio data to be adjusted is completed. After receiving the adjustment audio sent by the server, the terminal calls the playing space to play the adjustment audio so that the user can hear the adjustment audio, and thus, the audio adjustment process for the audio data to be adjusted is completed.

It can be understood that, when the terminal sends the audio data to be adjusted to the server at one time, i.e. when the adjustment is implemented after recording, correspondingly, the server can also send the adjusted audio back to the terminal at one time, so that the terminal can continuously play the adjusted audio, and a user can conveniently know the overall effect of the adjusted audio; when the terminal records a small section of data to be adjusted and then immediately sends the data to be adjusted, namely, when the adjustment is carried out while recording, correspondingly, the server can also quickly carry out audio adjustment and return on each small section of the data to be adjusted, so that the terminal can play and adjust the part corresponding to each small section in the audio in real time, and a user can conveniently adjust the pronunciation, the volume and other states when recording the audio in real time according to the adjusted audio.

For example, as shown in fig. 6, when the audio data to be adjusted is a cantonese song, "hai broad sky", recorded by the user in real time, the terminal takes one lyric as a unit, and after recording the one lyric, for example, after the terminal finishes recording how many times the lyric "faces cold eyes and jeers", the terminal sends a segment corresponding to the one lyric to the server. After the server receives the segment, the server determines the inaccuracy of the pronunciation of "Cold eye" or "jeer" by comparing the syllable of the portion of the song sung by the user with the syllable of the portion of the song sung originally, corrects the inaccuracy of the pronunciation by the pronunciation of "Cold eye" or "jeer" as sung originally, and returns the corrected result to the terminal. The terminal, upon receiving the corrected result, plays the corrected result to the user via the ear return and bolds "cold eye", "jeer" in lyrics display area 6-1 to indicate to the user that the two incorrectly pronounced words have been adjusted.

In some embodiments of the present invention, the server performs pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result, that is, the specific implementation process of S104 may include: s1041 or S1042, as follows:

s1041, the server takes the audio data to be adjusted as a segment to be adjusted, and pronunciation matching detection is directly carried out on the segment and the original sound audio data to obtain a pronunciation difference result.

The server can directly take the audio data to be adjusted as a complete segment to be adjusted to perform pronunciation matching detection with the complete acoustic audio data, and the obtained result is a pronunciation difference result. At this time, the server may perform pronunciation detection on the whole audio data to be adjusted, and may reserve the voice information and information other than the voice information in the audio data to be adjusted to the greatest extent, so as to facilitate pronunciation matching detection.

S1042, after the audio data to be adjusted and the acoustic audio data are segmented, the server performs pronunciation matching detection to obtain a pronunciation difference result.

The server can also segment the audio data to be adjusted and the acoustic audio data, and perform pronunciation matching detection on the segment to be adjusted and the acoustic segment obtained after the segmentation operation, at the moment, the granularity of the pronunciation matching detection is smaller, so that the obtained pronunciation difference result is more accurate, and the audio adjustment effect is further improved.

It should be noted that S1041 and S1042 are two optional implementation processes in S104, and the implementation process of S104 may be selected according to an actual situation, and the embodiment of the present invention is not limited herein.

In the embodiment of the invention, the server can carry out pronunciation matching detection on the audio data to be adjusted as a whole so as to retain various sound information in the audio data to be adjusted to the maximum extent and facilitate the realization of pronunciation matching detection, and can also carry out pronunciation matching detection on the audio data to be adjusted in a segmented manner so as to ensure that the granularity of pronunciation matching detection is smaller and the accuracy of the obtained pronunciation difference result is higher.

In some embodiments of the present invention, after segmenting the audio data to be adjusted and the acoustic audio data, the server performs pronunciation matching detection to obtain a pronunciation difference result, that is, the specific implementation process of S1042 may include: s1042a-S1042c, as follows:

s1042a, the server divides the acoustic audio data into a plurality of acoustic segments.

After the server acquires the acoustic audio data, in order to facilitate pronunciation comparison between the acoustic audio data and the audio data to be adjusted, the server can also perform paragraph analysis on the acoustic audio data first, so as to determine time points at which the acoustic audio data can be segmented, and then divide the acoustic audio data into a plurality of acoustic segments according to the time points.

It should be noted that, because some blank portions without human voice may exist in the original audio data, such as pause portions for reading aloud, ventilation portions during singing, prelude portions, and the like, correspondingly, if the user records the audio data to be adjusted for the original audio data, there is no user voice in these portions, and it is meaningless that the server performs pronunciation comparison on this portion of data, therefore, the server may perform paragraph analysis on the original audio data according to the pause portions, the ventilation portions, the prelude portions, and the like, divide the original audio data into a plurality of original sound segments, and perform pronunciation comparison only on the original sound segments of human voice.

S1042b, the server segments the audio data to be adjusted into a plurality of segments to be adjusted by using a plurality of segment times corresponding to the plurality of original sound segments.

After the server obtains a plurality of original sound segments, the server can firstly utilize the segmentation time corresponding to each original sound segment to determine a rough time point at which the audio to be adjusted needs to be segmented, then identify whether a voice exists near the rough time point, and the voice corresponds to the pronunciation of which character in the original sound audio data, so that the rough time point is adjusted to obtain a segmentation time point, and finally, the audio data to be adjusted is segmented into a plurality of segments to be adjusted by the segmentation time point. Then, correspondingly, the server carries out pronunciation comparison on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result, and the process of carrying out pronunciation comparison on the plurality of segments to be adjusted and the plurality of acoustic segments to obtain the pronunciation difference result is changed, so that the pronunciation difference result is more accurate.

It can be understood that, when the user records the audio data to be adjusted, there may be some differences with the speaking rhythm of the acoustic audio data, or the tempo of the song, and the like, and the segmentation time corresponding to the acoustic segment is directly used to segment the audio data to be adjusted, which may cause loss of the vocal information of the segment to be adjusted, so the server needs to identify the vocal of the user at a rough time point, then adjust the rough time point, so that the adjusted segmentation time point corresponds to the segmentation time of the acoustic segment, and then segment the audio data to be adjusted, so that the segment to be adjusted can correspond to the acoustic segment while preserving the vocal information of the user.

Of course, for the audio data to be adjusted with no difference in speaking rhythm with the acoustic audio data or in tempo of the song, the server may directly segment the audio data to be adjusted only by using the segment time of the acoustic segment to obtain the segment to be adjusted.

S1042c, the server carries out pronunciation matching detection on each segment to be adjusted and each corresponding acoustic segment to obtain a pronunciation difference result.

After the server finishes segmenting the audio data to be adjusted, the user pronunciation syllables of each segment to be adjusted can be matched with the standard pronunciation syllables of the corresponding original sound segment, and the result obtained by matching is the pronunciation difference result.

In the embodiment of the invention, the server can firstly carry out paragraph analysis on the original sound audio data, divide the original sound audio data into a plurality of original sound segments, and then divide the audio data to be beautified into a plurality of segments to be adjusted by utilizing the segmentation time corresponding to the original sound segments, so that the pronunciation comparison can be carried out subsequently based on the original sound segments and the segments to be adjusted, and the accuracy of the pronunciation difference result is provided.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating an optional flow of an audio adjusting method according to an embodiment of the present invention, in some embodiments of the present invention, a server performs pronunciation matching detection on each to-be-adjusted segment and each corresponding acoustic segment to obtain a pronunciation difference result, that is, a specific implementation process of S1042c may include: S201-S203, as follows:

s201, the server extracts at least one user pronunciation syllable from each segment to be adjusted and extracts at least one standard pronunciation syllable from each original sound segment.

When the server performs pronunciation comparison on the to-be-adjusted segment and the original sound segment, the server needs to perform syllable decomposition on the to-be-adjusted segment and the original sound segment respectively, so as to obtain one or more user pronunciation syllables corresponding to each to-be-adjusted segment and obtain one or more standard pronunciation syllables corresponding to each original sound segment.

It is understood that the user pronunciation syllable refers to the pronunciation of each word in the audio data to be adjusted by the user, and the standard pronunciation syllable refers to the pronunciation of each word by the original writer of the acoustic audio data. Because different people have different accent habits, different pronunciation modes and the like, different people have different pronunciations for the same character, and the pronunciations can be embodied by pronunciation syllables, so that the server can determine the pronunciation difference between the audio data to be adjusted and the original sound audio data by extracting the user pronunciation syllables and the standard pronunciation syllables.

S202, the server determines a corresponding standard syllable for each user pronunciation syllable of the at least one user pronunciation syllable from the at least one standard pronunciation syllable.

Since the audio data to be adjusted recorded by the user is relatively close to the time length of the acoustic audio data, and the sequence of each word of the audio data to be adjusted is also extremely similar to the sequence of each word of the acoustic audio data, the server can determine a corresponding standard syllable for each user-uttered syllable from at least one standard-uttered syllable in combination with the time point of each word and the sequence position of each word. In other words, in this step, the server associates all standard pronunciation syllables with all user pronunciation syllables one by one.

S203, the server matches each user pronunciation syllable with the corresponding standard syllable to obtain a pronunciation difference result.

After the server obtains the standard syllable corresponding to each user pronunciation syllable, each user pronunciation syllable is compared with the corresponding standard syllable to judge whether the user pronunciation syllable is different from the standard syllable. When the user pronunciation syllable is different from the standard syllable, the server considers that the user pronunciation syllable is needed to be corrected, and when the user pronunciation syllable is not different from the standard syllable, the server considers that the user pronunciation syllable is relatively standard without correction.

It should be noted that, since the server obtains one pronunciation difference result for each user pronunciation syllable, in the embodiment of the present invention, the server can obtain a number of pronunciation difference results of user pronunciation syllables. For example, when the user pronounces 10 syllables, the server will get 10 pronunciation difference results.

It will be appreciated that the pronounced syllable may be broken down into phonemes and in some embodiments of the invention, the server may determine whether the user pronounced syllable differs from the standard syllable by a comparison between the phonemes.

In the embodiment of the invention, the server can extract the user pronunciation syllables from the segment to be adjusted, extract the standard pronunciation syllables from the original sound segment, correspond the user pronunciation syllables and the standard pronunciation syllables one by one, and finally compare the standard syllables corresponding to the user pronunciation syllables to obtain the final pronunciation difference result. Thus, the server determines whether the user's pronunciation needs to be modified based on the syllable.

In some embodiments of the present invention, the server compares each user pronunciation syllable with its corresponding standard syllable to obtain a pronunciation difference result, that is, the specific implementation process of S203 may include: s2031 to S2035, as follows:

s2031, the server carries out phoneme decomposition on each user pronunciation syllable to obtain a first phoneme corresponding to each user pronunciation syllable, and carries out phoneme decomposition on a standard syllable corresponding to each user pronunciation syllable to obtain a second phoneme corresponding to the standard syllable.

Since a syllable may be composed of one or more phones, the server decomposes each user-spoken syllable into one or more first phones, and similarly, the server also decomposes the standard syllable corresponding to each user-spoken syllable into one or more second phones.

It will be appreciated that since there is a correspondence between the user-pronounced syllable and the standard syllable, there is also a correspondence between the first phone decomposed from the user-pronounced syllable and the second phone decomposed from the standard syllable.

S2032, the server matches the pronunciation of the first phoneme with the pronunciation of the second phoneme to obtain a pronunciation matching result.

Since the difference of the syllables of the pronunciations is also determined by the pronunciations of the phonemes, the server can also match the pronunciations of the first phoneme one by one with the pronunciations of the second phoneme, so that the server can obtain the pronunciation matching result that the pronunciation of the first phoneme matches with the pronunciation of the second phoneme and the pronunciation matching result that the pronunciation of the first phoneme does not match with the pronunciation of the second phoneme.

In some embodiments of the present invention, the server may determine whether the pronunciation of the first phoneme matches the pronunciation of the second phoneme by observing whether the sound waveform of the first phoneme is the same as or similar to the sound waveform of the second phoneme.

S2033, the server extracts the duration of the first phoneme to obtain a first duration, and extracts the duration of the second phoneme to obtain a second duration.

S2034, the server obtains a time comparison result according to the first duration and the second duration.

Since the difference between the pronounced syllables may be reflected in the duration of the phones, the server may extract the duration of each first phone, derive the first duration, and extract the second duration of each second phone. And then, the server corresponds the first duration time to the second duration time according to the corresponding relation between the first phoneme and the second phoneme, then compares whether the difference between the first duration time and the second duration time is greater than a preset time threshold, when the difference is greater than or equal to the preset time threshold, the server obtains a time comparison result that the difference between the first duration time and the second duration time is greater, otherwise, when the difference is less than the preset time threshold, the server obtains a time comparison result that the first duration time and the second duration time are close.

Since the sound waveforms of the different phonemes are different, in some embodiments of the present invention, the server may extract the duration of each first phoneme according to the sound waveform of the first phoneme, that is, the first duration, and similarly, the server extracts the second duration of each second phoneme according to the sound waveform of the second phoneme.

It should be understood that, in the embodiment of the present invention, the preset time threshold may be set according to actual situations, and the embodiment of the present invention is not limited herein. For example, the preset time threshold may be set to 2ms, or may be set to 5 ms.

S2035, the server determines the pronunciation difference result according to the time comparison result and the pronunciation matching result.

After the server obtains the time comparison result of the first phoneme and the second phoneme and the pronunciation matching result, the server can synthesize the time comparison result and the pronunciation matching result to obtain a final pronunciation difference result.

In the embodiment of the invention, the server can split the user pronunciation syllable into the first phoneme, split the standard syllable into the second phoneme and then match whether the pronunciations of the first phoneme and the second phoneme are similar or not. And comparing whether the duration time of the first phoneme is close to the duration time of the first phoneme, so that the pronunciation difference result of the pronunciation syllable of the pronunciation user and the standard syllable is determined according to the time comparison result and the pronunciation matching result, and thus, the server can complete the determination process of the pronunciation difference result so as to correct the pronunciation of the audio data to be adjusted according to the pronunciation difference result.

In some embodiments of the present invention, the server determines the pronunciation difference result according to the time comparison result and the pronunciation matching result, that is, the specific implementation process of S2035 may include: s2035a-S2035d as follows:

s2035a, when the time comparison result indicates that the first duration and the second duration match and the phone matching result indicates that the pronunciation of the first phone matches the pronunciation of the second phone, the pronunciation difference result indicates that the user pronunciation syllable has no difference from the corresponding standard syllable.

When the first duration and the second duration match, that is, the difference between the first duration and the second duration is smaller than the time threshold of the budget, and the pronunciation of the first phone is the same as or similar to that of the second phone, the server obtains the pronunciation difference result that the user pronunciation syllable has no difference from the standard syllable.

S2035b, when the time comparison result indicates that the first duration time does not match the second duration time, and the phone matching result indicates that the pronunciation of the first phone matches the pronunciation of the second phone, the pronunciation difference result indicates that the user pronunciation syllable is different from the corresponding standard syllable.

When the first duration and the second duration do not match, i.e. the difference between the first duration and the second duration exceeds a predetermined time threshold, the server will derive a pronunciation difference result that the user's pronunciation syllable differs from the standard syllable, even if the pronunciation of the first phone is the same as or similar to the pronunciation of the second phone.

S2035c, when the time comparison result indicates that the first duration and the second duration are matched, and the phone matching result indicates that the pronunciation of the first phone is not matched with the pronunciation of the second phone, the pronunciation difference result indicates that the user pronunciation syllable is different from the corresponding standard syllable.

When the first duration and the second duration are matched, but the difference between the pronunciation of the first phoneme and the pronunciation of the second phoneme is too large, the server also obtains the pronunciation difference result that the user pronunciation syllable is different from the standard syllable.

S2035d, when the time comparison result indicates that the first duration and the second duration are not matched and the phone matching result indicates that the pronunciation of the first phone is not matched with the pronunciation of the second phone, the pronunciation difference result indicates that the user pronunciation syllable is different from the corresponding standard syllable.

When the first duration and the second duration do not match and the difference between the pronunciation of the first phone and the pronunciation of the second phone is too large, the first phone and the second phone are not similar, and the server inevitably obtains the pronunciation difference result that the user pronunciation syllable is different from the standard syllable.

In the embodiment of the invention, the server can synthesize various conditions of the time comparison result and various conditions of the pronunciation matching result to obtain the pronunciation difference result between the user pronunciation syllable and the standard syllable so as to facilitate the subsequent pronunciation correction of the audio data to be beautified based on the pronunciation difference result.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating an optional second flow chart of the audio adjusting method according to an embodiment of the present invention, in some embodiments of the present invention, the server corrects the pronunciation of the audio data to be adjusted by using the pronunciation of the acoustic audio data based on the pronunciation difference result to obtain the adjusted audio, that is, a specific implementation process of S105 may include: S1051-S1054, as follows:

s1051, the server selects the syllable to be corrected with the difference of the pronunciation difference result representation from at least one user pronunciation syllable of the audio data to be adjusted.

Because each user pronunciation syllable has the corresponding pronunciation difference result and is used for representing whether the user pronunciation syllable is different from the corresponding standard pronunciation syllable, the server only needs to read and identify the pronunciation difference result one by one and selects the user pronunciation syllable with the pronunciation difference represented by the pronunciation difference result, and the syllable to be corrected can be obtained.

It should be noted that, since there may be more than one pronunciation difference result representing the difference between the user pronunciation syllable and the standard pronunciation syllable, the syllable to be corrected selected by the server is not a specific syllable, but a set of all user pronunciation syllables that need to be corrected.

S1052, the server extracts a standard syllable corresponding to the syllable to be corrected from at least one standard pronunciation syllable.

After obtaining the syllables to be corrected, the server also needs to determine corresponding standard pronunciation syllables for each user pronunciation syllable in the syllables to be corrected from at least one standard pronunciation syllable according to the corresponding relation between the user pronunciation syllables and the standard pronunciation syllables, and extract the standard pronunciation syllables to form a set so as to obtain the standard syllables.

It will be appreciated that the number of standard pronounced syllables in the standard syllable is the same as the data for the user pronounced syllables in the syllable to be corrected.

And S1053, the server replaces the syllable to be corrected with the standard syllable to obtain the corrected syllable.

The server correspondingly replaces each user pronunciation syllable in the syllables to be corrected by each standard user pronunciation in the standard syllables to obtain the corrected user pronunciation syllables of the number of the user pronunciation syllables, and the corrected user pronunciation syllables form a set to obtain the corrected syllables.

S1054, the server synthesizes the adjusted audio by using the corrected syllable and the syllables except the syllable to be corrected in at least one user pronunciation syllable.

The server, after obtaining the corrected syllables, re-synthesizes the audio using each of the corrected user-spoken syllables and at least one of the user syllables other than the syllable to be corrected, i.e., the user-spoken syllable that does not differ from the standard spoken syllable, at which time the synthesized audio is the adjusted audio.

It will be appreciated that the server uses the user's timbre when re-synthesizing the adapted audio, so that the synthesized adapted audio is of the user's timbre, i.e. the adapted audio is user-specific audio.

In the embodiment of the invention, the server can select the syllable to be corrected with the pronunciation difference result representing the pronunciation difference from at least one user pronunciation syllable in the audio data to be adjusted, namely the syllable with the user pronunciation being nonstandard, then extracts the corresponding standard syllable for the syllable to be corrected from at least one standard pronunciation syllable, and then the server replaces the syllable to be corrected with the standard syllable to obtain the corrected syllable; and synthesizing the adjusted audio by using the corrected syllable and the syllables except the syllable to be corrected in the original at least one user pronunciation syllable. Thus, the server can correct the user pronunciation syllable with the abnormal pronunciation by using the standard pronunciation syllable to obtain the adjusted audio.

Referring to fig. 9, fig. 9 is a second optional timing diagram of the audio adjustment method according to the embodiment of the present invention, in some embodiments of the present invention, after the server acquires the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database, that is, after S103, the method may further include: S107-S109, in other words, after S103, the server can choose to execute S104-S105, or choose to execute S107-S109, or can execute S104-S105 first and then S107-S109, or execute S107-S109 first and then S104-S105.

S107, the server extracts the original sound audio features from the original sound audio data and extracts the user audio features from the audio data to be adjusted; wherein the audio features include at least pitch, frequency, and duration.

The server can perform prosody optimization on the audio to be adjusted besides performing pronunciation correction on the audio data to be adjusted, so that the server can correct the part of the user with running tone and rhythm disorder in the audio to be adjusted, and accordingly more attractive adjusted audio is obtained. At this moment, the server can extract the original sound audio features from the original sound audio data first, and extract the user audio features from the audio data to be adjusted, so that the original sound audio features and the user audio features can be utilized to compare subsequently, and the part with running and rhythm disorder in the audio data to be adjusted is judged.

It should be understood that, in the embodiment of the present invention, the audio features at least include a tone, a frequency, and a duration, that is, at least information of the tone of the human voice of the user, a frequency distribution of the audio data to be adjusted as a whole, and a duration of the tone, and the like, and may also include other features capable of describing the audio data, and the embodiment of the present invention is not limited herein.

And S108, the server carries out similarity calculation on the original sound audio features and the user audio features to obtain an audio difference result.

The server compares the original sound audio features with the user audio features, so that the similarity degree of the original sound audio features and the user audio features is known, and further, the audio difference result of the original sound audio features and the user audio features is obtained.

In order to facilitate calculation of the similarity between the acoustic audio features and the user audio features, in some embodiments of the present invention, the server may map both the acoustic audio features and the user audio features to a vector space, and then obtain the similarity between the acoustic audio features and the user audio features according to the similarity calculation of the vectors.

And S109, performing prosody optimization on the audio data to be adjusted by using the original sound audio data based on the audio difference result by the server to obtain an adjusted audio.

After the server obtains the audio frequency difference result, whether the audio frequency data to be adjusted is out of tune or not or whether the rhythm of the audio frequency data to be adjusted influences the grace degree of the audio frequency data to be adjusted or not can be determined, if the audio frequency difference result represents that the difference between the melody and the rhythm of the audio frequency data to be adjusted and the original audio frequency data is large, the server can use the melody and the rhythm of the original audio frequency data to carry out rhythm optimization on the melody and the rhythm of the audio frequency data to be adjusted, and the audio frequency obtained after optimization is the adjustment audio frequency.

It should be noted that, in some embodiments of the present invention, because there may be only a difference in prosody and melody between the original audio data and the audio data to be adjusted, or there may be only a difference in pronunciation, or there may be both a difference in pronunciation and a difference in prosody and melody, the server may perform prosody optimization only on the audio data to be adjusted to obtain an adjusted audio, perform pronunciation correction only on the audio data to be adjusted to obtain an audio to be adjusted, perform pronunciation correction on the audio data to be adjusted, and perform prosody optimization to obtain an adjusted audio. Specifically, the audio adjustment mode is obtained, and the server may determine according to a difference between the audio data to be adjusted and the acoustic audio data, which is not limited herein in the embodiment of the present invention.

Exemplarily, referring to fig. 10, a second schematic diagram of adjusting audio is provided in the embodiment of the present invention. In fig. 10, when the audio data to be adjusted is the recorded yue-language song "platoon sky" by the user, the bold "cool eye", "jeer" in the lyrics display area 10-1 is a word of inaccurate pronunciation that the server has corrected, the pitch and duration of each of the tones the user sings is displayed in the prosody display area 10-2, and the pitch and duration of some of the tones after prosody optimization by the server, wherein the pitch and duration of the tone sung by the user are shown with a solid line segment, and the pitch and duration of the tone after prosody optimization by the server are shown with a dotted line segment, at this time, as can be seen from FIG. 10, the tone 10-21 and the duration of the voice sung by the user are not correct, and the server performs prosody optimization on the voice 10-21 to obtain the optimized voice 10-22, so as to correct the user's off-tune part.

In the embodiment of the invention, the server can also respectively extract the original sound audio features from the original sound audio data, extract the user audio features from the audio data to be adjusted, compare the original sound audio features with the audio features to be adjusted to determine the problems of rhythm and melody in the audio data to be adjusted, and then perform rhythm optimization on the audio data to be adjusted by using the original sound audio data to obtain the adjusted audio.

In some embodiments of the present invention, the server performs similarity calculation on the acoustic audio features and the user audio features to obtain an audio difference result, that is, the specific implementation process of S108 may include: S1081-S1082, as follows:

s1081, the server vectorizes the acoustic audio features and the user audio features respectively to obtain acoustic feature vectors corresponding to the acoustic audio features and user feature vectors corresponding to the user audio features.

The server maps the acoustic audio features to a vector space to obtain acoustic feature vectors corresponding to the acoustic features, and simultaneously maps the user audio features to the vector space to obtain user feature vectors corresponding to the user audio features.

It will be appreciated that since for audio features, pitch, frequency and duration can all be represented by converting to numbers, the server may convert these audio features to digital form when mapping them to vector space, resulting in a feature vector.

S1082, the server calculates an angle value between the acoustic feature vector and the user feature vector, and then obtains an audio difference result according to the angle value.

Since in vector space, the similarity between two vectors can be measured by the size of the angle between the two vectors, the smaller the angle, the more similar the two vectors are, and the larger the angle, the more dissimilar the two vectors are. Therefore, the server can calculate the angle value between the acoustic feature vector and the user feature vector, when the angle value is larger, the audio difference result with the acoustic audio feature being dissimilar to the user audio feature is obtained, and when the angle value is smaller, the audio difference result with the acoustic audio feature being similar to the user audio feature is obtained.

Further, the server may measure a difference between the acoustic feature vector and the user feature vector by setting an angle value threshold, that is, when the angle value is greater than or equal to the preset angle value threshold, the acoustic audio feature is not similar to the user audio feature, and when the angle value is smaller than the preset angle value threshold, the acoustic audio feature is similar to the user audio feature.

In the embodiment of the invention, the server can firstly quantize the original voice frequency characteristics and the user voice frequency characteristics, and obtains the voice frequency difference result between the original voice characteristic vector and the user characteristic vector by calculating the angle value between the original voice characteristic vector and the user characteristic vector, so that the server can measure whether the original voice frequency data and the voice frequency data to be adjusted have difference in rhythm and melody.

In some embodiments of the present invention, the vectorizing, by the server, the acoustic audio features and the user audio features respectively to obtain acoustic feature vectors corresponding to the acoustic audio features and user characteristic vectors corresponding to the user audio features, that is, a specific implementation process of S1081 may include: s1081a-S1081c, as follows:

s1081a, digitizing each original sound sub-feature in the original sound audio features by the server to obtain sub-elements corresponding to each original sound sub-feature.

S1081b, digitizing each user sub-feature in the user audio features by the server to obtain sub-elements corresponding to each user sub-feature.

Since each sub-feature in the audio feature, such as the sub-features of the pitch, the frequency, the duration, and the like, can be represented by converting into numbers, the server can directly digitize each original sound sub-feature to obtain sub-elements corresponding to each original sound sub-feature, and digitize each user sub-feature in the same manner to obtain sub-elements corresponding to each user sub-feature.

In other embodiments of the present invention, the server may further perform rounding, averaging, and the like on the numbers obtained after digitizing the sub-features to obtain the sub-elements corresponding to the sub-features, which is not limited in this embodiment of the present invention.

S1081c, the server combines the original sound characteristic vectors by using the sub-elements corresponding to the original sound sub-characteristics, and combines the user characteristic vectors by using the sub-elements corresponding to the user sub-characteristics.

The server splices the sub-elements corresponding to the original sound sub-features according to the sequence of the original sound sub-features to obtain original sound feature vectors, and simultaneously splices the sub-elements corresponding to the user sub-features according to the sequence of the user sub-features to obtain user feature vectors, so that the server completes vectorization of the original sound audio features and the user audio features.

In the embodiment of the invention, the server can obtain the original sound audio features and the user audio features to carry out vectorization, obtain the original sound feature vectors and the user feature vectors, so as to calculate the audio difference result subsequently.

In some embodiments of the present invention, the server performs prosody optimization on the audio data to be adjusted by using the acoustic audio data based on the audio difference result to obtain an adjusted audio, that is, a specific implementation process of S109 may include: S1091-S1094, as follows:

s1091, when the audio difference result is larger than a preset audio difference threshold, the server counts out the user special prosody from the audio data to be adjusted, and extracts the original voice prosody from the original voice audio data.

Only when the audio difference result is greater than the preset audio difference threshold value, that is, the difference between the original voice audio data and the audio data to be adjusted in the melody and the rhythm is large enough, the server performs rhythm optimization on the audio data to be adjusted, in other words, when the difference between the audio data to be adjusted and the original voice audio data in the melody and the rhythm is small, it indicates that the audio data to be adjusted is beautiful enough in the melody and the rhythm and does not need rhythm optimization. When the server carries out prosody optimization on the audio data to be adjusted, the server firstly calculates the user special prosody with the user characteristics from the audio data to be adjusted, and simultaneously extracts the original voice prosody from the original voice audio data.

It is understood that, in the embodiment of the present invention, the preset audio difference threshold may be set according to actual situations, for example, when the audio difference result is measured by an angle value, the preset audio difference threshold may be set to 10 °, or may be set to 30 °, and the embodiment of the present invention is not limited herein.

It should be noted that, in the audio data to be adjusted, the prosody frequently used by the user is the user-specific prosody. In other embodiments of the present invention, the user-specific prosody may not be counted in the audio data to be adjusted, and at this time, the server may obtain the audio recorded in the history of the user from the database, and then obtain the user-specific prosody from the audio recorded in the history.

S1092, the server extracts the prosody to be adjusted, the difference between which and the original prosody exceeds the preset prosody threshold value, from the special prosody of the user.

The server compares the user special prosody with the original prosody after obtaining the user special prosody, when the difference between the user special prosody and the original prosody exceeds a preset prosody threshold value, the server indicates that the user special prosody is seriously deviated from the original prosody, and at the moment, the server extracts the user special prosody to be used as the prosody to be adjusted.

It is to be understood that the preset prosody threshold may be set according to actual situations, and the embodiment of the present invention is not limited herein. For example, the preset prosody threshold may be set to 50%, that is, when a certain user specific prosody is different from the original prosody by 50%, the user specific prosody is the prosody to be adjusted, that is, the prosody to be adjusted.

It should be noted that, since there may be a difference in the user's timbre of the author of the original sound audio data, when comparing the user's special prosody with the original sound prosody, or even when comparing the user's pronunciation syllables with the original sound pronunciation syllables, the comparison should be performed on the basis of the stripped timbre, that is, it cannot be determined that there is a difference in pronunciation and prosody due to the timbre.

S1093, the server weights the prosody to be adjusted and the original voice prosody corresponding to the prosody to be adjusted to obtain the adjusted user prosody.

The server determines a corresponding original voice rhythm for the rhythm to be adjusted, then weights the rhythm to be adjusted and the corresponding original voice rhythm, and takes a weighting result as the adjusted user rhythm. It is understood that the server may set the weighting weight according to the difference between the prosody to be adjusted and the original prosody, for example, when the prosody to be adjusted is completely different from the original prosody, the weighting weight is set to 1, that is, the prosody to be adjusted is covered by the original prosody, and when the prosody to be adjusted is only partially different from the original prosody, the weighting weight is set to 0.4, that is, 60% of the prosody to be adjusted and 40% of the original prosody are used to generate the final adjusted prosody of the user, so that the prosody of the user is optimally adjusted while the characteristic of the prosody of the user can be saved.

S1094, the server synthesizes the adjusted audio by using the adjusted user prosody and the prosody except the prosody to be adjusted in the user special prosody.

And the server performs audio synthesis by using the adjusted user prosody and other prosody except the prosody to be adjusted in the special prosody of the user to obtain the adjusted audio. It will be appreciated that the server, when synthesizing the adapted audio, still uses the user's timbre, so that the resulting adapted audio will have the user's timbre.

In the embodiment of the invention, the server firstly counts out the special prosody of the user from the audio data to be adjusted, then extracts the prosody to be adjusted from the special prosody of the user, and then the server weights the original prosody corresponding to the prosody to be adjusted and the prosody to be adjusted to adjust the prosody to be adjusted to obtain the adjusted prosody of the user, thereby synthesizing the adjusted audio based on the adjusted prosody of the user. Therefore, the server can optimize the melody and rhythm of the audio data to be adjusted to obtain more beautiful adjusted audio.

In some embodiments of the present invention, after the terminal acquires the audio data to be adjusted of the user on the audio recording function interface, before the audio data to be adjusted is sent to the server, that is, after S101 and before S102, the method may further include: s112, the following steps are carried out:

and S112, compressing the audio data to be adjusted by the terminal to obtain the compressed audio data to be adjusted.

After the terminal acquires the audio data to be adjusted, in order to transmit the audio data to be adjusted to the server conveniently, the transmission time is saved, the data flow occupied by transmission is saved, the terminal can compress the audio data to be adjusted, and at the moment, correspondingly, the process that the terminal sends the audio adjusting instruction, the acoustic audio identifier and the audio data to be adjusted to the server is changed into the process that the audio adjusting instruction, the acoustic audio identifier and the compressed audio data to be adjusted are sent to the server.

In the embodiment of the invention, the terminal can compress the audio data to be adjusted and send the compressed audio data to be adjusted to the server, so that the transmission time of the audio data to be adjusted and the data traffic occupied by transmission are saved.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

The embodiment of the invention is realized in the scene that a user sings a cantonese song, a dialect song, a drama on a terminal, and reads English in an English short video on the terminal, and the like, and the server beautifies the audio.

Fig. 11 is a flowchart illustrating an audio enhancement performed at the terminal according to an embodiment of the present invention. After the process starts 11-1, the user clicks on a karaoke interface (audio recording function interface) of the terminal to start a song-as-you-go function 11-2 (audio adjustment instruction), then records user audio 11-3 (audio data to be adjusted), and then the terminal transmits the recorded user audio to the server 11-4 in a segmented manner, so that the server processes the user audio 11-5. The terminal then receives the beautified audio 11-6 and returns to the user's ear return to allow the user to hear the post-adjustment 11-7. After the audio recording is finished by 11-8, the terminal judges whether the user is satisfied with the beautified audio by 11-9 according to the user operation, if so, the beautified audio is output by 11-10, and if not, the user enters the processing flow of recording the user song 11-3 again.

FIG. 12 is a schematic diagram of an interaction flow for beautifying audio according to an embodiment of the present invention. When a user starts to record singing songs and dramas or records read English, the terminal requests the recording device to record 12-1, the recording device returns user audio 12-2 to the terminal, the terminal performs preprocessing 12-3 such as compression on the user audio and sends the preprocessed audio (compressed audio data to be adjusted) to the server 12-4, the server performs pronunciation matching degree test (pronunciation comparison) on the preprocessed audio, beautifies the user audio 12-5 according to the matching degree (pronunciation difference result), then returns the beautified audio to the terminal 12-6, and the terminal plays the beautified audio to the user 12-7 through an ear return or a loudspeaker, so that the user can determine subsequent operation.

Specifically, in the pronunciation matching degree test process, the server may compare the syllables of the user audio (user pronunciation syllables) with the syllables of the acoustic audio, cantonese syllables, drama syllables, dialect-like song syllables, or syllables of the english short texts (standard pronunciation syllables) to obtain a pronunciation matching degree test result. The server can also optimize the rhythm tone of the user audio, firstly, the server cuts the user audio and the original sound audio into small sections (the original sound sections and the sections to be beautified) by taking one sentence of lyrics as a unit, and compares the small sections by taking the small sections as a unit, thereby being convenient for comparison and reducing comparison errors. The server then characterizes the audio, i.e. converts the user audio and the acoustic audio into vector representations (acoustic feature vectors and user feature vectors), the vector neutralization parameters including the pitch, frequency and duration of each word. Then, the server calculates the similarity between the vector of the user audio and the vector of the acoustic audio by using cosine similarity, and divides the difference between the user audio and the acoustic audio into four cases according to the similarity: (a) the pronunciation is normal, and the rhythm tones are different; (b) the pronunciation is abnormal, the rhythm tones are matched, the condition is to translate the pronunciation of the user, such as translating the non-cantonese into cantonese, and the produced audio is smoother according to the historical audio and the acoustic audio; (c) the pronunciation is abnormal, the rhythm tones are not matched, the difference between the rhythm tones is too large, and new audio (adjusted audio) needs to be produced according to the user audio and the original audio; (d) the cantonese pronounces normally, the rhythm tone is identical, this is the ideal situation, do not need to make the adjustment to user's audio frequency.

Fig. 13 is a schematic diagram of a process of audio synthesis provided by an embodiment of the present invention. In the audio synthesizing process, the characteristic prosody (user-specific prosody) 13-3 of the user may be extracted from the user history audio 13-1 and the user audio 13-2. When the server starts to operate, the server can search the audio of the user with the highest similarity to the current user audio from the audio of all the users and use the audio as the historical audio of the user because the historical audio of the user is little or no. Then, the server extracts standard prosody (original prosody) 13-5 from the original voice audio 13-4, performs prosody adjustment 13-6 on the characteristic prosody of the user by using the standard prosody, and simultaneously synthesizes beautified voice 13-8 (adjusted voice) of the user by using the extracted standard syllables 13-7 and the prosody adjusted result. It will be appreciated that both standard syllables and standard prosody can be pre-extracted and stored in the server to speed up audio beautification. Finally, the server can carry out smoothing processing on the synthesized audio of the user, so that the audio of the user is more natural.

By the method, the server can optimize and adjust the user special rhythm and the user pronunciation of the user songs, operas and even English texts read aloud, so that the audio of the user is more gracefully heard, the optional adjustment types of the audio of the user are increased, and the audio adjustment effect is improved.

Continuing with the exemplary structure of the audio adaptation device 255 provided by the embodiments of the present invention implemented as software modules, in some embodiments, as shown in fig. 3, the software modules stored in the audio adaptation device 255 of the first memory 250 may include:

a first receiving module 2551, configured to receive audio data to be adjusted sent by a terminal; the audio data to be adjusted represent audio data recorded by a user;

an obtaining module 2552, configured to obtain, from an acoustic audio database, acoustic audio data corresponding to the audio data to be adjusted;

a difference comparison module 2553, configured to perform pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result; the pronunciation difference result represents the difference degree of the audio data to be adjusted and the acoustic audio data in pronunciation;

an adjusting module 2554, configured to modify the pronunciation of the audio data to be adjusted by using the pronunciation of the acoustic audio data based on the pronunciation difference result to obtain an adjusted audio;

a first sending module 2555, configured to return the adjusted audio to the terminal, so that the terminal can play the adjusted audio.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to take the audio data to be adjusted as a segment to be adjusted, and directly perform pronunciation matching detection on the audio data to be adjusted and the original sound audio data to obtain the pronunciation difference result; or after segmenting the audio data to be adjusted and the acoustic audio data, carrying out pronunciation matching detection to obtain the pronunciation difference result.

In some embodiments of the invention, the difference comparing module 2553 is specifically configured to divide the acoustic audio data into a plurality of acoustic segments; utilizing a plurality of segment time corresponding to the plurality of original sound segments to segment the audio data to be adjusted into a plurality of segments to be adjusted; and carrying out pronunciation matching detection on each segment to be adjusted and each corresponding acoustic segment to obtain the pronunciation difference result.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to extract at least one user pronunciation syllable from each of the segments to be adjusted, and extract at least one standard pronunciation syllable from each of the acoustic segments; determining, for each of the at least one user-pronounced syllable, a corresponding standard syllable from the at least one standard pronounced syllable; and matching each user pronunciation syllable with the corresponding standard syllable to obtain the pronunciation difference result.

In some embodiments of the invention, the difference comparing module 2553 is specifically configured to perform phoneme decomposition on each user-spoken syllable to obtain a first phoneme corresponding to each user-spoken syllable, and perform phoneme decomposition on a standard syllable corresponding to each user-spoken syllable to obtain a second phoneme corresponding to the standard syllable; matching the pronunciation of the first phoneme with the pronunciation of the second phoneme to obtain a pronunciation matching result; extracting the duration of the first phoneme to obtain the first duration, and extracting the duration of the second phoneme to obtain the second duration; obtaining a time comparison result according to the first duration and the second duration; and determining the pronunciation difference result according to the time comparison result and the pronunciation matching result.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to, when the time comparison result indicates that the first duration and the second duration are matched and the phone matching result indicates that the pronunciation of the first phone is matched with the pronunciation of the second phone, indicate that the user pronunciation syllable has no difference from the corresponding standard syllable; when the time comparison result represents that the first duration and the second duration are not matched, and the phoneme matching result represents that the pronunciation of the first phoneme is matched with the pronunciation of the second phoneme, the pronunciation difference result represents that the user pronunciation syllables are different from the corresponding standard syllables; when the time comparison result represents that the first duration and the second duration are matched and the phoneme matching result represents that the pronunciation of the first phoneme is not matched with the pronunciation of the second phoneme, the pronunciation difference result represents that the user pronunciation syllables are different from the corresponding standard syllables; when the time comparison result represents that the first duration and the second duration are not matched and the phoneme matching result represents that the pronunciation of the first phoneme is not matched with the pronunciation of the second phoneme, the pronunciation difference result represents that the user pronunciation syllable is different from the corresponding standard syllable.

In some embodiments of the present invention, the adjusting module 2554 is specifically configured to select a to-be-corrected syllable having a difference in pronunciation difference result characteristic from at least one user pronunciation syllable of the to-be-adjusted audio data; extracting a standard syllable corresponding to the syllable to be corrected from the at least one standard pronunciation syllable; replacing the syllable to be corrected with the standard syllable to obtain a corrected syllable; and synthesizing the adjusted audio by using the corrected syllable and syllables except the syllable to be corrected in the at least one user pronunciation syllable.

In some embodiments of the present invention, the difference comparing module 2553 is further configured to extract an acoustic audio feature from the acoustic audio data, and extract a user audio feature from the audio data to be adjusted; wherein the audio features include at least pitch, frequency, and duration; carrying out similarity calculation on the acoustic audio features and the user audio features to obtain an audio difference result;

the adjusting module 2554 is further configured to perform prosody optimization on the audio data to be adjusted by using the acoustic audio data based on the audio difference result, so as to obtain the adjusted audio.

In some embodiments of the present invention, the difference comparing module 2553 is further configured to vectorize the acoustic audio features and the user audio features, respectively, to obtain acoustic feature vectors corresponding to the acoustic audio features and user feature vectors corresponding to the user audio features; and calculating an angle value between the original sound feature vector and the user feature vector, and further obtaining the audio difference result according to the angle value.

In some embodiments of the present invention, the difference comparing module 2553 is further configured to digitize each original sound sub-feature in the original sound audio features to obtain a sub-element corresponding to each original sound sub-feature; digitizing each user sub-feature in the user audio features to obtain sub-elements corresponding to each user sub-feature; and combining the acoustic feature vectors by utilizing the sub-elements corresponding to the acoustic sub-features, and combining the user feature vectors by utilizing the sub-elements corresponding to the user sub-features.

In some embodiments of the present invention, the adjusting module 2554 is further configured to count a user specific prosody from the audio data to be adjusted and extract an original prosody from the original audio data when the audio difference result is greater than a preset audio difference threshold; extracting prosody to be adjusted, the difference between which and the original prosody exceeds a preset prosody threshold, from the special prosody of the user; weighting the prosody to be adjusted and the original voice prosody corresponding to the prosody to be adjusted to obtain the adjusted user prosody; and synthesizing the adjusted audio by using the adjusted user prosody and prosody except the prosody to be adjusted in the special prosody of the user.

Continuing with the exemplary structure of the audio playing device 455 provided by the embodiment of the present invention implemented as software modules, in some embodiments, as shown in fig. 4, the software modules stored in the audio playing device 455 of the second memory 450 may include:

an obtaining module 4551, configured to obtain audio data to be adjusted of a user;

a second sending module 4552, configured to send the audio data to be adjusted to a server, so that the server performs audio adjustment on the audio data to be adjusted, where the adjusted audio is generated by the server based on the audio data to be adjusted

And a second receiving module 4553, configured to receive the adjusted audio sent by the server and play the adjusted audio.

In some embodiments of the present invention, the audio playing device 455 further comprises a compression module 4554;

the compression module 4554 is configured to compress the audio data to be adjusted to obtain compressed audio data to be adjusted; (ii) a

Correspondingly, the second sending module 4552 is further configured to send the compressed audio data to be adjusted to the server.

Embodiments of the present invention provide a computer-readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform an audio adjustment method provided by embodiments of the present invention, for example, the method as illustrated in fig. 5, fig. 7, fig. 8, fig. 9, fig. 11, fig. 12, and fig. 13.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, the executable audio adjustment instructions may be in the form of a program, software module, script, or code written in any form of programming language (including compiled or interpreted languages), or declarative or procedural languages, and they may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, the executable audio adjustment instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. An audio adjustment method, comprising:

receiving audio data to be adjusted sent by a terminal; the audio data to be adjusted is audio data recorded by a user;

2. The method according to claim 1, wherein the performing pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result comprises:

taking the audio data to be adjusted as a segment to be adjusted, and directly carrying out pronunciation matching detection on the audio data to be adjusted and the original sound audio data to obtain the pronunciation difference result; alternatively, the first and second electrodes may be,

and segmenting the audio data to be adjusted and the acoustic audio data, and then carrying out pronunciation matching detection to obtain the pronunciation difference result.

3. The method according to claim 2, wherein the segmenting the audio data to be adjusted and the acoustic audio data and then performing pronunciation matching detection to obtain the pronunciation difference result comprises:

dividing the acoustic audio data into a plurality of acoustic segments;

utilizing a plurality of segment time corresponding to the plurality of original sound segments to segment the audio data to be adjusted into a plurality of segments to be adjusted;

and carrying out pronunciation matching detection on each segment to be adjusted and each corresponding acoustic segment to obtain the pronunciation difference result.

4. The method according to claim 3, wherein the performing pronunciation matching detection on each segment to be adjusted and each corresponding acoustic segment to obtain the pronunciation difference result comprises:

extracting at least one user pronunciation syllable from each segment to be adjusted, and extracting at least one standard pronunciation syllable from each acoustic segment;

determining, for each of the at least one user-pronounced syllable, a corresponding standard syllable from the at least one standard pronounced syllable;

and matching each user pronunciation syllable with the corresponding standard syllable to obtain the pronunciation difference result.

5. The method of claim 4, wherein said matching each of said user-uttered syllables to its corresponding standard syllable for said pronunciation difference result comprises:

performing phoneme decomposition on each user pronunciation syllable to obtain a first phoneme corresponding to each user pronunciation syllable, and performing phoneme decomposition on a standard syllable corresponding to each user pronunciation syllable to obtain a second phoneme corresponding to the standard syllable;

matching the pronunciation of the first phoneme with the pronunciation of the second phoneme to obtain a pronunciation matching result;

extracting the duration of the first phoneme to obtain a first duration, and extracting the duration of the second phoneme to obtain a second duration;

obtaining a time comparison result according to the first duration and the second duration;

and determining the pronunciation difference result according to the time comparison result and the pronunciation matching result.

6. The method according to any one of claims 1 to 5, wherein the modifying the pronunciation of the audio data to be adjusted by using the pronunciation of the acoustic audio data based on the pronunciation difference result to obtain an adjusted audio comprises:

selecting syllables to be corrected with differences in pronunciation difference result representation from at least one user pronunciation syllable of the audio data to be adjusted;

extracting a standard syllable corresponding to the syllable to be corrected from the at least one standard pronunciation syllable;

replacing the syllable to be corrected with the standard syllable to obtain a corrected syllable;

and synthesizing the adjusted audio by using the corrected syllable and syllables except the syllable to be corrected in the at least one user pronunciation syllable.

7. The method according to any one of claims 1 to 6, wherein after the obtaining of the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database, the method further comprises:

extracting original sound audio frequency characteristics from the original sound audio frequency data, and extracting user audio frequency characteristics from the audio frequency data to be adjusted; wherein the audio features include at least pitch, frequency, and duration;

carrying out similarity calculation on the acoustic audio features and the user audio features to obtain an audio difference result;

and performing prosody optimization on the audio data to be adjusted by using the acoustic audio data based on the audio difference result to obtain the adjusted audio.

8. The method of claim 7, wherein the performing a similarity calculation between the acoustic audio features and the user audio features to obtain an audio difference result comprises:

vectorizing the acoustic audio features and the user audio features respectively to obtain acoustic feature vectors corresponding to the acoustic audio features and user feature vectors corresponding to the user audio features;

and calculating an angle value between the original sound feature vector and the user feature vector, and further obtaining the audio difference result according to the angle value.

9. The method according to claim 8, wherein the vectorizing the acoustic audio features and the user audio features respectively to obtain acoustic feature vectors corresponding to the acoustic audio features and user feature vectors corresponding to the user audio features comprises:

digitizing each original phonon feature in the original sound audio features to obtain sub-elements corresponding to each original phonon feature;

digitizing each user sub-feature in the user audio features to obtain sub-elements corresponding to each user sub-feature;

and combining the acoustic feature vectors by utilizing the sub-elements corresponding to the acoustic sub-features, and combining the user feature vectors by utilizing the sub-elements corresponding to the user sub-features.

10. The method according to any one of claims 7 to 9, wherein performing prosody optimization on the audio data to be adjusted by using the acoustic audio data based on the audio difference result to obtain the adjusted audio comprises:

when the audio difference result is larger than a preset audio difference threshold value, counting user special prosody from the audio data to be adjusted, and extracting original voice prosody from the original voice audio data;

extracting prosody to be adjusted, the difference between which and the original prosody exceeds a preset prosody threshold, from the special prosody of the user;

weighting the prosody to be adjusted and the original voice prosody corresponding to the prosody to be adjusted to obtain the adjusted user prosody;

and synthesizing the adjusted audio by using the adjusted user prosody and prosody except the prosody to be adjusted in the special prosody of the user.

11. An audio adjustment method, comprising:

acquiring audio data to be adjusted of a user;

and receiving and playing the adjusted audio sent by the server.

12. A server, comprising:

a first memory to store executable audio adjustment instructions;

a first processor for implementing the method of any one of claims 1 to 10 when executing executable audio adaptation instructions stored in the first memory.

13. A terminal, comprising:

a second memory to store executable audio adjustment instructions;

a second processor, configured to implement the method of claim 11 when executing the executable audio adjustment instructions stored in the second memory.

14. A computer-readable storage medium having stored thereon executable audio adaptation instructions for causing a first processor to perform the method of any one of claims 1 to 10 when executed or for causing a second processor to perform the method of claim 11 when executed.