CN111370024B

CN111370024B - Audio adjustment method, device and computer readable storage medium

Info

Publication number: CN111370024B
Application number: CN202010107251.5A
Authority: CN
Inventors: 何涛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2023-07-04
Anticipated expiration: 2040-02-21
Also published as: CN111370024A

Abstract

The invention provides an audio adjustment method, equipment and a computer readable storage medium; the method comprises the following steps: receiving audio data to be adjusted sent by a terminal; acquiring acoustic audio data corresponding to the audio data to be adjusted from an acoustic audio database; performing pronunciation matching detection on the audio data to be adjusted and the original sound audio data to obtain a pronunciation difference result; the pronunciation difference result characterizes the difference degree of the audio data to be adjusted and the original sound audio data in pronunciation; and correcting the pronunciation of the audio data to be adjusted by using the pronunciation of the original sound audio data based on the pronunciation difference result to obtain the adjustment audio. The invention can improve the audio adjusting effect.

Description

Audio adjustment method, device and computer readable storage medium

Technical Field

The present invention relates to speech processing technology, and in particular, to an audio adjustment method, apparatus, and computer readable storage medium.

Background

Most of terminals have an audio recording function, and users can record words read by themselves or record songs singed by themselves on the terminals through the audio recording function, so that lives of the users are enriched. In practical application, the terminal can send the audio recorded by the user to the server, so that the server corrects or tunes the audio, and the audio of the user is more audible.

At present, the mode of adjusting the audio of the user is mainly to strengthen the sound effect of the audio in the recording process of the audio, for example, the reverberation effect such as a recording studio, a concert hall and the like is added for the audio music of the user, and the audio can only be simply modified, so that the effect of adjusting the audio of the user is poorer because the adjustment of the audio is single.

Disclosure of Invention

The embodiment of the invention provides an audio adjusting method, audio adjusting equipment and a computer readable storage medium, which can improve the audio adjusting effect.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an audio adjusting method, which comprises the following steps:

receiving audio data to be beautified sent by a terminal; the audio data to be beautified are audio data recorded by a user;

acquiring the original sound audio data corresponding to the audio data to be adjusted from an original sound audio database;

performing pronunciation matching detection on the audio data to be adjusted and the original sound audio data to obtain a pronunciation difference result; the pronunciation difference result characterizes the difference degree of the audio data to be adjusted and the original sound audio data in pronunciation;

And correcting the pronunciation of the audio data to be adjusted by utilizing the pronunciation of the original sound audio data based on the pronunciation difference result to obtain the adjustment audio.

acquiring audio data to be adjusted of a user on an audio recording function interface;

transmitting the audio data to be adjusted to a server so that the server can carry out audio adjustment on the audio data to be adjusted, wherein the adjustment audio is generated by the server based on the audio data to be adjusted;

and receiving and playing the adjustment audio sent by the server, and completing the audio adjustment of the audio data to be adjusted.

The embodiment of the invention provides a server, which comprises:

a first memory for storing executable audio adjustment instructions;

and the first processor is used for realizing the audio adjustment method provided by the server side in the embodiment of the invention when executing the executable audio adjustment instruction stored in the memory.

An embodiment of the present invention provides a terminal, including:

a second memory for storing executable audio adjustment instructions;

and the second processor is used for realizing the audio adjustment method provided by the terminal side of the embodiment of the invention when executing the executable audio adjustment instruction stored in the memory.

The embodiment of the invention provides a computer readable storage medium, which stores executable audio adjustment instructions for realizing the audio adjustment method provided by the server side of the embodiment of the invention when the first processor is caused to execute, or for realizing the audio adjustment method provided by the terminal side of the embodiment of the invention when the second processor is caused to execute.

The embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the server can receive the audio data to be adjusted sent by the terminal, then acquire the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database, then perform pronunciation comparison on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result, and correct pronunciation of the audio data to be adjusted by pronunciation of the acoustic audio data based on the pronunciation difference result to obtain the adjustment audio. Therefore, the pronunciation of the audio data to be adjusted can be corrected, the type which can be adjusted in the audio is increased, and finally the adjusting effect for the user audio is improved.

Drawings

FIG. 1 is a schematic diagram of an audio adjustment interface in the related art;

FIG. 2 is a schematic diagram of an alternative architecture of an audio conditioning system 100 according to an embodiment of the present invention;

Fig. 3 is a schematic structural diagram of a server 200 according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal 400 according to an embodiment of the present invention;

FIG. 5 is a timing diagram illustrating an alternative method for adjusting audio according to an embodiment of the present invention;

FIG. 6 is a schematic diagram showing an embodiment of the present invention for adjusting audio;

FIG. 7 is a schematic flow chart of an alternative audio adjustment method according to an embodiment of the present invention;

FIG. 8 is a second flowchart of an alternative audio adjustment method according to an embodiment of the present invention;

FIG. 9 is a second timing diagram of an alternative audio adjustment method according to an embodiment of the present invention;

FIG. 10 is a second schematic diagram of adjusting audio according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a flow chart of beautifying audio performed at a terminal side according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an interactive flow for beautifying audio provided by an embodiment of the present invention;

fig. 13 is a schematic diagram of an audio synthesis process according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", and the like are merely used to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", or the like may be interchanged with one another, if permitted, to enable embodiments of the invention described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

1) The audio adjustment means that the audio of the user is adjusted in terms of sound effect, tone color, balance and the like, so that the audio of the user is more audible. For example, after a song is recorded by a user, a reverberation effect may be added to the song, the user's timbre may be optimized, etc.

2) The original audio data represents the audio of the original version corresponding to the audio recorded by the user, for example, when the user records the song sung by the user, the original edition of the song is referred to as the original edition of the song.

3) Syllables of pronunciation refer to the phonetic units in the audio that are most audibly distinguished. For example, in a sentence, there are a plurality of words, each word has its pronunciation, and when listening to the Chinese, the person actually distinguishes the pronunciation of each word, where the pronunciation of a word corresponds to a syllable.

4) Phonemes are the smallest phonetic units, and each syllable of pronunciation is made up of one or more phonemes. For example, the syllable "Prime" in the Chinese word "Prime" can be decomposed into two phonemes "p" and "u

5) Audio features refer to musical features in audio, such as the pitch, beat, duration of each tone, intensity of each tone, tune, and the like.

6) Prosody refers to different tones, mood, pause modes, length of pronunciation and the like of a person when speaking or singing, and belongs to prosodic features, in other words, prosody characterizes habits of different persons when speaking or singing.

Nowadays, most of terminals have an audio recording function, and users can record words read by themselves or record songs sung by themselves on the terminals through the function, so that lives of the users are enriched. In practical application, the terminal can send the audio recorded by the user to the server, so that the server adjusts the audio, and the audio of the user is more audible.

In the related art, two main ways of adjusting the audio of the user are provided, one way is to enhance the audio effect of the user in the recording process, namely to increase the reverberation effect for the audio of the user, such as the reverberation effect of scenes such as a recording studio, a theatre, a concert hall, etc., so that the user can adjust the audio of the user by using different reverberation effects; the other is to adjust the tone color and equalization of the user according to the user's selection after the recording is completed, and to eliminate noise during the recording of the audio, etc., so that the user's audio is more graceful and listened to. For example, fig. 1 is a schematic diagram of an audio adjustment interface in the related art, in which a progress bar of audio is displayed in a display area 1-1, and total duration 03:25 of audio and a current audio time point 00:06 are respectively displayed. In the display area 1-2, there are two options of a single sentence edit 1-21 and an add video 1-22. In the display area 1-3, there are three columns of sound effect adjustment 1-31, tone adjustment 1-32 and equalization adjustment 1-33, and the user can enter into the corresponding functions by clicking on these columns. In the sound effect adjustment 1-31, a reverberation adjustment 1-311 is arranged, wherein the reverberation adjustment 1-311 is provided with 8 modes of a recording studio, a KTV, magnetism, a god of song, a space, a long-distance, a fantasy and an old record; in the sound effect adjustment 1-31, there is also a sound variation adjustment 1-312, wherein the easy adjustment 1-312 has 3 modes of sound, electricity and harmony. In the display area 1-4, release 1-41, re-recording 1-42 and save 1-43 options are provided. After the user finishes recording the own audio, the user can enter an audio beautifying interface in fig. 1, select whether to optimize for a single sentence of the audio or add video for the audio in the display area 1-2 by selecting an audio part to be adjusted in the display area 1-1, select an adjustment mode in the display area 1-3, adjust the audio of the user after finishing the selection, and finally select to release the audio, save the audio or re-record the audio through options in the display area 1-4, thereby finally finishing the adjustment process of the audio recorded by the user.

However, in the related art, when the audio of the user is adjusted, the audio effect can only be enhanced when the audio is recorded, that is, some simpler adjustments are performed, and after the recording is finished, the adjustment of the audio of the user is mostly focused on the adjustment of the audio effect, the equalization, the tone color, and the like, and the optional adjustment types are still single, so that the effect of adjusting the audio of the user is poor.

Embodiments of the present invention provide an audio adjustment method, apparatus, and computer-readable storage medium, which can improve the adjustment effect for user audio. The following describes an exemplary application of the audio adjustment device provided by the embodiment of the present invention, where the audio adjustment device provided by the embodiment of the present invention may be implemented as various types of user terminals such as a smart phone, a tablet computer, a notebook computer, and the like, and may also be implemented as a server. Next, an exemplary application when the audio adjustment apparatus implements the terminal and the server, respectively, will be described.

Referring to fig. 2, fig. 2 is a schematic diagram of an alternative architecture of the audio adjustment system 100 according to an embodiment of the present invention, in order to support an audio adjustment application, a terminal 400 is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of both, and the server 200 is further configured with an acoustic audio database 500 to store various types of acoustic audio data.

After the user enters the audio recording function interface 410 by operating the terminal 400, the terminal 400 obtains the audio data to be adjusted of the user on the audio recording function interface 410, and then the terminal 400 sends the audio data to be adjusted to the server 200 through the network 300. After receiving the audio data to be adjusted sent by the terminal 400, the server 200 obtains the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database 500. Then, the server 200 performs pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to compare the difference degree of the audio data to be adjusted and the acoustic audio data on pronunciation, so as to obtain a pronunciation difference result. Then, the server 200 corrects the pronunciation of the audio data to be adjusted by using the pronunciation of the original audio data based on the pronunciation difference result, to obtain the adjustment audio, and returns the adjustment audio to the terminal 400. The terminal 400 receives the adjustment audio transmitted from the server 200 and plays the adjustment audio, so that the audio adjustment process of the audio data to be adjusted can be completed through the cooperation of the server 200 and the terminal 400.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a server 200 according to an embodiment of the present invention, and the server 200 shown in fig. 3 includes: at least one first processor 210, a first memory 250, at least one first network interface 220, and a first user interface 230. The various components in server 200 are coupled together by a first bus system 240. It is appreciated that the first bus system 240 is used to enable connected communications between these components. The first bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as first bus system 240 in fig. 3.

The first processor 210 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose processor may be a microprocessor or any conventional processor or the like.

The first user interface 230 includes one or more first output devices 231, including one or more speakers and/or one or more visual displays, that enable presentation of media content. The first user interface 230 also includes one or more first input devices 232 including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The first memory 250 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be a read only memory (ROM, rea d Only Memory) and the volatile memory may be a random access memory (RAM, random Acc ess Memory). The first memory 250 described in embodiments of the present invention is intended to comprise any suitable type of memory. The first memory 250 optionally includes one or more storage devices physically remote from the first processor 210.

In some embodiments, the first memory 250 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

A first operating system 251 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a first network communication module 252 for reaching other computing devices via one or more (wired or wireless) first network interfaces 220, the exemplary first network interface 220 comprising: bluetooth, wireless compatibility authentication (Wi-Fi), universal serial bus (USB, universal Serial Bus), and the like;

a first display module 253 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more first output devices 231 (e.g., a display screen, a speaker, etc.) associated with the first user interface 230;

a first input processing module 254 for detecting one or more user inputs or interactions from one of the one or more first input devices 232 and translating the detected inputs or interactions.

In some embodiments, the audio adjustment device provided in the embodiments of the present invention may be implemented in software, and fig. 3 shows the audio adjustment device 255 stored in the first memory 250, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the first receiving module 2551, the acquiring module 2552, the difference comparing module 2553, the adjusting module 2554 and the first transmitting module 2555 will be described below.

In other embodiments, the audio adjustment device provided by the embodiments of the present invention may be implemented in hardware, and by way of example, the audio adjustment device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor that is programmed to perform the audio adjustment method provided by the embodiments of the present invention, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, progra mmable Logic Device), complex programmable logic device (CPLD, complex Programmabl e Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic component.

Exemplary, an embodiment of the present invention provides a server, including:

a first memory for storing executable audio adjustment instructions;

and the first processor is used for realizing the audio adjustment method provided by the server side in the embodiment of the invention when executing the executable audio adjustment instruction stored in the first memory.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a terminal 400 according to an embodiment of the present invention, and the terminal 400 shown in fig. 4 includes: at least one second processor 410, a second memory 450, at least one second network interface 420, and a second user interface 430. The various components in terminal 400 are coupled together by a second bus system 440. It is appreciated that the second bus system 440 is used to enable connected communication between these components. The second bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 4 as a second bus system 440.

The second processor 410 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., where the general purpose processor may be a microprocessor or any conventional processor, etc.

The second user interface 430 includes one or more second output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The second user interface 430 also includes one or more second input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The second memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be a read only memory (ROM, rea d Only Memory) and the volatile memory may be a random access memory (RAM, random Acc ess Memory). The second memory 450 described in embodiments of the present invention is intended to comprise any suitable type of memory. The second memory 450 optionally includes one or more storage devices physically remote from the second processor 410.

In some embodiments, the secondary memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

A second operating system 451 including system programs, such as a framework layer, a core library layer, a driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a second network communication module 452 for reaching other computing devices via one or more (wired or wireless) second network interfaces 420, the exemplary second network interface 420 comprising: bluetooth, wireless compatibility authentication (Wi-Fi), universal serial bus (USB, universal Serial Bus), and the like;

a second display module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more second output devices 431 (e.g., a display screen, speakers, etc.) associated with the second user interface 430;

a second input processing module 454 for detecting one or more user inputs or interactions from one of the one or more second input devices 432 and translating the detected inputs or interactions.

In some embodiments, the audio playing device provided in the embodiments of the present invention may be implemented in software, and fig. 4 shows the audio playing device 455 stored in the second memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the acquisition module 4551, the second transmission module 4551 and the second reception module 4552, the functions of each of which will be described below.

In other embodiments, the audio playing device provided by the embodiments of the present invention may be implemented in hardware, and by way of example, the audio playing device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor that is programmed to perform the audio adjustment method provided by the embodiments of the present invention, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, progra mmable Logic Device), complex programmable logic device (CPLD, complex Programmabl e Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic components.

An exemplary embodiment of the present invention provides a terminal, including:

a second memory for storing executable audio adjustment instructions;

and the second processor is used for realizing the audio adjustment method provided by the terminal side of the embodiment of the invention when executing the executable audio adjustment instruction stored in the second memory.

The audio adjustment method provided by the embodiment of the present invention will be described below in conjunction with exemplary applications and implementations of the server and the terminal provided by the embodiment of the present invention.

Referring to fig. 5, fig. 5 is an alternative timing diagram of an audio adjustment method according to an embodiment of the present invention, and will be described with reference to the steps shown in fig. 5.

S101, the terminal acquires audio data to be adjusted of a user.

The embodiment of the invention is realized in a scene that the audio data of the user needs to be adjusted, for example, pronunciation correction is carried out on English shortages read by the user, pronunciation adjustment is carried out on cantonese songs sung by the user, and the like. In addition, the embodiment of the invention can also record and send the audio data of the user while adjusting, and can also send and process the audio data of the user once after the recording is completed. When the user wakes up the terminal and enters the audio recording function interface through operation, the audio adjusting function provided by the audio recording function interface can be triggered through clicking and other operations. And then, the terminal can acquire the user audio data needing audio adjustment, namely, the audio data to be adjusted of the user.

In some embodiments of the present invention, the server also performs audio adjustment on the audio data to be adjusted after the user turns on the audio adjustment function by clicking or the like. At this time, the terminal may receive an audio adjustment instruction triggered by the user on the audio recording interface, and send the audio adjustment instruction and the audio data to be adjusted to the server together.

It should be noted that, in the embodiment of the present invention, the audio data to be adjusted may be selected by the user from the audio data stored in the terminal, that is, the audio data already recorded may be specified from the audio data already recorded, or may be the audio data recorded by the user at the current time, that is, the audio data recorded in real time.

It can be understood that before sending the audio data to be adjusted, the terminal may first extract the identification information of the acoustic audio data corresponding to the audio data to be adjusted, and send the identification information and the audio data to be adjusted to the server together, so that the server can determine which acoustic audio data the audio data to be adjusted corresponds to. For example, when the audio data to be adjusted is a song recorded by the user in real time, the user must specify which song is to be recorded on the terminal, that is, specify the identification information of the song to be recorded, at this time, the terminal may extract the identification information and send the identification information and the audio data to be adjusted to the server, so that the server specifies which song the user wants to sing and adjust.

In some embodiments of the present invention, the audio adjustment instruction may carry user identification information, so that when the server obtains the audio adjustment instruction, it is clear which user's audio is adjusted, so that the server stores the adjusted audio of different users correspondingly.

It may be understood that the terminal may receive the audio adjustment instruction triggered by the user through a touch, a click operation, or the like on the audio recording interface, or may receive the audio adjustment instruction triggered by the user through a voice instruction or the like of the user, which is not limited herein.

S102, the terminal sends the audio data to be adjusted to the server so that the server can carry out audio adjustment on the audio data to be adjusted, wherein the adjustment audio is generated by the server based on the audio data to be adjusted.

After the terminal receives the audio data to be adjusted and the original audio identifier, the audio data to be adjusted is sent to the server through the network, the server receives the audio data to be adjusted sent by the terminal, wherein the audio data to be adjusted represents the audio data recorded by the user, and the adjustment audio generated after the server is generated based on the audio data to be adjusted recorded by the user.

It can be understood that when the terminal sends the audio data to be adjusted, the audio data to be adjusted may be sent once, that is, after recording, the audio data to be adjusted may be sent immediately after recording a small section, and then recording the next small section and sending the next small section, that is, the audio data to be adjusted may be sent while recording, which is not limited in this embodiment of the present invention.

S103, the server acquires the original sound audio data corresponding to the audio data to be adjusted from the original sound audio database.

After the server obtains the audio adjustment instruction, it will determine that it needs to perform audio adjustment on the audio data to be adjusted, for example, the server may compare the audio identifier with each audio identifier in the audio database, and extract the audio data corresponding to the audio identifier identical to the audio identifier, where the extracted audio data is the audio data corresponding to the audio data to be adjusted.

It is understood that the original audio data refers to original audio which is not changed by other personnel after being released by the author, and it is noted that the author of the original audio data may refer to the original author or the dubbing author, in other words, the user may record his own audio on the basis of the audio of the original author or the audio of the dubbing author. For example, in the Yue-language song XXXX of the original edition which is turned by the user, the corresponding acoustic audio data is XXXX of the original singer, and in the Yue-language song XXXX of the turned version which is turned by the user, the acoustic audio data is XXXX of the original edition which is turned by the certain turned singer.

S104, the server performs pronunciation matching detection on the audio data to be adjusted and the original sound audio data to obtain a pronunciation difference result; the pronunciation difference result characterizes the difference degree of the audio data to be adjusted and the original sound audio data in pronunciation.

After obtaining the original sound audio data corresponding to the audio data to be adjusted, the server compares the difference of the original sound audio data and the audio data to be adjusted in pronunciation, so that inaccurate pronunciation parts in the audio data to be adjusted are found, and the inaccurate pronunciation parts can be corrected in subsequent pronunciation.

In some embodiments of the present invention, the server may decompose the audio data to be adjusted into a plurality of pronunciation syllables, then decompose the acoustic audio data into a plurality of syllables, and compare the pronunciation syllable decomposed from the audio data to be adjusted with the pronunciation syllable decomposed from the acoustic audio data, thereby determining a portion of the audio data to be adjusted that has inaccurate pronunciation.

And S105, the server corrects the pronunciation of the audio data to be adjusted by using the pronunciation of the original sound audio data based on the pronunciation difference result, and the adjustment audio is obtained.

The server can judge whether the pronunciation of the original sound audio data and the pronunciation of the audio data to be adjusted are too different or not according to the obtained pronunciation difference result, and when the pronunciation difference is large, the server can correct the pronunciation of the audio data to be adjusted by utilizing the pronunciation of the original sound audio data, and the corrected audio is used as the adjustment audio. When the difference of pronunciation is smaller, the audio data to be adjusted does not need pronunciation correction, and at this time, the server can directly serve as the adjustment audio.

In the embodiment of the invention, when the server corrects the pronunciation of the audio data to be adjusted by utilizing the pronunciation of the original audio data, the pronunciation syllables of the part of the original audio data with larger pronunciation difference are extracted, the pronunciation syllables to be corrected in the audio data to be adjusted are correspondingly replaced by the pronunciation syllables, then the pronunciation syllables to be corrected, which are replaced in the audio data to be adjusted, are used for re-synthesizing the audio with the rest pronunciation syllables, and at the moment, the tone of the user is still used when the audio is synthesized, so that the pronunciation correction is performed on the audio data to be adjusted while the tone of the user is ensured, and the adjusted audio is obtained.

For example, when the data to be adjusted is an english song recorded by the user, the server may compare the syllable of the original sound song of the english song with the syllable of the english song recorded by the user, and when it is determined that the part of the english song recorded by the user with inaccurate pronunciation is "heart" in "my heart will go on and on", the server may extract the syllable of the heart "of the corresponding part from the original sound song, and then replace the syllable with inaccurate pronunciation with the extracted syllable, and synthesize the part of the song with the other syllable to obtain the final adjusted audio.

S106, the terminal receives and plays the adjustment audio sent by the server.

After synthesizing the adjusting audio, the server returns the adjusting audio to the terminal for the terminal to play the adjusting audio, and the audio adjustment of the audio data to be adjusted is completed. After receiving the adjustment audio sent by the server, the terminal calls the playing space to play the adjustment audio so that the user can hear the adjustment audio, and thus, the audio adjustment process for the audio data to be adjusted is completed.

It can be understood that when the terminal sends the audio data to be adjusted to the server once, that is, when the adjustment after recording is realized, the server can send the adjustment audio back to the terminal once, so that the terminal can continuously play the adjustment audio, and the user can know the overall effect of the adjustment audio conveniently; when the terminal immediately sends the data to be adjusted after recording one small section, namely when the data to be adjusted is recorded while being adjusted, the server can also quickly adjust and return the audio for each small section of the data to be adjusted, so that the terminal can play the corresponding part of each small section in the adjusted audio in real time, and a user can conveniently adjust the pronunciation, the volume, or even other states when the user records the audio according to the adjusted audio in real time.

In an exemplary embodiment, as shown in fig. 6, when the audio data to be adjusted is a guangdong song "XXXX" recorded by a user in real time, the terminal takes a sentence of lyrics as a unit, and after recording a sentence of lyrics, for example, after finishing recording "how many times the lyrics face the eyes of a user and the eyes of a user, the terminal sends a segment corresponding to the sentence of lyrics to the server. After receiving the segment, the server determines that the pronunciation of the cold eye and the jeer is inaccurate by comparing the syllable of the part of the song sung by the user with the syllable of the part of the song sung by the original user, corrects the inaccurate pronunciation by using the pronunciation of the cold eye and the jeer of the original user, and returns the corrected result to the terminal. After receiving the corrected result, the terminal plays the corrected result to the user through the ear loop, and thickens the "cold eyes" and the "jeer" in the lyrics display area 6-1 to indicate to the user that the two words with inaccurate pronunciation have been adjusted.

In some embodiments of the present invention, the server performs pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result, that is, a specific implementation process of S104 may include: s1041 or S1042 as follows:

s1041, the server takes the audio data to be adjusted as a segment to be adjusted, and directly performs pronunciation matching detection with the original audio data to obtain a pronunciation difference result.

The server can directly take the audio data to be adjusted as a complete fragment to be adjusted to carry out pronunciation matching detection with the complete original sound audio data, and the obtained result is a pronunciation difference result. At this time, the server can perform pronunciation detection on the whole audio data to be adjusted, so that the voice information in the audio data to be adjusted and the information except the voice information can be reserved to the greatest extent, and pronunciation matching detection is facilitated.

S1042, the server segments the audio data to be adjusted and the original audio data, and then performs pronunciation matching detection to obtain a pronunciation difference result.

The server can segment the audio data to be adjusted and the acoustic audio data, and performs pronunciation matching detection based on the segment to be adjusted and the acoustic segment obtained after the segmentation operation, at this time, the granularity of pronunciation matching detection is smaller, so that the obtained pronunciation difference result is more accurate, and the audio adjustment effect is further improved.

It should be noted that S1041 and S1042 are two implementation procedures selected in S104, and the implementation procedure of S104 may be selected according to practical situations, which is not limited herein.

In the embodiment of the invention, the server can take the audio data to be adjusted as a whole to carry out pronunciation matching detection so as to reserve various sound information in the audio data to be adjusted to the greatest extent and facilitate realizing pronunciation matching detection, or can segment the audio data to be adjusted to carry out pronunciation matching detection so as to lead the granularity of pronunciation matching detection to be smaller and lead the accuracy of the obtained pronunciation difference result to be higher.

In some embodiments of the present invention, after segmenting the audio data to be adjusted and the acoustic audio data, the server performs pronunciation matching detection to obtain a pronunciation difference result, that is, a specific implementation process of S1042, may include: s1042a to S1042c are as follows:

s1042a, the server divides the acoustic audio data into a plurality of acoustic segments.

After the server acquires the acoustic audio data, in order to facilitate pronunciation comparison between the acoustic audio data and the audio data to be adjusted, the server may perform paragraph analysis on the acoustic audio data first, so as to determine at which time points the acoustic audio data may be divided into a plurality of acoustic segments, and then divide the acoustic audio data into a plurality of acoustic segments according to the time points.

It should be noted that, because there may be some blank parts without human voice in the audio data, such as a pause part for reading, ventilation, forefront, etc. during singing, correspondingly, if the user records the audio data to be adjusted for the audio data, these parts will not have the voice of the user, and the server will make pronunciation comparison for this part of data, so the server may make paragraph analysis for the audio data according to the pause part, ventilation part, forefront, etc. to divide the audio data into a plurality of acoustic segments, and only the acoustic segments of human voice need to make pronunciation comparison.

S1042b, the server segments the audio data to be adjusted into a plurality of segments to be adjusted by utilizing a plurality of segment times corresponding to the plurality of acoustic segments.

After obtaining a plurality of acoustic segments, the server can firstly determine a rough time point when the audio to be adjusted needs to be segmented by utilizing the segmentation time corresponding to each acoustic segment, then identify whether the voice exists near the rough time point, and correspondingly adjust the rough time point by reading which word in the acoustic audio data corresponds to the voice, so as to obtain a segmentation time point, and finally segment the audio data to be adjusted into a plurality of segments to be adjusted by utilizing the segmentation time point. Then, correspondingly, the server performs pronunciation comparison on the audio data to be adjusted and the original sound audio data to obtain a pronunciation difference result, and the process of performing pronunciation comparison on the plurality of fragments to be adjusted and the plurality of original sound fragments to obtain a pronunciation difference result is changed to be more accurate.

It can be understood that, when the user records the audio data to be adjusted, there may be some gaps between the speaking rhythm of the original audio data or the speed of the song beat, and the like, the audio data to be adjusted is directly segmented by using the segmentation time corresponding to the original sound segment, which may cause the loss of the voice information of the segment to be adjusted.

Of course, for the audio data to be adjusted, which has no gap with the speaking rhythm of the original sound audio data or the speed of the song rhythm, the server can directly cut the audio data to be adjusted only by using the segmentation time of the original sound segment to obtain the segment to be adjusted.

S1042c, the server performs pronunciation matching detection on each fragment to be adjusted and each corresponding acoustic fragment to obtain a pronunciation difference result.

After the server completes the segmentation of the audio data to be adjusted, the user pronunciation syllable of each segment to be adjusted can be matched with the standard pronunciation syllable of the corresponding original sound segment, and the matching result is a pronunciation difference result.

In the embodiment of the invention, the server can firstly perform paragraph analysis on the original sound audio data, divide the original sound audio data into a plurality of original sound fragments, and then divide the audio data to be beautified into a plurality of fragments to be regulated by utilizing the segmentation time corresponding to the original sound fragments, so that the subsequent pronunciation comparison can be performed based on the original sound fragments and the fragments to be regulated, and the accuracy of pronunciation difference results is provided.

Referring to fig. 7, fig. 7 is a schematic flow chart of an alternative audio adjustment method provided by an embodiment of the present invention, in some embodiments of the present invention, a server performs pronunciation match detection on each segment to be adjusted and each corresponding acoustic segment to obtain a pronunciation difference result, that is, a specific implementation process of S1042c may include: s201 to S203, as follows:

s201, the server extracts at least one syllable of user pronunciation from each segment to be adjusted, and extracts at least one syllable of standard pronunciation from each original sound segment.

When the server performs pronunciation comparison on the to-be-adjusted fragments and the original sound fragments, syllable disassembly is required to be performed on the to-be-adjusted fragments and the original sound fragments respectively, so that one or more user pronunciation syllables corresponding to each to-be-adjusted fragment are obtained, and one or more standard pronunciation syllables corresponding to each original sound fragment are obtained.

It will be appreciated that a user pronunciation syllable refers to the pronunciation of each word in the audio data to be adjusted by the user, while a standard pronunciation syllable refers to the pronunciation of each word by the original of the acoustic audio data. Because different people have different accent habits, sounding modes and the like, different people can pronounce the same word differently, and the pronouncing can be embodied by using pronunciation syllables, so that the server can determine the pronunciation difference between the audio data to be adjusted and the original sound audio data by extracting the user pronunciation syllables and standard pronunciation syllables.

S202, the server determines a corresponding standard syllable for each user pronunciation syllable of at least one user pronunciation syllable from at least one standard pronunciation syllable.

Because the time length of the audio data to be adjusted recorded by the user is relatively close to that of the original audio data, and the sequence of each word of the audio data to be adjusted is very similar to that of each word of the original audio data, the server can determine the corresponding standard syllable for each user pronunciation syllable from at least one standard pronunciation syllable by combining the time point of each word and the sequence position of each word. In other words, in this step, the server correlates all the standard syllables of pronunciation with all the syllables of pronunciation of the user one by one.

S203, the server matches each user pronunciation syllable with the corresponding standard syllable to obtain a pronunciation difference result.

After obtaining the standard syllables corresponding to each user pronunciation syllable, the server compares each user pronunciation syllable with the corresponding standard syllables to judge whether the user pronunciation syllable is different from the standard syllable. When the user pronunciation syllable is different from the standard syllable, the server considers that the user pronunciation syllable is required to be corrected, and when the user pronunciation syllable is not different from the standard syllable, the server considers that the user pronunciation syllable is relatively standard and does not need to be corrected.

It should be noted that, because the server obtains a pronunciation difference result for a user pronunciation syllable, in the embodiment of the present invention, the server may obtain a number of pronunciation difference results of the user pronunciation syllable. For example, when the syllables of the user have 10 syllables, the server will get 10 pronunciation difference results.

It will be appreciated that the pronunciation syllables may be broken down into phonemes and that in some embodiments of the present invention the server may determine if the user pronunciation syllables differ from the standard syllables by a comparison between the phonemes.

In the embodiment of the invention, the server can extract the syllables of the user pronunciation from the segment to be adjusted, extract the standard syllables of the pronunciation from the original segment, then correspond the syllables of the user pronunciation with the standard syllables of the pronunciation one by one, and finally compare the standard syllables corresponding to the syllables of the user pronunciation to obtain the final pronunciation difference result. Thus, the server determines whether the pronunciation of the user needs to be corrected based on syllables.

In some embodiments of the present invention, the server compares each syllable of the user with its corresponding standard syllable to obtain a pronunciation difference result, that is, the specific implementation process of S203 may include: s2031 to S2035 are as follows:

s2031, the server carries out phoneme decomposition on each user pronunciation syllable to obtain a first phoneme corresponding to each user pronunciation syllable, and carries out phoneme decomposition on a standard syllable corresponding to each user pronunciation syllable to obtain a second phoneme corresponding to the standard syllable.

Because a syllable may be composed of one or more phones, the server decomposes each user-uttered syllable into one or more first phones, and similarly, the server also decomposes each user-uttered syllable into one or more second phones corresponding to the standard syllable.

It will be appreciated that, since there is a correspondence between the user pronunciation syllable and the standard syllable, there is also a correspondence between the first phoneme decomposed from the user pronunciation syllable and the second phoneme decomposed from the standard syllable.

S2032, the server matches the pronunciation of the first phoneme with the pronunciation of the second phoneme to obtain a pronunciation matching result.

Since the difference of pronunciation syllables is also determined by the pronunciation of the phonemes, the server can also match the pronunciation of the first phoneme with the pronunciation of the second phoneme one by one, so that the server can obtain a pronunciation matching result that the pronunciation of the first phoneme is matched with the pronunciation of the second phoneme and a pronunciation matching result that the pronunciation of the first phoneme is not matched with the pronunciation of the second phoneme.

In some embodiments of the present invention, the server may determine whether the pronunciation of the first phoneme and the pronunciation of the second phoneme match by observing whether the sound waveform of the first phoneme is identical or similar to the sound waveform of the second phoneme.

S2033, the server extracts the duration of the first phoneme to obtain a first duration, and extracts the duration of the second phoneme to obtain a second duration.

And S2034, the server obtains a time comparison result according to the first duration and the second duration.

Since the difference between pronunciation syllables can be manifested in the duration of the phonemes, the server can derive the duration of each first phoneme, derive the first duration, and extract the second duration of each second phoneme. And then, the server corresponds the first duration time and the second duration time according to the corresponding relation between the first phoneme and the second phoneme, compares whether the difference between the first duration time and the second duration time is larger than a preset time threshold value, and obtains a time comparison result that the difference between the first duration time and the second duration time is larger when the difference is larger than or equal to the preset time threshold value, otherwise, obtains a time comparison result that the first duration time and the second duration time are close when the difference is smaller than the preset time threshold value.

Since the acoustic waveforms of the different phones are different, in some embodiments of the present invention, the server may extract the duration of each first phone based on the acoustic waveform of the first phone, i.e., obtain the first duration, and similarly, the server extracts the second duration of each second phone based on the acoustic waveform of the second phone.

It can be understood that, in the embodiment of the present invention, the preset time threshold may be set according to actual situations, and the embodiment of the present invention is not limited herein. For example, the preset time threshold may be set to 2ms or 5ms.

S2035, the server determines a pronunciation difference result according to the time comparison result and the pronunciation matching result.

After obtaining the time comparison result of the first phoneme and the second phoneme and the pronunciation matching result, the server can synthesize the time comparison result and the pronunciation matching result to obtain a final pronunciation difference result.

In the embodiment of the invention, the server can disassemble syllables of the pronunciation of the user into the first phonemes, disassemble standard syllables into the second phonemes, and then match whether the pronunciation of the first phonemes and the pronunciation of the second phonemes are similar. And comparing whether the duration of the first phoneme is close to that of the first phoneme, so as to determine the pronunciation difference result of the pronunciation syllable of the pronunciation user and the standard syllable according to the time comparison result and the pronunciation matching result, and thus, the server can complete the determination process of the pronunciation difference result, so that the pronunciation of the audio data to be adjusted can be corrected according to the pronunciation difference result.

In some embodiments of the present invention, the server determines, according to the time comparison result and the pronunciation matching result, a pronunciation difference result, that is, a specific implementation process of S2035, may include: s2035a-S2035d are as follows:

s2035a, when the time comparison result indicates that the first duration and the second duration match, and the phoneme match result indicates that the pronunciation of the first phoneme matches the pronunciation of the second phoneme, the pronunciation difference result indicates that the user pronunciation syllable is not different from the standard syllable corresponding to the user pronunciation syllable.

When the first duration and the second duration match, i.e. the difference between the first duration and the second duration is smaller than the budgeted time threshold, and the pronunciation of the first phoneme and the pronunciation of the second phoneme are the same or similar, the server will obtain a pronunciation difference result that the syllable of the user pronunciation is not different from the standard syllable.

S2035b, when the time comparison result indicates that the first duration and the second duration are not matched, and the phoneme matching result indicates that the pronunciation of the first phoneme is matched with the pronunciation of the second phoneme, the pronunciation difference result indicates that the pronunciation syllable of the user is different from the corresponding standard syllable.

When the first duration and the second duration do not match, i.e., the difference between the first duration and the second duration exceeds the budgeted time threshold, the server derives a pronunciation difference result that the user's pronunciation syllable differs from the standard syllable even if the pronunciation of the first phoneme is the same as or similar to the pronunciation of the second phoneme.

S2035c, when the time comparison result indicates that the first duration and the second duration are matched, and the phoneme matching result indicates that the pronunciation of the first phoneme and the pronunciation of the second phoneme are not matched, the pronunciation difference result indicates that the syllable of the pronunciation of the user is different from the corresponding standard syllable.

When the first duration and the second duration are matched, but the pronunciation of the first phoneme and the pronunciation of the second phoneme are too great, the server also obtains a pronunciation difference result that the syllables of the pronunciation of the user are different from the standard syllables.

S2035d, when the time comparison result indicates that the first duration and the second duration are not matched, and the phoneme match result indicates that the pronunciation of the first phoneme and the pronunciation of the second phoneme are not matched, the pronunciation difference result indicates that the syllable of the pronunciation of the user is different from the corresponding standard syllable.

When the first duration and the second duration do not match and the pronunciation of the first phoneme and the pronunciation of the second phoneme are too great, the first phoneme and the second phoneme are not similar, and the server can necessarily obtain a pronunciation difference result that the syllable of the user pronunciation is different from the standard syllable.

In the embodiment of the invention, the server can synthesize various conditions of the time comparison result and various conditions of the pronunciation matching result to obtain the pronunciation difference result of the pronunciation syllable of the user and the standard syllable, so that the subsequent pronunciation correction on the audio data to be beautified based on the pronunciation difference result is convenient.

Referring to fig. 8, fig. 8 is a second flowchart of an alternative audio adjustment method provided by the embodiment of the present invention, in some embodiments of the present invention, a server corrects a pronunciation of audio data to be adjusted by using a pronunciation of original audio data to obtain an adjustment audio, that is, a specific implementation process of S105 may include: s1051 to S1054, as follows:

s1051, the server selects syllables to be corrected, which have differences represented by the difference results of the departure sound, from at least one syllable of the pronunciation of the user of the audio data to be adjusted.

Because each user pronunciation syllable has the corresponding pronunciation difference result for representing whether the user pronunciation syllable is different from the corresponding standard pronunciation syllable, the server only needs to read and identify the pronunciation difference result one by one, and select the user pronunciation syllable with pronunciation difference represented by the pronunciation difference result, so that the syllable to be corrected can be obtained.

It should be noted that, because more than one pronunciation difference result may indicate that there is a difference between the syllable of the user pronunciation and the standard syllable of the pronunciation, the syllable to be corrected selected by the server is not a syllable, but a set of all syllables of the user pronunciation that need to be corrected.

S1052, the server extracts standard syllables corresponding to the syllables to be corrected from at least one standard pronunciation syllable.

After obtaining the syllables to be corrected, the server also needs to determine the corresponding standard syllables to be corrected for each user syllable to be corrected from at least one standard syllable to be corrected according to the corresponding relation between the user syllables to be corrected and the standard syllables, and extracts the standard syllables to form a set to obtain the standard syllables.

It will be appreciated that the number of standard pronunciation syllables in the standard syllables is the same as the data of the user pronunciation syllables in the syllables to be corrected.

S1053, replacing syllables to be corrected by standard syllables by the server to obtain corrected syllables.

The server replaces each standard user pronunciation in the standard syllables with each user pronunciation syllable in the syllables to be corrected correspondingly to obtain a plurality of corrected user pronunciation syllables of the user pronunciation syllables, and the corrected user pronunciation syllables are used for forming a set to obtain corrected syllables.

S1054, the server synthesizes the adjustment audio by using the correction syllable and syllables except the syllable to be corrected in at least one syllable of the user pronunciation.

After obtaining the corrected syllables, the server re-synthesizes the audio by using each corrected user pronunciation syllable in the corrected syllables and the syllables except the syllables to be corrected in at least one user syllable, namely the user pronunciation syllables which are not different from the standard pronunciation syllables, and at the moment, the synthesized audio is the adjustment audio.

It will be appreciated that the server, when re-synthesizing the tuned audio, uses the still user's timbre so that the synthesized tuned audio is with the user's timbre, i.e., the tuned audio is the audio that is specific to the user.

In the embodiment of the invention, the server can select syllables to be corrected, which have pronunciation differences and are represented by the pronunciation difference results, from at least one syllable of the user pronunciation in the audio data to be adjusted, namely syllables of the user pronunciation which are not standard, then extract corresponding standard syllables for the syllables to be corrected from at least one standard syllable of the pronunciation, then replace the syllables to be corrected by the standard syllables to obtain corrected syllables, and finally; and synthesizing the adjustment audio by utilizing the corrected syllables and syllables except the syllables to be corrected in the original at least one syllable of the user pronunciation. Thus, the server can correct the user pronunciation syllable with nonstandard pronunciation by using the standard pronunciation syllable to obtain the adjustment audio.

Referring to fig. 9, fig. 9 is a second alternative timing diagram of an audio adjustment method according to an embodiment of the present invention, in some embodiments of the present invention, after the server obtains the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database, that is, after S103, the method may further include: S107-S109, in other words, after S103, the server may choose to execute S104-S105, may choose to execute S107-S109, may execute S104-S105 first, then S107-S109, or execute S107-S109 first, then S104-S105.

S107, the server extracts the original sound audio characteristics from the original sound audio data and extracts the user audio characteristics from the audio data to be adjusted; wherein the audio features include at least pitch, frequency and duration.

The server can carry out pronunciation correction on the audio data to be adjusted and can carry out rhythm optimization on the audio to be adjusted so as to be convenient for correcting the part of the user with disordered running tone and rhythm in the audio to be adjusted, thereby obtaining more graceful adjusted audio. At this time, the server will extract the original sound audio feature from the original sound audio data first, and extract the user audio feature from the audio data to be adjusted, so as to facilitate the subsequent comparison between the original sound audio feature and the user audio feature, and determine the portion of the audio data to be adjusted with disordered pitch and rhythm.

It should be understood that, in the embodiment of the present invention, the audio features include at least tone, frequency and duration, that is, at least information including tone of human voice of the user, frequency distribution of the entire audio data to be adjusted, duration of tone, and the like, and may further include some other features capable of describing the audio data, which is not limited herein.

S108, the server calculates the similarity between the original sound audio characteristics and the user audio characteristics to obtain an audio difference result.

The server compares the acoustic audio characteristics with the user audio characteristics, so that the similarity degree of the acoustic audio characteristics and the user audio characteristics is obtained, and further, the audio difference result of the acoustic audio characteristics and the user audio characteristics is obtained, and therefore, the server can use the audio difference result to represent the difference condition of acoustic audio data and audio data to be adjusted on the tune and rhythm of the audio.

In order to facilitate calculation of the similarity of the acoustic audio feature and the user audio feature, in some embodiments of the present invention, the server may map both the acoustic audio feature and the user audio feature to a vector space, and then calculate, according to the similarity of the vectors, the similarity of the acoustic audio feature and the user audio feature.

And S109, the server performs prosody optimization on the audio data to be adjusted by utilizing the acoustic audio data based on the audio difference result, and obtains the adjusted audio.

After the server obtains the audio difference result, the server can determine whether the audio data to be adjusted is a tune or not, or whether the rhythm of the audio data to be adjusted affects the grace degree with the audio data to be adjusted, if the audio difference result represents that the difference between the audio data to be adjusted and the tune and the rhythm of the original audio data is large, the server uses the tune and the rhythm of the original audio data to perform rhythm optimization on the tune and the rhythm of the audio data to be adjusted, and the audio obtained after optimization is the adjustment audio.

It should be noted that, in some embodiments of the present invention, since there may be only a difference in rhythm and tune between the original audio data and the audio data to be adjusted, there may be only a difference in pronunciation, there may be a difference in pronunciation, and there may be a difference in rhythm and tune, so that the server may perform prosody optimization on only the audio data to be adjusted to obtain the adjusted audio, may perform pronunciation correction on only the audio data to be adjusted to obtain the audio to be adjusted, and may perform pronunciation correction on both the audio data to be adjusted and prosody optimization to obtain the adjusted audio. The specific manner of obtaining the adjustment audio may be determined by the server according to the difference between the audio data to be adjusted and the acoustic audio data, which is not limited herein.

Referring to fig. 10, an embodiment of the present invention provides a second schematic diagram for adjusting audio. In fig. 10, when the audio data to be adjusted is a guangdong song "XXXX" recorded by the user, the thickened "cold eyes" and "jeer" in the lyrics display area 10-1 are words with inaccurate pronunciation that have been corrected by the server, the tone and duration of each tone sung by the user and the tone and duration of some tones after prosody optimization by the server are displayed in the prosody display area 10-2, wherein the tone and duration of the tone sung by the user are displayed in solid line segments, and the tone and duration of the tone after prosody optimization by the server are displayed in dashed line segments, at this time, as can be seen from fig. 10, the tone and duration of the tone 10-21 sung by the user are not matched, the prosody optimization is performed on the tone 10-21 by the server, the optimized tone 10-22 is obtained, and correction of the user's tuning part is achieved.

In the embodiment of the invention, the server can also extract the original sound audio characteristics from the original sound audio data respectively, extract the user audio characteristics from the audio data to be adjusted, compare the original sound audio characteristics with the audio characteristics to be adjusted so as to determine the rhythm and tune problems in the audio data to be adjusted, and then perform rhythm optimization on the audio data to be adjusted by using the original sound audio data to obtain the adjusted audio, so that the server can perform optimization on the tune and the rhythm on the audio data to be adjusted to obtain more graceful adjusted audio, and further improve the audio adjustment effect.

In some embodiments of the present invention, the server performs similarity calculation on the acoustic audio feature and the user audio feature to obtain an audio difference result, that is, a specific implementation process of S108 may include: S1081-S1082, as follows:

s1081, the server vectorizes the acoustic audio feature and the user audio feature to obtain an acoustic feature vector corresponding to the acoustic audio feature and a user feature vector corresponding to the user audio feature.

The server maps the acoustic audio features to a vector space to obtain acoustic feature vectors corresponding to the acoustic features, and maps the user audio features to the vector space to obtain user feature vectors corresponding to the user audio features.

It will be appreciated that since for audio features, pitch, frequency and duration can all be represented by conversion to digital, the server maps the audio features to vector space, which can be converted to digital form, resulting in feature vectors.

S1082, the server calculates an angle value between the acoustic feature vector and the user feature vector, and further obtains an audio difference result according to the angle value.

Since in vector space, the similarity between two vectors is described as measured by the magnitude of the angle between the two vectors, the more similar the two vectors are when the angle is smaller and the less similar the two vectors are when the angle is larger. Therefore, the server can calculate the angle value between the acoustic feature vector and the user feature vector, when the angle value is larger, the audio difference result that the acoustic audio feature and the user audio feature are dissimilar is obtained, and when the angle value is smaller, the audio difference result that the acoustic audio feature and the user audio feature are similar is obtained.

Further, the server may measure the difference between the acoustic feature vector and the user feature vector by setting an angle value threshold, that is, when the angle value is greater than or equal to a preset angle value threshold, the acoustic audio feature is dissimilar to the user audio feature, and when the angle value is less than the preset angle value threshold, the acoustic audio feature is similar to the user audio feature.

In the embodiment of the invention, the server can vectorize the original sound audio feature and the user audio feature, and the audio difference result between the original sound feature vector and the user feature vector is obtained by calculating the angle value between the original sound feature vector and the user feature vector, so that the server can measure whether the original sound audio data and the audio data to be adjusted have differences in rhythm and tune.

In some embodiments of the present invention, the server vectorizes the acoustic audio feature and the user audio feature to obtain an acoustic feature vector corresponding to the acoustic audio feature and a user feature vector corresponding to the user audio feature, that is, a specific implementation process of S1081 may include: s1081a to S1081c are as follows:

s1081a, the server digitizes each original sound sub-feature in the original sound audio features to obtain sub-elements corresponding to each original sound sub-feature.

S1081b, the server digitizes each user sub-feature in the user audio features to obtain sub-elements corresponding to each user sub-feature.

Because each sub-feature in the audio feature, such as tone, frequency, duration and the like, can be converted into numbers to be represented, the server can directly digitize each original sound sub-feature to obtain sub-elements corresponding to each original sound sub-feature, digitize each user sub-feature in the same manner to obtain sub-elements corresponding to each user sub-feature.

In other embodiments of the present invention, the server may further perform rounding, averaging, and other processing on the numbers obtained after digitizing each sub-feature to obtain sub-elements corresponding to each sub-feature, which embodiments of the present invention are not limited in this disclosure.

S1081c, the server combines the original sound characteristic vector by utilizing the sub-elements corresponding to the original sound sub-characteristics, and combines the user characteristic vector by utilizing the sub-elements corresponding to the user sub-characteristics.

The server splices the sub-elements corresponding to the original sound sub-features according to the sequence of the original sound sub-features to obtain an original sound feature vector, and splices the sub-elements corresponding to the user sub-features according to the sequence of the user sub-features to obtain a user feature vector, so that the server finishes vectorization of the original sound audio features and the user audio features.

In the embodiment of the invention, the server can obtain the vectorization of the acoustic audio characteristics and the user audio characteristics to obtain the acoustic characteristic vector and the user characteristic vector so as to facilitate the subsequent calculation of the audio difference result.

In some embodiments of the present invention, the server performs prosody optimization on the audio data to be adjusted using the acoustic audio data based on the audio difference result, to obtain the adjusted audio, that is, the specific implementation process of S109 may include: S1091-S1094 as follows:

s1091, when the audio difference result is larger than a preset audio difference threshold, the server calculates the special rhythm of the user from the audio data to be adjusted, and extracts the acoustic rhythm from the acoustic audio data.

Only when the audio difference result is larger than a preset audio difference threshold, that is, the difference between the original audio data and the audio data to be adjusted on the tune and the rhythm is large enough, the server performs rhythm optimization on the audio data to be adjusted, in other words, when the difference between the audio data to be adjusted and the original audio data on the tune and the rhythm is small, the server indicates that the audio data to be adjusted is beautiful enough on the tune and the rhythm without rhythm optimization. When the server performs prosody optimization on the audio data to be adjusted, the server firstly calculates the special prosody of the user with the user characteristics from the audio data to be adjusted, and simultaneously extracts the acoustic prosody from the acoustic audio data.

It may be appreciated that, in the embodiment of the present invention, the preset audio difference threshold may be set according to actual situations, for example, when the audio difference result is measured by using an angle value, the preset audio difference threshold may be set to 10 ° or 30 °, which is not limited herein.

It should be noted that, in the audio data to be adjusted, the prosody frequently used by the user is the user specific prosody. In other embodiments of the present invention, the user-specific prosody may not be counted in the audio data to be adjusted, at which time the server may obtain the user-recorded audio from the database and then obtain the user-specific prosody from the recorded audio.

S1092, the server extracts the prosody to be adjusted, which is different from the original prosody by more than a preset prosody threshold value, from the special prosody of the user.

After the server obtains the special prosody of the user, the special prosody of the user is compared with the original prosody, when the difference between the special prosody of the user and the original prosody exceeds a preset prosody threshold value, the special prosody of the user is shown to be seriously deviated from the original prosody, and at the moment, the special prosody of the user is extracted by the server and is used as the prosody to be adjusted.

It will be appreciated that the preset prosody threshold value may be set according to practical situations, and embodiments of the present invention are not limited herein. For example, the preset prosody threshold may be set to 50%, that is, when 50% of the user-specific prosody is different from the original prosody, the user-specific prosody is the prosody to be adjusted, that is, the prosody to be adjusted.

Note that, since the tone color of the author of the acoustic audio data of the tone color of the user may have a difference, when comparing the user specific prosody with the acoustic prosody, or even comparing the user pronunciation syllable with the acoustic pronunciation syllable, the comparison should be performed on the basis of the stripped tone color, that is, the difference in pronunciation and prosody cannot be determined due to the tone color.

And S1093, weighting the prosody to be adjusted and the original prosody corresponding to the prosody to be adjusted by the server to obtain the adjusted user prosody.

The server determines the corresponding acoustic prosody for the prosody to be adjusted, then weights the prosody to be adjusted and the corresponding acoustic prosody, and takes the weighted result as the adjusted user prosody. It will be appreciated that the server may set the weighting according to the gap between the prosody to be adjusted and the original prosody, for example, when the prosody to be adjusted is completely different from the original prosody, the weighting is set to 1, i.e., the original prosody is used to cover the prosody to be adjusted, and when the prosody to be adjusted is only partially different from the original prosody, the weighting is set to 0.4, i.e., the final adjusted user prosody is generated using 60% of the prosody to be adjusted and 40% of the original prosody, so that the user's prosody is optimally adjusted while preserving the characteristics of the user's prosody.

S1094, the server synthesizes the regulated audio by utilizing the regulated user rhythm and the rhythm except for the rhythm to be regulated in the special rhythm of the user.

And the server synthesizes the audio by utilizing the regulated user rhythm and other rhythms except the rhythm to be regulated in the special rhythms of the user to obtain regulated audio. It will be appreciated that the server still uses the user's timbre when synthesizing the modified audio, and thus the resulting modified audio will have the user's timbre.

In the embodiment of the invention, the server firstly calculates the special rhythm of the user from the audio data to be adjusted, then extracts the rhythm to be adjusted from the special rhythm of the user, and then weights the original acoustic rhythm corresponding to the rhythm to be adjusted with the rhythm to be adjusted to adjust the rhythm to be adjusted, so as to obtain the adjusted user rhythm, and further synthesizes the adjusted audio based on the adjusted user rhythm. Therefore, the server can optimize the tunes and rhythms of the audio data to be adjusted to obtain more graceful adjusted audio.

In some embodiments of the present invention, after the terminal obtains the audio data to be adjusted of the user on the audio recording function interface, before the audio data to be adjusted is sent to the server, that is, after S101, before S102, the method may further include: s112, as follows:

S112, the terminal compresses the audio data to be adjusted to obtain compressed audio data to be adjusted.

After the terminal obtains the audio data to be adjusted, in order to facilitate the transmission of the audio data to be adjusted to the server, save transmission time and data flow occupied by the transmission, the terminal can compress the audio data to be adjusted, and at this time, correspondingly, the process that the terminal sends the audio adjustment instruction, the original sound audio identifier and the audio data to be adjusted to the server will become to send the audio adjustment instruction, the original sound audio identifier and the compressed audio data to be adjusted to the server.

In the embodiment of the invention, the terminal can compress the audio data to be adjusted and send the compressed audio data to be adjusted to the server so as to save the transmission time of the audio data to be adjusted and the data flow occupied by transmission.

In the following, an exemplary application of the embodiment of the present invention in a practical application scenario will be described.

The embodiment of the invention is realized in the scene that a user sings a canteen song, a dialect song, a drama and even reads English in an English short video on the terminal, and the like, and a server beautifies the audio.

Fig. 11 is a schematic flow chart of beautifying audio performed at a terminal side according to an embodiment of the present invention. After the process starts 11-1, the user clicks on the song-on-song interface (audio recording function interface) of the terminal to start the song-on-song 11-2 function (audio adjustment instruction), then records the user audio 11-3 (audio data to be adjusted), and then the terminal transmits the recorded user audio segment to the server 11-4 so that the server processes the user audio 11-5. The terminal then receives the aesthetic audio 11-6 and returns to the user's ear so that the user adjusts 11-7 after hearing. After the audio recording is completed 11-8, the terminal judges whether the user is satisfied with the beautified audio 11-9 according to the user operation, if satisfied, the beautified audio is output 11-10, and if not satisfied, the terminal reenters the processing flow of recording the user song 11-3.

FIG. 12 is a schematic diagram of an interactive process for beautifying audio provided by an embodiment of the present invention. When a user starts recording a singed song, a drama or recording an English, the terminal requests the recording device to record 12-1, the recording device returns user audio 12-2 to the terminal, the terminal performs preprocessing 12-3 such as compression on the user audio and sends the preprocessed audio (compressed audio data to be adjusted) to the server 12-4, the server performs pronunciation matching degree test (pronunciation comparison) on the preprocessed audio, beautifies the user audio 12-5 according to the matching degree (pronunciation difference result), and then returns beautified audio to the terminal 12-6, and the terminal plays the beautified audio to the user 12-7 through an ear loop or a loudspeaker so as to facilitate the user to decide subsequent operations.

Specifically, in the process of testing the pronunciation matching degree, the server can compare the syllable of the user audio (the syllable of the user pronunciation) with the syllable of the cantonese, the syllable of the drama, the syllable of the dialect class song or the syllable of the English short text (the standard pronunciation syllable) of the original sound audio to obtain the result of testing the pronunciation matching degree. The server can also optimize the rhythm tone of the user audio, the server firstly cuts the user audio and the original audio into small segments (original sound segments and segments to be beautified) by taking a sentence of lyrics as a unit, and compares the small segments as a unit, thereby being convenient for comparison and reducing comparison errors. The server then characterizes the audio, i.e. converts the user audio and the acoustic audio into vector representations (acoustic feature vector and user feature vector), the vector neutralization parameters comprising the pitch, frequency and duration of each word. Then, the server calculates the similarity between the vector of the user audio and the vector of the original sound audio by adopting cosine similarity, and divides the difference between the user audio and the original sound audio into four cases according to the similarity: (a) normal pronunciation, differential rhythmic tone; (b) Abnormal pronunciation, consistent rhythm tone, the situation is to translate the pronunciation of the user, for example, translate non-cantonese into cantonese, and make the produced audio smoother according to the historical audio and the original audio; (c) The pronunciation is abnormal, the rhythm tone is not coincident, the difference between the pronunciation and the rhythm tone is too large, and new audio (adjusting audio) is required to be produced according to the user audio and the original sound audio; (d) The cantonese sounds normally and the rhythmic tones coincide, which is ideal without adjusting the user's audio.

Fig. 13 is a schematic diagram of an audio synthesis process according to an embodiment of the present invention. In the audio synthesis process, the user's distinctive prosody (user-specific prosody) 13-3 may be extracted from the user's historical audio 13-1 and the user audio 13-2 first. When the server just starts to run, since the user history audio is little or no, at this time, the server can find the user audio with the highest similarity to the current user audio from the audio of all users and take it as the user history audio. Then, the server extracts a standard prosody (acoustic prosody) 13-5 from the acoustic audio 13-4, prosody adjusts the user's distinctive prosody by the standard prosody 13-6, and synthesizes a user's beautified audio 13-8 (adjusted audio) by the result of the extraction of the standard syllable 13-7 and prosody adjustment. It can be appreciated that both standard syllables and standard prosody can be pre-extracted and stored in a server to speed up the speed of audio beautification. Finally, the server can carry out smoothing processing on the synthesized audio of the user, so that the audio of the user is more natural.

By the method, the server can optimally adjust the special rhythm of the user and the pronunciation of the user in the English short text of songs, dramas and even speaks of the user, so that the audio of the user can be heard more gracefully, the optional adjustment types of the audio of the user are increased, and the audio adjustment effect is improved.

Continuing with the description below of an exemplary architecture of the audio adjustment device 255 implemented as a software module provided by an embodiment of the present invention, in some embodiments, as shown in fig. 3, the software module stored in the audio adjustment device 255 of the first memory 250 may include:

a first receiving module 2551, configured to receive audio data to be adjusted sent by a terminal; the audio data to be adjusted represent audio data recorded by a user;

an obtaining module 2552, configured to obtain acoustic audio data corresponding to the audio data to be adjusted from an acoustic audio database;

the difference comparison module 2553 is configured to perform pronunciation matching detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result; the pronunciation difference result characterizes the difference degree of the audio data to be adjusted and the original sound audio data in pronunciation;

an adjusting module 2554, configured to correct the pronunciation of the audio data to be adjusted by using the pronunciation of the original sound audio data based on the pronunciation difference result, so as to obtain an adjustment audio;

and the first sending module 2555 is configured to return the adjustment audio to the terminal, so that the terminal plays the adjustment audio.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to directly perform pronunciation matching detection with the original sound audio data by using the audio data to be adjusted as a segment to be adjusted, so as to obtain the pronunciation difference result; or segmenting the audio data to be adjusted and the original sound audio data, and then performing pronunciation matching detection to obtain the pronunciation difference result.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to divide the acoustic audio data into a plurality of acoustic segments; dividing the audio data to be adjusted into a plurality of segments to be adjusted by utilizing a plurality of segmentation times corresponding to the plurality of acoustic segments; and carrying out pronunciation matching detection on each fragment to be adjusted and each corresponding original sound fragment to obtain the pronunciation difference result.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to extract at least one syllable of a user pronunciation from each segment to be adjusted, and extract at least one syllable of a standard pronunciation from each original sound segment; determining, from the at least one standard pronunciation syllable, a corresponding standard syllable for each user pronunciation syllable of the at least one user pronunciation syllable; and matching each user pronunciation syllable with the corresponding standard syllable to obtain the pronunciation difference result.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to perform phoneme decomposition on each syllable of the user pronunciation to obtain a first phoneme corresponding to each syllable of the user pronunciation, and perform phoneme decomposition on a standard syllable corresponding to each syllable of the user pronunciation to obtain a second phoneme corresponding to the standard syllable; matching the pronunciation of the first phoneme with the pronunciation of the second phoneme to obtain a pronunciation matching result; extracting the duration of the first phoneme to obtain the first duration, and extracting the duration of the second phoneme to obtain the second duration; obtaining a time comparison result according to the first duration and the second duration; and determining the pronunciation difference result according to the time comparison result and the pronunciation matching result.

In some embodiments of the present invention, the difference comparing module 2553 is specifically configured to, when the time comparison result indicates that the first duration and the second duration match, and the phoneme match result indicates that the pronunciation of the first phoneme and the pronunciation of the second phoneme match, indicate that the pronunciation difference result indicates that the user pronunciation syllable has no difference from the standard syllable corresponding to the user pronunciation syllable; when the time comparison result indicates that the first duration time and the second duration time are not matched, and the phoneme matching result indicates that the pronunciation of the first phoneme is matched with the pronunciation of the second phoneme, the pronunciation difference result indicates that a user pronunciation syllable is different from a standard syllable corresponding to the user pronunciation syllable; when the time comparison result indicates that the first duration time is matched with the second duration time, and the phoneme matching result indicates that the pronunciation of the first phoneme is not matched with the pronunciation of the second phoneme, the pronunciation difference result indicates that the syllable of the pronunciation of the user is different from the corresponding standard syllable; when the time comparison result indicates that the first duration time and the second duration time are not matched, and the phoneme matching result indicates that the pronunciation of the first phoneme and the pronunciation of the second phoneme are not matched, the pronunciation difference result indicates that the syllable of the user pronunciation is different from the corresponding standard syllable.

In some embodiments of the present invention, the adjusting module 2554 is specifically configured to select syllables to be corrected, whose departure sound difference results represent differences, from at least one syllable of a user pronunciation of the audio data to be adjusted; extracting standard syllables corresponding to the syllables to be corrected from the at least one standard pronunciation syllable; replacing the syllable to be corrected with the standard syllable to obtain a corrected syllable; and synthesizing the adjustment audio by utilizing the corrected syllables and syllables except the syllables to be corrected in the syllables of the at least one user pronunciation.

In some embodiments of the present invention, the difference comparing module 2553 is further configured to extract an acoustic audio feature from the acoustic audio data, and extract a user audio feature from the audio data to be adjusted; wherein the audio features include at least pitch, frequency, and duration; similarity calculation is carried out on the original sound audio characteristics and the user audio characteristics, and an audio difference result is obtained;

the adjusting module 2554 is further configured to perform prosody optimization on the audio data to be adjusted by using the acoustic audio data based on the audio difference result, so as to obtain the adjusted audio.

In some embodiments of the present invention, the difference comparing module 2553 is specifically further configured to vectorize the acoustic audio feature and the user audio feature to obtain an acoustic feature vector corresponding to the acoustic audio feature and a user feature vector corresponding to the user audio feature; and calculating an angle value between the original sound feature vector and the user feature vector, and further obtaining the audio difference result according to the angle value.

In some embodiments of the present invention, the difference comparing module 2553 is specifically further configured to digitize each original sound sub-feature in the original sound audio feature to obtain sub-elements corresponding to each original sound sub-feature; digitizing each user sub-feature in the user audio features to obtain sub-elements corresponding to each user sub-feature; and combining the acoustic feature vector by utilizing the sub-elements corresponding to the acoustic sub-features, and combining the user feature vector by utilizing the sub-elements corresponding to the user sub-features.

In some embodiments of the present invention, the adjusting module 2554 is specifically further configured to, when the audio difference result is greater than a preset audio difference threshold, calculate a user specific prosody from the audio data to be adjusted, and extract an acoustic prosody from the acoustic audio data; extracting prosody to be adjusted, the difference between the prosody and the original acoustic prosody of which exceeds a preset prosody threshold value, from the special prosody of the user; weighting the prosody to be adjusted and the original acoustic prosody corresponding to the prosody to be adjusted to obtain adjusted user prosody; and synthesizing the adjusted audio by utilizing the adjusted user rhythm and the rhythms except the rhythms to be adjusted in the special rhythms of the user.

Continuing with the description below of an exemplary architecture of the audio playback device 455 implemented as a software module provided by embodiments of the present invention, in some embodiments, as shown in fig. 4, the software modules stored in the audio playback device 455 of the second memory 450 may include:

the acquisition module 4551 is configured to acquire audio data to be adjusted of a user;

the second sending module 4552 is configured to send the audio data to be adjusted to a server, so that the server performs audio adjustment on the audio data to be adjusted, where the adjusted audio is a second receiving module 4553 generated by the server based on the audio data to be adjusted, and is configured to receive and play the adjusted audio sent by the server.

In some embodiments of the present invention, the audio playing device 455 further includes a compression module 4554;

the compression module 4554 is configured to compress the audio data to be adjusted to obtain compressed audio data to be adjusted; the method comprises the steps of carrying out a first treatment on the surface of the

Correspondingly, the second sending module 4552 is further configured to send the compressed audio data to be adjusted to the server.

Embodiments of the present invention provide a computer readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform the audio adjustment method provided by embodiments of the present invention, for example, as shown in fig. 5, 7, 8, 9, 11, 12, and 13.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable audio adjustment instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable audio adjustment instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

The foregoing is merely exemplary embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. An audio adjustment method, comprising:

receiving audio data to be adjusted sent by a terminal; the audio data to be adjusted are audio data recorded by a user;

extracting acoustic audio features from the acoustic audio data and extracting user audio features from the audio data to be adjusted; wherein the audio features include at least pitch, frequency, and duration;

similarity calculation is carried out on the original sound audio characteristics and the user audio characteristics, and an audio difference result is obtained;

When the audio difference result is larger than a preset audio difference threshold value, counting the special rhythm of the user from the audio data to be adjusted, and extracting the original sound rhythm from the original sound audio data;

extracting prosody to be adjusted, the difference between the prosody and the original acoustic prosody of which exceeds a preset prosody threshold value, from the special prosody of the user;

weighting the prosody to be adjusted and the original acoustic prosody corresponding to the prosody to be adjusted to obtain adjusted user prosody;

and synthesizing the adjusted audio by utilizing the adjusted user rhythm and the rhythms except the rhythms to be adjusted in the special rhythms of the user.

2. The method according to claim 1, wherein after the obtaining the acoustic audio data corresponding to the audio data to be adjusted from the acoustic audio database, the method further comprises:

3. The method according to claim 2, wherein the performing the pronunciation match detection on the audio data to be adjusted and the acoustic audio data to obtain a pronunciation difference result includes:

taking the audio data to be adjusted as a segment to be adjusted, and directly performing pronunciation matching detection with the original sound audio data to obtain a pronunciation difference result; or alternatively, the process may be performed,

segmenting the audio data to be adjusted and the original sound audio data, and performing pronunciation matching detection on the segmentation operation result to obtain the pronunciation difference result.

4. A method according to claim 3, wherein the segmenting the audio data to be adjusted and the acoustic audio data, performing pronunciation match detection on the result of the segmentation operation to obtain the pronunciation difference result, comprises:

dividing the acoustic audio data into a plurality of acoustic segments;

dividing the audio data to be adjusted into a plurality of segments to be adjusted by utilizing a plurality of segmentation times corresponding to the plurality of acoustic segments;

and carrying out pronunciation matching detection on each fragment to be adjusted and each corresponding original sound fragment to obtain the pronunciation difference result.

5. The method of claim 4, wherein performing the pronunciation match detection on each segment to be adjusted and each corresponding acoustic segment to obtain the pronunciation difference result comprises:

extracting at least one user pronunciation syllable from each fragment to be adjusted, and extracting at least one standard pronunciation syllable from each original sound fragment;

determining, from the at least one standard pronunciation syllable, a corresponding standard syllable for each user pronunciation syllable of the at least one user pronunciation syllable;

and matching each user pronunciation syllable with the corresponding standard syllable to obtain the pronunciation difference result.

6. The method of claim 5, wherein said matching each of said user's pronunciation syllables with its corresponding standard syllable to obtain said pronunciation difference result comprises:

performing phoneme decomposition on each user pronunciation syllable to obtain a first phoneme corresponding to each user pronunciation syllable, and performing phoneme decomposition on a standard syllable corresponding to each user pronunciation syllable to obtain a second phoneme corresponding to the standard syllable;

Matching the pronunciation of the first phoneme with the pronunciation of the second phoneme to obtain a pronunciation matching result;

extracting the duration of the first phoneme to obtain a first duration, and extracting the duration of the second phoneme to obtain a second duration;

obtaining a time comparison result according to the first duration and the second duration;

and determining the pronunciation difference result according to the time comparison result and the pronunciation matching result.

7. The method according to any one of claims 2 to 6, wherein correcting the pronunciation of the audio data to be adjusted with the pronunciation of the acoustic audio data based on the pronunciation difference result, to obtain the adjustment audio, comprises:

selecting syllables to be corrected, which have differences represented by the difference results of the departure sound, from at least one syllable of the pronunciation of the user of the audio data to be adjusted;

extracting standard syllables corresponding to the syllables to be corrected from the at least one standard pronunciation syllable;

replacing the syllable to be corrected with the standard syllable to obtain a corrected syllable;

and synthesizing the adjustment audio by utilizing the corrected syllables and syllables except the syllables to be corrected in the syllables of the at least one user pronunciation.

8. The method according to any one of claims 1 to 6, wherein the performing similarity calculation on the acoustic audio feature and the user audio feature to obtain an audio difference result includes:

vectorizing the original sound audio feature and the user audio feature to obtain an original sound feature vector corresponding to the original sound audio feature and a user feature vector corresponding to the user audio feature;

and calculating an angle value between the original sound feature vector and the user feature vector, and obtaining the audio difference result according to the angle value.

9. The method of claim 8, wherein the vectorizing the acoustic audio feature and the user audio feature, respectively, to obtain an acoustic feature vector corresponding to the acoustic audio feature and a user feature vector corresponding to the user audio feature, comprises:

digitizing each original sound sub-feature in the original sound audio feature to obtain sub-elements corresponding to each original sound sub-feature;

digitizing each user sub-feature in the user audio features to obtain sub-elements corresponding to each user sub-feature;

And combining the acoustic feature vector by utilizing the sub-elements corresponding to the acoustic sub-features, and combining the user feature vector by utilizing the sub-elements corresponding to the user sub-features.

10. An audio adjustment method, comprising:

acquiring audio data to be adjusted of a user;

transmitting the audio data to be adjusted to a server, so that the server performs audio adjustment on the audio data to be adjusted through the audio adjustment method according to any one of claims 1 to 9, wherein the adjustment audio is generated by the server based on the audio data to be adjusted;

and receiving the adjustment audio sent by the server and playing the adjustment audio.

11. A server, comprising:

a first memory for storing executable audio adjustment instructions;

a first processor for implementing the method of any one of claims 1 to 9 when executing the executable audio adjustment instructions stored in the first memory.

12. A terminal, comprising:

a second memory for storing executable audio adjustment instructions;

a second processor for implementing the method of claim 10 when executing the executable audio adjustment instructions stored in the second memory.

13. A computer readable storage medium, storing executable audio adjustment instructions for causing a first processor to perform the method of any one of claims 1 to 9 or for causing a second processor to perform the method of claim 10.