CN117854458A

CN117854458A - Audio adjustment method, system, computer device and computer readable storage medium

Info

Publication number: CN117854458A
Application number: CN202311663676.4A
Authority: CN
Inventors: 岳伯禹; 李成
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-12-06
Filing date: 2023-12-06
Publication date: 2024-04-09

Abstract

The invention discloses an audio adjustment method, an audio adjustment system, computer equipment and a computer readable storage medium, and relates to the technical field of audio processing, wherein the method comprises the following steps: dividing the audio to be adjusted into tracks to obtain audio track data corresponding to each sound source; determining at least one target sound source, and performing tone shifting processing on the sound track data corresponding to each target sound source to obtain final sound track data corresponding to each target sound source; and merging the final audio track data corresponding to each target sound source and the audio track data which is not subjected to the tone modification processing to obtain the adjusted audio. According to the invention, only one audio frequency is required to be recorded, different audio frequency effects can be realized more flexibly by carrying out tone changing processing on the audio track data corresponding to the sound source, and the economic cost and the time cost are greatly reduced.

Description

Audio adjustment method, system, computer device and computer readable storage medium

Technical Field

The present invention relates to the field of audio processing technology, and in particular, to an audio adjustment method, an audio adjustment system, a computer device, and a computer readable storage medium.

Background

At present, if the same song is used for realizing the singing effect of a man and a woman, the singing effect of a plurality of persons or the singing effect of a single person, different singing versions of the song need to be recorded, the technical problems of high economic cost and high time cost are solved, and when the song is recorded, the conditions of singing errors and the like can also occur, so that the time cost is further improved.

Disclosure of Invention

The invention aims to solve the technical problems of the prior art, in particular to the problems of high economic cost, high time cost and the like existing in the prior recorded songs, and particularly provides an audio adjusting method, an audio adjusting system, computer equipment and a computer readable storage medium, so as to reduce the economic cost and the time cost, which are as follows:

1) In a first aspect, the present invention provides an audio adjustment method, which specifically includes:

dividing the audio to be adjusted into tracks to obtain audio track data corresponding to each sound source;

determining at least one target sound source, and performing tone shifting processing on the sound track data corresponding to each target sound source to obtain final sound track data corresponding to each target sound source;

and merging the final audio track data corresponding to each target sound source and the audio track data which is not subjected to the tone modification processing to obtain the adjusted audio.

The audio adjusting method provided by the invention has the beneficial effects that:

only one audio is needed to be recorded, different audio effects can be realized more flexibly by carrying out tone changing processing on the audio track data corresponding to the sound source, and the economic cost and the time cost are greatly reduced.

On the basis of the scheme, the audio adjusting method can be improved as follows.

Further, the method further comprises the following steps:

when the audio to be adjusted is music and when a user is listening to the adjusted music, acquiring a facial image of the user;

identifying the emotion of the user according to the facial image of the user to obtain an emotion identification result of the user;

and adjusting the volume and rhythm of the music which is being played according to the emotion recognition result of the user.

The beneficial effects of adopting the further scheme are as follows: the user emotion is accurately identified, and then the volume and the rhythm of the music which is being played and is adjusted are adjusted, so that the purpose of adjusting the emotion for the user is achieved.

Further, the method further comprises the following steps:

when the audio to be adjusted is music and when a user is listening to the adjusted music, acquiring a video containing the face of the user, extracting a plurality of key frames from the video, respectively inputting each key frame into a trained emotion recognition model, and obtaining an emotion recognition result corresponding to each key frame;

determining a final emotion recognition result of the user according to the emotion recognition result corresponding to each key frame;

and adjusting the volume and rhythm of the music which is being played according to the final emotion recognition result of the user.

Further, the method further comprises the following steps: and denoising the initial audio to obtain the audio to be adjusted.

The beneficial effects of adopting the further scheme are as follows: through carrying out denoising processing to initial audio, can acquire higher quality wait to adjust audio such as music, can satisfy user's demand, improve user experience degree.

2) In a second aspect, the present invention further provides an audio adjustment system, which has the following specific technical scheme:

the system comprises a track dividing module, a tone changing processing module and a merging module;

the track dividing module is used for: dividing the audio to be adjusted into tracks to obtain audio track data corresponding to each sound source;

the tone-changing processing module is used for: determining at least one target sound source, and performing tone shifting processing on the sound track data corresponding to each target sound source to obtain final sound track data corresponding to each target sound source;

the merging module is used for: and merging the final audio track data corresponding to each target sound source and the audio track data which is not subjected to the tone modification processing to obtain the adjusted audio.

On the basis of the scheme, the audio adjusting system can be improved as follows.

Further, the system also comprises a first identification adjustment module, wherein the first identification adjustment module is used for:

Further, the system also comprises a second identification adjustment module, wherein the second identification adjustment module is used for:

Further, still include the denoising module, the denoising module is used for: and denoising the initial audio to obtain the audio to be adjusted.

3) In a third aspect, the present invention also provides a computer device, the computer device including a processor, the processor being coupled to a memory, the memory storing at least one computer program, the at least one computer program being loaded and executed by the processor to cause the computer device to implement any one of the above-mentioned audio adjustment methods.

4) In a fourth aspect, the present invention also provides a computer readable storage medium, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor, so that the computer implements any one of the above-mentioned audio adjustment methods.

It should be noted that, the technical solutions of the second aspect to the fourth aspect and the corresponding possible implementation manners of the present invention may refer to the technical effects of the first aspect and the corresponding possible implementation manners of the first aspect, which are not described herein.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings in which:

fig. 1 is a flow chart of an audio adjustment method according to an embodiment of the invention;

FIG. 2 is a flow chart for adjusting the volume and rhythm of the music being played;

FIG. 3 is a second flow chart for adjusting the volume and rhythm of the music being played;

FIG. 4 is a schematic diagram of an audio adjustment method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, an audio adjustment method according to an embodiment of the present invention includes the following steps:

s1, carrying out track division on audio to be adjusted to obtain audio track data corresponding to each sound source;

the audio to be adjusted can be specifically music, speaking audio recorded by a user, or recorded cat call, etc., the music is pure music or song, and sound sources in the pure music comprise: instruments involved in pure music such as guitar, piano, etc. Sound sources in songs include: musical instruments and singers to which songs relate, and speaking audio recorded by a user includes: sound sources in users and environments such as car sounds, and the like, and sound sources in recorded cat sounds include: cats and sound sources in the environment such as car sounds, etc.

S2, determining at least one target sound source, and performing tone shifting processing on the sound track data corresponding to each target sound source to obtain final sound track data corresponding to each target sound source;

s3, merging the final audio track data corresponding to each target sound source and the audio track data which is not subjected to the tone modification processing to obtain adjusted audio, and specifically:

1) For example, if the singer of the song is two singers, if the song of the singing effect of the male and female is to be achieved, determining any one of the singers as a target sound source, performing tone changing processing on the audio track data corresponding to the male singer to obtain female voice, and then combining the processed female voice with the audio track data not subjected to the tone changing processing to obtain adjusted audio, namely the adjusted song, wherein the adjusted song achieves the singing effect of the male and female.

2) For example, in order to make a video, a user records a video including a cat call by himself, extracts a cat call sound in the video, uses the cat as a target sound source, performs tone changing processing on audio track data corresponding to the cat, processes the audio track data into a dog call sound, then combines the dog call sound with the audio track data which is not subjected to tone changing processing to obtain adjusted audio, and then combines the adjusted audio with the video, thereby improving user experience.

According to the audio adjusting method, only one audio is needed to be recorded, different audio effects can be achieved more flexibly through tone changing processing on the audio track data corresponding to the sound source, and economic cost and time cost are greatly reduced.

Optionally, in the above technical solution, as shown in fig. 2, the method further includes:

s10, acquiring a facial image of a user, specifically:

when the audio to be adjusted is music and when the user is listening to the adjusted music, the facial image of the user is acquired, and it is required to be noted that the facial image of the user is acquired in a legal allowed range, for example, the facial image of the user is acquired through the laid cameras in scenes such as an airplane, a train, an airport, a railway station, a psychological treatment room and the like.

S11, emotion recognition, specifically:

identifying the emotion of the user according to the facial image of the user to obtain an emotion identification result of the user, wherein the emotion identification result specifically comprises S110 to S111:

s110, facial images of a plurality of users are collected in advance and emotion marking is carried out, a data set is established, a preset deep learning network is trained based on the data set, a trained emotion recognition model is obtained, and the preset deep learning network is a convolutional neural network, a cyclic neural network, a self-encoder and a transducer.

S111, inputting the facial image of the user into a trained emotion recognition model, and outputting an emotion recognition result of the user by the trained emotion recognition model.

S12, adjusting, specifically:

according to the emotion recognition result of the user, adjusting the volume and rhythm of the adjusted music being played, specifically including S120 and S121:

s120, determining the crowd ratio of each favorite volume and rhythm corresponding to each emotion through a questionnaire and combining with a psychological expert suggestion mode, and taking the favorite volume and rhythm with the highest crowd ratio as the optimal volume and optimal rhythm corresponding to the emotion, so as to obtain the optimal volume and optimal rhythm corresponding to each emotion;

s121, according to the emotion recognition result of the user, determining the optimal volume and the optimal rhythm corresponding to the user from the predetermined optimal volume and optimal rhythm corresponding to each emotion, and adjusting the volume and the rhythm of the adjusted music being played to the optimal volume and the optimal rhythm corresponding to the user.

It should be noted that, the method of questionnaires and psychological expert advice is combined to meet the needs of most people, after the volume and rhythm of the music being played are statistically adjusted, users of the music are closed in the preset duration content, a record table is individually built for the users, so that the volume and rhythm of the music being played are modified and adjusted later, the experience of the users can be improved, and the user viscosity is further improved.

In this embodiment, the purpose of adjusting the emotion of the user is achieved by accurately identifying the emotion of the user and then adjusting the volume and rhythm of the adjusted music being played.

Optionally, in the above technical solution, as shown in fig. 3, the method further includes:

s20, extracting key frames, and carrying out emotion recognition, specifically:

when the audio to be adjusted is music and the user is listening to the adjusted music, acquiring a video containing the face of the user, extracting a plurality of key frames from the video, and respectively inputting each key frame into a trained emotion recognition model to obtain an emotion recognition result corresponding to each key frame;

s21, determining a final emotion recognition result, specifically:

for example, 10 key frames are total, wherein the emotion recognition results corresponding to 9 key frames are anger, the emotion recognition results corresponding to the rest 1 key frames are principal, and anger is determined as the final emotion recognition result of the user. The problem of large emotion recognition result error caused by a single image can be prevented by performing emotion recognition through a plurality of key frames.

It should be noted that, the emotion recognition results corresponding to the 5 key frames are all anger, and the emotion recognition results corresponding to the remaining 5 key frames are principal, so that training of the model needs to be performed again.

S22, adjusting:

according to the final emotion recognition result of the user, the volume and rhythm of the music being played are adjusted, and the specific process is referred to the specific explanation of S12 above, and will not be described here.

Optionally, in the above technical solution, the method further includes: and denoising the initial audio to obtain the audio to be adjusted. Through carrying out denoising processing to initial audio, can acquire higher quality wait to adjust audio such as music, can satisfy user's demand, improve user experience degree.

The method comprises the steps of pre-establishing a training set comprising a plurality of noise samples, training a convolutional neural network based on the training set to obtain a trained noise recognition model, inputting initial audio into the trained noise recognition model, recognizing noise in the initial audio, and carrying out denoising processing to obtain audio to be adjusted. Initial audio refers to the audio that was originally recorded.

The convolutional neural network comprises an input layer, a convolutional layer, an activation function, a pooling layer and a full-connection layer, wherein the convolutional layer adopts expansion convolution, the activation function adopts a Sigmoid function, and the pooling layer adopts mixed pooling.

In the above embodiments, although steps S1, S2, etc. are numbered, only specific embodiments of the present invention are given, and those skilled in the art may adjust the execution sequence of S1, S2, etc. according to the actual situation, which is also within the scope of the present invention, and it is understood that some embodiments may include some or all of the above embodiments.

As shown in fig. 4, an audio adjustment system 200 according to an embodiment of the present invention includes a track dividing module 201, a pitch changing processing module 202, and a merging module 203;

the track dividing module 201 is used for: dividing the audio to be adjusted into tracks to obtain audio track data corresponding to each sound source;

the tone change processing module 202 is configured to: determining at least one target sound source, and performing tone shifting processing on the sound track data corresponding to each target sound source to obtain final sound track data corresponding to each target sound source;

the merging module 203 is configured to: and merging the final audio track data corresponding to each target sound source and the audio track data which is not subjected to the tone modification processing to obtain the adjusted audio.

Optionally, in the above technical solution, the device further includes a first recognition adjustment module, where the first recognition adjustment module is configured to:

when the audio to be adjusted is music and when the user is listening to the adjusted music, acquiring a facial image of the user;

and adjusting the volume and rhythm of the music which is being played and is adjusted according to the emotion recognition result of the user.

Optionally, in the above technical solution, the device further includes a second recognition adjustment module, where the second recognition adjustment module is configured to:

Optionally, in the above technical solution, the system further includes a denoising module, where the denoising module is configured to: and denoising the initial audio to obtain the audio to be adjusted.

It should be noted that the beneficial effects of the audio adjustment system 200 provided in the above embodiment are the same as those of the audio adjustment method described above, and will not be described herein. In addition, when the system provided in the above embodiment implements the functions thereof, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the system is divided into different functional modules according to practical situations, so as to implement all or part of the functions described above. In addition, the system and method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

As shown in fig. 5, in a computer device 300 according to an embodiment of the present invention, the computer device 300 includes a processor 320, the processor 320 is coupled to a memory 310, at least one computer program 330 is stored in the memory 310, and the at least one computer program 330 is loaded and executed by the processor 320, so that the computer device 300 implements any one of the above-mentioned audio adjustment methods, specifically:

the computer device 300 may include one or more processors 320 (Central Processing Units, CPU) and one or more memories 310, where the one or more memories 310 store at least one computer program 330, where the at least one computer program 330 is loaded and executed by the one or more processors 320 to enable the computer device 300 to implement any of the audio conditioning methods provided by the embodiments described above. Of course, the computer device 300 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The embodiment of the invention provides a computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, and the at least one computer program is loaded and executed by a processor, so that a computer realizes any one of the audio adjustment methods.

Alternatively, the computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a compact disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any of the above-described audio adjustment methods.

It should be noted that the terms "first," "second," and the like in the description and in the claims of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. The order of use of similar objects may be interchanged where appropriate so that embodiments of the present application described herein may be implemented in other sequences than those illustrated or described.

Those skilled in the art will appreciate that the present invention may be embodied as a system, method or computer program product, and that the disclosure may therefore be embodied in the form of: either entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or entirely software, or a combination of hardware and software, referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media, which contain computer-readable program code.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. An audio adjustment method, comprising:

2. The audio conditioning method of claim 1, further comprising:

3. The audio conditioning method of claim 1, further comprising:

4. A method of audio conditioning as claimed in any of claims 1 to 3, further comprising: and denoising the initial audio to obtain the audio to be adjusted.

5. An audio adjusting system is characterized by comprising a track dividing module, a tone changing processing module and a merging module;

6. The audio conditioning system of claim 5, further comprising a first recognition conditioning module configured to:

7. The audio conditioning system of claim 5, further comprising a second recognition conditioning module configured to:

8. The audio conditioning system of any of claims 5-7, further comprising a denoising module for:

and denoising the initial audio to obtain the audio to be adjusted.

9. A computer device, characterized in that it comprises a processor coupled to a memory, in which at least one computer program is stored, which is loaded and executed by the processor, so that the computer device implements an audio adjustment method according to any of claims 1 to 4.

10. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor, to cause a computer to implement an audio adjustment method according to any one of claims 1 to 4.