CN111526242B

CN111526242B - Audio processing method and device and electronic equipment

Info

Publication number: CN111526242B
Application number: CN202010366389.7A
Authority: CN
Inventors: 王诗云
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-09-07
Anticipated expiration: 2040-04-30
Also published as: CN111526242A

Abstract

The application discloses an audio processing method and device and electronic equipment, and belongs to the technical field of electronic equipment. The audio processing method comprises the following steps: acquiring a target audio, wherein the target audio comprises N first sub-audio bands, and the first sub-audio comprises sound of a sound-emitting object; adding a sound production object identifier for the first sub-tone frequency band according to the sound production object to which the first sub-tone frequency band belongs; displaying M audio tracks corresponding to a target audio, wherein each audio track comprises at least one first sub-audio frequency range, and one audio track corresponds to a sound production object identifier; wherein N is a positive integer, and M is a positive integer less than or equal to N. According to the audio processing method and device and the electronic equipment, the problem that the difficulty of audio processing is high in the prior art can be solved.

Description

Audio processing method and device and electronic equipment

Technical Field

The application belongs to the technical field of electronic equipment, and particularly relates to an audio processing method and device and electronic equipment.

Background

At present, when recording or recording, electronic devices fuse and record sounds from different sound production objects. When a user needs to perform post-processing on the sound of one or more sounding objects in the recorded audio, the sound needs to be separated manually from the recorded audio, which not only wastes time and labor, but also causes the problem that the sound cannot be separated or is separated incorrectly, so that the difficulty of audio processing is high.

Disclosure of Invention

The embodiment of the application aims to provide an audio processing method, an audio processing device and electronic equipment, and can solve the problem that the difficulty of audio processing is high in the prior art.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides an audio processing method, where the method includes:

acquiring a target audio, wherein the target audio comprises N first sub-audio bands, and the first sub-audio comprises sound of a sound-emitting object;

adding a sound production object identifier for the first sub-tone frequency band according to the sound production object to which the first sub-tone frequency band belongs;

displaying M audio tracks corresponding to a target audio, wherein each audio track comprises at least one first sub-audio frequency range, and one audio track corresponds to a sound production object identifier;

wherein N is a positive integer, and M is a positive integer less than or equal to N.

In a second aspect, an embodiment of the present application provides an audio processing apparatus, including:

the audio acquisition module is used for acquiring a target audio, wherein the target audio comprises N first sub-audio frequency ranges, and the first sub-audio comprises a sound of a sound-emitting object;

the first adding module is used for adding a sound production object identifier for the first sub-audio frequency band according to the sound production object to which the first sub-audio frequency band belongs;

the first display module is used for displaying M audio tracks corresponding to the target audio, each audio track comprises at least one first sub-audio frequency range, and one audio track corresponds to one sound production object identifier;

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, where the program or instructions, when executed by the processor, implement the steps of the audio processing method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the audio processing method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the audio processing method according to the first aspect.

In the embodiment of the application, after the target audio is acquired, according to the sounding object to which each first sub-audio frequency band in the target audio belongs, a sounding object identifier is added to each first sub-audio frequency band, and the audio track corresponding to each sounding object identifier is displayed.

Drawings

Fig. 1 is a schematic flowchart of an audio processing method according to an embodiment of the present application;

fig. 2 is a schematic interface diagram of a recording display interface according to an embodiment of the present application;

FIG. 3 is a schematic interface diagram illustrating a recording process according to a first embodiment of the present application;

FIG. 4 is a schematic interface diagram illustrating a recording process according to a second embodiment of the present application;

FIG. 5 is a schematic interface diagram illustrating a recording process according to a third embodiment of the present application;

FIG. 6 is a schematic interface diagram illustrating a recording process according to a fourth embodiment of the present application;

FIG. 7 is a schematic interface diagram illustrating a recording process according to a fifth embodiment of the present application;

FIG. 8 is a schematic interface diagram illustrating a recording process according to a sixth embodiment of the present application;

FIG. 9A is a schematic interface diagram illustrating a recording process according to a seventh embodiment of the present application;

FIG. 9B is a schematic interface diagram illustrating another recording process according to a seventh embodiment of the present application;

fig. 10 is an interface schematic diagram of a recording display interface provided in another embodiment of the present application;

FIG. 11A is a schematic interface diagram of a video processing process according to a first embodiment of the present application;

FIG. 11B is a schematic interface diagram of another video processing procedure provided in the first embodiment of the present application;

FIG. 12 is a schematic interface diagram of a video processing process provided in a second embodiment of the present application;

FIG. 13 is a schematic interface diagram of a video processing process according to a third embodiment of the present application;

FIG. 14A is a schematic interface diagram of a video processing process according to a fourth embodiment of the present application;

FIG. 14B is a schematic interface diagram of another video processing procedure according to the fourth embodiment of the present application;

fig. 15 is a schematic interface diagram of a video processing procedure according to a fifth embodiment of the present application;

fig. 16 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 17 is a hardware configuration diagram of an electronic device implementing an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

Currently, a user generally records sound or pictures through an electronic device, so that the user can play back and edit the recorded sound or pictures. When the electronic equipment is used for recording sound or video, the sound from different sound production objects can be fused and recorded. For example, in the recording and video of a conversation of a plurality of persons by an electronic device, different persons, background sounds and miscellaneous sounds are merged together for recording.

In practical practice, the applicant has found that at least the following problems exist in the prior art:

when a user needs to perform post-processing on the sounds of one or more sounding objects in the recorded audio, the sounds need to be separated from the recorded audio manually, which is time-consuming and labor-consuming, and also causes problems that the sounds cannot be separated or the sounds are separated incorrectly (for example, the speaking objects are mistaken), so that the difficulty of audio processing is high.

In order to solve the above problem, embodiments of the present application provide an audio processing method, an audio processing apparatus, and an electronic device. The audio processing method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Fig. 1 shows a schematic flowchart of an audio processing method according to an embodiment of the present application.

In some embodiments of the present application, the method illustrated in fig. 1 may be performed by an audio processing device. As shown in fig. 1, the audio processing method may include:

step 110, obtaining a target audio, where the target audio includes N first sub-audio bands, and the first sub-audio includes a sound of a sound-emitting object;

step 120, adding a sound production object identifier for the first sub-audio frequency band according to the sound production object to which the first sub-audio frequency band belongs;

step 130, displaying M audio tracks corresponding to a target audio, wherein each audio track comprises at least one first sub-audio frequency range, and one audio track corresponds to a sound production object identifier;

Specific implementations of the above steps will be described in detail below.

Specific implementations of the above steps are described below.

In some embodiments of the present application, the target audio may be already recorded audio.

Optionally, in these embodiments, the specific method of step 110 may include:

target audio is retrieved from a target storage location.

In some embodiments, the target audio may be local audio and the target storage location may be a storage location in a local storage space of the audio processing device. At this time, the audio processing apparatus may directly acquire the target audio from the local storage space based on the target storage path.

In other embodiments, the target audio may be network audio and the target storage location may be an internet storage device. At this time, the audio processing apparatus may access the internet storage device based on the target access address and acquire the target audio from the internet storage device.

Optionally, in these embodiments, the specific method of step 110 may further include:

and receiving the target audio sent by the target equipment.

The target device may be an electronic device communicating with the audio processing apparatus, for example, an electronic device communicating with the audio processing apparatus through an instant messaging application.

In other embodiments of the present application, the target audio may be audio in a multimedia file.

Optionally, in these embodiments, the specific method of step 110 may include:

acquiring a multimedia file;

target audio is extracted from the multimedia file.

The multimedia file may be a local multimedia file, a network multimedia file, or a multimedia file sent by a target device, and the method for acquiring the multimedia file is similar to the method for acquiring the target audio, and is not described herein again.

After the multimedia file is obtained, the target audio may be extracted from the multimedia file based on an audio extraction technique or an audio extraction application.

In still other embodiments of the present application, the target audio may be audio that is being recorded by the audio processing device.

Optionally, in these embodiments, the specific method of step 110 may include:

and collecting the target audio through audio collecting equipment.

The audio acquisition equipment can comprise a microphone arranged on the audio processing device, a recorder, a video recorder or a microphone which is communicated with the audio processing device, and the like.

In an embodiment of the present application, the sound generating object may include at least one of a person, an animal, a thing, and an interference source.

When the sound generating object is a character, the sound of the sound generating object can be the speaking sound of the character; when the sound-producing object is an animal, the sound of the sound-producing object can be the sound of the animal; when the sound-emitting object is an object, the object may include a natural phenomenon (e.g., wind, rain, thunder, etc.), a vehicle, a construction site, etc., and the sound of the transmitting object may be a sound of the natural phenomenon (e.g., wind, rain, thunder, etc.), a whistling sound of the vehicle, a construction sound of the construction site, etc.; when the sound-generating object is an interference source, the interference source may be an electromagnetic wave or the like, and the sound of the sound-generating object may be a noise.

The above is a specific implementation of step 110, and a specific implementation of step 120 will be described below.

In step 120 of some embodiments of the present application, a sound-generating object identifier corresponding to a sound-generating object to which the sub-audio band belongs may be added to each first sub-audio band. The sound-producing object identification is a label used for identifying a generating object to which the sub-audio frequency band belongs, and one sound-producing object is provided with one sound-producing object identification.

In some embodiments of the present application, before step 120, the audio processing method may further include:

acquiring sound characteristics of a first sub-audio band;

and determining a sound production object to which the first sub-sound band belongs according to the sound characteristics.

In some embodiments, the sound characteristics may be characteristics such as loudness, pitch, and tone of the sound emitted by the sound emitting object. In other embodiments, the sound feature may also be a voiceprint feature of a sound emitted by the sound emitting object. The voiceprint features may include audio parameter features reflecting physiological and behavioral features of the sound-emitting object in an audio waveform corresponding to an audio signal formed by sound emitted by the sound-emitting object, and the voiceprint features of each sound-emitting object are different.

In some embodiments, the sound characteristics of a plurality of first sub-bands may be obtained, and the N first sub-bands may be subjected to voiceprint recognition based on the sound characteristics, and the first sub-bands with the same sound characteristics are grouped into a group, where the group of first sub-bands belongs to a sound production object.

After the N first sub-audio bands are grouped, each group of first sub-audio bands may be numbered according to the number of groups of the first sub-audio bands, and the number is used as a sound production object identifier of a sound production object to which each group of first sub-audio bands belongs, so as to add a number of the group to each first sub-audio band.

For example, if the first sub-audio band corresponds to a first group, the sound-emitting object to which the first sub-audio band belongs may be object a, and the sound-emitting object identifier added thereto may be tag 1.

In other embodiments, a plurality of preset sound characteristics may be stored in the audio processing device in advance. For example, the sound characteristics of a plurality of persons, the sound characteristics of animals, or the sound characteristics of vehicles may be stored in advance. And each preset sound characteristic corresponds to a preset sound production object identifier.

Specifically, the user may pre-record the sounds of people, animals, or vehicles, respectively, using the audio processing device, and recognize the sound features in the recorded sounds using the audio processing device, and store them as preset sound features. The user may also add a personalized preset sound object identifier to the preset sound feature when storing the preset sound feature, for example, the preset sound object identifier may be a name of a person, an animal, or the like.

For example, the user may record his or her own voice in advance by using the audio processing apparatus, recognize a voice feature in the recorded voice by using the audio processing apparatus, store the voice feature as a preset voice feature, and set a preset sound emitting object identifier corresponding to the preset voice feature as "himself" at the same time.

For another example, the user may pre-record the sound of the pet by using the audio processing device, recognize the sound feature in the recorded sound by using the audio processing device, store the sound feature as the preset sound feature, and set the preset sound object identifier corresponding to the preset sound feature as "smallpox".

After the audio processing device obtains the sound characteristics of the first sub-audio band, the sound characteristics may be compared with the preset sound characteristics, and a sound emission object to which the preset sound characteristics that are the same as the sound characteristics belong is taken as a sound emission object to which the first sub-audio band belongs.

After determining the sound emitting object to which the first sub-audio band belongs, a preset sound emitting object identifier corresponding to a preset sound characteristic which is the same as the sound characteristic may be added to the first sub-audio band.

In other embodiments, after acquiring the first sub-audio band, the audio processing apparatus may acquire a sound characteristic of the first sub-audio band, and determine that a sound-emitting object to which the first sub-audio band belongs may be an object a, and a sound-emitting object identifier added thereto may be a tag 1; after the audio processing device acquires the second first sub-audio band, it may acquire the sound characteristics of the first sub-audio band, and determine whether the sound characteristics of the second first sub-audio band are the same as the sound characteristics of the first sub-audio band, if so, determine that the sound generating object to which the first sub-audio band belongs may be an object a, and the sound generating object identifier added thereto may be a tag 1, and if not, determine that the sound generating object to which the first sub-audio band belongs may be an object B, and the sound generating object identifier added thereto may be a tag 2, and so on, identify the sound generating objects to which the plurality of first sub-audio bands belong, and add sound generating object identifiers to the first sub-audio bands.

After adding the sound-emitting object identifier corresponding to the sound-emitting object to each first sub-audio band, the first sub-audio bands may be recorded in different tracks according to the sound-emitting object identifier in the order of time from first to last, and the recording time of each first sub-audio band in the audio track to which the first sub-audio band belongs is the same as the playing time of the first sub-audio band in the target audio.

The above is a specific implementation of step 120, and a specific implementation of step 130 is described below.

In some embodiments of the present application, in a case that the target audio is an already recorded audio, M audio tracks corresponding to the target audio may be displayed in an audio display interface of the target audio.

In other embodiments of the present application, in a case that the target audio is an audio being recorded, M audio tracks corresponding to the target audio may be displayed in a recording display interface of the target audio.

Because the audio processing device can record the first sub-audio frequency band according to the sound-producing object identifier in a track-sharing manner, one audio track can correspond to one sound-producing object identifier, namely, at least one first sub-audio frequency band corresponding to one sound-producing object can be recorded in one audio track.

In this embodiment of the application, optionally, M audio tracks may be displayed longitudinally, and the track length and the track duration of each audio track are the same.

In this embodiment of the application, optionally, a sound emission object identifier corresponding to a sound emission object to which each audio track belongs may be displayed.

In the case where a preset sound emission object set by the user for a preset sound feature in advance is identified as an object name of the sound emission object, the audio processing apparatus can automatically display the object name of the sound emission object to which each audio track belongs, for example, "oneself" and "floret".

Under the condition that the sound generating object identification is the number of the sub-tone band group corresponding to the sound generating object, the user can also modify the displayed sound generating object identification, so that the user can perform self-defined setting on the sound generating object identification of each sound generating object.

Fig. 2 shows an interface diagram of a recording display interface according to an embodiment of the present application.

As shown in fig. 2, the recording display interface displays a recording name "record 1", a recording duration 00:10:17, an audio track 1 of an object a, an audio track 2 of an object B, an audio track 3 of an object C, and a function button 202. The audio track 1, the audio track 2 and the audio track 3 respectively include a first sub-audio band 201.

In a scene where the target audio is a multi-person conversation, the object a may be a user a, the object B may be a user B, and the object C may be a user C, where each audio track corresponds to a human voice.

In a scene that the target audio is natural environment wind, the object a may be a bird, the object B may be wind, and the object C may be running water, at this time, the audio track 1 corresponds to a bird song, the audio track 2 corresponds to a wind sound, and the audio track 3 corresponds to a running water sound.

In another embodiment of the present application, the target audio may further include K second sub-audio bands, where the second sub-audio bands include sounds of at least two sound-producing objects, and K is a positive integer. In order to more accurately track-record the sound, the audio processing method may further include, before step 130:

carrying out audio separation on the second sub-audio frequency band to obtain at least two third sub-audio frequency bands; wherein a third sub-band comprises a sound of a sound-emitting object;

adding a sound production object identifier for the third sub-audio frequency band according to the sound production object to which the third sub-audio frequency band belongs;

wherein the audio track further comprises a third sub-audio band.

In some embodiments, the second sub-audio band may be subjected to audio separation by using an audio separation technique for multiple speakers, so as to obtain at least two third sub-audio bands, where each of the three sub-audio bands includes a sound of a sound generating object, and then a sound generating object identifier is added to the third sub-audio band according to the sound generating object to which the third sub-audio band belongs.

The method for adding the sound object identifier to the third sub-audio band is similar to the method for adding the sound object identifier to the first sub-audio band, and is not described herein again.

It should be noted that, when the third sub-audio band is recorded in a track, the recording time of the audio track to which the third sub-audio band belongs, which belongs to the same second sub-audio band, is the same as the playing time of the second sub-audio band in the target audio.

Therefore, the second sub-audio band of the sound including more than two sound-producing objects can be split into at least two third sub-audio bands of the sound including one sound-producing object, and sound-producing object identifiers are added to the third sub-audio bands respectively, so that the sounds of different sound-producing objects can be recorded in a track-by-track mode more accurately, and the difficulty of audio processing is further reduced.

In some embodiments of the present application, before step 110, the audio processing method may further include:

acquiring at least two audio frames in a target audio;

determining the sound characteristics corresponding to each audio frame;

and generating at least one sub-tone section of the target audio based on the continuous audio frames with the same sound characteristics.

Specifically, the audio processing apparatus may divide the target audio into at least one sub-tone band based on the sound characteristics of each audio frame in the target audio, so that the sub-tone bands may be more accurately divided according to the sound characteristics.

If one sub-audio band includes the sound of one sound-producing object, the sub-audio band is a first sub-audio band, and if one sub-audio band includes the sound of at least two sound-producing objects, the sub-audio band is a second sub-audio band.

In other embodiments of the present application, before step 110, the audio processing method may further include:

at least one sub-audio segment of the target audio is generated based on a predetermined number of consecutive audio frames in the target audio.

Specifically, each group of a predetermined number of consecutive audio frames may be taken as one sub-audio band, starting with the first audio frame in the target audio.

Therefore, the target audio frequency can be divided into the sub-audio frequency bands rapidly, and the data processing amount is reduced.

In another embodiment of the present application, in order to further reduce the difficulty of audio processing, after step 130, the audio processing method may further include:

receiving a first input of a user to a first audio track;

responding to the first input, and carrying out audio processing on at least one sub-audio frequency band in the first audio track according to a first processing mode corresponding to the first input;

wherein the first processing mode comprises at least one of the following: audio deletion, audio speed change, audio sound modification and audio character extraction.

In some embodiments, the first input may be an input for triggering audio processing, and the first input may include at least one of a click input, a double-click output, a long-press input, a slide input, and a drag input. Correspondingly, the first processing mode may include at least one of audio deletion, audio speed change, and audio text extraction.

Specifically, after receiving a first input of a user, the audio processing apparatus may determine a first processing manner corresponding to the first input based on a correspondence between the first input and the processing manner, and perform audio processing on at least one sub-audio segment in the first audio track according to the first processing manner.

Optionally, audio processing may be performed on all sub-audio bands in the first audio track, or audio processing may be performed on sub-audio bands in a selected state.

In the case of audio processing of the sub-audio band in the selected state, before receiving a first input of the first audio track by the user, the audio processing method may further include:

receiving a ninth input of the first audio track by the user;

in response to the ninth input, the sub-tone segment selected by the ninth input is set to the selected state.

In still other embodiments, prior to receiving the first input of the user to the first audio track, a tenth input of the user to the first audio track may be further received, the tenth input may be an input for triggering display of a processing mode option, and the tenth input may include at least one of a click input, a double click output, and a long press input.

Specifically, after the audio processing apparatus receives the tenth input from the user, a plurality of processing manner options corresponding to the first audio track may be displayed. Accordingly, the processing mode option may include at least one of an audio deletion option, an audio shift option, an audio change option, an audio modify option, and an audio text extraction option.

At this time, the first input may be a user selection input of each processing mode option and a sub option under each processing mode option for the first audio track, and the first input may include at least one of a click input, a double-click output, and a long-press input. Correspondingly, the first processing mode may include at least one of audio deletion, audio speed change, audio modification, and audio character extraction.

Optionally, when the plurality of processing manner options corresponding to the first audio track are displayed, other audio tracks may be hidden and displayed, or the plurality of processing manner options may be displayed in a display area corresponding to the first audio track, for example, a blank area adjacent to the first audio track.

Next, a first input by the user and a first processing method corresponding to the first input will be described in detail with reference to fig. 3 to 8.

Fig. 3 is a schematic interface diagram illustrating a recording process according to a first embodiment of the present application.

As shown in FIG. 3, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 for object A, an audio track 2 for object B, an audio track 3 for object C, and a function button 301. The user can click on the audio track 1, so that the audio track 2 and the audio track 3 are hidden and displayed in the audio display interface, and the processing mode option 302 corresponding to the audio track 1 is displayed: the "delete" option, the "shift" option, the "sound change" option, the "sound modification" option, and the "text extraction" option.

Fig. 4 is a schematic interface diagram illustrating a recording process according to a second embodiment of the present application.

As shown in fig. 4, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 of the object a, and a processing mode option 401 corresponding to the audio track 1: the "delete" option, the "shift" option, the "sound change" option, the "sound modification" option, and the "text extraction" option. The user may click on the "delete" option and cause a first prompt 402 to be displayed in the audio display interface, the first prompt being "determine to delete this audio? And, the first prompt 402 may also include a "delete" option sub-option 403: an "ok" option and a "cancel" option. The user may click on the "ok" option to complete the deletion of audio track 1. The audio display interface after deleting audio track 1 may display the recording name "record 1", the audio duration 00:10:17, the audio track 2 of object B, the audio track 3 of object C, and the function button 404.

Fig. 5 is a schematic interface diagram illustrating a recording process according to a third embodiment of the present application.

As shown in fig. 5, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 of an object a, and a processing mode option 501 corresponding to the audio track 1: the "delete" option, the "shift" option, the "sound change" option, the "sound modification" option, and the "text extraction" option. The user may click on the "shift" option causing sub-option 502 of the "shift" option to be displayed within the audio display interface: the "0.5X" option, "0.75X" option, "1X" option, "1.25X" option, "1.5X" option, and "2X" option, with different sub-options corresponding to different shift multiples. The user can click the option corresponding to the required speed change multiple, so that each sub-audio frequency band of the audio track 1 is subjected to speed change according to the speed change multiple selected by the user.

Fig. 6 is a schematic interface diagram illustrating a recording process according to a fourth embodiment of the present application.

As shown in fig. 6, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 of an object a, and a processing mode option 601 corresponding to the audio track 1: the "delete" option, the "shift" option, the "sound change" option, the "sound modification" option, and the "text extraction" option. The user may click on the "change" option causing a sub-option 602 of the "change" option to be displayed within the audio display interface: the sound changing method comprises a 'tertiary sound' option, a 'positive Taiyin' option, a 'Lorentz' option, a 'Miss sound' option, an 'AI sound' option and an 'original sound' option, wherein different sub-options correspond to different sound changing effects. The user can click the option corresponding to the required sound variation effect, so that each sub-audio band of the audio track 1 performs sound variation according to the sound variation effect selected by the user.

Fig. 7 is a schematic interface diagram illustrating a recording process according to a fifth embodiment of the present application.

As shown in fig. 7, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 of an object a, and a processing mode option 701 corresponding to the audio track 1: the "delete" option, the "shift" option, the "sound change" option, the "sound modification" option, and the "text extraction" option. The user may click on the "modify" option, causing a sub-option 702 of the "modify" option to be displayed within the audio display interface: the 'pop' option, 'R & B' option, 'rock' option, 'hip hop' option, 'soul' option, 'phonograph' option and 'original sound' option, and different sub-options correspond to different sound modifying effects. The user can click the option corresponding to the required sound modification effect, so that each sub-audio frequency band of the audio track 1 is subjected to sound modification according to the sound modification effect selected by the user.

Fig. 8 is a schematic interface diagram illustrating a recording process according to a sixth embodiment of the present application.

As shown in fig. 8, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 of the object a, and a processing mode option 801 corresponding to the audio track 1: the "delete" option, the "shift" option, the "sound change" option, the "sound modification" option, and the "text extraction" option. The user can click the "words extraction" option, causing the audio processing device to convert the sound in the sub-audio band of the audio track 1 into text words, and display the converted text words "today's weather is really good" in the audio display interface.

Alternatively, while the text word "today's weather is really good" is displayed in fig. 8, the word count of the text word, e.g., "6 words," converted by the sound in the sub-tone band of the audio track 1 may also be displayed.

In other embodiments of the present application, after step 130, the audio processing method may further include:

receiving a second input of the user to the at least one second audio track;

responding to the second input, and performing audio processing on at least one sub-audio frequency band in each second audio track according to a second processing mode corresponding to the second input;

wherein the second processing mode comprises at least one of the following: audio merging and audio deleting.

Specifically, in the case that at least one second audio track is displayed, the user may input a second input, the second input may be at least one of a click input, a long-press input, and a double-click input, and the second processing manner may be at least one of audio merging and audio deleting.

Specifically, after receiving a second input of the user, the audio processing apparatus may determine a second processing manner corresponding to the second input based on a corresponding relationship between the second input and the processing manner, and perform audio processing on at least one sub-audio segment in at least one second audio track displayed by the audio processing apparatus according to the second processing manner.

Optionally, audio processing may be performed on all sub-audio bands in the second audio track, or audio processing may be performed on sub-audio bands in a selected state. The method for making the sub-tone band in the selected state is described above, and is not described herein again.

In some further embodiments of the present application, prior to receiving a second input of the at least one second audio track by the user, the audio processing method may further comprise:

receiving an eleventh input of the second target control by the user;

and responding to the eleventh input, and displaying a selection control and a processing mode option corresponding to each audio track.

The second target control may be a control that triggers entering into the batch processing mode.

In these embodiments, the eleventh input may be at least one of a click input, a long press input, and a double click input. The second input may be a selection input to at least one of a selection control and a treatment mode option. The second input may be at least one of a click input, a long press input, and a double click input. And the audio track with the selection control in the selected state can be used as a second audio track.

In these embodiments, the processing mode options may include a merge option and a delete option. When the user selects the merging option, at least one sub-audio segment in at least one second audio track can be merged; when the user selects the deletion option, at least one sub-audio segment in at least one second audio track may be merged.

In some embodiments, after the user selects the deletion option, a second prompt message may be further displayed in response to the second input, the second prompt message being used to prompt the user for deletion confirmation. The second reminder information may include a sub-option of the "delete" option: an "ok" option and a "cancel" option. The user may click on the "ok" option to cause the audio processing device to complete the deletion of the second audio track.

Next, a second input by the user and a second processing method corresponding to the second input will be described in detail with reference to fig. 9.

Fig. 9A is a schematic interface diagram illustrating a recording process according to a seventh embodiment of the present application.

As shown in fig. 9A, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 of object a, an audio track 2 of object B, an audio track 3 of object C, a second target control 901, and a function button 902. The user may click on the second target control 901, causing a selection control 903 to be displayed in the upper right of each audio track within the audio display interface, and displaying a treatment option 904 within the audio display interface: a "merge" option and a "delete" option. The user can click the selection control 903 of the audio track 2 and the audio track 3 to make the selection control 903 in a selected state, and then click the 'merge' option to make the audio processing device merge the sub-audio bands in the audio track 2 and the audio track 3 to obtain a merged track 1.

Fig. 9B is a schematic interface diagram illustrating another recording process according to a seventh embodiment of the present application.

As shown in fig. 9B, the audio display interface displays a recording name "recording 1", an audio duration 00:10:17, an audio track 1 of the object a, an audio track 2 of the object B, an audio track 3 of the object C, a second target control 901, and a function button 902. The user may click on the second target control 901, causing a selection control 903 to be displayed in the upper right of each audio track within the audio display interface, and displaying a treatment option 904 within the audio display interface: a "merge" option and a "delete" option. The user may click on the selection controls 903 of the audio track 2 and the audio track 3 to set the selection controls 903 in the selected state, and then click on the "delete" option to display the second prompt information 905 in the audio display interface, where the second prompt information 905 may be "determine to delete the audio? And, the second prompt 905 may also include a "delete" option sub-option 906: an "ok" option and a "cancel" option. The user may click on the "ok" option to complete the deletion of audio track 2 and audio track 3. The audio display interface after deleting the audio track 2 and the audio track 3 may display a recording name "recording 1", an audio duration 00:10:17, the audio track 1 of the object 1, the second target control 901, and a function button 902.

In this embodiment of the application, the function buttons in the above embodiments may further include a storage button, and a user may click the storage button to store the edited audio track.

Therefore, in the embodiment of the application, the voiceprint recognition technology can be utilized, the sounding objects to which different sub-audio frequency bands belong can be accurately recognized, the sub-audio frequency bands of different sounding objects are recorded in different tracks according to the sounding object identifiers and displayed on different audio tracks, the audio contents of different sounding objects can be accurately distinguished by a user when recording is carried out, the audio contents of different sounding objects can be accurately distinguished when other people listen to the recording, audio processing can be carried out on the audio contents of different sounding objects based on different audio tracks, the audio processing efficiency is improved, meanwhile, the interestingness of audio processing is also increased, and the user can independently modify and adjust the audio contents.

In yet another embodiment of the present application, the audio processing method may further include:

acquiring a target video corresponding to a target audio;

and displaying an image preview window and a video track corresponding to the target video, wherein the image preview window is used for displaying the image frame of the target video.

In some embodiments of the present application, the target video may be a video that has already been recorded, and the target video may be a video stored in association with the target audio. The method for obtaining the target video corresponding to the target audio is similar to the method for obtaining the target audio, and is not described herein again.

In other embodiments of the present application, the target video may be audio in a multimedia file.

Optionally, in these embodiments, a specific method for obtaining a target video corresponding to a target audio may include:

acquiring a multimedia file;

and extracting a target video corresponding to the target audio from the multimedia file.

After the multimedia file is obtained, the target video may be extracted from the multimedia file based on a video extraction technique or a video extraction application.

In still other embodiments of the present application, the target video may be a video being recorded by the audio processing device.

and when the target audio is collected through the audio collecting equipment, the target video corresponding to the target audio is collected through the video collecting equipment.

The video acquisition equipment can comprise a camera or a video recorder and the like arranged on the audio processing device.

In some embodiments of the present application, in a case that the target audio and the target video are already recorded audio and video, M audio tracks corresponding to the target audio and an image preview window and a video track corresponding to the target video may be displayed in a video display interface.

In other embodiments of the present application, in a case that the target audio and the target video are audio and video being recorded, M audio tracks corresponding to the target audio and an image preview window and a video track corresponding to the target video may be displayed in the recording display interface.

Because the audio processing device can record the sub-audio frequency ranges in a track division manner according to the sound-emitting object identifications, one audio track can correspond to one sound-emitting object identification, namely, at least one sub-audio frequency range corresponding to one sound-emitting object can be recorded in one audio track.

In the embodiment of the present application, optionally, an image preview window is displayed at the top of the display interface, a video track is displayed below the image preview window, and M audio tracks may be displayed vertically below the video track, where the track length and the track duration of each of the video track and the audio tracks are the same.

Fig. 10 is a schematic interface diagram illustrating a recording display interface according to another embodiment of the present application.

As shown in fig. 10, the recording display interface displays an image preview window 1001, a video track, an audio track 1 of object a, an audio track 2 of object B, an audio track 3 of object C, and a function button 1002. The audio track 1, the audio track 2, and the audio track 3 respectively include a sub-audio segment 1003.

In some embodiments of the present application, the audio processing method may further include:

receiving a third input of the user to a third audio track in the case that the image preview window displays the target image frame;

responding to a third input, and performing audio processing on a target audio frame in a third audio track according to a third processing mode corresponding to the third input;

the target audio frame is an audio frame with the same time stamp as the target image frame, and the third processing mode includes at least one of the following: audio deletion, audio segmentation, audio speed change, audio sound change, audio replacement and subtitle addition.

In some embodiments, the third input may be an input for triggering audio processing, and the third input may include at least one of a click input, a double-click output, a long-press input, a slide input, and a drag input. Accordingly, the third processing mode may include at least one of audio deletion, audio speed change, and subtitle addition.

Specifically, after receiving a third input of the user, the audio processing apparatus may determine a third processing manner corresponding to the third input based on a correspondence between the third input and the processing manner, and perform audio processing on at least one sub-audio band in the third audio track according to the third processing manner.

Optionally, audio processing may be performed on all sub-audio bands in the third audio track, or may be performed on sub-audio bands in a selected state. The method for making the sub-tone band in the selected state is described above, and is not described herein again.

In still other embodiments, before receiving a third input of the third audio track by the user, a twelfth input of the third audio track by the user may be further received, the twelfth input may be an input for triggering a display processing mode option, and the twelfth input may include at least one of a click input, a double click output, and a long press input.

Specifically, after the audio processing apparatus receives the twelfth input of the user, a plurality of processing mode options corresponding to the third audio track may be displayed. Accordingly, the processing mode option may include at least one of an audio deletion option, an audio segmentation option, an audio shift option, an audio voicing option, an audio replacement option, and an add caption option.

At this time, the third input may be a selection input of the user for each processing mode option and a sub option under each processing mode option for the third audio track, and the third input may include at least one of a click input, a double-click output, and a long-press input. Accordingly, the third processing mode may include at least one of audio deletion, audio segmentation, audio shifting, audio change, audio replacement, and subtitle addition.

Alternatively, when the plurality of processing method options corresponding to the third audio track are displayed, other audio tracks may be hidden and displayed, the plurality of processing method options may be displayed in a display area corresponding to the third audio track, for example, a blank area adjacent to the third audio track, the processing method options may be displayed below all the audio tracks, and a selection box may be displayed in the third audio track.

Next, a third input by the user and a third processing mode corresponding to the third input will be described in detail with reference to fig. 11 to 13.

Fig. 11A is a schematic interface diagram illustrating a video processing procedure according to a first embodiment of the present application.

As shown in fig. 11A, the video display interface displays processing method options 1102 corresponding to an image preview window 1101, a video track, an audio track 1 of an object a, an audio track 2 of an object B, an audio track 3 of an object C, and an audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option. Audio track 1 is boxed. The user may click on the "split" option causing the audio processing device to delete the same audio frame in the audio track 1 as the timestamp of the video frame displayed in the image preview window 1101. The user may click on the "change" option causing sub-option 1105 of the "change" option to be displayed within the video display interface: the sound changing method comprises a 'tertiary sound' option, a 'positive Taiyin' option, a 'Lorentz' option, a 'Miss sound' option, an 'AI sound' option and an 'original sound' option, wherein different sub-options correspond to different sound changing effects. The user can click the option corresponding to the required sound variation effect, so that each sub-audio band of the audio track 1 performs sound variation according to the sound variation effect selected by the user. The user may click on the "replace" option, causing the video display interface to display a sub-option of the "replace" option: a plurality of preset audios 1106, a "ok" option, and a "cancel" option. The preset audio 1106 may include local audio and network audio, and the user may select one of the preset audio 1106 and click on the "ok" option, causing the audio processing device to replace the audio in audio track 1 with the preset audio 1106.

Fig. 11B is a schematic interface diagram illustrating another video processing procedure according to the first embodiment of the present application.

As shown in fig. 11B, the video display interface displays processing method options 1102 corresponding to the image preview window 1101, the video track, the audio track 1 of the object a, the audio track 2 of the object B, the audio track 3 of the object C, and the audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option. Audio track 1 is boxed. The user may click on the "delete" option, causing a third prompt to be displayed 1103 in the audio display interface, which may be "determine to delete this audio? ", and, third prompting information 1103 may also include a" delete "option sub-option 1104: an "ok" option and a "cancel" option. The user may click on the "ok" option to complete the deletion of audio track 1. The video display interface after deleting the audio track 1 may display an image preview window 1101, a video track, an audio track 2 of the object B, an audio track 3 of the object C, and a processing mode option 1102 corresponding to the audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option.

Fig. 12 is a schematic interface diagram illustrating a video processing procedure according to a second embodiment of the present application.

As shown in fig. 12, the video display interface displays processing method options 1202 corresponding to an image preview window 1201, a video track, an audio track 1 of an object a, an audio track 2 of an object B, an audio track 3 of an object C, and an audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option. The user may click on the "shift" option, causing sub-option 1203 of the "shift" option to be displayed within the audio display interface: the "0.5X" option, "0.75X" option, "1X" option, "1.25X" option, "1.5X" option, and "2X" option, with different sub-options corresponding to different shift multiples. The user can click the option corresponding to the required speed change multiple, so that each sub-audio frequency band of the audio track 1 is subjected to speed change according to the speed change multiple selected by the user.

Fig. 13 is a schematic interface diagram illustrating a video processing procedure according to a third embodiment of the present application.

As shown in fig. 13, the video display interface displays processing method options 1302 corresponding to an image preview window 1301, a video track, an audio track 1 of an object a, an audio track 2 of an object B, an audio track 3 of an object C, and an audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option. The user may click on the "add captioning" option, causing the audio processing device to convert the sound in the sub-audio band of the audio track 1 to text and display the converted text in the image preview window 1301.

In other embodiments of the present application, the audio processing method may further include:

receiving a fourth input of the user to the at least one fourth audio track in a case where the image preview window displays the target image frame;

responding to a fourth input, and performing audio processing on the target audio frame in each fourth audio track according to a fourth processing mode corresponding to the fourth input;

wherein the target audio frame is an audio frame with the same timestamp as the target image frame, and the fourth processing mode includes at least one of the following: audio merging and audio deleting.

Specifically, in the case that at least one fourth audio track is displayed, the user may input a fourth input, the fourth input may be at least one of a click input, a long-press input, and a double-click input, and the fourth processing manner may be at least one of audio merging and audio deleting.

Specifically, after receiving a fourth input of the user, the audio processing apparatus may determine a fourth processing manner corresponding to the fourth input based on a correspondence between the fourth input and the processing manner, and perform audio processing on at least one sub-audio segment in at least one fourth audio track displayed by the audio processing apparatus according to the fourth processing manner.

Optionally, audio processing may be performed on all sub-audio bands in the fourth audio track, or audio processing may be performed on sub-audio bands in a selected state. The method for making the sub-tone band in the selected state is described above, and is not described herein again.

In some further embodiments of the present application, before receiving a fourth input of the at least one fourth audio track by the user, the audio processing method may further include:

receiving a thirteenth input of the user to the third target control;

The third target control may be a control that triggers entering into the batch processing mode.

In these embodiments, the thirteenth input may be at least one of a click input, a long press input, and a double click input. The fourth input may be a selection input to at least one of a selection control and a treatment mode option. The fourth input may be at least one of a click input, a long press input, and a double click input. And the selection control selected by the fourth input is changed from the unselected state to the selected state, and the audio track with the selection control in the selected state can be used as a fourth audio track.

In these embodiments, the processing mode options may include a merge option and a delete option. When the user selects the merging option, at least one sub-audio band in at least one fourth audio track can be merged; when the user selects the deletion option, at least one sub-audio segment in at least one fourth audio track may be merged.

Next, a fourth input by the user and a fourth processing mode corresponding to the fourth input will be described in detail with reference to fig. 14.

Fig. 14A is a schematic interface diagram illustrating a video processing procedure according to a fourth embodiment of the present application.

As shown in fig. 14A, the video display interface displays an image preview window 1401, a third target control 1402, a video track, an audio track 1 of an object a, an audio track 2 of an object B, an audio track 3 of an object C, and a processing mode option 1403 corresponding to the audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option. The user may click on the third target control 1402, causing a selection control 1404 to be displayed in the upper right of each audio track within the video display interface, and processing mode options 1043 to be displayed within the video display interface: a "merge" option and a "delete" option. The user may click on the selection controls 1404 of the audio track 2 and the audio track 3 to set the selection controls 1404 in the selected state, and then click on the "merge" option to cause the audio processing device to merge the sub-audio bands in the audio track 2 and the audio track 3 to obtain a merged track 1.

Fig. 14B is a schematic interface diagram illustrating another video processing procedure according to the fourth embodiment of the present application.

As shown in fig. 14B, the video display interface displays an image preview window 1401, a third target control 1402, a video track, an audio track 1 of an object a, an audio track 2 of an object B, an audio track 3 of an object C, and a processing mode option 1403 corresponding to the audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option. The user may click on the third target control 1402, causing a selection control 1404 to be displayed in the upper right of each audio track within the video display interface, and processing mode options 1043 to be displayed within the video display interface: a "merge" option and a "delete" option. The user may click on the selection controls 1404 of the audio track 2 and the audio track 3 to place the selection controls 1404 in a selected state, and then click on the "delete" option to display a fourth prompt 1405 in the video display interface, which may be "determine to delete this audio? ", and, the fourth prompt 1405 may also include a" delete "option sub-option 1406: an "ok" option and a "cancel" option. The user may click on the "ok" option to complete the deletion of audio track 2 and audio track 3. The audio display interface after deleting the audio track 2 and the audio track 3 may display an image preview window 1401, a third target control 1402, a video track, an audio track 1 of the object a, and a processing mode option 1403 corresponding to the audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option.

In this embodiment of the application, the function buttons may also be displayed in the video display interfaces, and the function buttons may include a storage button, so that a user may click to store the edited audio track safely.

receiving a fifth input of the first target control from the user under the condition that the image preview window displays the target image frame;

displaying at least one preset audio in response to a fifth input;

receiving a sixth input of the user to a first audio frequency in the at least one preset audio frequency;

in response to a sixth input, adding the first audio to the target track position in the fifth audio track;

and the fifth audio track is an audio track with the same sound characteristics as the first audio, and the timestamp corresponding to the target track position is the same as the timestamp of the target image frame.

In some embodiments of the present application, the first target control may be a control for triggering an audio add function, the fifth input may be at least one of a click input, a long press input, a double click input, and the sixth input may be at least one of a click input, a long press input, a double click input.

The audio processing apparatus may enter the audio add function in response to a fifth input and display a sub-option of the audio add function: and the user can input a sixth input to a first audio and an audio adding button in the at least one preset audio, so that the audio processing device acquires the sound characteristic of the first audio, identifies a fifth audio track with the same sound characteristic as the first audio, and then adds the first audio to a target track position in the fifth audio track.

receiving a seventh input of the first target control by the user;

displaying at least one preset audio in response to a seventh input;

receiving an eighth input of a second audio frequency in the at least one preset audio frequency by the user;

in response to an eighth input, a sixth audio track corresponding to the second audio is added.

In some embodiments of the present application, the first target control may be a control for triggering an audio add function, the seventh input may be at least one of a click input, a long press input, a double click input, and the eighth input may be at least one of a click input, a long press input, a double click input.

The audio processing apparatus may enter the audio add function in response to the seventh input and display a sub-option of the audio add function: and the user can input a sixth input to a second audio and audio adding button in the at least one preset audio, so that the audio processing device directly adds a sixth audio track corresponding to the second audio in the current display interface, for example, to the lower part of the M audio tracks.

Therefore, the new audio track can be added to the target audio by selecting the new audio, so that the new sound elements can be added to the target audio, and a user can edit the target audio in a personalized manner.

Fig. 15 is a schematic interface diagram illustrating a video processing procedure according to a fifth embodiment of the present application.

As shown in fig. 15, the video display interface displays an image preview window 1501, a first target control 1502, a video track, an audio track 1 of an object a, an audio track 2 of an object B, an audio track 3 of an object C, and a processing mode option 1503 corresponding to the audio track 1: the "delete" option, the "split" option, the "shift" option, the "sound change" option, the "replace" option, and the "add caption" option. The user may press the first target control 1502 for a long time to cause at least one preset audio 1504, "ok" option, and "cancel" option to be displayed within the video display interface. The user may select a preset audio 1504 and click the "ok" option, causing the audio processing device to add the audio track 4 corresponding to the preset audio 1504 within the video display interface.

Therefore, in the embodiment of the application, the voiceprint recognition technology can be utilized, the sound-emitting objects to which different sub-audio frequency bands belong can be accurately recognized, the sub-audio frequency bands of the different sound-emitting objects are recorded in a split track mode according to the sound-emitting object identifications and displayed on different audio tracks, and the video tracks and the audio tracks are displayed in a split track mode on the display interface, so that a user can carry out audio processing on the audio contents of the different sound-emitting objects based on the different audio tracks, the video processing efficiency is improved, meanwhile, the interestingness of the video processing is also increased, and the user can modify and adjust the audio contents corresponding to videos independently.

In conclusion, the audio processing method provided by the application can record the sounds of different sound production objects in different tracks, so that the audio processing is more efficient and convenient, and more interesting playing methods are added in the audio processing process to increase the pleasure of users.

In the above embodiments, the audio processing method is described by taking the execution subject as the audio processing apparatus as an example. However, the execution subject of the audio processing method provided in the embodiment of the present application is not limited to the audio processing apparatus, and may also be a functional module in the audio processing apparatus for executing each step of the loaded audio processing method.

Fig. 16 shows a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.

As shown in fig. 16, the audio processing apparatus may include:

an audio obtaining module 1610, configured to obtain a target audio, where the target audio includes N first sub-audio bands, and the first sub-audio includes a sound of a sound generating object;

a first adding module 1620, configured to add a sound generating object identifier to the first sub-audio band according to the sound generating object to which the first sub-audio band belongs;

a first display module 1630, configured to display M audio tracks corresponding to a target audio, where each audio track includes at least one first sub-audio band and each audio track corresponds to a sound object identifier;

In some embodiments of the present application, the audio processing apparatus may further include:

the characteristic acquisition module is used for acquiring the sound characteristic of the first sub-tone frequency band;

and the object determining module is used for determining a sound production object to which the first sub-sound band belongs according to the sound characteristics.

In some embodiments of the present application, the target audio further includes K second sub-audio bands, the second sub-audio bands include sounds of at least two sound-producing objects, and K is a positive integer;

accordingly, the audio processing apparatus may further include:

the audio separation module is used for carrying out audio separation on the second sub-audio frequency band to obtain at least two third sub-audio frequency bands; wherein a third sub-band comprises a sound of a sound-emitting object;

the second adding module is used for adding a sound production object identifier for the third sub-audio frequency band according to the sound production object to which the third sub-audio frequency band belongs;

wherein the audio track further comprises a third sub-audio band.

the first receiving module is used for receiving a first input of a user to the first audio track;

the first processing module is used for responding to the first input and carrying out audio processing on at least one sub-audio frequency band in the first audio track according to a first processing mode corresponding to the first input;

a second receiving module, configured to receive a second input of the at least one second audio track from the user;

the second processing module is used for responding to the second input and carrying out audio processing on at least one sub-audio frequency band in each second audio track according to a second processing mode corresponding to the second input;

the video acquisition module is used for acquiring a target video corresponding to the target audio;

and the second display module is used for displaying an image preview window and a video track corresponding to the target video, and the image preview window is used for displaying the image frame of the target video.

the third receiving module is used for receiving a third input of a user to a third audio track under the condition that the target image frame is displayed on the image preview window;

the third processing module is used for responding to a third input and carrying out audio processing on a target audio frame in a third audio track according to a third processing mode corresponding to the third input;

the fourth receiving module is used for receiving a fourth input of the user to at least one fourth audio track under the condition that the target image frame is displayed in the image preview window;

the fourth processing module is used for responding to a fourth input and carrying out audio processing on the target audio frame in each fourth audio track according to a fourth processing mode corresponding to the fourth input;

the fifth receiving module is used for receiving a fifth input of the first target control from the user under the condition that the image preview window displays the target image frame;

the third display module is used for responding to a fifth input and displaying at least one preset audio;

the sixth receiving module is used for receiving sixth input of the user on the first audio in the at least one preset audio;

a fifth processing module to add the first audio to a target track position in a fifth audio track in response to a sixth input;

the seventh receiving module is used for receiving seventh input of the first target control by the user;

a fourth display module for displaying at least one preset audio in response to a seventh input;

the eighth receiving module is used for receiving an eighth input of the user to a second audio frequency in the at least one preset audio frequency;

and the fifth display module is used for responding to the eighth input and adding a sixth audio track corresponding to the second audio.

the audio frame acquisition module is used for acquiring at least two audio frames in the target audio;

the characteristic identification module is used for determining the sound characteristic corresponding to each audio frame;

and the audio segment generating module is used for generating at least one sub-audio segment of the target audio based on the continuous audio frames with the same sound characteristics.

The audio processing device in the embodiment of the present application may be a device, and may also be a component, an integrated circuit, or a chip in the device. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The audio processing apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The audio processing apparatus provided in the embodiment of the present application can implement each process implemented by the audio processing apparatus in the method embodiments of fig. 1 to fig. 15, and for avoiding repetition, details are not described here again.

In conclusion, the audio processing device improved by the application can record the sounds of different sound production objects in different tracks, so that the audio processing is more efficient and convenient, more interesting playing methods are added in the audio processing process, and the interest of a user is increased.

Optionally, an embodiment of the present application further provides an electronic device, which includes a processor, a memory, and a program or an instruction stored in the memory and capable of running on the processor, where the program or the instruction is executed by the processor to implement each process of the above-mentioned audio processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.

Fig. 17 is a schematic diagram illustrating a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1700 includes, but is not limited to: radio frequency unit 1701, network module 1702, audio output unit 1703, input unit 1704, sensor 1705, display unit 1706, user input unit 1707, interface unit 1708, memory 1709, and processor 1710.

The input unit 1704 is used for receiving audio or video signals, and may include a camera 17041 and a microphone 17042, the display unit 1706 is used for displaying information input by a user or information provided to the user, and may include a display panel, the user input unit 1707 is used for receiving input numeric or character information, and generating key signal input related to user setting and function control of the mobile terminal, and may include a touch panel which may be overlaid on the display panel, and a memory 1709 for storing software programs and various data, and may include a storage program area for storing an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area for storing data created according to use of the mobile phone (such as audio data, video data, etc.) Phone book, etc.), etc.

Those skilled in the art will appreciate that the electronic device 1700 may also include a power supply (e.g., a battery) for powering the various components, and that the power supply may be logically coupled to the processor 1710 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The electronic device structure shown in fig. 17 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description thereof is omitted.

The input unit 1704, in some embodiments of the present application, includes an audio acquisition device, such as a microphone 17042, configured to acquire a target audio, where the target audio includes N first sub-audio bands, and the first sub-audio band includes a sound of a sound generating object;

a processor 1710, configured to add a sound generation object identifier to the first sub-audio band according to the sound generation object to which the first sub-audio band belongs;

a display unit 1706, configured to display M audio tracks corresponding to a target audio, where each audio track includes at least one first sub-audio band, and one audio track corresponds to one sound object identifier;

In some embodiments of the present application, the processor 1710 is further configured to obtain a sound characteristic of the first sub-audio band, and determine, according to the sound characteristic, a sound emission object to which the first sub-audio band belongs.

correspondingly, the processor 1710 is further configured to perform audio separation on the second sub-audio band, so as to obtain at least two third sub-audio bands; adding a sound production object identifier for the third sub-audio frequency band according to the sound production object to which the third sub-audio frequency band belongs;

wherein a third sub-audio band comprises a sound of a sound object and the audio track further comprises the third sub-audio band.

In some embodiments of the present application, a user input unit 1707 for receiving a first input of a first audio track by a user;

correspondingly, the processor 1710 is further configured to, in response to the first input, perform audio processing on at least one sub-audio band in the first audio track in a first processing manner corresponding to the first input; wherein the first processing mode comprises at least one of the following: audio deletion, audio speed change, audio sound modification and audio character extraction.

In some embodiments of the present application, the user input unit 1707 is further configured to receive a second input from the user to at least one second audio track;

correspondingly, the processor 1710 is further configured to, in response to the second input, perform audio processing on at least one sub-audio band in each second audio track according to a second processing manner corresponding to the second input; wherein the second processing mode comprises at least one of the following: audio merging and audio deleting.

In some embodiments of the present application, the input unit 1704 further includes a video capture device, such as a camera 17041, and the input unit 1704 is further configured to obtain a target video corresponding to the target audio;

correspondingly, the display unit 1706 is further configured to display an image preview window and a video track corresponding to the target video, where the image preview window is used to display image frames of the target video.

In some embodiments of the present application, the user input unit 1707 is further configured to receive a third input from the user to the third audio track in a case where the image preview window displays the target image frame;

correspondingly, the processor 1710 is further configured to, in response to a third input, perform audio processing on the target audio frame in the third audio track in a third processing manner corresponding to the third input;

In some embodiments of the present application, the user input unit 1707 is further configured to receive a fourth input from the user to at least one fourth audio track in a case where the image preview window displays the target image frame;

correspondingly, the processor 1710 is further configured to, in response to the fourth input, perform audio processing on the target audio frame in each fourth audio track in a fourth processing manner corresponding to the fourth input;

In some embodiments of the present application, the user input unit 1707 is further configured to receive a fifth input from the user to the first target control in a case where the image preview window displays the target image frame;

correspondingly, the display unit 1706 is further configured to display at least one preset audio in response to the fifth input;

the user input unit 1707 is further configured to receive a sixth input of the first audio of the at least one preset audio by the user;

the processor 1710 is further configured to add the first audio to a target track position in the fifth audio track in response to a sixth input;

In some embodiments of the present application, the user input unit 1707 is further configured to receive a seventh input of the first target control by the user;

correspondingly, the display unit 1706 is further configured to display at least one preset audio in response to the seventh input;

the user input unit 1707 is further configured to receive an eighth input of a second audio of the at least one preset audio by the user;

the display unit 1706 is further configured to add a sixth audio track corresponding to the second audio in response to the eighth input.

In some embodiments of the present application, the processor 1710 is further configured to obtain at least two audio frames in the target audio, determine a sound characteristic corresponding to each audio frame, and generate at least one sub-audio band of the target audio based on consecutive audio frames with the same sound characteristic.

In conclusion, the electronic equipment provided by the application can record the sounds of different sound production objects in different tracks, so that the audio processing is more efficient and convenient, and more interesting playing methods are added in the audio processing process to increase the pleasure of users.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the embodiment of the audio processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. Readable storage media, including computer-readable storage media, such as Read-Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, etc.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the above-mentioned audio processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An audio processing method, comprising:

acquiring at least two audio frames in a target audio;

determining the sound characteristics corresponding to each audio frame;

generating at least one first sub-audio band of the target audio based on continuous audio frames with the same sound characteristics;

acquiring sound characteristics of the first sub-audio band;

determining a sound production object to which the first sub-audio band belongs according to the sound characteristics;

acquiring a target audio, wherein the target audio comprises N first sub-audio bands, and the first sub-audio bands comprise sounds of a sound production object;

adding a sound production object identifier for the first sub-audio frequency band according to the sound production object to which the first sub-audio frequency band belongs;

displaying M audio tracks corresponding to the target audio, wherein each audio track comprises at least one first sub-audio frequency range, and one audio track corresponds to one sound production object identifier;

2. The method of claim 1, wherein the target audio further comprises K second sub-bands, wherein the second sub-bands comprise sounds of at least two sound-producing objects, and wherein K is a positive integer;

wherein, before the displaying the M audio tracks corresponding to the target audio, the method further includes:

performing audio separation on the second sub-audio frequency band to obtain at least two third sub-audio frequency bands; wherein one of said third sub-bands comprises a sound of a sound-generating object;

wherein the audio track further comprises the third sub-audio segment.

3. The method according to any one of claims 1 to 2, wherein after the displaying the M audio tracks corresponding to the target audio, the method further comprises:

receiving a first input of a user to a first audio track;

responding to the first input, and performing audio processing on at least one sub-audio frequency band in the first audio track according to a first processing mode corresponding to the first input;

4. The method according to any one of claims 1 to 2, wherein after the displaying the M audio tracks corresponding to the target audio, the method further comprises:

receiving a second input of the user to the at least one second audio track;

5. The method according to any one of claims 1 to 2, further comprising:

acquiring a target video corresponding to the target audio;

6. The method of claim 5, further comprising:

receiving a third input of a user to a third audio track in a case that the image preview window displays the target image frame;

responding to the third input, and performing audio processing on a target audio frame in the third audio track according to a third processing mode corresponding to the third input;

wherein the target audio frame is an audio frame with the same timestamp as the target image frame, and the third processing mode includes at least one of the following: audio deletion, audio segmentation, audio speed change, audio sound change, audio replacement and subtitle addition.

7. The method of claim 5, further comprising:

receiving a fourth input of the user to at least one fourth audio track in a case that the image preview window displays the target image frame;

responding to the fourth input, and performing audio processing on the target audio frame in each fourth audio track according to a fourth processing mode corresponding to the fourth input;

8. The method of claim 5, further comprising:

receiving a fifth input of a first target control from a user in a case that the image preview window displays a target image frame;

displaying at least one preset audio in response to the fifth input;

receiving a sixth input of the user to a first audio of the at least one preset audio;

in response to the sixth input, adding the first audio to a target track position in a fifth audio track;

the fifth audio track is an audio track with the same sound characteristics as the first audio, and the timestamp corresponding to the target track position is the same as the timestamp of the target image frame.

9. The method of claim 5, further comprising:

receiving a seventh input of the first target control by the user;

displaying at least one preset audio in response to the seventh input;

receiving an eighth input of a second audio of the at least one preset audio by the user;

in response to the eighth input, adding a sixth audio track corresponding to the second audio.

10. An audio processing apparatus, comprising:

the audio segment generating module is used for generating at least one first sub-audio segment of the target audio based on the continuous audio frames with the same sound characteristics;

the object determining module is used for determining a sound production object to which the first sub-sound band belongs according to the sound characteristics;

the audio acquisition module is used for acquiring a target audio, wherein the target audio comprises N first sub-audio bands, and the first sub-audio bands comprise sound of a sound production object;

a first adding module, configured to add a sound generation object identifier to the first sub-audio band according to the sound generation object to which the first sub-audio band belongs;

a first display module, configured to display M audio tracks corresponding to the target audio, where each audio track includes at least one first sub-audio band, and one audio track corresponds to one sound-emitting object identifier;

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the audio processing method of any of claims 1-9.