CN115101094A

CN115101094A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN115101094A
Application number: CN202210697617.8A
Authority: CN
Inventors: 范欣悦; 陈联武; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-23

Abstract

The disclosure relates to an audio processing method and device, electronic equipment and a storage medium. The audio processing method comprises the following steps: performing audio division on audio to be processed to obtain a plurality of candidate audio segments, wherein the candidate audio segments correspond to different audio tags, and the audio tags are determined by performing feature clustering on the audio to be processed; for each candidate audio clip, determining a feature evaluation result of each candidate audio clip on a preset audio evaluation feature; and determining a target audio clip from the candidate audio clips according to the characteristic evaluation result, and taking the target audio clip as an audio clip to be used. According to the audio processing method and the audio processing device, the selection accuracy of the target audio clip can be improved.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of audio and video technology. More particularly, the present disclosure relates to an audio processing method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Highlight segments, generally, refer to the most appealing and appealing highlights in the audio, such as the chorus portion of a song. In short video soundtracks, highlight segments are often used to select as background music for the video due to the limited time. Currently, a score is provided for short videos by manually selecting a few measure of song refrain part as a highlight segment of each song. When the highlight segments are manually selected, the highlight segments cannot be accurately selected.

Disclosure of Invention

Exemplary embodiments of the present disclosure are directed to providing an audio processing method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product, to at least solve the problem of audio processing in the related art.

According to an exemplary embodiment of the present disclosure, there is provided an audio processing method including: performing audio division on audio to be processed to obtain a plurality of candidate audio segments, wherein different candidate audio segments correspond to different audio tags, and the audio tags are determined by performing feature clustering on the audio to be processed; for each candidate audio segment, determining a feature evaluation result of each candidate audio segment on a preset audio evaluation feature; and determining a target audio clip from the candidate audio clips according to the characteristic evaluation result, and taking the target audio clip as an audio clip to be used.

Optionally, the audio rating characteristic comprises a plurality. The determining a target audio segment from the candidate audio segments according to the feature evaluation result may include: determining a target characteristic evaluation result of each candidate audio clip according to the characteristic evaluation results under the multiple audio evaluation characteristics of each candidate audio clip; comparing the target characteristic evaluation results of the candidate audio clips, and selecting a target characteristic evaluation result meeting a preset condition; and determining the candidate audio clip corresponding to the target feature evaluation result meeting the preset conditions in the plurality of candidate audio clips as the target audio clip.

Optionally, the audio evaluation feature may include at least one of average loudness, spectral brightness, drum spot intensity.

Optionally, the determining a target audio segment from a plurality of candidate audio segments may include: carrying out sound-accompaniment separation processing on each candidate audio clip to obtain a content component of each candidate audio clip, wherein the content component comprises at least one of a target object component, an accompaniment component and silence; determining the target audio segment from the candidate audio segments containing the target object component.

Optionally, the audio dividing the audio to be processed to obtain a plurality of candidate audio segments may include: carrying out audio structure division on the audio to be processed to obtain a plurality of audio structure segments; clustering audio features corresponding to the plurality of audio structure segments, and determining an audio label of each audio structure segment; and combining the adjacent audio structure segments with the same audio label to obtain the plurality of candidate audio segments.

Optionally, the audio structure division is performed on the audio to be processed to obtain a plurality of audio structure segments, which may include: extracting a Mel cepstrum coefficient characteristic spectrum of the audio to be processed; determining a self-similarity matrix of the audio to be processed based on the Mel cepstrum coefficient feature spectrum; presetting the self-similarity matrix to obtain an enhanced self-similarity matrix; determining local and global relationships of the enhanced self-similarity matrix; and determining the cutting point of the audio to be processed based on the local and global relations to obtain the plurality of audio structure segments.

Optionally, the determining a cut point of the audio to be processed based on the local and global relationships may include: determining structural change data of the audio to be processed based on the local and global relationships; determining a cut point of the audio to be processed based on the structural change data.

Optionally, after the audio division is performed on the audio to be processed to obtain a plurality of candidate audio segments, the audio processing method may further include: detecting to obtain a rephotograph of the audio to be processed; and adjusting the starting positions and the ending positions of the candidate audio clips based on the rephotography to obtain a plurality of adjusted candidate audio clips. The determining a target audio segment from a plurality of candidate audio segments may include: selecting at least one candidate audio clip from the adjusted plurality of candidate audio clips as the target audio clip.

According to an exemplary embodiment of the present disclosure, there is provided an audio processing apparatus including: the segment dividing unit is configured to perform audio division on the audio to be processed to obtain a plurality of candidate audio segments, wherein different candidate audio segments correspond to different audio tags, and the audio tags are determined by performing feature clustering on the audio to be processed; the segment evaluation unit is configured to determine a feature evaluation result of each candidate audio segment on a preset audio evaluation feature for each candidate audio segment; and the segment determining unit is configured to determine a target audio segment from the candidate audio segments according to the characteristic evaluation result, and the target audio segment is used as an audio segment to be used.

Optionally, the audio rating feature may comprise a plurality. The segment determination unit may be configured to: determining a target characteristic evaluation result of each candidate audio clip according to the characteristic evaluation results under the multiple audio evaluation characteristics of each candidate audio clip; comparing the target feature evaluation results of the candidate audio clips, and selecting a target feature evaluation result meeting a preset condition; and determining the candidate audio clip corresponding to the target feature evaluation result meeting the preset condition in the plurality of candidate audio clips as the target audio clip.

Optionally, the segment determining unit may be configured to: carrying out sound-accompaniment separation processing on each candidate audio clip to obtain a content component of each candidate audio clip, wherein the content component comprises at least one of a target object component, an accompaniment component and silence; determining the target audio segment from the candidate audio segments containing the target object component.

Optionally, the fragment dividing unit may be configured to: carrying out audio structure division on the audio to be processed to obtain a plurality of audio structure segments; clustering audio features corresponding to the plurality of audio structure segments, and determining an audio label of each audio structure segment; and combining the adjacent audio structure segments with the same audio label to obtain the plurality of candidate audio segments.

Optionally, the fragment dividing unit may be configured to: extracting a Mel cepstrum coefficient characteristic spectrum of the audio to be processed; determining a self-similarity matrix of the audio to be processed based on the Mel cepstrum coefficient feature spectrum; presetting the self-similarity matrix to obtain an enhanced self-similarity matrix; determining local and global relationships of the enhanced self-similarity matrix; and determining a cutting point of the audio to be processed based on the local and global relationships to obtain the plurality of audio structure segments.

Optionally, the fragment dividing unit may be configured to: determining structural change data of the audio to be processed based on the local and global relationships; determining a cut point of the audio to be processed based on the structural change data.

Optionally, the audio processing apparatus may further include: a segment adjusting unit configured to detect a rephotography of the audio to be processed; and adjusting the start positions and the end positions of the candidate audio clips based on the rephotographs to obtain a plurality of adjusted candidate audio clips. The segment determining unit may be configured to: selecting at least one candidate audio clip from the adjusted plurality of candidate audio clips as the target audio clip.

According to an exemplary embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement an audio processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to execute an audio processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement an audio processing method according to an exemplary embodiment of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the candidate audio segments corresponding to the audio labels are obtained by performing audio division and feature clustering on the audio to be processed, so that the integrity of the candidate audio segments of the same type can be ensured. The method comprises the steps of determining a feature evaluation result of each candidate audio clip on a preset audio evaluation feature according to each candidate audio clip, determining a target audio clip from a plurality of candidate audio clips according to the feature evaluation result, wherein the audio evaluation feature can represent the emotion expression degree of the candidate audio clip, screening out the target audio clip from the candidate audio clips based on the audio evaluation feature, and taking the target audio clip as a highlight clip, so that the selection accuracy of the target audio clip can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 illustrates an exemplary system architecture to which exemplary embodiments of the present disclosure may be applied.

Fig. 2 illustrates a flowchart of an audio processing method according to an exemplary embodiment of the present disclosure.

Fig. 3 illustrates an example of an enhanced self-similarity matrix for a MFCC spectrum of a segment of audio according to an exemplary embodiment of the present disclosure.

FIG. 4 illustrates a process of changing a self-similarity matrix of a song to structural features to structural change data according to an exemplary embodiment of the present disclosure.

Fig. 5 illustrates an example of a deep audio separation network according to an exemplary embodiment of the present disclosure.

Fig. 6 shows a block diagram of an audio processing system according to an exemplary embodiment of the present disclosure.

Fig. 7 illustrates a block diagram of an audio processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram of an electronic device 800 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the phrase "at least one of the plurality of items" in the present disclosure means that the three parallel cases including "any one of the plurality of items", "a combination of any plurality of the plurality of items", and "the entirety of the plurality of items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Hereinafter, an audio processing method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product according to exemplary embodiments of the present disclosure will be described in detail with reference to fig. 1 to 8.

Fig. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. A user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or send messages (e.g., audio processing requests), etc. Various audio and video applications can be installed on the

terminal devices

101, 102, 103. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and capable of playing, recording, editing, etc. audio and video, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, etc. When the

terminal device

101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.

The

terminal devices

101, 102, 103 may be equipped with an image capture device (e.g., a camera) to capture video data. In practice, the smallest visual unit that makes up a video is a Frame (Frame). Each frame is a static image. Temporally successive sequences of frames are composited together to form a dynamic video. Further, the

terminal apparatuses

101, 102, 103 may also be mounted with a component (e.g., a speaker) for converting an electric signal into sound to play the sound, and may also be mounted with a device (e.g., a microphone) for converting an analog audio signal into a digital audio signal to pick up the sound.

The server 105 may be a server providing various services, such as a background server providing support for audio-video applications installed on the

terminal devices

101, 102, 103. The backend server may perform parsing, storage, and other processing on the received data such as the audio processing request, and may also feed back the audio processing result to the

terminal devices

101, 102, and 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the audio processing method provided by the embodiment of the present disclosure is generally executed by a terminal device, but may also be executed by a server, or may also be executed by cooperation of the terminal device and the server. Accordingly, the audio processing means may be provided in the terminal device, in the server, or in both the terminal device and the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.

Referring to fig. 2, in step S201, an audio to be processed is divided into multiple candidate audio segments. Here, different candidate audio segments correspond to different audio tags, and the audio tags are determined by performing feature clustering on the audio to be processed.

In the exemplary embodiment of the disclosure, when audio structure division is performed on an audio to be processed, the audio to be processed may be first subjected to audio structure division to obtain a plurality of audio structure segments, audio features corresponding to the plurality of audio structure segments are subjected to clustering processing, an audio tag of each audio structure segment is determined, and then adjacent audio structure segments having the same audio tag are merged to obtain a plurality of candidate audio segments, thereby improving accuracy of audio division. In one implementation, k-means clustering (k-means clustering, for short, k-means clustering) can be performed on the mel cepstrum coefficient features of the divided audio structure segments, a parameter k of the k-means clustering can be set to be 4, and an audio label of each audio structure segment is obtained through clustering.

In the exemplary embodiment of the present disclosure, when audio structure division is performed on an audio to be processed to obtain a plurality of audio structure segments, a mel cepstrum coefficient feature spectrum of the audio to be processed may be first extracted, a self-similarity matrix of the audio to be processed is determined based on the mel cepstrum coefficient feature spectrum, the self-similarity matrix is subjected to preset processing to obtain an enhanced self-similarity matrix, a local and global relationship of the enhanced self-similarity matrix is determined, then a cutting point of the audio to be processed is determined based on the local and global relationship, and a plurality of audio structure segments are obtained, thereby improving accuracy of audio division.

In the exemplary embodiment of the present disclosure, when determining the cut point of the audio to be processed based on the local and global relationships, the structural change data of the audio to be processed may be first determined based on the local and global relationships, and then the cut point of the audio to be processed may be determined based on the structural change data, thereby improving the accuracy of the cut point, and further improving the accuracy of audio division.

In an exemplary embodiment of the present disclosure, audio (e.g., a song) is typically composed of some of the following elements: prelude (intro), master song (verse), part in front of refrain (pre-chord), refrain (chord), bridge segment (bridge), ending (outro). The prelude is the prelude of the audio frequency, and lays a foundation for the rhythm and melody of the audio frequency. The verse is a verse of audio, guides listeners to know main information of the audio, and promotes the development of the audio. The refrain is the most wonderful part in the audio, the most profound part of the memory point, and forms a sharp contrast with the verse, and usually repeats many times in the audio, the bridge section is the bridge section of the connecting section, usually the pure musical instrument makes up the nobody sound, this part may be as the connecting point of the mode conversion. The last end is the end of the audio.

Considering that the song and the refrain in the audio are repeated many times, the division of the audio structure may be implemented by a self-similarity matrix (SSM). In one implementation, the self-similarity matrix of the feature spectrum may be calculated by using Mel-Frequency Cepstral Coefficients (MFCC).

When extracting the mel frequency cepstrum coefficient characteristic spectrum of the audio, firstly, the audio is converted to mel frequency, and then the cepstrum analysis is carried out. Cepstral analysis is often applied to speech recognition, audio genre classification, and audio signal similarity measurements, and the low-order signal features of cepstral analysis include timbre and loudness. The cepstrum analysis can be performed by the following formula.

X＝MFCC(x)

Here, X is an audio time domain signal, and X is a mel-frequency cepstrum coefficient feature spectrum.

The self-similarity matrix can be calculated by the following formula.

S＝X ^T X

Here, S denotes a self-similarity matrix, X denotes a mel-frequency cepstrum coefficient signature, and T denotes transposition.

After the self-similarity Matrix is obtained, an Enhanced self-similarity Matrix (Enhanced SSM Matrix) can be obtained through a series of feature smoothing methods, path smoothing, threshold setting and the like. Fig. 3 illustrates an example of an enhanced self-similarity matrix for a MFCC spectrum for a piece of audio according to an example embodiment of the present disclosure.

After enhancing the self-similarity matrix, the self-similarity matrix can be capturedLocal and global relationships to segment. In one implementation, the structural Feature (structured Feature) may be obtained by computing a circular time-lag matrix (circular time-lag matrix). For example, by formula

A cyclic time lag matrix is calculated. Where S denotes a self-similarity matrix, the dimension of the self-similarity matrix S is N × N, mod denotes a complementary function, N denotes a sampling point, L is a positive integer, and L is a square ^o (l, n) represents a cyclic lag matrix.

Then, the cyclic time lag matrix can be smoothed and the two norms of the first-order difference can be calculated to obtain the structural change data (novel Function) of the song structural characteristics. For example, by the formula Δ _Structure (n)：=||L ^o[n+1] -L ^o [n]And | | calculate the structural change data. Here, Δ structure (n) represents structural change data, n represents a sampling point, L ^o[n+1] And L ^o[n] Representing the structural characteristics of the song, i.e., the cyclic time lag matrix.

FIG. 4 illustrates a process of changing a self-similarity matrix of a song to structural features to structural change data according to an exemplary embodiment of the present disclosure. In fig. 4, the structural change data is shown in the form of a novelty curve. For example, the fragmentation time points can be determined from the peaks of the novelty curves in fig. 4.

In the exemplary embodiment of the present disclosure, after audio division is performed on the audio to be processed to obtain a plurality of candidate audio segments, a beat-down (downbeat) of the audio to be processed may be further detected, and the start positions and the end positions of the plurality of candidate audio segments are adjusted based on the beat-down to obtain the adjusted plurality of candidate audio segments, thereby improving the accuracy of the candidate audio segments.

A beat (beat) is an organized form of audio representing a fixed unit of duration and intensity, which defines the prosodic structure of audio (e.g., a musical piece). In audio, tempo is characterized by a repeating sequence of beats and non-beats, while beats refer to strong beats in audio. In one implementation, beat and reprint detection may be performed based on a combination of convolutional recurrent neural network and deep belief network post-processing. For example, the beat and the reprint of the audio to be processed can be detected by inputting the audio to be processed into a combined model of a convolutional recurrent neural network and a deep confidence network.

In order to ensure that the start time of each candidate audio segment starts from the time of the rephotography, the time of the rephotography in which the start time of each candidate audio segment is closest to the start time of the candidate audio segment after the fine adjustment can be calculated as the start time of the candidate audio segment after the fine adjustment, and the time of the rephotography in which the end time of each candidate audio segment is closest to the end time of the candidate audio segment after the fine adjustment can be calculated as the end time of the candidate audio segment after the fine adjustment.

In step S202, for each candidate audio segment, a feature evaluation result of each candidate audio segment on a preset audio evaluation feature is determined. Here, the audio evaluation feature may include a plurality of audio evaluation features. In addition, the audio rating feature may also include an audio rating feature.

In an exemplary embodiment of the present disclosure, the audio evaluation feature may include at least one of average loudness, spectral brightness, drum intensity, thereby improving the accuracy of the feature evaluation result. In addition, the audio rating features may also include other rating features, which are not limited by this disclosure.

Considering that the highlight segments of popular songs are usually the parts with the strongest audio emotion, the audio color of the parts is usually the most full, the whole color is brighter, the audio loudness is the greatest, the drumbeats are heavier, and the low frequencies are the most full. Therefore, the candidate audio segments can be comprehensively scored by calculating the audio features of the candidate audio segments in terms of average loudness, spectral brightness, drum intensity and the like.

For example, the average loudness of a candidate audio segment may be calculated by the following formula.

Here, RMS [ m ] denotes an average loudness, x is an input audio signal, a candidate audio segment is divided into a plurality of frames, N denotes a length of one frame (i.e., a window length), m denotes an index of the frame, h denotes the number of sampling points spaced from frame to frame, N denotes sampling points, and ω [ N ] denotes a window function.

For example, the spectral luminance of a candidate audio piece may be calculated by the following formula.

Here, SC [ m ]]Representing the spectral luminance, f _k K frequencies of the short-time fourier transform, X the frequency spectrum of the short-time fourier transform of the candidate audio piece, and m the index of the frame.

For example, the drum point intensity of a candidate audio segment may be calculated by the following formula.

Here, Δ spectral (m) represents the drum intensity, X is the spectrum of the short-time fourier transform of the candidate audio piece, and m is the index of the frame.

And finally, respectively distributing weights to the average loudness, the spectrum brightness and the drum point intensity, scoring the candidate audio segments of the audio, and selecting the segment with the highest score as the highest-brightness segment.

For example, candidate audio segments of audio may be scored by the following formula.

Score＝k1*RMS+k2*SC+k3*Δspectral

(k1+k2+k3＝1)

Here, Score denotes the scoring result, k1 denotes the weight of average loudness, k2 denotes the weight of spectral luminance, k3 denotes the weight of drum point intensity, RMS denotes the average loudness, SC denotes the spectral luminance, and Δ spectral denotes the drum point intensity.

In step S203, a target audio segment is determined from the multiple candidate audio segments according to the feature evaluation result, and the target audio segment is used as an audio segment to be used.

In the exemplary embodiment of the disclosure, when a target audio segment is determined from a plurality of candidate audio segments according to a feature evaluation result, a target feature evaluation result of each candidate audio segment may be determined according to a feature evaluation result under a plurality of audio evaluation features of each candidate audio segment, the target feature evaluation results of the plurality of candidate audio segments are compared, a target feature evaluation result meeting a preset condition is selected, and then a candidate audio segment corresponding to the target feature evaluation result meeting the preset condition in the plurality of candidate audio segments is determined as the target audio segment, thereby improving the accuracy of the target audio segment.

In the exemplary embodiment of the present disclosure, when determining a target audio segment from a plurality of candidate audio segments, a sound-accompaniment separation process may be performed on each candidate audio segment first to obtain a content component of each candidate audio segment, where the content component includes at least one of a target object component, an accompaniment component, and silence, and then the target audio segment may be determined from the candidate audio segments including the target object component, so as to reduce interference of non-target object components. As an example, the target object component may be, for example, but not limited to, a human voice component, an animal voice component, or the like.

For example, the human voice of the audio and the voices of various instruments can be separated through a deep audio separation network, such as separating the input audio signal into 4 parts of human voice, drum, bass and the like. Fig. 5 illustrates an example of a deep audio separation network according to an exemplary embodiment of the present disclosure. The deep tone separation network in fig. 5 is constructed based on a U-Net (U-Net, a variant of the full convolutional neural network FCN). As shown in fig. 5, the audio X performs masking processing on the output of the Complex network through a Complex-valued network (also called Complex-valued network), and combines the masking processing result with the audio X to obtain an audio separation processing result.

Thereafter, the pure human voice portion of each candidate audio segment may be obtained first by the vocal accompaniment separation technique. And then deleting the candidate audio segments without the human voice, so that each candidate audio segment contains the human voice, and the bridge segment is prevented from being detected as the refrain by mistake.

In the exemplary embodiment of the disclosure, after audio division is performed on an audio to be processed to obtain a plurality of candidate audio segments, and under the condition that a plurality of adjusted candidate audio segments are obtained by detecting and obtaining a rephotograph of the audio to be processed, when a target audio segment is determined from the plurality of candidate audio segments, at least one candidate audio segment may be selected from the plurality of adjusted candidate audio segments as the target audio segment, so that a selection effect of the target audio segment is improved.

As shown in fig. 6, after the audio is input into the audio processing system, in the audio processing system, the audio is first cut by the audio structure analysis technique to obtain different audio structure segments. And then clustering the audio structure segments by an audio segment clustering technology to obtain different audio tags. And merging the adjacent audio segments of the same category to ensure the integrity of the audio segments. And then, finely adjusting the start and end positions of the audio clip by means of a rephotography detection result of the neural network, and ensuring that the start and the end of the audio clip are on the rephotography. Finally, the divided audio clips are understood, analyzed and compared, and the audio clips without voices are removed through a voice accompaniment separation technology to obtain the highest bright clip of the input audio.

The audio processing method according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 1 to 6. Hereinafter, an audio processing apparatus and units thereof according to an exemplary embodiment of the present disclosure will be described with reference to fig. 7.

Fig. 7 shows a block diagram of an audio processing device according to an exemplary embodiment of the present disclosure.

Referring to fig. 7, the audio processing apparatus includes a section dividing unit 71, a section evaluating unit 72, and a section determining unit 73.

The segment dividing unit 71 is configured to perform audio division on the audio to be processed, resulting in a plurality of candidate audio segments. Here, different candidate audio segments correspond to different audio tags, and the audio tags are determined by performing feature clustering on the audio to be processed.

In an exemplary embodiment of the present disclosure, the segmentation dividing unit 71 may be configured to: carrying out audio structure division on audio to be processed to obtain a plurality of audio structure fragments; clustering audio features corresponding to the multiple audio structure segments, and determining an audio label of each audio structure segment; and combining the adjacent audio structure segments with the same audio label to obtain a plurality of candidate audio segments.

In an exemplary embodiment of the present disclosure, the segmentation dividing unit 71 may be configured to: extracting a Mel cepstrum coefficient characteristic spectrum of the audio to be processed; determining a self-similarity matrix of the audio to be processed based on the Mel cepstrum coefficient characteristic spectrum; presetting the self-similarity matrix to obtain an enhanced self-similarity matrix; determining local and global relationships of the enhanced self-similarity matrix; and determining a cutting point of the audio to be processed based on the local and global relations to obtain a plurality of audio structure segments.

In an exemplary embodiment of the present disclosure, the fragment dividing unit 71 may be configured to: determining structural change data of the audio to be processed based on the local and global relationships; and determining a cutting point of the audio to be processed based on the structural change data.

In an exemplary embodiment of the present disclosure, the audio processing apparatus may further include: a section adjustment unit (not shown) configured to detect a rephoto resulting in the audio to be processed; and adjusting the start positions and the end positions of the candidate audio clips based on the repeated shooting to obtain the adjusted candidate audio clips.

The segment evaluation unit 72 is configured to determine, for each candidate audio segment, a feature evaluation result of each candidate audio segment on a preset audio evaluation feature. In an exemplary embodiment of the present disclosure, the audio evaluation feature may include a plurality of audio evaluation features. In addition, the audio rating feature may also include an audio rating feature.

In an exemplary embodiment of the present disclosure, the audio evaluation feature may include at least one of average loudness, spectral brightness, drum intensity. The section determining unit 73 is configured to determine a target audio section from the plurality of candidate audio sections according to the feature evaluation result, and take the target audio section as an audio section to be used.

In an exemplary embodiment of the present disclosure, the section determining unit 73 may be configured to: determining a target characteristic evaluation result of each candidate audio clip according to the characteristic evaluation result of each candidate audio clip under the multiple audio evaluation characteristics; comparing the target characteristic evaluation results of the candidate audio clips, and selecting a target characteristic evaluation result meeting a preset condition; and determining the candidate audio clip corresponding to the target feature evaluation result meeting the preset conditions in the plurality of candidate audio clips as the target audio clip.

In an exemplary embodiment of the present disclosure, the section determining unit 73 may be configured to: carrying out sound-accompaniment separation processing on each candidate audio clip to obtain a content component of each candidate audio clip, wherein the content component comprises at least one of a target object component, an accompaniment component and silence; and determining the target audio segment from the candidate audio segments containing the target object components.

In an exemplary embodiment of the present disclosure, the section determining unit 73 may be configured to: and selecting at least one candidate audio segment from the adjusted plurality of candidate audio segments as the target audio segment.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The audio processing apparatus according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 7. Next, an electronic apparatus according to an exemplary embodiment of the present disclosure is described with reference to fig. 8.

Referring to fig. 8, an electronic device 800 includes at least one memory 801 and at least one processor 802, the at least one memory 801 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 802, perform a method of audio processing according to an example embodiment of the present disclosure.

In exemplary embodiments of the present disclosure, the electronic device 800 may be a PC computer, a tablet device, a personal digital assistant, a smartphone, or other device capable of executing the above-described instruction sets. Here, the electronic device 800 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 800, the processor 802 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 802 may execute instructions or code stored in the memory 801, wherein the memory 801 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 801 may be integrated with the processor 802, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 801 may include a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 801 and the processor 802 may be operatively coupled or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 802 can read files stored in the memory.

In addition, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.

There is also provided, in accordance with an example embodiment of the present disclosure, a computer-readable storage medium, such as a memory 801, including instructions executable by a processor 802 of a device 800 to perform the above-described method. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, which comprises computer programs/instructions, which when executed by a processor, implement the method of audio processing according to an exemplary embodiment of the present disclosure.

The audio processing method and apparatus, the electronic device, the computer-readable storage medium, and the computer program product according to the exemplary embodiments of the present disclosure have been described above with reference to fig. 1 to 8. However, it should be understood that: the audio processing apparatus and its units shown in fig. 7 may be respectively configured as software, hardware, firmware, or any combination thereof to perform a specific function, the electronic device shown in fig. 8 is not limited to include the above-shown components, but some components may be added or deleted as needed, and the above components may also be combined.

According to the audio processing method and the audio processing device, the audio to be processed is firstly subjected to audio division to obtain a plurality of candidate audio segments (the candidate audio segments correspond to different audio tags which are determined by performing feature clustering on the audio to be processed), a feature evaluation result of each candidate audio segment on a preset audio evaluation feature is determined for each candidate audio segment, then a target audio segment is determined from the candidate audio segments according to the feature evaluation result, the target audio segment is used as an audio segment to be used, and therefore the selection of the target audio segment of the audio to be processed is achieved by performing feature evaluation on each candidate audio segment of the audio to be processed.

In addition, according to the audio processing method and the audio processing device, the audio structure segments cut out according to the structure characteristics can be clustered, and the audio structure segments of the same category are combined, so that the completeness of the cut segments is ensured.

In addition, according to the audio processing method and apparatus of the present disclosure, the audio segment can be fine-tuned through stress detection, thereby ensuring that the candidate audio segment starts from stress and ends before stress.

In addition, according to the audio processing method and the audio processing device disclosed by the invention, the non-human voice segments in the candidate audio segments can be eliminated by utilizing the sound companion separation technology, so that the interference of the bridge segment on the detection of the refrain is reduced.

In addition, according to the audio processing method and apparatus of the present disclosure, the candidate audio segments may be comprehensively scored from at least one of the aspects of average loudness, spectral brightness, drum intensity, and the like, so as to find the target audio segment with the highest score.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio processing method, comprising:

performing audio division on audio to be processed to obtain a plurality of candidate audio segments, wherein different candidate audio segments correspond to different audio tags, and the audio tags are determined by performing feature clustering on the audio to be processed;

for each candidate audio segment, determining a feature evaluation result of each candidate audio segment on a preset audio evaluation feature;

and determining a target audio clip from the candidate audio clips according to the characteristic evaluation result, and taking the target audio clip as an audio clip to be used.

2. The audio processing method according to claim 1, wherein the audio evaluation feature includes a plurality of,

the determining a target audio clip from the candidate audio clips according to the feature evaluation result comprises:

determining a target characteristic evaluation result of each candidate audio clip according to the characteristic evaluation results under the multiple audio evaluation characteristics of each candidate audio clip;

comparing the target feature evaluation results of the candidate audio clips, and selecting a target feature evaluation result meeting a preset condition;

and determining the candidate audio clip corresponding to the target feature evaluation result meeting the preset conditions in the plurality of candidate audio clips as the target audio clip.

3. The audio processing method according to claim 1, wherein the audio evaluation feature comprises at least one of average loudness, spectral brightness, drum spot intensity.

4. The audio processing method of claim 1, wherein the determining a target audio segment from the plurality of candidate audio segments comprises:

carrying out sound-accompaniment separation processing on each candidate audio clip to obtain a content component of each candidate audio clip, wherein the content component comprises at least one of a target object component, an accompaniment component and silence;

determining the target audio segment from the candidate audio segments containing the target object components.

5. The audio processing method of claim 1, wherein the audio partitioning of the audio to be processed to obtain a plurality of candidate audio segments comprises:

carrying out audio structure division on the audio to be processed to obtain a plurality of audio structure segments;

clustering audio features corresponding to the plurality of audio structure segments, and determining an audio label of each audio structure segment;

and combining the adjacent audio structure segments with the same audio label to obtain the plurality of candidate audio segments.

6. The audio processing method according to claim 5, wherein the audio structure dividing the audio to be processed to obtain a plurality of audio structure segments comprises:

extracting a mel cepstrum coefficient characteristic spectrum of the audio to be processed;

determining a self-similarity matrix of the audio to be processed based on the Mel cepstrum coefficient feature spectrum;

presetting the self-similarity matrix to obtain an enhanced self-similarity matrix;

determining local and global relationships of the enhanced self-similarity matrix;

and determining the cutting point of the audio to be processed based on the local and global relations to obtain the plurality of audio structure segments.

7. An audio processing apparatus, comprising:

the segment dividing unit is configured to perform audio division on the audio to be processed to obtain a plurality of candidate audio segments, wherein different candidate audio segments correspond to different audio tags, and the audio tags are determined by performing feature clustering on the audio to be processed;

a segment evaluation unit configured to determine, for each of the candidate audio segments, a feature evaluation result of each of the candidate audio segments on a preset audio evaluation feature; and

and the segment determining unit is configured to determine a target audio segment from the candidate audio segments according to the feature evaluation result, and take the target audio segment as an audio segment to be used.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio processing method of any of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, which when executed by a processor of an electronic device causes the electronic device to perform the audio processing method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the audio processing method of any of claims 1 to 6 when executed by a processor.