CN113327628B

CN113327628B - Audio processing method, device, readable medium and electronic equipment

Info

Publication number: CN113327628B
Application number: CN202110586577.5A
Authority: CN
Inventors: 徐怡廷; 王素珍; 丁锐
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2023-12-22
Anticipated expiration: 2041-05-27
Also published as: CN113327628A

Abstract

The disclosure relates to an audio processing method, an apparatus, a readable medium and an electronic device, and relates to the technical field of audio signal processing, wherein the method comprises the following steps: extracting frequency domain characteristics of each audio frame in the audio to be processed, determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain characteristics, acquiring time sequence of each audio frame in the audio to be processed, correcting the initial similarity matrix according to the time sequence to obtain a target similarity matrix fusing time relations, constructing an audio feature map corresponding to the audio to be processed according to the target similarity matrix, performing spectral clustering on the audio feature map to obtain a plurality of clusters, determining a plurality of segmentation boundaries of the audio to be processed according to the clustering boundaries of the clusters, and segmenting the audio to be processed according to the segmentation boundaries to obtain a plurality of audio segments. The method and the device can improve the accuracy and the adaptability of audio segmentation.

Description

Audio processing method, device, readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of audio signal processing technologies, and in particular, to an audio processing method, an apparatus, a readable medium, and an electronic device.

Background

With the continuous development of terminal technology and electronic information technology, the importance of terminal equipment in daily life of people is increasing, and people can acquire various forms of information through the terminal equipment. Compared with text information and picture information, the content contained in the video information is more abundant in expression, and information can be transmitted to a user from two dimensions of vision and hearing. For video information, suitable background music tends to increase the expressivity and the spread of the information.

Because the duration of a piece of complete music is often different from the duration of a video picture which needs to be matched with the complete music, if the music is directly cut off, the music is incomplete, the performance effect is hard, and even the expressive force of the video picture is affected. Therefore, in order to match a video picture and audio, it is necessary to analyze the music structure of the audio to divide the audio according to the music structure to obtain a plurality of audio segments, thereby realizing matching of the audio segments and the picture segments.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides an audio processing method, the method comprising:

extracting frequency domain characteristics of each audio frame in audio to be processed, and determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain characteristics;

acquiring the time sequence of each audio frame in the audio to be processed, and correcting the initial similarity matrix according to the time sequence to obtain a target similarity matrix fusing time relations;

constructing an audio feature map corresponding to the audio to be processed according to the target similarity matrix;

performing spectral clustering on the audio feature map to obtain a plurality of clusters, and determining a plurality of segmentation boundaries of the audio to be processed according to the clustering boundaries of the clusters;

and dividing the audio to be processed according to the dividing boundary to obtain a plurality of audio segments.

In a second aspect, the present disclosure provides an audio processing apparatus, the apparatus comprising:

the first determining module is used for extracting the frequency domain characteristics of each audio frame in the audio to be processed and determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain characteristics;

the second determining module is used for obtaining the time sequence of each audio frame in the audio to be processed, and correcting the initial similarity matrix according to the time sequence to obtain a target similarity matrix fusing the time relations;

The map construction module is used for constructing an audio characteristic map corresponding to the audio to be processed according to the target similarity matrix;

the clustering module is used for carrying out spectral clustering on the audio feature map to obtain a plurality of clusters, and determining a plurality of segmentation boundaries according to the clustering boundaries of the plurality of clusters;

and the segmentation module is used for segmenting the audio to be processed according to the segmentation boundary so as to obtain a plurality of audio segments.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of the first aspect of the disclosure.

Through the technical scheme, the method and the device for processing the audio frame of the audio to be processed firstly extract the frequency domain characteristics of each audio frame of the audio to be processed, determine the initial similarity matrix based on the frequency domain characteristics, then acquire the time sequence of each audio frame in the audio to be processed, and correct the initial similarity matrix according to the time sequence to obtain the target similarity matrix fused with the time relation. And constructing a corresponding audio feature map according to the target similarity matrix, performing spectral clustering on the constructed audio feature map to obtain a plurality of clusters, determining a plurality of segmentation boundaries according to the clustering boundaries of the plurality of clusters, and finally segmenting the audio to be processed according to the segmentation boundaries to obtain a plurality of audio segments. According to the method and the device, the similarity of the audio frames on the frequency domain characteristics and the time sequence of the audio frames in the audio to be processed are combined to obtain the target similarity matrix, so that the segmentation boundary for segmenting the audio to be processed is determined according to the target similarity matrix capable of reflecting the association between the audio frames from the two dimensions of the frequency domain and the time domain, and the accuracy and the adaptability of audio segmentation can be improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart illustrating a method of audio processing according to an exemplary embodiment;

FIG. 2 is a schematic diagram of an audio feature map shown according to an exemplary embodiment;

FIG. 3 is a flow chart illustrating another audio processing method according to an exemplary embodiment;

FIG. 4 is a schematic diagram of an initial similarity matrix shown in accordance with an exemplary embodiment;

FIG. 5 is a flowchart illustrating another audio processing method according to an exemplary embodiment;

fig. 6 is a schematic diagram of a beat time sequence shown according to an example embodiment;

FIG. 7 is a flowchart illustrating another audio processing method according to an exemplary embodiment;

FIG. 8 is a schematic diagram of a target similarity matrix shown according to an example embodiment;

FIG. 9 is a flowchart illustrating another audio processing method according to an exemplary embodiment;

FIG. 10 is a flowchart illustrating another audio processing method according to an exemplary embodiment;

FIG. 11 is a schematic diagram of an initial audio segment shown in accordance with an exemplary embodiment;

FIG. 12 is a block diagram of an audio processing device, according to an example embodiment;

FIG. 13 is a block diagram of another audio processing device shown in accordance with an exemplary embodiment;

FIG. 14 is a block diagram of another audio processing device shown in accordance with an exemplary embodiment;

FIG. 15 is a block diagram of another audio processing device shown in accordance with an exemplary embodiment;

FIG. 16 is a block diagram of another audio processing device shown in accordance with an exemplary embodiment;

FIG. 17 is a block diagram of another audio processing device shown in accordance with an exemplary embodiment;

fig. 18 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a flow chart illustrating a method of audio processing, as shown in fig. 1, according to an exemplary embodiment, the method comprising:

step 101, extracting frequency domain characteristics of each audio frame in the audio to be processed, and determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain characteristics. The initial similarity matrix comprises the similarity of every two audio frames on the frequency domain characteristics.

For example, the audio to be processed is divided into a plurality of audio segments, wherein each audio segment includes a plurality of consecutive audio frames, and the audio frames included in one audio segment have a similar music structure (for example, belong to the same melody). First, the frequency domain feature of each audio frame included in the audio to be processed may be extracted, and the frequency domain feature may be understood as a local feature of the audio frame in the frequency domain, and the frequency domain feature may be, for example, an amplitude value of the audio frame at each note frequency. After the frequency domain features of each audio frame are obtained, an initial similarity matrix may be constructed based on the frequency domain features. The initial similarity matrix comprises the similarity of any two audio frames on the frequency domain characteristics, and can reflect the association of each audio frame on the frequency domain in the audio to be processed. For example, the similarity of each two audio frames on the frequency domain features may be calculated according to a preset algorithm according to the two frequency domain features corresponding to each two audio frames. The preset algorithm may be, for example, a gaussian kernel function, or other algorithms for calculating similarity, which is not specifically limited in this disclosure. Further, an initial similarity matrix may be constructed based on the similarity of each two audio frames in the frequency domain characteristics. It can be understood that the similarity of each two audio frames on the frequency domain features is filled into an initial similarity matrix, and the element of the ith row and the jth column in the initial similarity matrix represents the similarity of the ith audio frame and the jth audio frame in the audio to be processed on the frequency domain features. The similarity of the ith audio frame and the jth audio frame in the frequency domain features is equivalent to the similarity of the jth audio frame and the ith audio frame in the frequency domain features, so that the initial similarity matrix is a symmetric matrix.

Step 102, obtaining the time sequence of each audio frame in the audio to be processed, and correcting the initial similarity matrix according to the time sequence to obtain the target similarity matrix fusing the time relations. The target similarity matrix comprises target similarity between every two audio frames.

For example, for each audio frame included in the audio to be processed, temporally adjacent audio frames are more likely to belong to the same melody, that is, audio frames that are temporally shorter in distance are naturally similar. The initial similarity matrix reflects the association of each audio frame in the audio to be processed in the frequency domain. Therefore, on the basis of the initial similarity matrix, the time sequence of each audio frame in the audio to be processed is firstly acquired, and each element in the initial similarity matrix is corrected according to the acquired time sequence, so that a target similarity matrix is acquired, wherein the target similarity between every two audio frames is included. The target similarity between every two audio frames is determined according to the similarity of the two audio frames on the frequency domain characteristics and the distance of the two audio frames on the time domain, so that the target similarity matrix fuses the time relationship on the basis of the initial similarity matrix and can reflect the association of each audio frame on the frequency domain and the time domain in the audio frames to be processed. The element of the ith row and the jth column in the target similarity matrix represents the target similarity between the ith audio frame and the jth audio frame in the audio to be processed. The target similarity between the ith and jth audio frames is equivalent to the target similarity between the jth and ith audio frames, so the target similarity matrix is also a symmetric matrix.

And step 103, constructing an audio feature map corresponding to the audio to be processed according to the target similarity matrix. The audio feature map may include a node corresponding to each audio frame, and an edge between any two nodes, where each edge is used to represent a target similarity between two corresponding audio frames at two ends of the edge.

For example, after the target similarity matrix is obtained, an audio feature map corresponding to the audio to be processed may be constructed according to the elements included in the target similarity matrix and all audio frames included in the audio to be processed. The process of constructing the audio feature map may include, for example: and establishing a plurality of nodes according to the number of the audio frames included in the audio to be processed, wherein each node corresponds to one audio frame. Then, taking the first node as any node in the audio feature map, and taking the second node as any node except the first node in the audio feature map as an example, the target correlation degree between two audio frames corresponding to the first node and the second node can be determined first, and then an edge is established between the first node and the second node according to the target correlation degree. Specifically, the target correlation may be taken as the value of the edge, or the width of the edge may be used to represent the target correlation, that is, the greater the correlation between two audio frames corresponding to the first node and the second node, the greater the width of the edge. Further, a correlation threshold (for example, may be 0.3) may be set, if the target correlation is greater than or equal to the correlation threshold, the target correlation may be used as the value of the edge, and if the target correlation is less than the correlation threshold, 0 may be used as the value of the edge. Taking the example that the audio to be processed comprises 4 audio frames, an audio characteristic spectrum as shown in fig. 2 is established, wherein the audio characteristic spectrum comprises 4 nodes: node A, node B, node C, node D correspond respectively 1 st audio frame, 2 nd audio frame, 3 rd audio frame, 4 th audio frame, still include 6 limits: side AB, side AC, side AD, side BC, side BD, side CD, where the value of side AB is 0.7, representing the target correlation between node a and node B (i.e., 1 st audio frame and 2 nd audio frame), the value of side AC is 0.3, representing the target correlation between node a and node C (i.e., 1 st audio frame and 3 rd audio frame), and so on.

And 104, performing spectral clustering on the audio feature map to obtain a plurality of clusters, and determining a plurality of segmentation boundaries of the audio to be processed according to the clustering boundaries of the plurality of clusters.

Step 105, dividing the audio to be processed according to the dividing boundary to obtain a plurality of audio segments.

For example, after the audio feature spectrum is established, the audio feature spectrum may be spectrally clustered to obtain a result of the spectral clustering: a plurality of clusters. The spectral clustering of the audio feature spectrum can be understood as cutting the audio feature spectrum, so that the edge weight sum of different sub-spectrums after cutting the audio feature spectrum is as low as possible, and the edge weight sum in the sub-spectrums is as high as possible, each sub-spectrum obtained in this way corresponds to one cluster, each cluster comprises at least one node corresponding to an audio frame, and the audio frames corresponding to the nodes are similar in two dimensions of a frequency domain and a time domain. Specifically, a laplacian matrix may be calculated first from a target similarity matrix, for example, a laplacian matrix l=d-W, where D represents a degree matrix of the audio feature map and W represents a target similarity matrix. And then, carrying out standardization processing on the obtained Laplace matrix. Further, eigenvalues and eigenvectors of the normalized laplacian matrix are calculated. And then, sorting the characteristic values, selecting the minimum preset number of characteristic values, and clustering the characteristic vectors of the minimum preset number of characteristic values by using K-means to obtain the preset number of clusters.

Further, since the audio frames corresponding to the nodes included in each cluster are similar in both the frequency domain and the time domain, a plurality of partition boundaries can be determined from the cluster boundaries of a plurality of clusters. The number of partition boundaries may be the same as or different from the number of clusters. After the segmentation boundary is obtained, the audio to be processed may be directly segmented according to the segmentation boundary, so as to obtain a plurality of audio segments. The number of audio segments may be the same as or different from the number of segment boundaries.

In one implementation, the central node included in each cluster may be determined first based on the cluster boundaries of the cluster. The segmentation boundary will then be determined from the central node comprised by each cluster. For example, after spectral clustering is performed on the audio feature map, 3 clusters are obtained, 3 central nodes are respectively determined according to cluster boundaries of the 3 clusters, the time sequence of an audio frame corresponding to the central node of the first cluster in the audio to be processed is 1s, the time sequence of an audio frame corresponding to the central node of the second cluster in the audio to be processed is 5s, and the time sequence of an audio frame corresponding to the central node of the third cluster in the audio to be processed is 10s, so that a midpoint 3s of 1s to 5s can be used as a segmentation boundary, a midpoint 7.5s of 5s to 10s can be used as a segmentation boundary, and 2 segmentation boundaries of 3s and 7.5s can be obtained. If the duration of the audio to be processed is 16s, the audio to be processed is divided according to 3s and 7.5s, so that 0s-3s, 3s-7.5s and 7.5s-16s can be obtained, and 3 audio segments are obtained.

In another implementation, the partition boundary may also be determined directly from the cluster boundary of each cluster. For example, the first cluster has a cluster boundary of 0.2s to 2.5s, the second cluster has a cluster boundary of 2.6s to 8.1s, and the third cluster has a cluster boundary of 8.6s to 12.8s. Then the midpoint 2.55s of 2.5s to 2.6s can be taken as one split boundary and the emphasis 8.35s of 8.1s to 8.6s can be taken as one split boundary, resulting in 2 split boundaries of 2.55s and 8.35 s. If the duration of the audio to be processed is 16s, the audio to be processed is divided according to 2.55s and 8.35s, so that 0s-2.55s, 2.55s-8.35s and 8.35s-16s can be obtained, and 3 audio segments are obtained.

In this way, the target similarity matrix can reflect the association of each audio frame in the frequency domain and the time domain in the audio frames to be processed, so that the audio frames corresponding to the nodes included in the obtained cluster are similar in two dimensions of the frequency domain and the time domain by performing spectral clustering on the audio feature spectrum corresponding to the target similarity matrix, and the corresponding segmentation boundary determined according to the cluster can also segment the audio to be processed according to the association of the two dimensions of the frequency domain and the time domain. Compared with the technical scheme that the audio is segmented only according to the association of the local features, or the technical scheme that the audio frames are close in time is only considered, the method and the device can comprehensively consider the similarity of the frequency domain features of the audio frames and the association of the audio frames in time, and can effectively improve the accuracy and the adaptability of audio segmentation.

In summary, the present disclosure firstly extracts the frequency domain feature of each audio frame in the audio to be processed, determines an initial similarity matrix based on the frequency domain feature, then obtains the time sequence of each audio frame in the audio to be processed, and corrects the initial similarity matrix according to the time sequence to obtain a target similarity matrix fused with the time relationship. And constructing a corresponding audio feature map according to the target similarity matrix, performing spectral clustering on the constructed audio feature map to obtain a plurality of clusters, determining a plurality of segmentation boundaries according to the clustering boundaries of the plurality of clusters, and finally segmenting the audio to be processed according to the segmentation boundaries to obtain a plurality of audio segments. According to the method and the device, the similarity of the audio frames on the frequency domain characteristics and the time sequence of the audio frames in the audio to be processed are combined to obtain the target similarity matrix, so that the segmentation boundary for segmenting the audio to be processed is determined according to the target similarity matrix capable of reflecting the association between the audio frames from the two dimensions of the frequency domain and the time domain, and the accuracy and the adaptability of audio segmentation can be improved.

FIG. 3 is a flow chart illustrating another audio processing method according to an exemplary embodiment, as shown in FIG. 3, the implementation of step 101 may include:

At step 1011, the frequency domain characteristics of each audio frame are extracted according to the CQT.

Step 1012, determining the similarity of each two audio frames on the frequency domain features according to the frequency domain features of each audio frame.

In step 1013, an initial similarity matrix is generated according to the similarity of each two audio frames in the frequency domain features, and the initial similarity matrix is smoothed.

For example, the frequency domain features of each audio frame may be first extracted using CQT (english: constant-Q transform, chinese: constant Q transform). Since the horizontal axis frequency of the CQT spectrum is based on log2, and the bandwidth of the filter window can be changed according to the difference of spectral line frequencies, the CQT is identical to the distribution of musical scale frequencies, and the amplitude value of the audio frame at each note frequency can be directly obtained and used as the frequency domain characteristic of the audio frame. The bandwidth of the filter window is very small for low-frequency waves, the filter window has higher frequency resolution to decompose similar notes, the bandwidth of the filter window is relatively large for high-frequency waves, and the filter window has higher time resolution at high frequency to track fast-changing overtones, so that the obtained frequency domain features are more accurate. For example, the frequency domain characteristics of each audio frame may be determined by equation 1:

Where W [ k, n ] represents the value corresponding to the nth audio frame in the kth window function (here, hamming window is taken as an example) in the CQT, α is a constant, for example, may be 0.46, N [ k ] represents the bandwidth of the kth window function (i.e. the number of sampling instants included in the kth window function), and k represents the frequency sequence number of the CQT spectrum. X [ k ] represents the result of filtering the nth audio frame with the kth window function, i.e., the component of the nth audio frame at the kth frequency, X [ n ] represents the amplitude of the nth audio frame, and Q is a constant factor in the CQT transform.

Then, the similarity of each two audio frames in the frequency domain features may be determined according to the frequency domain features of each audio frame obtained in step 1011. The similarity in frequency domain characteristics of each two audio frames reflects the degree of similarity between the frequency domain characteristics of the two audio frames. And filling the similarity of every two audio frames on the frequency domain characteristics into an initial similarity matrix, namely filling the similarity of the ith audio frame and the jth audio frame on the frequency domain characteristics into the ith row and the jth column in the initial similarity matrix, wherein the obtained initial similarity matrix can be shown as (a) in fig. 4, the horizontal axis and the vertical axis both represent time, the depth of each point represents the audio frame corresponding to the horizontal axis coordinate, and the similarity of the audio frame corresponding to the vertical axis coordinate on the frequency domain characteristics. Since noise is inevitably contained in the process of extracting the frequency domain features of the audio to be processed and calculating the similarity, some sparse, discontinuous inter-frame similarity may be contained in the initial similarity matrix. The initial similarity matrix may be further smoothed (e.g., median filtered) to suppress the similarity generated by the noise interference, e.g., after the smoothing filter process shown in fig. 4 (a), as shown in fig. 4 (b). Because the audio segmentation is finally expected to be of a longer-time audio segment, and the smoothing filter can screen out sparse and discontinuous inter-frame similarity, the accuracy and the adaptability of the audio segmentation can be further improved.

The implementation of step 1012 is specifically described below, and step 1012 may be implemented by:

step a) calculates a first similarity between frequency domain features of each two audio frames using a gaussian kernel function.

Step b), determining adjacent audio frames corresponding to each audio frame on the frequency domain characteristics according to a preset adjacent algorithm.

And c) if the first audio frame belongs to an adjacent audio frame corresponding to the second audio frame, taking the first similarity between the frequency domain characteristics of the first audio frame and the frequency domain characteristics of the second audio frame as the similarity between the frequency domain characteristics of the first audio frame and the second audio frame. And if the first audio frame does not belong to the adjacent audio frame corresponding to the second audio frame, setting the similarity of the first audio frame and the second audio frame on the frequency domain characteristics to be zero.

Wherein the first audio frame is different from the second audio frame. That is, the first audio frame is any audio frame in the audio to be processed, and the second audio frame is any audio frame except the first audio frame in the audio to be processed.

In one implementation, the frequency domain features of the two audio frames may be brought into a gaussian kernel function, and a first similarity between the frequency domain features of the two audio frames may be calculated. The gaussian kernel function is shown in formula two:

Wherein,representing a first similarity, x, between frequency domain features of an ith audio frame and a jth audio frame _i Representing the frequency domain characteristics of the ith audio frame, x _j Representing the frequency domain characteristics of the jth audio frame, σ represents the width parameter of the gaussian kernel.

In one application, the first similarity between the frequency domain features of each two audio frames may be directly used as the similarity between each two audio frames in the frequency domain features to generate the initial similarity matrix.

In another application mode, the first similarity between the frequency domain features of the two audio frames may be calculated by the formula two. Then, the adjacent audio frames corresponding to each audio frame in the frequency domain feature can be determined according to a preset adjacent algorithm. The proximity algorithm may be, for example, a KNN (english: K-nearest neighbor) algorithm, and K (for example, 3) neighboring audio frames of each audio frame in the frequency domain feature may be obtained. Finally, determining the similarity of every two audio frames on the frequency domain characteristics according to a formula III so as to generate an initial similarity matrix:

wherein R is _i ' _j Representing the ith audio frame and the jthSimilarity in frequency domain characteristics of the individual audio frames,representing a first similarity between frequency domain features of an ith audio frame and a jth audio frame, R _ij 1 in the case where the i-th audio frame and the j-th audio frame are k adjacent audio frames to each other, and 0 in the case where the i-th audio frame and the j-th audio frame are not k adjacent audio frames to each other.

Fig. 5 is a flow chart illustrating another audio processing method according to an exemplary embodiment, and as shown in fig. 5, the implementation of step 105 may include:

in step 1051, a corresponding beat time sequence in the audio to be processed is extracted, where the beat time sequence includes a time corresponding to each beat in the audio to be processed.

Step 1052, for each of the division boundaries, searches in the beat time sequence for a target beat with the smallest time difference between the corresponding time and the division boundary.

In step 1053, the audio to be processed is segmented according to the time corresponding to the target beat corresponding to each segmentation boundary, so as to obtain a plurality of audio segments.

The segmentation boundary obtained through step 104 is based on the correlation of the audio frame in two dimensions, i.e., the frequency domain and the time domain. On the basis, the audio to be processed can be further divided by combining the beats in the audio to be processed. Because the beat accords with the composing principle and the hearing feeling of the user, the audio to be processed is segmented by combining the segmentation boundary and the beat, the hearing of the user is more met, and the accuracy and the adaptability of the audio segmentation can be further improved.

Specifically, a beat time sequence including a time corresponding to each beat in the audio to be processed may be extracted first. For example, the duration of the audio to be processed is 60s, and the beat time sequence is: {0.5s,1s,1.5s,2s,2.5s,3s,3.5s, …,36.5s,37s, …,59.5s,60s }, wherein 120 beats are included, and the time corresponding to each beat. And then searching for the beat closest to each division boundary in the beat time sequence as a target beat (i.e., the time corresponding to the target beat, the time difference from the division boundary is the smallest). Finally, according to the moment corresponding to the target beat corresponding to each division boundary, dividing the audio to be processed to obtain a third number of audio segments. For example, if the dividing boundaries are 15.6s, 27.4s, and 46.9s, the beats corresponding to 15.5s can be regarded as target beats when 15.6s is closest to 15.5 s. 27.4s is closest to 27.5s, and a beat corresponding to 27.5s can be taken as a target beat. 46.9s is closest to 47s, and a beat corresponding to 47s can be regarded as a target beat. Then, the audio to be processed may be divided according to 15.5s, 27.5s and 47s, and 4 audio pieces in total may be obtained from 0s to 15.5s,15.5s to 27.5s,27.5s to 47s,47s to 60 s.

In one application scenario, the beat time sequence in step 1051 can be obtained by:

step d) determining the starting point of each note in the audio to be processed according to the short-time energy of each audio frame.

Step e) determining the tempo and beat of the audio to be processed according to the starting point of each note.

Step f) determining the corresponding time of each beat in the audio to be processed according to the short-time energy of each audio frame, the speed and the beat of the audio to be processed.

First, the short-time energy (i.e., short-time average energy) of each audio frame may be calculated separately, so that the starting point of each note in the audio to be processed is detected from the short-time energy of each audio frame. The Onset may be detected according to any of a variety of note Onset Detection (english: onset Detection) algorithms, which are not specifically limited by the present disclosure. The tempo and beat of the audio to be processed can then be determined from the starting point of each note, e.g. tempo 120, beat 4/4 beat. Finally, according to the short-time energy of each audio frame, the speed and the beat of the audio to be processed, the energy peak value is selected as the corresponding time of each beat to obtain a beat time sequence, wherein each beat is the energy peak value and meets the rules indicated by the speed and the beat of the audio to be processed. The beat time sequence may be as shown in fig. 6, where the time identified by the dashed line is the time corresponding to each beat, where the horizontal axis represents time and the vertical axis represents energy.

Fig. 7 is a flowchart illustrating another audio processing method according to an exemplary embodiment, and as shown in fig. 7, step 102 may include:

in step 1021, a target similarity between each two audio frames is determined according to the time sequence of each audio frame in the audio to be processed and the similarity of each two audio frames in the frequency domain characteristics.

Step 1022, generating a target similarity matrix according to the target similarity between every two audio frames.

For example, on the basis of the initial similarity matrix, each element in the initial similarity matrix may be corrected by combining the time sequence of each audio frame in the audio to be processed, so as to obtain the target similarity matrix. Specifically, the target similarity of the two audio frames may be determined according to the similarity of the two audio frames in the frequency domain features and the distance of the two audio frames in the time domain. Then, filling the target similarity between every two audio frames into a target similarity matrix, namely filling the target similarity between the ith audio frame and the jth audio frame into the ith row and the jth column in the target similarity matrix, wherein the obtained target similarity matrix can be shown in fig. 8, the horizontal axis and the vertical axis both represent time, the depth of each point represents the audio frame corresponding to the horizontal axis coordinate, and the target similarity between the audio frames corresponding to the vertical axis coordinate.

In an application scenario, the implementation manner of step 1021 may be:

if the time difference between the first audio frame and the second audio frame is smaller than the preset time threshold, setting the target similarity between the first audio frame and the second audio frame to be 1. And if the time difference between the first audio frame and the second audio frame is greater than or equal to the time threshold, taking the similarity of the first audio frame and the second audio frame on the frequency domain characteristics as the target similarity between the first audio frame and the second audio frame.

By way of example, the target similarity for every two audio frames may be determined by equation four:

wherein R is _i ' _j ' denote the target similarity of the ith and jth audio frames, i-j denote the time difference between the ith and jth audio frames, and n denote a time threshold, which can be understood as the number of frame intervals, e.g. 1 (i.e. delta when the ith and jth audio frames are adjacent _ij 1, delta when the ith audio frame is not adjacent to the jth audio frame _ij 0).

Fig. 9 is a flowchart illustrating another audio processing method according to an exemplary embodiment, and as shown in fig. 9, step 104 may include:

step 1041, for each cluster, clustering according to the times corresponding to the plurality of audio frames included in the cluster, so as to obtain at least one time cluster included in the cluster.

In step 1042, time clusters included in the plurality of clusters are arranged according to a time sequence.

Step 1043, determining a plurality of partition boundaries according to the cluster boundaries of each two adjacent time clusters.

For example, for each cluster obtained by spectral clustering, a plurality of audio frames included in the cluster may be clustered according to corresponding times, so as to obtain at least one time cluster included in the cluster. It is understood that the clusters are divided in time to obtain at least one time cluster. Each time cluster includes at least one audio frame, and the time difference between the audio frames is smaller than a preset time radius (for example, 2.5 s). The central time of each time cluster in all the time clusters included in each cluster may be determined first, and then all the time clusters are arranged according to the time sequence of the central time of each time cluster. And finally, determining a plurality of segmentation boundaries according to the clustering boundaries of every two adjacent time clusters corresponding to the central moment.

Taking the audio feature spectrum as an example, 3 clusters are obtained after spectral clustering, wherein the first cluster comprises an a time cluster and a b time cluster, the corresponding central moments are respectively 2s and 16s, the second cluster comprises a c time cluster, the corresponding central moment is 7s, the third cluster comprises a d time cluster, the corresponding central moment is 12s, and the total number of the 4 time clusters is 4. The clustering boundary of the a time cluster is 0.6s-5.3s, the clustering boundary of the b time cluster is 14.9s-20.6s, the clustering boundary of the c time cluster is 5.9s-10.1s, and the clustering boundary of the d time cluster is 10.7s-14s. And arranging the 4 time clusters according to the time sequence of the central moment to obtain an a time cluster-c time cluster-d time cluster-b time cluster, namely, the a time cluster is adjacent to the c time cluster, the c time cluster is adjacent to the d time cluster, and the d time cluster is adjacent to the b time cluster. Then, the midpoint of the cluster boundary of the a-time cluster and the c-time cluster, i.e., (5.3s+5.9s)/2=5.6s, is taken as one division boundary, the midpoint of the cluster boundary of the c-time cluster and the d-time cluster, i.e., (10.1s+10.7s)/2=10.4s, is taken as one division boundary, and the midpoint of the cluster boundary of the d-time cluster and the b-time cluster, i.e., (14s+14.9s) =14.45 s, is taken as one division boundary. A total of 3 segmentation boundaries of 5.6s, 10.4s, 14.45s are obtained.

Fig. 10 is a flowchart illustrating another audio processing method according to an exemplary embodiment, and as shown in fig. 10, step 105 may be implemented by:

in step 1054, the plurality of segmentation boundaries are arranged in time order, and the audio to be processed is segmented according to the ordered segmentation boundaries, so as to obtain a first number of initial audio segments.

Step 1055, for each initial audio segment, determining a target time cluster to which the initial audio segment belongs according to two partition boundaries corresponding to two ends of the initial audio segment, determining a partition boundary corresponding to a start end of the initial audio segment by a cluster boundary of the target time cluster and a cluster boundary of a time cluster before the target time cluster, and determining a partition boundary corresponding to an end of the initial audio segment by a cluster boundary of the target time cluster and a cluster boundary of a time cluster after the target time cluster.

For example, when the audio to be processed is divided, the audio to be processed may be divided according to the dividing boundary, and then the result of the division is associated with the cluster obtained in step 104, so as to combine the result of the division according to the association result. Specifically, the multiple dividing boundaries may be arranged in time sequence, and the audio to be processed is divided according to the ordered dividing boundaries, so as to obtain a first number of initial audio segments, where the first number=the number of dividing boundaries+1. The initial audio segment is obtained by directly dividing the audio to be processed according to the dividing boundary obtained in step 104, which can be understood that the audio frames included in the initial audio segment belong to the same melody.

Then, for each initial audio segment, a target time cluster to which the initial audio segment belongs can be determined according to two segmentation boundaries respectively corresponding to two ends of the initial audio segment. For example, the duration of the audio to be processed is 60s, and 2 clusters are corresponding, where the first cluster includes an e-time cluster and an f-time cluster, the corresponding central moments are 8s and 29s, the second cluster includes a g-time cluster and an h-time cluster, the corresponding central moments are 17s and 43s, and the 4 time clusters are arranged in time sequence: e time cluster-g time cluster-f time cluster-h time cluster. The segmentation boundaries determined from the 4 time clusters are: 12.5s, 23s, 36s, a total of 3 segmentation boundaries. Then the corresponding 4 (i.e., first number) initial audio segments are: 0s-12.5s,12.5s-23s,23s-36s,36s-60s. For an initial audio segment of 12.5s-23s, the corresponding segmentation boundaries at the two ends are 12.5s and 23s, the target time cluster which belongs to the initial audio segment can be determined to be a g time cluster, namely, the clustering boundary of the g time cluster and the e time cluster is determined to be 12.5s, and the clustering boundary of the g time cluster and the f time cluster is determined to be 23s. Similarly, it can be determined that the target time cluster to which 0s-12.5s belongs is an e time cluster, the target time cluster to which 23s-36s belongs is an f time cluster, and the target time cluster to which 36s-60s belongs is an h time cluster, respectively.

In step 1056, it is determined that the initial audio segment belongs to the target cluster according to the target cluster to which the target time cluster belongs.

In step 1057, the first number of initial audio segments are combined according to the target cluster to which each initial audio segment belongs, so as to obtain the second number of audio segments. The first number and the second number may be the same or different.

For example, after the target time clusters to which each initial audio segment belongs are respectively determined, the initial audio segment may be marked according to the target cluster to which the target time cluster belongs, for example, a tag may be added to the initial audio segment to indicate that the initial audio segment belongs to the target cluster. For example: 0s-12.5s belongs to the e-time cluster, e-time cluster belongs to the first cluster, then 0s-12.5s may be marked as X to indicate the target cluster as the first cluster, 12.5s-23s may be marked as Y to indicate the target cluster as the second cluster, and similarly 23s-36s may be marked as X and 36s-60s as Y. I.e. the labels corresponding to the 4 initial audio segments are X-Y-X-Y. It will be appreciated that the same initial audio piece is marked as belonging to a cluster, that is to say, the same initial audio piece has a similar music structure, so that the music structure to which each initial audio piece belongs can be explicitly output in the form of a mark.

Furthermore, the rule of each initial audio segment can be obtained according to the target cluster to which each initial audio segment belongs, so that the first number of initial audio segments can be combined to obtain the second number of audio segments. Wherein the second number is less than or equal to the first number. That is, in combination with the music structure to which each initial audio segment belongs (i.e., the target cluster to which each initial audio segment belongs), a relationship between the music structures of the first number of initial audio segments can be obtained, so that the initial audio segments are combined according to specific requirements. As can be seen from the corresponding labels X-Y-X-Y for the 4 initial audio segments, in the audio to be processed, the 4 initial audio segments are regularly repeated, i.e. the structure X-Y is repeated 2 times, and X-Y can be understood as a bar. Then 0s-12.5s and 12.5s-23s can be combined to obtain 0s-23s as an audio segment and 23s-36s and 36s-60s can be combined to obtain 23s-60s as an audio segment.

For another example, taking 8 partition boundaries corresponding to the audio to be processed as an example, a spectrogram corresponding to the audio to be processed is shown in fig. 11, where the horizontal axis represents time and the vertical axis represents frequency. 9 (i.e., a first number) of initial audio segments are segmented as per step 1054, as indicated by the dashed line division in fig. 11. Marking according to the target cluster to which each initial audio segment belongs, and obtaining: A1-B1-A1-B1-C1-D1-C1-D1-E1, wherein the structure A1-B1 is repeated 2 times and the structure C1-D1 is repeated 2 times. Then A1-B1 can be considered as a measure and C1-D1 as a measure, combined to obtain (A1-B1) - (C1-D1) -E1 for a total of 5 (i.e., a second number) audio segments. A1-B1-A1-B1 can also be regarded as a section, and C1-D1-C1-D1 can be regarded as a section to be combined, so that (A1-B1-A1-B1) - (C1-D1-C1-D1) -E1 are obtained, and 3 (namely, a second number) audio segments are obtained. Therefore, on the basis of obtaining the initial audio segments, the relation between the music structures of the initial audio segments can be combined, and the initial audio segments are combined, so that the flexibility of audio segmentation is further improved.

It should be noted that after the second number of audio segments is obtained, each audio segment may be matched with a different video segment according to specific requirements. For example, matching may be performed according to the duration of each audio segment and the duration of each video segment, and the audio segment and the video segment with the closest duration may be fused. A content tag (or emotion tag) can be added to each audio segment, and the audio segment and the video segment matched with the content tag (or emotion tag) are fused.

Fig. 12 is a block diagram of an audio processing apparatus according to an exemplary embodiment, and as shown in fig. 12, the apparatus 200 includes:

the first determining module 201 is configured to extract a frequency domain feature of each audio frame in the audio to be processed, and determine an initial similarity matrix corresponding to the audio to be processed based on the frequency domain feature.

The second determining module 202 is configured to obtain a time sequence of each audio frame in the audio to be processed, and correct the initial similarity matrix according to the time sequence to obtain a target similarity matrix with a fused time relationship.

The spectrum construction module 203 is configured to construct an audio feature spectrum corresponding to the audio to be processed according to the target similarity matrix.

The clustering module 204 is configured to perform spectral clustering on the audio feature map to obtain a plurality of clusters, and determine a plurality of segmentation boundaries according to the plurality of clusters.

The segmentation module 205 is configured to segment the audio to be processed according to the segmentation boundary to obtain a plurality of audio segments.

Fig. 13 is a block diagram of another audio processing apparatus according to an exemplary embodiment, and as shown in fig. 13, the first determining module 201 may include:

an extraction submodule 2011 is used for extracting frequency domain features of each audio frame according to the CQT.

A first determining submodule 2012 is configured to determine the similarity of each two audio frames in the frequency domain feature according to the frequency domain feature of each audio frame.

A first generating sub-module 2013, configured to generate an initial similarity matrix according to the similarity of each two audio frames in the frequency domain feature, and perform smooth filtering on the initial similarity matrix.

In one application scenario, the first determination submodule 2012 is configured to perform the following steps:

Wherein the first audio frame is different from the second audio frame.

Fig. 14 is a block diagram of another audio processing apparatus, shown in fig. 14, according to an exemplary embodiment, the segmentation module 205 may include:

The sequence extraction submodule 2051 is configured to extract a corresponding beat time sequence in the audio to be processed, where the beat time sequence includes a time corresponding to each beat in the audio to be processed.

The searching sub-module 2052 is configured to search, for each partition boundary, in the beat time sequence, for a target beat with a smallest time difference between the corresponding time and the partition boundary.

The dividing submodule 2053 is configured to divide the audio to be processed according to the time corresponding to the target beat corresponding to each dividing boundary, so as to obtain a plurality of audio segments.

In one application scenario, the sequence extraction submodule 2051 may be used to perform the following steps:

Fig. 15 is a block diagram of another audio processing apparatus, shown in fig. 15, according to an exemplary embodiment, the second determination module 202 may include:

A second determining submodule 2021 is configured to determine a target similarity between each two audio frames according to a time sequence of each audio frame in the audio to be processed and a similarity of each two audio frames in the frequency domain feature.

A second generation sub-module 2022 is configured to generate a target similarity matrix according to the target similarity between each two audio frames.

In one application scenario, the second determination submodule 2021 may be used to:

Wherein the first audio frame is different from the second audio frame.

Fig. 16 is a block diagram of another audio processing device, shown in fig. 16, according to an example embodiment, the clustering module 204 may include:

the clustering submodule 2041 is configured to, for each cluster, perform clustering according to times corresponding to a plurality of audio frames included in the cluster, so as to obtain at least one time cluster included in the cluster.

An arrangement submodule 2042 for arranging time clusters included in the plurality of clusters in time order.

A third determining submodule 2043 is configured to determine a plurality of partition boundaries according to the cluster boundaries of each two adjacent time clusters.

Fig. 17 is a block diagram of another audio processing apparatus, shown in accordance with an exemplary embodiment, and as shown in fig. 17, the segmentation module 205 may include:

the initial segmentation submodule 2054 is configured to arrange a plurality of segmentation boundaries in time sequence, and segment the audio to be processed according to the ordered segmentation boundaries, so as to obtain a first number of initial audio segments.

A fourth determining submodule 2055, configured to determine, for each initial audio segment, a target time cluster to which the initial audio segment belongs according to two partition boundaries corresponding to two ends of the initial audio segment, where a cluster boundary of the target time cluster and a cluster boundary of a time cluster preceding the target time cluster determine a partition boundary corresponding to a start end of the initial audio segment, and a cluster boundary of the target time cluster and a cluster boundary of a time cluster following the target time cluster determine a partition boundary corresponding to an end of the initial audio segment.

And a fifth determining submodule 2056, configured to determine that the initial audio segment belongs to the target cluster according to the target cluster to which the target time cluster belongs.

And the merging submodule 2057 is configured to merge the first number of initial audio segments according to the target cluster to which each initial audio segment belongs, so as to obtain a second number of audio segments.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Referring now to fig. 18, a schematic diagram of an electronic device (e.g., an execution body in the embodiment of the present disclosure, which may be a terminal device or a server) 300 suitable for implementing the embodiment of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 18 is merely an example, and should not impose any limitation on the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 18, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 18 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device 309, or installed from a storage device 308, or installed from a ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the terminal devices, servers, may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting frequency domain characteristics of each audio frame in audio to be processed, and determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain characteristics; acquiring the time sequence of each audio frame in the audio to be processed, and correcting the initial similarity matrix according to the time sequence to obtain a target similarity matrix fusing time relations; constructing an audio feature map corresponding to the audio to be processed according to the target similarity matrix; performing spectral clustering on the audio feature map to obtain a plurality of clusters, and determining a plurality of segmentation boundaries of the audio to be processed according to the clustering boundaries of the clusters; and dividing the audio to be processed according to the dividing boundary to obtain a plurality of audio segments.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module is not limited to the module itself in some cases, and for example, the first determining module may be described as "a module for determining an initial similarity matrix".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides an audio processing method, comprising: extracting frequency domain characteristics of each audio frame in audio to be processed, and determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain characteristics; acquiring the time sequence of each audio frame in the audio to be processed, and correcting the initial similarity matrix according to the time sequence to obtain a target similarity matrix fusing time relations; constructing an audio feature map corresponding to the audio to be processed according to the target similarity matrix; performing spectral clustering on the audio feature map to obtain a plurality of clusters, and determining a plurality of segmentation boundaries of the audio to be processed according to the clustering boundaries of the clusters; and dividing the audio to be processed according to the dividing boundary to obtain a plurality of audio segments.

In accordance with one or more embodiments of the present disclosure, example 2 provides the method of example 1, the extracting frequency domain features of each audio frame in the audio to be processed, comprising: extracting the frequency domain characteristics of each audio frame according to CQT; the determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain features comprises the following steps: according to the frequency domain characteristics of each audio frame, determining the similarity of each two audio frames on the frequency domain characteristics; and generating the initial similarity matrix according to the similarity of every two audio frames on the frequency domain characteristics, and carrying out smooth filtering on the initial similarity matrix.

According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, the determining a similarity of each two of the audio frames in terms of frequency domain features of each of the audio frames, including: calculating a first similarity between frequency domain features of each two audio frames by using a Gaussian kernel function; determining adjacent audio frames corresponding to each audio frame on the frequency domain characteristics according to a preset adjacent algorithm; if the first audio frame belongs to the adjacent audio frame corresponding to the second audio frame, taking the first similarity between the frequency domain characteristics of the first audio frame and the frequency domain characteristics of the second audio frame as the similarity between the frequency domain characteristics of the first audio frame and the second audio frame; if the first audio frame does not belong to the adjacent audio frame corresponding to the second audio frame, setting the similarity of the first audio frame and the second audio frame on the frequency domain characteristics to zero; wherein the first audio frame is different from the second audio frame.

According to one or more embodiments of the present disclosure, example 4 provides the method of example 1, the dividing the audio to be processed according to the dividing boundary to obtain a plurality of audio segments, including: extracting a corresponding beat time sequence in the audio to be processed, wherein the beat time sequence comprises the time corresponding to each beat in the audio to be processed; for each dividing boundary, searching in the beat time sequence, and obtaining a target beat with the minimum time difference between the corresponding time and the dividing boundary; and dividing the audio to be processed according to the time corresponding to the target beat corresponding to each dividing boundary to obtain a plurality of audio segments.

According to one or more embodiments of the present disclosure, example 5 provides the method of example 4, the extracting a corresponding beat time sequence in the audio to be processed, including: determining a starting point of each note in the audio to be processed according to the short-time energy of each audio frame; determining the speed and the beat of the audio to be processed according to the starting point of each note; and determining the time corresponding to each beat in the audio to be processed according to the short-time energy of each audio frame, the speed and the beat of the audio to be processed.

In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 1, said modifying the initial similarity matrix in the temporal order to obtain a target similarity matrix fusing temporal relationships, comprising: determining target similarity between every two audio frames according to the time sequence of each audio frame in the audio to be processed and the similarity of every two audio frames on frequency domain characteristics; and generating the target similarity matrix according to the target similarity between every two audio frames.

According to one or more embodiments of the present disclosure, example 7 provides the method of example 6, the determining the target similarity between each two of the audio frames according to a temporal sequence of each of the audio frames in the audio to be processed and a similarity of each of the two audio frames in frequency domain features, comprising: if the time difference between the first audio frame and the second audio frame is smaller than a preset time threshold, setting the target similarity between the first audio frame and the second audio frame to be 1; if the time difference between the first audio frame and the second audio frame is greater than or equal to the time threshold, the similarity of the first audio frame and the second audio frame on the frequency domain characteristics is used as the target similarity between the first audio frame and the second audio frame; wherein the first audio frame is different from the second audio frame.

According to one or more embodiments of the present disclosure, example 8 provides the method of example 1, the determining a plurality of partition boundaries of the audio to be processed from cluster boundaries of a plurality of the clusters, comprising: clustering according to the time corresponding to the plurality of audio frames included in each cluster aiming at each cluster to obtain at least one time cluster included in the cluster; arranging the time clusters included in the clusters according to a time sequence; and determining a plurality of segmentation boundaries according to the clustering boundaries of every two adjacent time clusters.

According to one or more embodiments of the present disclosure, example 9 provides the method of example 8, the segmenting the audio to be processed according to the segmentation boundary to obtain a plurality of audio segments, including: arranging a plurality of segmentation boundaries according to a time sequence, and segmenting the audio to be processed according to the sequenced segmentation boundaries to obtain a first number of initial audio segments; for each initial audio segment, determining a target time cluster to which the initial audio segment belongs according to two partition boundaries respectively corresponding to two ends of the initial audio segment, determining a partition boundary corresponding to a start end of the initial audio segment by a cluster boundary of the target time cluster and a cluster boundary of a time cluster before the target time cluster, and determining a partition boundary corresponding to an end of the initial audio segment by a cluster boundary of the target time cluster and a cluster boundary of a time cluster after the target time cluster; determining that the initial audio segment belongs to the target cluster according to the target cluster to which the target time cluster belongs; and merging the first number of the initial audio segments according to the target cluster to which each initial audio segment belongs, so as to obtain a second number of the audio segments.

In accordance with one or more embodiments of the present disclosure, example 10 provides an audio processing apparatus, comprising: the first determining module is used for extracting the frequency domain characteristics of each audio frame in the audio to be processed and determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain characteristics; the second determining module is used for obtaining the time sequence of each audio frame in the audio to be processed, and correcting the initial similarity matrix according to the time sequence to obtain a target similarity matrix fusing the time relations; the map construction module is used for constructing an audio characteristic map corresponding to the audio to be processed according to the target similarity matrix; the clustering module is used for carrying out spectral clustering on the audio feature map to obtain a plurality of clusters, and determining a plurality of segmentation boundaries according to the clustering boundaries of the plurality of clusters; and the segmentation module is used for segmenting the audio to be processed according to the segmentation boundary so as to obtain a plurality of audio segments.

According to one or more embodiments of the present disclosure, example 11 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the methods described in examples 1 to 9.

Example 12 provides an electronic device according to one or more embodiments of the present disclosure, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to realize the steps of the method described in examples 1 to 9.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of audio processing, the method comprising:

Extracting a corresponding beat time sequence in the audio to be processed, wherein the beat time sequence comprises the time corresponding to each beat in the audio to be processed;

for each dividing boundary, searching in the beat time sequence, and obtaining a target beat with the minimum time difference between the corresponding time and the dividing boundary;

and dividing the audio to be processed according to the moment corresponding to the target beat corresponding to each dividing boundary to obtain a plurality of audio segments.

2. The method of claim 1, wherein extracting frequency domain features for each audio frame in the audio to be processed comprises:

extracting the frequency domain characteristics of each audio frame according to CQT;

the determining an initial similarity matrix corresponding to the audio to be processed based on the frequency domain features comprises the following steps:

according to the frequency domain characteristics of each audio frame, determining the similarity of each two audio frames on the frequency domain characteristics;

and generating the initial similarity matrix according to the similarity of every two audio frames on the frequency domain characteristics, and carrying out smooth filtering on the initial similarity matrix.

3. The method of claim 2, wherein said determining the similarity of each two of said audio frames in frequency domain features from the frequency domain features of each of said audio frames comprises:

Calculating a first similarity between frequency domain features of each two audio frames by using a Gaussian kernel function;

determining adjacent audio frames corresponding to each audio frame on the frequency domain characteristics according to a preset adjacent algorithm;

if the first audio frame belongs to the adjacent audio frame corresponding to the second audio frame, taking the first similarity between the frequency domain characteristics of the first audio frame and the frequency domain characteristics of the second audio frame as the similarity between the frequency domain characteristics of the first audio frame and the second audio frame; if the first audio frame does not belong to the adjacent audio frame corresponding to the second audio frame, setting the similarity of the first audio frame and the second audio frame on the frequency domain characteristics to zero;

wherein the first audio frame is different from the second audio frame.

4. The method according to claim 1, wherein the extracting the corresponding beat time sequence in the audio to be processed comprises:

determining a starting point of each note in the audio to be processed according to the short-time energy of each audio frame;

determining the speed and the beat of the audio to be processed according to the starting point of each note;

and determining the time corresponding to each beat in the audio to be processed according to the short-time energy of each audio frame, the speed and the beat of the audio to be processed.

5. The method of claim 1, wherein said modifying said initial similarity matrix in said temporal order to obtain a target similarity matrix fusing temporal relationships comprises:

determining target similarity between every two audio frames according to the time sequence of each audio frame in the audio to be processed and the similarity of every two audio frames on frequency domain characteristics;

and generating the target similarity matrix according to the target similarity between every two audio frames.

6. The method of claim 5, wherein said determining a target similarity between each two of said audio frames based on a temporal order of each of said audio frames in said audio to be processed and a similarity of each two of said audio frames in frequency domain characteristics, comprises:

if the time difference between the first audio frame and the second audio frame is smaller than a preset time threshold, setting the target similarity between the first audio frame and the second audio frame to be 1; if the time difference between the first audio frame and the second audio frame is greater than or equal to the time threshold, the similarity of the first audio frame and the second audio frame on the frequency domain characteristics is used as the target similarity between the first audio frame and the second audio frame;

Wherein the first audio frame is different from the second audio frame.

7. The method of claim 1, wherein the determining a plurality of partition boundaries for the audio to be processed from cluster boundaries for a plurality of the clusters comprises:

clustering according to the time corresponding to the plurality of audio frames included in each cluster aiming at each cluster to obtain at least one time cluster included in the cluster;

arranging the time clusters included in the clusters according to a time sequence;

and determining a plurality of segmentation boundaries according to the clustering boundaries of every two adjacent time clusters.

8. The method of claim 7, wherein the dividing the audio to be processed according to the dividing boundary to obtain a plurality of audio segments comprises:

arranging a plurality of segmentation boundaries according to a time sequence, and segmenting the audio to be processed according to the sequenced segmentation boundaries to obtain a first number of initial audio segments;

for each initial audio segment, determining a target time cluster to which the initial audio segment belongs according to two partition boundaries respectively corresponding to two ends of the initial audio segment, determining a partition boundary corresponding to a start end of the initial audio segment by a cluster boundary of the target time cluster and a cluster boundary of a time cluster before the target time cluster, and determining a partition boundary corresponding to an end of the initial audio segment by a cluster boundary of the target time cluster and a cluster boundary of a time cluster after the target time cluster;

Determining that the initial audio segment belongs to the target cluster according to the target cluster to which the target time cluster belongs;

and merging the first number of the initial audio segments according to the target cluster to which each initial audio segment belongs, so as to obtain a second number of the audio segments.

9. An audio processing apparatus, the apparatus comprising:

the segmentation module is used for extracting a corresponding beat time sequence in the audio to be processed, wherein the beat time sequence comprises the time corresponding to each beat in the audio to be processed; for each dividing boundary, searching in the beat time sequence, and obtaining a target beat with the minimum time difference between the corresponding time and the dividing boundary; and dividing the audio to be processed according to the moment corresponding to the target beat corresponding to each dividing boundary to obtain a plurality of audio segments.

10. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-8.

11. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-8.