CN111506766A

CN111506766A - Audio frame clustering method, device and equipment

Info

Publication number: CN111506766A
Application number: CN202010314785.5A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-08-07
Anticipated expiration: 2040-04-20
Also published as: CN111506766B

Abstract

The application provides an audio frame clustering method, an audio frame clustering device and audio frame clustering equipment, wherein the method comprises the following steps: acquiring a distance matrix corresponding to target audio data; determining a first seed element from the distance matrix, determining each element meeting the region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, and determining the category of the audio frame corresponding to each element in the first element as a first category; acquiring a difference set between an element set and a first set included in the distance matrix; and determining a second seed element from the difference set, determining each element meeting the region growing condition in the distance matrix as a second element by taking the second seed element as a starting point of region growing, and determining the category of the audio frame corresponding to each element in the second element as a second category. By adopting the embodiment of the application, a plurality of audio frames with similar characteristics can be clustered, so that the searching efficiency is improved.

Description

Audio frame clustering method, device and equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for clustering audio frames.

Background

In techniques such as performing music structure analysis on audio data, searching for audio frames is usually involved, and since the number of audio frames may be large for one or more songs or music, some specific audio frames need to be searched when performing music structure analysis.

The existing audio frame searching method is to traverse all audio frames corresponding to audio data, so as to find an audio frame satisfying a condition, for example, searching one audio frame requires traversing all audio frames corresponding to audio data once, and when a plurality of audio frames need to be searched, traversing all audio frames corresponding to audio data for a plurality of times is required, which results in low audio frame searching efficiency.

Disclosure of Invention

The embodiment of the application provides an audio frame clustering method, an audio frame clustering device and audio frame clustering equipment, and solves the problem that searching is inconvenient under the condition that the number of audio frames is too large.

In a first aspect, a method for clustering audio frames is provided, which includes:

acquiring a distance matrix corresponding to target audio data, wherein the target audio data comprises N audio frames, the distance matrix is a symmetric matrix, elements on a main diagonal in the distance matrix are 0, other elements except the elements on the main diagonal in the distance matrix are used for representing the distance between audio feature vectors corresponding to every two audio frames in the N audio frames, and N is a positive integer greater than or equal to 2;

determining a first seed element from the distance matrix, determining each element meeting a region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, determining the category of an audio frame corresponding to each element in the first element as a first category, and determining the category of the audio frame corresponding to the first seed element as the first category, wherein the absolute difference value between horizontal and vertical coordinates of the first seed element mapped on a two-dimensional coordinate system is equal to 1;

obtaining a difference set between a set of elements included in the distance matrix and a first set, wherein the first set includes the first seed element and the first element;

determining a second seed element from the difference set, determining each element meeting the region growing condition in the distance matrix as a second element by taking the second seed element as a starting point of region growing, determining the category of the audio frame corresponding to each element in the second element as a second category, and determining the category of the audio frame corresponding to the second seed element as the second category, wherein the absolute difference between the horizontal and vertical coordinates of the second seed element mapped on the two-dimensional coordinate system is equal to 1, and the second seed element is smaller than the first seed element.

With reference to the first aspect, in a possible implementation manner, the determining a distance matrix includes N × N elements, determining a first seed element from the distance matrix, determining, with the first seed element as a starting point of region growing, each element in the distance matrix that meets a region growing condition as a first element, and determining, as a first category, a category of an audio frame corresponding to each element in the first element, includes: mapping N x N elements in the distance matrix to the two-dimensional coordinate system to obtain N x N coordinate points corresponding to the N x N elements in the distance matrix, wherein one element corresponds to one coordinate point, the first element in the distance matrix is mapped to the origin of coordinates of the two-dimensional coordinate system, and the distance between two adjacent coordinate points mapped to the two-dimensional coordinate system by every two adjacent elements in the distance matrix is equal; determining a seed coordinate point with the absolute difference value between horizontal and vertical coordinates equal to 1 from the N x N coordinate points, and determining an element corresponding to the seed coordinate point as the first seed element; determining elements in the distance matrix, which belong to the neighborhood of the first seed element and are smaller than a target threshold value, as the first elements, and determining the categories of the two audio frames corresponding to each element in the first elements as the first categories until no elements in the distance matrix, which belong to the neighborhood of the first seed element and are smaller than the target threshold value, exist; and determining elements in the distance matrix, which belong to the neighborhood of each element in the first elements and are smaller than the target threshold value, as third elements, and determining the categories of the two audio frames corresponding to each element in the third elements as the first category.

With reference to the first aspect, in a possible implementation manner, the method further includes: if K fourth elements with absolute differences larger than 1 exist in the distance matrix, the K fourth elements belong to the neighborhood of the first seed element and the second seed element, the category of the audio frame corresponding to each fourth element in the K fourth elements is determined, K third categories are obtained, the category of the audio frame corresponding to one fourth element is a third category, and K is a positive integer larger than or equal to 1.

With reference to the first aspect, in a possible implementation manner, the method further includes: and generating an audio frame sequence corresponding to the target audio data according to the time sequence of each audio frame belonging to the first category, each audio frame belonging to the second category and each audio frame belonging to each third category in the K third categories in the N audio frames.

With reference to the first aspect, in a possible implementation manner, the method further includes: acquiring a first frame number of an audio frame with the minimum playing time in each audio frame belonging to the first category in the N audio frames; acquiring a second frame number of an audio frame with the maximum playing time in each audio frame belonging to the first category in the N audio frames; calculating the average value of the audio characteristic vectors corresponding to the audio frames belonging to the first category in the N audio frames; and generating a category identifier of the first category according to the first frame number, the second frame number and the mean value, wherein the category identifier of the first category comprises the first frame number, the second frame number and the mean value.

With reference to the first aspect, in one possible implementation manner, the element a in the distance matrix_ijThe similarity distance between the audio characteristic vector corresponding to the ith audio frame and the audio characteristic vector of the jth audio frame in the N audio frames is shown, wherein i and j are positive integers which are more than 0 and less than or equal to N.

With reference to the first aspect, in a possible implementation manner, before the obtaining the distance matrix corresponding to the target audio data, the method further includes: dividing the target audio data into the N audio frames; acquiring an audio feature vector of each audio frame in the N audio frames to obtain N audio feature vectors, wherein the audio feature vectors are Mel frequency spectrum feature vectors; calculating the similarity distance between every two audio feature vectors in the N audio feature vectors to obtain N-x (N-1) distance values; and constructing a symmetric matrix with a main diagonal of 0 according to the N-x (N-1) distance values, and determining the symmetric matrix as the distance matrix.

In a second aspect, an audio frame clustering apparatus is provided, including:

the device comprises a matrix acquisition module, a distance processing module and a processing module, wherein the matrix acquisition module is used for acquiring a distance matrix corresponding to target audio data, the target audio data comprises N audio frames, the distance matrix is a symmetric matrix, elements positioned on a main diagonal in the distance matrix are 0, other elements except the elements positioned on the main diagonal in the distance matrix are used for representing the distance between audio feature vectors corresponding to every two audio frames in the N audio frames, and N is a positive integer greater than or equal to 2;

a first determining module, configured to determine a first seed element from the distance matrix, determine, with the first seed element as a starting point of region growth, each element that satisfies a region growth condition in the distance matrix as a first element, determine, as a first category, a category of an audio frame corresponding to each element in the first element, and determine, as the first category, a category of an audio frame corresponding to the first seed element, where an absolute difference between horizontal and vertical coordinates of the first seed element mapped onto a two-dimensional coordinate system is equal to 1;

a difference set obtaining module, configured to obtain a difference set between a set of elements included in the distance matrix and a first set, where the first set includes the first seed element and the first element;

a second determining module, configured to determine a second seed element from the difference set, determine, with the second seed element as a starting point of region growth, each element in the distance matrix that meets the region growth condition as a second element, determine, as a second category, a category of an audio frame corresponding to each element in the second element, and determine, as the second category, a category of an audio frame corresponding to the second seed element, where an absolute difference between horizontal and vertical coordinates of the second seed element mapped onto the two-dimensional coordinate system is equal to 1, and the second seed element is smaller than the first seed element.

With reference to the second aspect, in a possible implementation manner, the distance matrix includes N × N elements, and the first determining module is specifically configured to map the N × N elements in the distance matrix onto the two-dimensional coordinate system to obtain N × N coordinate points corresponding to the N × N elements in the distance matrix, where one element corresponds to one coordinate point, a first element in the distance matrix is mapped to a coordinate origin of the two-dimensional coordinate system, and distances between two adjacent coordinate points mapped to the two-dimensional coordinate system by every two adjacent elements in the distance matrix are equal; determining a seed coordinate point with the absolute difference value between horizontal and vertical coordinates equal to 1 from the N x N coordinate points, and determining an element corresponding to the seed coordinate point as the first seed element; determining elements in the distance matrix, which belong to the neighborhood of the first seed element and are smaller than a target threshold value, as the first elements, and determining the categories of the two audio frames corresponding to each element in the first elements as the first categories until no elements in the distance matrix, which belong to the neighborhood of the first seed element and are smaller than the target threshold value, exist; and determining elements in the distance matrix, which belong to the neighborhood of each element in the first elements and are smaller than the target threshold value, as third elements, and determining the categories of the two audio frames corresponding to each element in the third elements as the first category.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes: a third determining module, configured to determine, if K fourth elements whose absolute differences between horizontal and vertical coordinates mapped to the two-dimensional coordinate system are greater than 1 exist in the distance matrix, and the K fourth elements belong to neighborhoods of the first seed element and the second seed element, a category of an audio frame corresponding to each fourth element in the K fourth elements, to obtain K third categories, where a category of an audio frame corresponding to a fourth element is a third category, and K is a positive integer greater than or equal to 1.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes: and the sequence determining module is used for generating an audio frame sequence corresponding to the target audio data according to the time sequence of each audio frame belonging to the first category, each audio frame belonging to the second category and each audio frame belonging to each of the K third categories in the N audio frames.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes: the identification determining module is used for acquiring a first frame number of an audio frame with the minimum playing time in the audio frames belonging to the first category in the N audio frames; acquiring a second frame number of an audio frame with the maximum playing time in each audio frame belonging to the first category in the N audio frames; calculating the average value of the audio characteristic vectors corresponding to the audio frames belonging to the first category in the N audio frames; and generating a category identifier of the first category according to the first frame number, the second frame number and the mean value, wherein the category identifier of the first category comprises the first frame number, the second frame number and the mean value.

With reference to the second aspect, in one possible implementation manner, the element a in the distance matrix_ijThe similarity distance between the audio characteristic vector corresponding to the ith audio frame and the audio characteristic vector of the jth audio frame in the N audio frames is shown, wherein i and j are positive integers which are more than 0 and less than or equal to N.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes: a matrix construction module for dividing the target audio data into the N audio frames; acquiring an audio feature vector of each audio frame in the N audio frames to obtain N audio feature vectors, wherein the audio feature vectors are Mel frequency spectrum feature vectors; calculating the similarity distance between every two audio feature vectors in the N audio feature vectors to obtain N-x (N-1) distance values; and constructing a symmetric matrix with a main diagonal of 0 according to the N-x (N-1) distance values, and determining the symmetric matrix as the distance matrix.

In a third aspect, an electronic device is provided, which includes a processor, a memory, and an input/output interface, where the processor, the memory, and the input/output interface are connected to each other, where the input/output interface is used for inputting or outputting data, the memory is used for storing application program codes for the electronic device to execute the method, and the processor is configured to execute the audio frame clustering method of the first aspect.

In a fourth aspect, there is provided a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the audio frame clustering method of the first aspect.

In the embodiment of the application, a distance matrix corresponding to target audio data is obtained; determining a first seed element from the distance matrix, determining each element meeting the region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, determining the category of an audio frame corresponding to each element in the first element as a first category, and determining the category of the audio frame corresponding to the first seed element as a first category; acquiring a difference set between an element set and a first set included in the distance matrix; and determining a second seed element from the difference set, determining each element meeting the region growing condition in the distance matrix as a second element by taking the second seed element as a starting point of region growing, determining the category of the audio frame corresponding to each element in the second element as a second category, and determining the category of the audio frame corresponding to the second seed element as a second category. When all audio frame information of the target audio data is reserved, the audio frames with similar audio feature vectors are determined to be of the same type, and when the audio frames are searched, the searching efficiency can be improved; the elements corresponding to the seed elements are determined by the region growing method, and because only the elements meeting the conditions need to be searched in the neighborhood of the seed elements, each audio frame included in the target audio data does not need to be searched, and the searched similar audio frames are clustered, so that the searching and clustering efficiency of the audio frames can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of an audio frame clustering method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a distance matrix provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of a distance matrix construction method provided in an embodiment of the present application;

fig. 4 is a flowchart illustrating a first sub-element and a first category determination method provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a mapping relationship between a distance matrix and a two-dimensional coordinate system provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a neighborhood of a first seed element provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of an upper triangular element of a distance matrix provided by an embodiment of the present application;

FIG. 8 is a schematic flowchart of another audio frame clustering method provided in the embodiments of the present application;

fig. 9 is a schematic structural diagram illustrating a composition of an audio frame clustering apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flowchart of an audio frame clustering method according to an embodiment of the present application. As shown in fig. 1, the audio frame clustering method may include the steps of:

s101, obtaining a distance matrix corresponding to the target audio data.

The target audio data comprises N audio frames, the distance matrix is a symmetric matrix, elements on the main diagonal in the distance matrix are 0, other elements except the elements on the main diagonal in the distance matrix are used for representing the distance between the audio feature vectors corresponding to every two audio frames in the N audio frames, and N is a positive integer greater than or equal to 2. The target audio data may be, for example, audio data corresponding to a song, audio data corresponding to a music piece, or the like.

Here, the distance matrix includes N × N elements. Element a in the distance matrix_ijThe similarity distance between the audio characteristic vector corresponding to the ith audio frame and the audio characteristic vector of the jth audio frame in the N audio frames is obtained. i and j are both positive integers greater than 0 and less than or equal to N. Since the distance matrix is a symmetric matrix, it can be known that the element a in the matrix_ij＝a_ji，a_jiThe similarity distance between the audio characteristic vector corresponding to the jth audio frame in the N audio frames and the audio characteristic vector of the ith audio frame is obtained. As shown in fig. 2, fig. 2 is a schematic diagram of a distance matrix provided in an embodiment of the present application. As shown in fig. 2a, the distance matrix includes N × N elements, and the elements on the main diagonal are a_ii，a_iiWhich represents the distance between the corresponding audio feature vectors of two identical audio frames, and the corresponding value is 0, i.e. the distance matrix can also be shown as 2b in fig. 2.

In this embodiment of the present application, before obtaining a distance matrix corresponding to target audio data, a distance matrix may be pre-constructed, a method for specifically constructing the distance matrix may be as shown in fig. 3, where fig. 3 is a schematic flow diagram of the distance matrix construction method provided in this embodiment of the present application, and includes the following steps:

s1, the target audio data is divided into N audio frames.

The discrete audio data may then be filtered through a digital filter having a transfer function of h (Z) 1- α Z-1, increasing the high frequency resolution of the audio data α is a pre-emphasis coefficient, α is greater than 0.9 and less than 1.

Optionally, after the target audio data is divided into N audio frames, noise and interference in the audio frames may also be rejected through endpoint detection. The endpoint detection can be performed by energy-based endpoint detection, information entropy-based endpoint detection, or band variance-based endpoint detection.

S2, obtaining the audio feature vector of each audio frame in the N audio frames to obtain N audio feature vectors, wherein the audio feature vectors are Mel frequency spectrum feature vectors.

Here, one audio frame corresponds to one audio feature vector. In specific implementation, a Short-time Fourier Transform (STFT) may be performed on audio data corresponding to each of the N audio frames to obtain a mel-frequency spectrum feature vector of each of the N audio frames, that is, an audio feature vector of each of the N audio frames, so as to obtain N audio feature vectors corresponding to the N audio frames.

And S3, calculating the similarity distance between every two audio feature vectors in the N audio feature vectors to obtain N (N-1) distance values.

Here, the similarity distance between two audio feature vectors, i.e., the degree of similarity between two audio feature vectors, may be calculated by using an existing similarity calculation method. Such as euclidean distance, cosine distance, block distance, and manhattan distance calculation methods, etc. It is understood that, for every two audio feature vectors in the N audio feature vectors, the same similarity calculation method should be used for calculation, for example, all euclidean distances are used, or all cosine distances are used for calculation, and so on.

It can be known that the distance value obtained by calculating the similar distance between the audio feature vector corresponding to the audio frame 1 and the audio feature vector corresponding to the audio frame 2 is equal to the distance value obtained by calculating the similar distance between the audio feature vector corresponding to the audio frame 2 and the audio feature vector corresponding to the audio frame 1. Audio frame 1 and audio frame 2 are two different audio frames, so that N × N (N-1) distance values can be obtained.

And S4, constructing a symmetric matrix with a main diagonal of 0 according to the N × N-1 distance values, and determining the symmetric matrix as a distance matrix.

Here, constructing a distance matrix having a main diagonal of 0 from N × N (N-1) distance values may be as shown in 2b of fig. 2.

S102, determining a first seed element from the distance matrix, determining each element meeting the region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, determining the category of the audio frame corresponding to each element in the first element as a first category, and determining the category of the audio frame corresponding to the first seed element as a first category.

Here, the absolute difference between the abscissa and the ordinate, to which the first seed element is mapped on the two-dimensional coordinate system, is equal to 1. The region growing refers to a process of determining a starting point (i.e., a first seed element) of the region growing first and clustering elements which are in the neighborhood of the starting point and have the same or similar characteristics, that is, a process of determining an audio frame first and clustering audio frames which are in the neighborhood of the audio frame and have the same or similar characteristics as the audio frame. For example, the starting point of the region growing is audio frame d1, the starting point d1 has feature a, all elements (audio frames) which are in the neighborhood of d1 and have feature a or have features similar to feature a are clustered to obtain category a, all elements (audio frames) in category a can be found by subsequently searching category a, and all found elements (audio frames) have feature a or have features similar to feature a. Wherein, the audio frame satisfying the region growing condition has the same or similar characteristics as the audio frame corresponding to the starting point (i.e. the first sub-element) of the region growing. It can be seen that the elements satisfying the region growing condition are the elements that belong to the neighborhood of the first seed element and are smaller than the target threshold.

Specifically, a method for determining a first seed element and a first category in a distance matrix may be as shown in fig. 4, where fig. 4 is a flowchart of the method for determining a first seed element and a first category provided in the embodiment of the present application, and includes the following steps:

and S1021, mapping N x N elements in the distance matrix to the two-dimensional coordinate system to obtain N x N coordinate points corresponding to the N x N elements in the distance matrix.

Here, one element corresponds to one coordinate point, the first element in the distance matrix is mapped to the origin of coordinates of the two-dimensional coordinate system, and the distance between two adjacent coordinate points mapped to the two-dimensional coordinate system by every two adjacent elements in the distance matrix is equal. The mapping of N × N elements in the distance matrix to the two-dimensional coordinate system to obtain N × N coordinate points corresponding to the N × N elements in the distance matrix may be as shown in fig. 5, where fig. 5 is a schematic diagram of a mapping relationship between the distance matrix and the two-dimensional coordinate system provided in the embodiment of the present application.

As shown in fig. 5, the element a in the distance matrix₁₁Mapped as the origin of coordinates b on a two-dimensional coordinate system₁₁(0,0), element a in distance matrix₁₂Mapped as a coordinate point b on a two-dimensional coordinate system₁₂(0,1), element a in distance matrix₁₃Mapped as a coordinate point b on a two-dimensional coordinate system₁₃(0,2), element a in distance matrix₁₄Mapped as a coordinate point b on a two-dimensional coordinate system₁₄(0,3), etc. From this, a coordinate point on the two-dimensional coordinate system to which each element in the distance matrix is mapped can be obtained. B on a two-dimensional coordinate system_ijIs a coordinate point, b_ijAnd b_(i+1)jIs equal to b_ijAnd b_i(j+1)A distance between b_ijAnd b_(i-1)jIs equal to b_ijAnd b_i(j-1)A distance between b_ijAnd b_(i+1)jIs equal to b_ijAnd b_(i-1)jA distance between b_ijAnd b_i(j-1)Is equal to b_ijAnd b_i(j+1)The distance between them. For example, i-1, j-1, b₁₁And adjacent coordinate point b₁₂Is equal to b₁₁And adjacent coordinate point b₂₁The distance between them may be 1 or 2, etc. The distance between two adjacent coordinate points in fig. 5 is 1. As shown in FIG. 5, it can be seen that the distance momentElement a in the matrix₁₁Coordinate point b mapped on two-dimensional coordinate system₁₁The coordinates of (1) are (0, 0); element a in the distance matrix₁₂Coordinate point b mapped on two-dimensional coordinate system₁₂The coordinates of (1) are (0); element a in the distance matrix₁₃Coordinate point b mapped on two-dimensional coordinate system₁₃The coordinates of (0,2), etc. From this, the coordinates of the coordinate point on the two-dimensional coordinate system to which each element in the distance matrix is mapped can be obtained.

S1022, determining a seed coordinate point whose absolute difference between horizontal and vertical coordinates is equal to 1 from the N × N coordinate points, and determining an element corresponding to the seed coordinate point as a first seed element.

Here, since the distance matrix is a symmetric matrix, the upper triangle element of the main diagonal line is equal to the lower triangle element of the main diagonal line, the upper triangle element of the distance matrix is an element above the main diagonal line in the distance matrix, and the lower triangle element of the distance matrix is an element below the main diagonal line in the distance matrix. In order to avoid the repetition of the calculation and reduce the calculation amount, a seed coordinate point having an absolute difference between horizontal and vertical coordinates equal to 1 may be determined only from coordinate points corresponding to elements in the upper triangle of the distance matrix, that is, a seed coordinate point having an absolute difference between horizontal and vertical coordinates equal to 1 may be determined from N × N/2 coordinate points corresponding to N × N/2 elements in the upper triangle of the distance matrix, and an element corresponding to the seed coordinate point may be determined as the first seed element.

Specifically, a plurality of coordinate points in N × N/2 coordinate points whose absolute difference between horizontal and vertical coordinates is equal to 1 may be determined, and a minimum element in elements corresponding to each of the plurality of coordinate points may be determined, and a coordinate point corresponding to the minimum element may be determined as a seed coordinate point. E.g. as shown in FIG. 5, b₁₂The abscissa and ordinate of (1,0), then b₁₂The absolute difference value between the corresponding horizontal and vertical coordinates is 1-0| ═ 1; b₂₃The abscissa and ordinate of (2,1), then b₂₃The absolute difference value between the corresponding horizontal and vertical coordinates is |2-1| ═ 1; b₃₄The abscissa and ordinate of (3,2), then b₃₄The absolute difference between the corresponding abscissa and ordinate is |3-2| ═ 1. b₁₂Corresponding element a₁₂Equal to 0.5, b₂₃Corresponding element a₂₃Equal to 0.8, b₃₄Corresponding element a₃₄Equal to 0.6, i.e. the seed coordinate point is b₁₂Then, the seed coordinate point b is set₁₂Corresponding element a₁₂Is determined as the first seed element.

Wherein, a₁₂Representing the similarity distance between the audio feature vector corresponding to the first audio frame and the audio feature vector corresponding to the second audio frame, a₂₃Representing the similarity distance, a, between the audio feature vector corresponding to the second audio frame and the audio feature vector corresponding to the third audio frame₃₄Representing a similar distance between the audio feature vector corresponding to the third audio frame and the audio feature vector corresponding to the fourth audio frame.

And S1023, determining the elements which belong to the neighborhood of the first seed element and are smaller than the target threshold value in the distance matrix as first elements, and determining the categories of the two audio frames corresponding to each element in the first elements as first categories until no elements which belong to the neighborhood of the first seed element and are smaller than the target threshold value exist in the distance matrix.

In a specific implementation, after a first element in a neighborhood of a first seed element is determined, the category of the audio frame corresponding to each element in the first element is determined as a first category, and the categories of two audio frames corresponding to the first seed element are determined as a first category, so that the audio frames in the first category include two audio frames corresponding to the first seed element and audio frames corresponding to each element in the first element. Here, the neighborhood of the first seed element may comprise a 4-neighborhood or an 8-neighborhood. The elements of the 4-neighborhood of the first seed element may include four positions, i.e., upper, lower, left, and right, of the position of the first seed element. As shown in fig. 6, fig. 6 is a schematic neighborhood diagram of a first seed element provided in this embodiment of the present application. As shown in FIG. 6a, assume that the first sub-element is a₂₄Then the elements in the 4-neighborhood of the first seed element are the four-position elements marked by the dashed box in 6a of fig. 6, i.e. a₂₃、a₁₄、a₂₅、a₃₄.8 + of the first seed elementThe elements of the neighborhood may include eight positions of the first seed element, top, bottom, left, right, and top left, top right, bottom left, and bottom right. If the first sub-element is a, as shown in FIG. 6b₂₄Then the elements of the 8-neighborhood of the first seed element are the eight-position elements marked by the dashed box in 6b of fig. 6, i.e. a₂₃、a₁₄、a₂₅、a₃₄、a₁₃、a₁₅、0、a₃₅。

The target threshold may be a value such as 0.3, 0.5, 0.8, etc., and the size of the target threshold is not limited in the embodiment of the present application.

Since the upper triangle element of the main diagonal of the distance matrix is equal to the lower triangle element of the main diagonal, in order to avoid repeated operation and reduce calculation amount, the first element can be determined from only N × N/2 elements in the upper triangle of the distance matrix. And searching all elements which belong to the neighborhood of the first seed element and are smaller than the target threshold value from N x N/2 elements in the upper triangle of the distance matrix, and determining the searched elements as the first elements, namely the number of the first elements may be multiple. Each element corresponds to two audio frames since each element is obtained by calculating the distance between the audio feature vectors corresponding to the two audio frames. And determining the categories of the two audio frames corresponding to each first element in the plurality of first elements as first categories until no elements which belong to the neighborhood of the first seed element and are smaller than the target threshold value exist in N x N/2 elements in the upper triangle of the distance matrix.

For example, the neighborhood of the first seed element is 4-neighborhood, the target threshold is 0.6, the first seed element is a₂₃Then a is₂₃Includes a in the 4-neighborhood of₁₃、a₂₄、a₂₂、a₃₃Wherein a is₁₃Equal to 0.5, a₂₄Equal to 0.3, due to a₂₂And a₃₃Being the elements on the main diagonal of the distance matrix, the first element is then the element a other than the elements on the main diagonal of the distance matrix₁₃And a₂₄。

It will be appreciated that the first element does not include an element on the main diagonal of the distance matrix that does not belong to the first element even if the element on the main diagonal of the distance matrix belongs to the neighborhood of the first seed element and is less than the target threshold, i.e. the first element is an element that belongs to the neighborhood of the first seed element and is less than the target threshold, except for the element on the main diagonal, of N x N/2 elements in the upper triangle of the distance matrix.

S1024, determining the elements in the distance matrix, which belong to the neighborhood of each element in the first elements and are smaller than the target threshold value, as third elements, and determining the categories of the two audio frames corresponding to each element in the third elements as the first categories.

Here, the neighborhood of the first element may comprise a 4-neighborhood or an 8-neighborhood. The 4-neighborhood of the first element may include four positions, up, down, left, and right, of the position of the first element. Or the 8-neighborhood of the first seed element may include eight positions of the element above, below, left, right, above left, above right, below left, below right of the position of the first element. Similarly, in order to avoid repeated operations and reduce the amount of computation, the third element may be determined from only N × N/2 elements in the upper triangle of the distance matrix, all elements that are in the neighborhood of the first element and are smaller than the target threshold are searched from N × N/2 elements in the upper triangle of the distance matrix, and the searched elements are determined as the third element, and since there may be a plurality of third elements that are in the neighborhood of the first element and are smaller than the target threshold, the categories of two audio frames corresponding to each third element are determined as the first category. It will be appreciated that the third element also does not include elements on the main diagonal of the distance matrix.

The process of determining the first element and the third element in the distance matrix is explained by way of example: as shown in fig. 7, fig. 7 is a schematic diagram of an upper triangular element of a distance matrix provided in an embodiment of the present application. FIG. 7a is a diagram illustrating the elements included in the distance matrix, and FIG. 7b is a diagram illustrating the upper triangle elements of the distance matrix in the area C, if it is determined that the first sub-element is a₂₃And the neighborhood is the 4-neighborhood of the first seed element, then the first seed element a₂₃Is a in the 4-neighborhood of₁₃、a₂₄、a₂₂、a₃₃. If the target threshold is 0.8, a₁₃Is equal to 0.6, a₂₄Equal to 0.7, due to a₂₂And a₃₃Is an element on the main diagonal of the distance matrix, the first element is thus an element other than the element on the main diagonal of the distance matrix, i.e. the first element comprises a₁₃And a₂₄A is to₁₃Class of the corresponding two audio frames and a₂₄The category of the corresponding two audio frames is determined as the first category. It is understood that, in addition to the element a₁₃And a₂₄In addition, there are no elements in the distance matrix that belong within the neighborhood of the first seed element and are smaller than the target threshold.

First element a₁₃Includes a in the 4-neighborhood of₁₂And a₁₄First element a₂₄Includes a in the 4-neighborhood of₁₄、a₂₅And a₃₄. If a₁₂、a₁₄、a₂₅And a₃₄Equal to 0.4, 0.7, 0.6, 0.5, respectively, the third element comprises a₁₂、a₁₄、a₂₅And a₃₄Then a will be₁₂Class of the corresponding two audio frames, a₁₄Class of the corresponding two audio frames, a₂₅Class of the corresponding two audio frames and a₃₄The category of the corresponding two audio frames is determined as the first category. I.e. the audio frames in the first category comprise audio frame 1, audio frame 2, audio frame 3, audio frame 4, audio frame 5. If the third element a₁₂、a₁₄、a₂₅And a₃₄If all the elements in the neighborhood of (b) are greater than the target threshold, all audio frames belonging to the first category are determined. If a third element a is present₁₂、a₁₄、a₂₅And a₃₄Is less than the target threshold, the categories of the two audio frames corresponding to the elements within the third element neighborhood and less than the target threshold are determined to be the first category until there are no elements within the third element neighborhood and less than the target threshold. Thus, a first seed element and a first category may be determined, and the audio frames belonging to the first category may be determined.

S103, acquiring a difference set between an element set and a first set included in the distance matrix, wherein the first set includes a first seed element and a first element.

Here, the difference set refers to a difference between the two sets, and the difference set between the set of elements included in the distance matrix and the first set refers to elements that belong to the set of elements included in the distance matrix and do not belong to the first set. E.g. distance matrix comprising a₁₁、a₁₂、a₁₃…a_1nWherein the first set comprises 8 elements of a₁₁～a₁₈The distance matrix comprises a set of differences a between the set of elements and the first set₁₉～a_1nI.e. the difference set, represents the elements of the set of elements comprised by the distance matrix excluding the elements comprised by the first set.

And S104, determining a second seed element from the difference set, determining each element meeting the region growing condition in the distance matrix as the second element by taking the second seed element as a starting point of region growing, determining the category of the audio frame corresponding to each element in the second element as a second category, and determining the category of the audio frame corresponding to the second seed element as a second category.

In a specific implementation, after the second element in the neighborhood of the second seed element is determined, the category of the audio frame corresponding to each element in the second element is determined as a second category, and the categories of the two audio frames corresponding to the second seed element are determined as second categories, so that the audio frames in the second category include the two audio frames corresponding to the second seed element and the audio frame corresponding to each element in the second element. Here, an absolute difference between horizontal and vertical coordinates of the second seed element mapped on the two-dimensional coordinate system is equal to 1, the second seed element is smaller than the first seed element, and the category of the audio frame corresponding to the second seed element is the second category. Referring to the method for determining the first seed element and the first category from the distance matrix in step S102, the step of determining the second seed element and the second category from the difference set is as follows:

a. and mapping the elements in the difference set to a two-dimensional coordinate system to obtain coordinate points corresponding to the elements in the difference set, wherein one element corresponds to one coordinate point, the first element in the distance matrix is mapped to a coordinate origin of the two-dimensional coordinate system, and the distance between two adjacent coordinate points mapped to the two-dimensional coordinate system by every two adjacent elements in the distance matrix is equal.

b. And determining a seed coordinate point with the absolute difference value between the horizontal and vertical coordinates equal to 1 from the coordinate points corresponding to the elements in the difference set, and determining the element corresponding to the seed coordinate point as a second seed element.

c. And determining the elements which belong to the neighborhood of the second seed element and are smaller than the target threshold value in the difference set as second elements, and determining the categories of the two audio frames corresponding to each element in the second elements as second categories until no elements which belong to the neighborhood of the second seed element and are smaller than the target threshold value exist in the distance matrix.

d. And determining elements which belong to the neighborhood of each element in the second elements and are smaller than the target threshold value in the distance matrix as fifth elements, and determining the categories of the two audio frames corresponding to each element in the fifth elements as second categories.

Thus, the second seed element and the second category can be determined by the method of steps a to d in step S104, and each audio frame belonging to the second category can be determined. Until all seed elements in the distance matrix and elements belonging to the neighborhood of the seed elements are determined, steps S102 to S104 are performed in a loop.

In a possible case, after determining all the seed elements in the distance matrix and the elements in the neighborhood of the seed elements, K fourth elements whose absolute difference between horizontal and vertical coordinates mapped on the two-dimensional coordinate system is greater than 1 also exist in the distance matrix, and the K fourth elements belong to the neighborhoods of the first seed element and the second seed element, the category of the audio frame corresponding to each element in the K fourth elements is determined, and K third categories are obtained. The audio frame corresponding to a fourth element is of a third category, and K is a positive integer greater than or equal to 1.

For example, after the first seed element and the second seed element in the distance matrix are determined, if 3 fourth elements are present in the distance matrix and are respectively the fourth element y1, the fourth element y2, and the fourth element y3, and the 3 fourth elements belong to the neighborhood of the first seed element and the second seed element, the category of the audio frame corresponding to each element in the 3 fourth elements is determined, so as to obtain 3 third categories, that is, the third category 1, the third category 2, and the third category 3, the category of the two audio frames corresponding to the fourth element y1 is determined as the third category 1, the category of the two audio frames corresponding to the fourth element y2 is determined as the third category 2, and the category of the two audio frames corresponding to the fourth element y3 is determined as the third category 3.

In the embodiment of the present application, elements in an upper triangle of a distance matrix are searched to determine a first seed element, a first element, a second element, a third element, a second seed element, a fourth element, and so on, and elements in a lower triangle of the distance matrix may also be searched to determine the first seed element, the first element, the second element, the third element, the second seed element, the fourth element, and so on, and none of the first seed element, the first element, the second element, the third element, the second seed element, and the fourth element includes an element on a main diagonal line in the distance matrix.

In a possible situation, after the category of each of the N audio frames included in the target audio data is determined, an audio frame sequence corresponding to the target audio data may be generated according to a time sequence of each of the N audio frames belonging to the first category, each of the N audio frames belonging to the second category, and each of the N audio frames belonging to each of the K third categories.

For example, when K is 3, the first category is category a, the second category is category B, the third category is category C1, category C2, and category C3, where the playback time of each audio frame of category a is less than the playback time of each audio frame of category B, the playback time of each audio frame of category B is less than the playback time of each audio frame of category C1, the playback time of each audio frame of category C1 is less than the playback time of each audio frame of category C2, and the playback time of each audio frame of category C2 is less than the playback time of each audio frame of category C3, that is, the audio frame sequence corresponding to the generated target audio data is (a, B, C1, C2, C3).

In the embodiment of the application, a distance matrix corresponding to target audio data is obtained; determining a first seed element from the distance matrix, determining each element meeting the region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, and determining the category of the audio frame corresponding to each element in the first element as a first category; acquiring a difference set between an element set and a first set included in the distance matrix; and determining a second seed element from the difference set, determining each element meeting the region growing condition in the distance matrix as a second element by taking the second seed element as a starting point of region growing, and determining the category of the audio frame corresponding to each element in the second element as a second category. The searching efficiency can be improved when the audio frames are searched by determining the audio frames with similar audio characteristic vectors as the same category while retaining all audio frame information of the target audio data, and the elements corresponding to the seed elements are determined by a region growing method.

In a possible implementation manner, after determining each audio frame belonging to the first category in the N audio frames, each audio frame belonging to the second category in the N audio frames, and each audio frame belonging to each category in the K third categories in the N audio frames, the category identifier of the first category, the category identifier of the second category, and the category identifier of each third category in the K third categories may also be generated, and the category corresponding to the category identifier may be quickly found by finding the category identifier, so that each audio frame belonging to the category is found. Referring to fig. 8, fig. 8 is a schematic flowchart of another audio frame clustering method provided in the embodiment of the present application, and as shown in fig. 8, the method includes the following steps:

s201, obtaining a distance matrix corresponding to the target audio data.

S202, determining a first seed element from the distance matrix, determining each element meeting the region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, determining the category of the audio frame corresponding to each element in the first element as a first category, and determining the category of the audio frame corresponding to the first seed element as a first category.

S203, a difference set between the element set included in the distance matrix and a first set is obtained, wherein the first set includes a first seed element and a first element.

And S204, determining a second seed element from the difference set, determining each element meeting the region growing condition in the distance matrix as the second element by taking the second seed element as a starting point of region growing, determining the category of the audio frame corresponding to each element in the second element as a second category, and determining the category of the audio frame corresponding to the second seed element as a second category.

Here, the specific implementation manner of steps S201 to S204 may refer to the description of steps S101 to S104 in the embodiment corresponding to fig. 1, and is not described herein again.

S205, a first frame number of an audio frame with the minimum playing time in each audio frame belonging to the first category in the N audio frames is obtained.

Here, since each of the N audio frames has a corresponding playing time, the frame number of the audio frame with the smallest playing time among the audio frames belonging to the first category is acquired, and the frame number of the audio frame is determined as the first frame number. The playing time of the audio frames in the first category is continuous. For example, if the audio frame with the smallest playing time among the audio frames belonging to the first category is the first audio frame in the first category, i.e., audio frame 1, the first frame number is 1.

S206, acquiring a second frame number of the audio frame with the maximum playing time in the audio frames belonging to the first category in the N audio frames.

Similarly, the frame number of the audio frame with the largest playing time in the audio frames belonging to the first category is obtained, and the frame number of the audio frame is determined as the second frame number. For example, if the audio frame with the largest playing time among the audio frames belonging to the first category is the sixth audio frame in the first category, i.e., audio frame 6, the second frame number is 6.

S207, calculating the average value of the audio feature vectors corresponding to the audio frames belonging to the first category in the N audio frames.

Here, the mean value between the audio feature vectors corresponding to the n audio frames may be calculated by formula (1-1)

Where n is the number of audio frames, (x1, y1, z1) represents the audio feature vector corresponding to the first audio frame, (x2, y2, z2) represents the audio feature vector corresponding to the second audio frame, (xn, yn, zn) represents the audio feature vector corresponding to the nth audio frame, and so on.

It can be known that, in the case that the first frame number is 1 and the second frame number is 6, since the playing time of each audio frame in the first category is continuous, the number of the audio frames in the first category is 6, for example, the audio feature vectors corresponding to 6 audio frames are (1,1,1), (2,2,2), (3, 3), (7,7,7), (8,8,8), (9,9,9),

the calculated mean is then:

and S208, generating a category identifier of the first category according to the first frame number, the second frame number and the average value among the audio feature vectors corresponding to the audio frames belonging to the first category in the N audio frames.

Here, the class identifier of the first class includes a first frame number, a second frame number, and an average value between audio feature vectors corresponding to audio frames belonging to the first class from among the N audio frames. The class identification of the first class is used to uniquely indicate the first class, i.e. the class used to uniquely indicate the respective audio frame of the first class. For example, the category identifier of the first category may be pinyin of the first category, an abbreviation of the first category, a unique number of the first category, or an average value of a frame number of an audio frame with a minimum playing time, a frame number of an audio frame with a maximum playing time, and an audio feature vector corresponding to each audio frame in the first category. For example, the category of the first category, which is generated according to the first frame number, the second frame number, and an average value between audio feature vectors corresponding to audio frames belonging to the first category in the N audio frames, is identified as a triple [1,6, (5,5,5) ].

Optionally, the class identifier of the second class and the class identifier of each of the K third classes may also be determined with reference to the methods of step S205 to step S208.

For example, it is determined that the category identification of the second category is [7,15, (6,6,6) ]. Assuming that K is equal to 1, it is determined that the class identifier of the third class is [16,17, (7,7,7) ], and the audio frame sequence corresponding to the target audio data may be { [1,6, (5,5,5) ], [7,15, (6,6,6) ], [16,17, (7,7,7) ] }.

In the embodiment of the application, by generating the category identifier of the category corresponding to the audio frame, when the audio frame is searched subsequently, the category corresponding to the category identifier can be quickly searched by searching the category identifier, so that each audio frame belonging to the category is searched; for example, when feature analysis needs to be performed on the audio frame of the start portion or the end portion in the target audio data, the audio frame with the minimum playing time or the audio frame with the maximum playing time may be determined by searching the audio frame number in the category identifier, so as to improve the efficiency of searching for the audio frame.

The method of the embodiments of the present application is described above, and the apparatus of the embodiments of the present application is described below.

Referring to fig. 9, fig. 9 is a schematic diagram of a structure of an audio frame clustering apparatus provided in an embodiment of the present application, where the apparatus 90 includes:

a matrix obtaining module 901, configured to obtain a distance matrix corresponding to target audio data, where the target audio data includes N audio frames, the distance matrix is a symmetric matrix, an element on a main diagonal in the distance matrix is 0, other elements in the distance matrix except the element on the main diagonal are used to represent a distance between audio feature vectors corresponding to every two audio frames in the N audio frames, and N is a positive integer greater than or equal to 2;

a first determining module 902, configured to determine a first seed element from the distance matrix, determine, with the first seed element as a starting point of region growth, each element in the distance matrix that meets a region growth condition as a first element, determine, as a first category, a category of an audio frame corresponding to each element in the first element, determine, as the first category, a category of an audio frame corresponding to the first seed element, and determine, as the first category, an absolute difference between horizontal and vertical coordinates of the first seed element mapped onto a two-dimensional coordinate system is equal to 1;

a difference set obtaining module 903, configured to obtain a difference set between a set of elements included in the distance matrix and a first set, where the first set includes the first seed element and the first element;

a second determining module 904, configured to determine a second seed element from the difference set, determine, with the second seed element as a starting point of region growing, each element in the distance matrix that meets the region growing condition as a second element, determine, as a second category, a category of an audio frame corresponding to each element in the second element, and determine, as the second category, a category of an audio frame corresponding to the second seed element, where an absolute difference between horizontal and vertical coordinates of the second seed element mapped onto the two-dimensional coordinate system is equal to 1, and the second seed element is smaller than the first seed element.

In a possible design, the distance matrix includes N × N elements, and the first determining module 902 is specifically configured to:

mapping N x N elements in the distance matrix to the two-dimensional coordinate system to obtain N x N coordinate points corresponding to the N x N elements in the distance matrix, wherein one element corresponds to one coordinate point, the first element in the distance matrix is mapped to the origin of coordinates of the two-dimensional coordinate system, and the distance between two adjacent coordinate points mapped to the two-dimensional coordinate system by every two adjacent elements in the distance matrix is equal;

determining a seed coordinate point with the absolute difference value between horizontal and vertical coordinates equal to 1 from the N x N coordinate points, and determining an element corresponding to the seed coordinate point as the first seed element;

determining elements in the distance matrix, which belong to the neighborhood of the first seed element and are smaller than a target threshold value, as the first elements, and determining the categories of the two audio frames corresponding to each element in the first elements as the first categories until no elements in the distance matrix, which belong to the neighborhood of the first seed element and are smaller than the target threshold value, exist;

and determining elements in the distance matrix, which belong to the neighborhood of each element in the first elements and are smaller than the target threshold value, as third elements, and determining the categories of the two audio frames corresponding to each element in the third elements as the first category.

In one possible design, the apparatus further includes:

a third determining module 905, configured to determine, if K fourth elements whose absolute differences between horizontal and vertical coordinates mapped to the two-dimensional coordinate system are greater than 1 exist in the distance matrix, and the K fourth elements belong to neighborhoods of the first seed element and the second seed element, a category of an audio frame corresponding to each fourth element in the K fourth elements, to obtain K third categories, where a category of an audio frame corresponding to a fourth element is a third category, and K is a positive integer greater than or equal to 1.

In one possible design, the apparatus further includes:

a sequence determining module 906, configured to generate an audio frame sequence corresponding to the target audio data according to a time sequence of each audio frame belonging to the first category, each audio frame belonging to the second category, and each audio frame belonging to each of the K third categories in the N audio frames.

In one possible design, the apparatus further includes:

an identifier determining module 907, configured to obtain a first frame number of an audio frame with a minimum playing time in the audio frames belonging to the first category from among the N audio frames;

acquiring a second frame number of an audio frame with the maximum playing time in each audio frame belonging to the first category in the N audio frames;

calculating the average value of the audio characteristic vectors corresponding to the audio frames belonging to the first category in the N audio frames;

and generating a category identifier of the first category according to the first frame number, the second frame number and the mean value, wherein the category identifier of the first category comprises the first frame number, the second frame number and the mean value.

In one possible design, the element a in the distance matrix_ijThe similarity distance between the audio characteristic vector corresponding to the ith audio frame and the audio characteristic vector of the jth audio frame in the N audio frames is shown, wherein i and j are positive integers which are more than 0 and less than or equal to N.

In one possible design, the apparatus further includes:

a matrix construction module 908 for dividing the target audio data into the N audio frames;

acquiring an audio feature vector of each audio frame in the N audio frames to obtain N audio feature vectors, wherein the audio feature vectors are Mel frequency spectrum feature vectors;

calculating the similarity distance between every two audio feature vectors in the N audio feature vectors to obtain N-x (N-1) distance values;

and constructing a symmetric matrix with a main diagonal of 0 according to the N-x (N-1) distance values, and determining the symmetric matrix as the distance matrix.

It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 9, reference may be made to the description of the method embodiment, and details are not described here again.

In the embodiment of the application, the audio frame clustering device can improve the searching efficiency by determining the audio frames with similar audio feature vectors as the same type while keeping all audio frame information of the target audio data, and can determine the elements corresponding to the seed elements by a region growing method. By generating the category identification of the category corresponding to the audio frame, when the audio frame is searched subsequently, the category corresponding to the category identification can be quickly searched by searching the category identification, so that each audio frame belonging to the category is searched; for example, when feature analysis needs to be performed on the audio frame of the start portion or the end portion in the target audio data, the audio frame with the minimum playing time or the audio frame with the minimum playing time can be determined by searching the audio frame number in the category identifier, so that the efficiency of searching for the audio frame is improved.

Referring to fig. 10, fig. 10 is a schematic diagram of a composition structure of an electronic device according to an embodiment of the present application, where the device 100 includes a processor 1001, a memory 1002, and an input/output interface 1003. The processor 1001 is connected to the memory 1002 and the input/output interface 1003, and for example, the processor 1001 may be connected to the memory 1002 and the input/output interface 1003 via a bus.

The processor 1001 is configured to support the electronic clustering apparatus to perform corresponding functions in the audio frame clustering method of fig. 1, 3-4 or 8, the processor 1001 may be a Central Processing Unit (CPU), a Network Processor (NP), a hardware chip, or any combination thereof, the hardware chip may be an Application Specific Integrated Circuit (ASIC), a programmable logic device (P L D), or a combination thereof, the P L D may be a complex programmable logic device (CP L D), a field-programmable logic device (FPGA), a general array logic (GA L), or any combination thereof.

The memory 1002 is used to store program codes and the like. The memory 1002 may include Volatile Memory (VM), such as Random Access Memory (RAM); the memory 1002 may also include a non-volatile memory (NVM), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory 1002 may also comprise a combination of the above-described types of memory.

The input/output interface 1003 is used for inputting or outputting data.

The processor 1001 may call the program code to perform the following operations:

determining a first seed element from the distance matrix, determining each element meeting a region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, determining the category of an audio frame corresponding to each element in the first element as a first category, determining the category of the audio frame corresponding to the first seed element as the first category, and determining that the absolute difference value between horizontal and vertical coordinates of the first seed element mapped on a two-dimensional coordinate system is equal to 1;

determining a second seed element from the difference set, determining each element meeting the region growing condition in the distance matrix as a second element by taking the second seed element as a starting point of region growing, determining the category of the audio frame corresponding to each element in the second element as a second category, determining the category of the audio frame corresponding to the second seed element as the second category, wherein the absolute difference between the horizontal and vertical coordinates of the second seed element mapped on the two-dimensional coordinate system is equal to 1, and the second seed element is smaller than the first seed element.

It should be noted that, the implementation of each operation may also correspond to the corresponding description with reference to the above method embodiment; the processor 1001 may also cooperate with the i/o interface 1003 to perform other operations in the above-described method embodiments.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions, which, when executed by a computer, cause the computer to perform the method according to the foregoing embodiments, and the computer may be a part of the above-mentioned electronic device. Such as the processor 1001 described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. An audio frame clustering method, comprising:

obtaining a distance matrix corresponding to target audio data, wherein the target audio data comprises N audio frames, the distance matrix is a symmetric matrix, elements on a main diagonal in the distance matrix are 0, other elements except the elements on the main diagonal in the distance matrix are used for representing the distance between audio feature vectors corresponding to every two audio frames in the N audio frames, and N is a positive integer greater than or equal to 2;

determining a first seed element from the distance matrix, determining each element meeting a region growing condition in the distance matrix as a first element by taking the first seed element as a starting point of region growing, determining a category of an audio frame corresponding to each element in the first element as a first category, and determining a category of an audio frame corresponding to the first seed element as the first category, wherein an absolute difference value between horizontal and vertical coordinates of the first seed element mapped on a two-dimensional coordinate system is equal to 1;

determining a second seed element from the difference set, determining, with the second seed element as a starting point of region growth, each element in the distance matrix that meets the region growth condition as a second element, determining a category of an audio frame corresponding to each element in the second element as a second category, and determining a category of an audio frame corresponding to the second seed element as the second category, where an absolute difference between horizontal and vertical coordinates of the second seed element mapped onto the two-dimensional coordinate system is equal to 1, and the second seed element is smaller than the first seed element.

2. The method according to claim 1, wherein the distance matrix comprises N × N elements, the determining a first seed element from the distance matrix, determining, with the first seed element as a starting point of region growing, each element in the distance matrix that satisfies a region growing condition as a first element, and determining, as a first category, a category of an audio frame corresponding to each element in the first element, comprises:

mapping N x N elements in the distance matrix to the two-dimensional coordinate system to obtain N x N coordinate points corresponding to the N x N elements in the distance matrix, wherein one element corresponds to one coordinate point, the first element in the distance matrix is mapped to the origin of coordinates of the two-dimensional coordinate system, and the distance between every two adjacent elements in the distance matrix, which are mapped to the two adjacent coordinate points on the two-dimensional coordinate system, is equal;

determining a seed coordinate point with an absolute difference value between horizontal and vertical coordinates equal to 1 from the N x N coordinate points, and determining an element corresponding to the seed coordinate point as the first seed element;

determining elements in the distance matrix which belong to the neighborhood of the first seed element and are smaller than a target threshold value as the first elements, and determining the categories of two audio frames corresponding to each element in the first elements as the first categories until no elements which belong to the neighborhood of the first seed element and are smaller than the target threshold value exist in the distance matrix;

determining elements in the distance matrix which belong to the neighborhood of each element in the first elements and are smaller than the target threshold value as third elements, and determining the categories of the two audio frames corresponding to each element in the third elements as the first categories.

3. The method of claim 2, further comprising:

if K fourth elements with absolute differences between horizontal and vertical coordinates mapped on the two-dimensional coordinate system being larger than 1 exist in the distance matrix, and the K fourth elements belong to the neighborhood of the first seed element and the second seed element, determining the category of the audio frame corresponding to each fourth element in the K fourth elements to obtain K third categories, wherein the category of the audio frame corresponding to one fourth element is one third category, and K is a positive integer larger than or equal to 1.

4. The method of claim 3, further comprising:

and generating an audio frame sequence corresponding to the target audio data according to the time sequence of each audio frame belonging to the first category, each audio frame belonging to the second category and each audio frame belonging to each third category in the K third categories.

5. The method of claim 1, further comprising:

acquiring a first frame number of an audio frame with the minimum playing time in each audio frame belonging to the first category in the N audio frames;

calculating the average value of the audio feature vectors corresponding to the audio frames belonging to the first category in the N audio frames;

6. The method of claim 1, wherein the distance matrix has an element a_ijAnd the similarity distance between the audio characteristic vector corresponding to the ith audio frame and the audio characteristic vector of the jth audio frame in the N audio frames is obtained, wherein i and j are positive integers which are more than 0 and less than or equal to N.

7. The method of claim 1, wherein before obtaining the distance matrix corresponding to the target audio data, the method further comprises:

dividing the target audio data into the N audio frames;

and constructing a symmetric matrix with a main diagonal of 0 according to the N x (N-1) distance values, and determining the symmetric matrix as the distance matrix.

8. An audio frame clustering apparatus, comprising:

a first determining module, configured to determine a first seed element from the distance matrix, determine, with the first seed element as a starting point of region growth, each element in the distance matrix that meets a region growth condition as a first element, determine, as a first category, a category of an audio frame corresponding to each element in the first element, and determine, as the first category, a category of an audio frame corresponding to the first seed element, where an absolute difference between horizontal and vertical coordinates of the first seed element mapped onto a two-dimensional coordinate system is equal to 1;

9. An electronic device, comprising a processor, a memory and an input/output interface, wherein the processor, the memory and the input/output interface are connected to each other, wherein the input/output interface is used for inputting or outputting data, the memory is used for storing program codes, and the processor is used for calling the program codes and executing the method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.