CN111460215A

CN111460215A - Audio data processing method and device, computer equipment and storage medium

Info

Publication number: CN111460215A
Application number: CN202010237269.7A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-28
Anticipated expiration: 2040-03-30
Also published as: CN111460215B

Abstract

The embodiment of the application provides an audio data processing method, an audio data processing device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring at least two frequency sequences corresponding to each audio data in an audio data set, and clustering the at least two frequency sequences corresponding to each audio data to obtain at least two sequence clusters; determining index information corresponding to each audio data according to cluster identifications of sequence clusters to which at least two frequency sequences in each audio data belong respectively; and acquiring recommended audio data corresponding to the audio data to be retrieved in the audio data set according to the index information. By adopting the embodiment of the application, the retrieval accuracy of the audio data can be improved.

Description

Audio data processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to an audio data processing method and apparatus, a computer device, and a storage medium.

Background

With the continuous development of multimedia data, the frequency of users listening to multimedia information such as music, broadcast, etc. through terminals is increasing, and the frequency has become a basic user demand. Due to the rapid increase of data volume, large data shows a trend of diversification and decentralization. In the context of large-scale data, users need to retrieve from a large amount of data to obtain audio data of interest.

In the prior art, the audio data can be retrieved in a form of keywords to obtain audio data matched with the keywords, and the keywords can be text information such as song names, lyrics, song type labels, singer names and the like. However, the amount of audio data obtained by keyword retrieval is often too large, and it is difficult to ensure that the retrieved audio data is an audio that is of interest to the user, which results in the accuracy of the retrieved audio data being too low.

Disclosure of Invention

The embodiment of the application provides an audio data processing method and device, computer equipment and a storage medium, and can improve the retrieval accuracy of audio data.

An aspect of the present embodiment provides an audio data processing method, including:

acquiring at least two frequency sequences corresponding to each audio data in an audio data set, and clustering the at least two frequency sequences corresponding to each audio data to obtain at least two sequence clusters;

determining index information corresponding to each audio data according to cluster identifications of sequence clusters to which at least two frequency sequences in each audio data belong respectively;

and acquiring recommended audio data corresponding to the audio data to be retrieved in the audio data set according to the index information.

Wherein, obtain at least two frequency sequences that each audio data corresponds respectively in the audio data set, include:

acquiring each audio data contained in the audio data set, and respectively sampling each audio data according to sampling interval time to obtain a sampling time sequence corresponding to each audio data;

grouping the sampling time sequences according to the time period information to obtain at least two time sequences corresponding to each audio data;

and respectively carrying out frequency domain transformation on the at least two time sequences to obtain at least two frequency sequences corresponding to each audio data.

Wherein, clustering at least two frequency sequences respectively corresponding to each audio data to obtain at least two sequence clusters, comprising:

determining at least two frequency sequences corresponding to each audio data as sequences to be processed, adding each sequence to be processed to a sequence set, and selecting a central sequence t from the sequence set_k(ii) a k is a positive integer less than or equal to the number of center sequences;

acquiring each sequence to be processed and a central sequence t contained in a sequence set_kThe similarity between the sequence G to be processed in the sequence set_iWith the central sequence t_kIf the similarity is maximum, the sequence G to be processed is determined_iAddition to the central sequence t_kBelonging cluster C to be processed_k(ii) a i is a positive integer less than or equal to the number of sequences to be processed contained in the sequence set;

according to the cluster C to be processed_kThe central sequence t is updated according to the sequence to be processed contained in (1)_kUp to cluster C to be processed_kMiddle updated center sequence t_kAnd the central sequence t before updating_kWhen the same, the cluster C to be processed_kDetermined as a sequence cluster.

determining at least two frequency sequences corresponding to each audio data as to-be-processed sequences, and adding each to-be-processed sequence to a sequence set;

acquiring M to-be-processed clusters corresponding to the sequence set, and acquiring the similarity between any two to-be-processed clusters in the M to-be-processed clusters; m is the number of sequences to be processed contained in the sequence set, and each cluster to be processed comprises one sequence to be processed;

combining two clusters to be processed corresponding to the maximum similarity in the M clusters to be processed to obtain updated M-1 clusters to be processed, and obtaining the similarity between any two updated clusters to be processed in the updated M-1 clusters to be processed;

and combining the two updated clusters to be processed corresponding to the maximum similarity in the updated M-1 clusters to be processed to obtain updated M-2 clusters to be processed until the number of the updated clusters to be processed is equal to the threshold of the number of clusters, and determining the updated clusters to be processed as sequence clusters.

Determining index information corresponding to each audio data according to cluster identifiers of sequence clusters to which at least two frequency sequences in each audio data belong respectively, wherein the determining comprises the following steps:

respectively setting cluster identifiers for at least two sequence clusters, and determining initial index information corresponding to each audio data according to the cluster identifiers of the sequence clusters to which at least two frequency sequences in each audio data respectively belong;

when the adjacent cluster identifiers in the initial index information are different, acquiring a combined identifier corresponding to the adjacent cluster identifiers;

and adding the combined identification to the initial index information to obtain the index information corresponding to each audio data.

The method for acquiring the recommended audio data corresponding to the audio data to be retrieved in the audio data set according to the index information comprises the following steps:

when the retrieval triggering operation aiming at the audio data to be retrieved in the audio data set is detected, determining the index information corresponding to the audio data to be retrieved as target index information, and determining the index information corresponding to the rest audio data except the audio data to be retrieved in the audio data set as candidate index information;

acquiring the matching degree between the target index information and the candidate index information, and sequencing the candidate audio data contained in the audio data set according to the matching degree; the candidate audio data are audio data corresponding to the candidate index information;

and acquiring recommended audio data from the sorted candidate audio data according to the sorting order, and sending the recommended audio data to the terminal equipment corresponding to the audio data to be retrieved.

The target index information and the candidate index information both comprise cluster identification and combined identification;

obtaining the matching degree between the target index information and the candidate index information, including:

determining cluster identification and combined identification contained in the target index information as first identification to be processed;

selecting candidate audio data y from a set of audio data_nCorresponding candidate index information x_nThe candidate index information x_nThe cluster identifier and the combined identifier contained in the first group are determined as a second identifier to be processed; n is a positive integer less than or equal to the number of candidate audio data;

classifying first to-be-processed identifications contained in the target index information to obtain at least two target identification categories, and respectively counting a first quantity of the first to-be-processed identifications contained in each target identification category;

for candidate index information x_nThe second to-be-processed identifiers contained in the first to-be-processed identifier list are classified to obtain at least two candidate identifier categories, and the second number of the second to-be-processed identifiers contained in each candidate identifier category is respectively counted;

determining target index information and candidate index information x according to the first number and the second number_nThe degree of match between them.

Wherein, according to the first number and the second number, the target index information and the candidate index information x are determined_nThe matching degree between the two, including:

acquiring at least two target identification categories and a matching identification category in at least two candidate identification categories; the at least two target identification categories and the at least two candidate identification categories each comprise a matching identification category;

matching the first number of the identification categories in the target index information with the candidate index information x_nThe second quantity in (1) is subjected to logarithmic summation to obtain target index informationAnd candidate index information x_nThe degree of match between them.

An aspect of an embodiment of the present application provides an audio data processing apparatus, including:

the clustering module is used for acquiring at least two frequency sequences corresponding to each audio data in the audio data set, and clustering the at least two frequency sequences corresponding to each audio data to obtain at least two sequence clusters;

the determining module is used for determining the index information corresponding to each audio data according to the cluster identification of the sequence cluster to which at least two frequency sequences in each audio data belong respectively;

and the recommending module is used for acquiring recommended audio data corresponding to the audio data to be retrieved in the audio data set according to the index information.

Wherein, the clustering module includes:

the sampling unit is used for acquiring each audio data contained in the audio data set, and respectively sampling each audio data according to sampling interval time to obtain a sampling time sequence corresponding to each audio data;

the grouping unit is used for grouping the sampling time sequences according to the time period information to obtain at least two time sequences corresponding to each audio data;

and the frequency domain transformation unit is used for respectively carrying out frequency domain transformation on the at least two time sequences to obtain at least two frequency sequences corresponding to each audio data.

Wherein, the clustering module includes:

a sequence selection unit, configured to determine at least two frequency sequences corresponding to each piece of audio data as to-be-processed sequences, add each to-be-processed sequence to a sequence set, and select a center sequence t from the sequence set_k(ii) a k is a positive integer less than or equal to the number of center sequences;

a sequence dividing unit for acquiring each sequence to be processed and the central sequence t contained in the sequence set_kThe similarity between the sequence G to be processed in the sequence set_iWith the central sequence t_kIf the similarity is maximum, the sequence G to be processed is determined_iAddition to the central sequence t_kBelonging cluster C to be processed_k(ii) a i is a positive integer less than or equal to the number of sequences to be processed contained in the sequence set;

a cluster updating unit for updating the cluster C according to the cluster to be processed_kThe central sequence t is updated according to the sequence to be processed contained in (1)_kUp to cluster C to be processed_kMiddle updated center sequence t_kAnd the central sequence t before updating_kWhen the same, the cluster C to be processed_kDetermined as a sequence cluster.

Wherein, the clustering module includes:

the sequence set acquisition unit is used for determining at least two frequency sequences respectively corresponding to each piece of audio data as sequences to be processed and adding each sequence to be processed to a sequence set;

the similarity obtaining unit is used for obtaining M to-be-processed clusters corresponding to the sequence set and obtaining the similarity between any two to-be-processed clusters in the M to-be-processed clusters; m is the number of sequences to be processed contained in the sequence set, and each cluster to be processed comprises one sequence to be processed;

the first merging unit is used for merging two clusters to be processed corresponding to the maximum similarity in the M clusters to be processed to obtain updated M-1 clusters to be processed and acquiring the similarity between any two updated clusters to be processed in the updated M-1 clusters to be processed;

and the second merging unit is used for merging the two updated clusters to be processed corresponding to the maximum similarity in the updated M-1 clusters to be processed to obtain updated M-2 clusters to be processed until the number of the updated clusters to be processed is equal to the threshold of the number of clusters, and determining the updated clusters to be processed as the sequence clusters.

Wherein the determining module comprises:

an initial index obtaining unit, configured to set cluster identifiers for at least two sequence clusters respectively, and determine initial index information corresponding to each audio data according to the cluster identifiers of the sequence clusters to which at least two frequency sequences in each audio data belong respectively;

a combined identifier obtaining unit, configured to obtain a combined identifier corresponding to an adjacent cluster identifier when adjacent cluster identifiers in the initial index information are different;

and the combined identifier adding unit is used for adding the combined identifier to the initial index information to obtain the index information corresponding to each audio data.

Wherein, the recommendation module includes:

the detection unit is used for determining index information corresponding to the audio data to be retrieved as target index information and determining index information corresponding to the rest audio data except the audio data to be retrieved in the audio data set as candidate index information when the retrieval triggering operation aiming at the audio data to be retrieved in the audio data set is detected;

the sorting unit is used for acquiring the matching degree between the target index information and the candidate index information and sorting the candidate audio data contained in the audio data set according to the matching degree; the candidate audio data are audio data corresponding to the candidate index information;

and the recommended audio acquiring unit is used for acquiring recommended audio data from the sorted candidate audio data according to the sorting order and sending the recommended audio data to the terminal equipment corresponding to the audio data to be retrieved.

the sorting unit includes:

a first determining subunit, configured to determine a cluster identifier and a combined identifier included in the target index information as a first identifier to be processed;

a second determining subunit for selecting a candidate audio data y from the set of audio data_nCorresponding candidate index information x_nThe candidate index information x_nThe cluster identifier and the combined identifier contained in the first group are determined as a second identifier to be processed; n is a positive integer less than or equal to the number of candidate audio data;

the first counting subunit is configured to classify the first to-be-processed identifiers included in the target index information to obtain at least two target identifier categories, and count a first number of the first to-be-processed identifiers included in each target identifier category;

a second statistical subunit for calculating candidate index information x_nThe second to-be-processed identifiers contained in the first to-be-processed identifier list are classified to obtain at least two candidate identifier categories, and the second number of the second to-be-processed identifiers contained in each candidate identifier category is respectively counted;

a matching degree determining subunit, configured to determine the target index information and the candidate index information x according to the first number and the second number_nThe degree of match between them.

Wherein the matching degree determining subunit includes:

a matching identifier obtaining subunit, configured to obtain at least two target identifier categories and a matching identifier category in the at least two candidate identifier categories; the at least two target identification categories and the at least two candidate identification categories each comprise a matching identification category;

a summation subunit, configured to sum the first number of the matching identifier classes in the target index information with the candidate index information x of the matching identifier classes_nThe second quantity in (2) is subjected to logarithmic summation to obtain target index information and candidate index information x_nThe degree of match between them.

An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the method in the aspect of the embodiments of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions that, when executed by a processor, perform the steps of the method as in an aspect of the embodiments of the present application.

According to the embodiment of the application, at least two frequency sequences corresponding to each audio data in the audio data set can be obtained, clustering is carried out on the at least two frequency sequences corresponding to each audio data to obtain at least two sequence clusters, index information corresponding to each audio data is determined according to cluster identifications of the sequence clusters to which the at least two frequency sequences in each audio data belong, and then recommended audio data corresponding to the audio data to be retrieved in the audio data set are obtained according to the index information. Therefore, the frequency sequences corresponding to all the audio data in the audio data set are clustered to obtain a plurality of sequence clusters, a cluster identifier is set for each sequence cluster, the cluster identifier of the sequence cluster to which each frequency sequence in the audio data to be retrieved belongs is used as the index information of the audio data to be retrieved, and the recommended audio data corresponding to the audio data to be retrieved is determined based on the index information of the audio data to be retrieved and the index information of the rest audio data, so that the similarity between the audio data to be retrieved and the recommended audio data can be enhanced, and the retrieval accuracy of the audio data can be further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a diagram of a network architecture provided by an embodiment of the present application;

fig. 2a and fig. 2b are schematic diagrams of an audio data processing scenario provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of an audio data processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an acquisition frequency sequence provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of clustering of frequency sequences provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of another frequency sequence clustering provided in the embodiments of the present application;

fig. 7 is a schematic diagram of an audio data retrieval scenario provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the embodiment of the application relates to Machine learning (M L) belonging to the field of artificial intelligence, wherein the Machine learning is a multi-field cross subject and relates to a multi-field subject such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like.

Fig. 1 is a diagram of a network architecture according to an embodiment of the present application. The network architecture may include a server 10d and a plurality of terminal devices (specifically, as shown in fig. 1, including a terminal device 10a, a terminal device 10b, and a terminal device 10c), where the server 10d may perform data transmission with each terminal device through a network.

The server 10d may obtain all audio data in the audio application, and perform time domain sampling on each audio data according to a sampling interval time (e.g., 0.1 second) to obtain a discrete sampling time sequence corresponding to each audio data, and further may group sampling values included in the discrete sampling time sequence, and may divide the discrete sampling time sequence corresponding to each audio data into at least two time sequences. Of course, it is also possible to perform frequency domain transformation on each time sequence, and perform frequency domain sampling on the frequency signal after the frequency domain transformation according to the sampling interval frequency, so as to obtain the frequency sequence corresponding to each time sequence. The server 10d may cluster the frequency sequences (or time sequences) corresponding to all the audio data, so that similar frequency sequences are divided into the same sequence cluster, and dissimilar frequency sequences are divided into different sequence clusters, so as to obtain at least two sequence clusters, and a cluster identifier may be set for each sequence cluster; the server 10d may generate index information corresponding to each audio data according to the cluster identifier of the sequence cluster to which the frequency sequence included in each audio data belongs, that is, the cluster identifier is used to represent the corresponding frequency sequence. For the audio data to be retrieved in the audio application (e.g., the audio data being played by the user in the audio application of the terminal device 10a, or the audio data played by the user the most times in the audio application of the terminal device 10 a), the server 10d may select the audio data most similar to the audio data to be retrieved from the audio application according to the index information corresponding to the audio data to be retrieved and the index information corresponding to the other audio data, and add the audio data to the playlist of the user as the retrieval result.

Of course, if the terminal device integrates the audio data sampling, frequency domain transformation, and clustering functions, the terminal device may also directly determine at least two frequency sequences (or time sequences) corresponding to all audio data in the audio application, and obtain a sequence cluster to which the frequency sequence included in each audio data belongs through clustering, thereby obtaining index information corresponding to each audio data, determine a matching degree between the audio data to be retrieved and the other audio data according to the index information, and further determine recommended audio data corresponding to the audio data to be retrieved according to the matching degree. It can be understood that the audio data processing scheme provided in the embodiment of the present application may be executed by a computer program (including program code) in a computer device, for example, the audio data processing scheme is executed by an application software, a backend server of the application software may directly obtain each piece of audio data from a backend database, and generate index information corresponding to each piece of audio data, a client of the application software may detect a user behavior (e.g., a behavior of playing audio data, collecting audio data, and the like) with respect to the audio data, and the backend server of the application software determines recommended audio data that matches the audio data. In the following, how the terminal device determines the target recommendation data corresponding to the multimedia data is taken as an example for explanation.

The terminal device 10a, the terminal device 10b, the terminal device 10c, and the like may include a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a wearable device (e.g., a smart watch, a smart bracelet, and the like), and each terminal device may be installed with an audio application.

Please refer to fig. 2a and fig. 2b, which are schematic diagrams of an audio data processing scenario provided in an embodiment of the present application, and take a music application in a terminal device as an example to specifically describe an implementation process of an audio data processing scheme provided in the embodiment of the present application. As shown in fig. 2a, the terminal device may retrieve all music in the background database of the music application, and add the retrieved music to the music collection 20 a. In order to ensure that the terminal device obtains all music in the music application, the terminal device may read music data from the background database at intervals (e.g., 10 minutes) and update the music collection 20a in real time.

The terminal device may sample the music contained in the music collection 20a, representing each piece of music in the music collection 20a as a plurality of frequency sequences. Taking music 1 in the music collection 20a as an example, after the terminal device selects music 1 from the music collection 20a, time-domain sampling may be performed on the music 1 according to the sampling interval time, for example, a sampling value is obtained every 1 second, so as to obtain a sampling time sequence { T1, T2, …, Tn } corresponding to the music 1, and each Ti (i is a positive integer smaller than n) in the sampling time sequence { T1, T2, …, Tn } may be represented as a numerical value of the music 1 at a corresponding sampling point; further, the sampling time series { T1, T2, …, Tn } may be grouped according to the time period information to obtain a time series set of music 1 (the time series set includes at least two time series), when the time period information is 30 seconds, the sampling points in every 30 seconds are represented as a time series, then a time series may include values of 30 sampling points, for example, T1 to T30 may be a first time series of music 1, T31 to T60 may be a second time series of music 1, and so on, to obtain a time series set corresponding to music 1. Further, the terminal device may further perform frequency domain transformation on each time series of the music 1 to obtain a frequency signal corresponding to each time series, and perform frequency domain sampling on the frequency signal corresponding to each time series according to a sampling interval frequency (e.g., 10 hz) to obtain a frequency series set 20b corresponding to the music 1 (the number of frequency series included in the frequency series set 20b may be the same as the number of time series included in the time series set, that is, one time series corresponds to one frequency series), for example, the frequency series set 20b corresponding to the music 1 may include 7 frequency series, which are respectively represented by the frequency series G1 to the frequency series G7. When the frequency range of the frequency signal is 0-f, each frequency sequence may include f/10 numerical values, i.e., any one of the frequency sequence G1-G7 may be regarded as a vector with dimension f/10. Similarly, the terminal device may obtain a frequency sequence set corresponding to each piece of music in the music set 20a, for example, the frequency sequence set 20c corresponding to music 2, the frequency sequence set 20d corresponding to music 3, the frequency sequence set 20e corresponding to music 4, the frequency sequence set 20f corresponding to music 5, and the like, where the number of frequency sequences included in the frequency sequence sets corresponding to different pieces of music may also be different due to different durations of music, and for example, the frequency sequence set 20b corresponding to music 1 may include frequency sequences G1 to G7, for a total of 7 frequency sequences; the frequency sequence set 20c for music 2 may include 9 frequency sequences, i.e., the frequency sequence G8 to the frequency sequence G16.

The terminal device may add the frequency sequence corresponding to each piece of music in the music collection 20a to the frequency sequence total collection 20g, in other words, the frequency sequence total collection 20g may include the frequency sequences corresponding to all pieces of music in the music application. The terminal device may calculate a similarity between any two frequency sequences in the total frequency sequence set 20g (i.e., a distance between any two frequency sequences), and cluster the frequency sequences included in the total frequency sequence set 20g according to the similarity, so that similar frequency sequences are clustered into the same sequence cluster, and dissimilar frequency sequences are divided into different sequence clusters, so as to obtain at least two sequence clusters, where each sequence cluster may include a plurality of frequency sequences. When clustering is performed on the frequency sequences in the total frequency sequence set 20g, the number k of categories of clustering (i.e., the number of sequence clusters) may be preset, and if the number k of categories is set to 500, 500 sequence clusters, namely, sequence cluster 1, sequence cluster 2, …, and sequence cluster 500, may be obtained after clustering is completed. The terminal device may set a cluster identifier for each sequence cluster, where the cluster identifier may be a cluster number, and when the number of the sequence clusters is 500, the cluster identifier may be a number between 1 and 500, for example, if the cluster identifier of the sequence cluster 1 is set to 1, the cluster identifier of the sequence cluster 2 is set to 2, and so on, the cluster identifier of the sequence cluster 500 is set to 500. The clustering method includes but is not limited to: gaussian Mixed Model (GMM), K-means (Kmeans) algorithm, spectral Clustering (spectral Clustering) algorithm, Hierarchical Clustering (Hierarchical Clustering) algorithm.

The terminal apparatus may represent the frequency series by cluster identification, such as that the frequency series G1 in the frequency series set 20b of music 1 belongs to the series cluster 9, the frequency series G2, the frequency series G3 and the frequency series G4 all belong to the series cluster 1, the frequency series G5 and the frequency series G6 all belong to the series cluster 2, and the frequency series G7 belongs to the series cluster 3, so that the frequency series G1 can be represented by cluster identification 9, the frequency series G2, the frequency series G3 and the frequency series G4 can be represented by cluster identification 1, the frequency series G5 and the frequency series G6 can be represented by cluster identification 2, and the frequency series G7 can be represented by cluster identification 3, and index information {9, 1, 1, 1, 2, 2, 3} corresponding to music 1 is generated. In other words, music 1 may be compressed into index information 9, 1, 1, 1, 2, 2, 3, i.e. a multidimensional frequency sequence may be compressed into a cluster identifier. Similarly, the terminal device may generate index information corresponding to each piece of music in the music collection 20a, and the generated index information may preserve similarity between pieces of music because frequency sequences belonging to the same sequence cluster are similar.

As shown in fig. 2b, when the terminal device detects a play click operation of the music 1 by the user in the music application, the terminal device may use the music 1 as music to be retrieved, and calculate a matching degree between the music 1 and the rest of the music according to index information corresponding to each piece of music in the music collection 20a, where the matching degree between the music 1 and the music 2 is: a1, the matching degree between music 1 and music 3 is: a2, the matching degree between music 1 and music 4 is: a3, etc. The other music except the music 1 in the music set 20a is sorted according to the matching degree, the top 5 pieces of music are selected from the other sorted music as the recommended music corresponding to the music 1, that is, the 5 pieces of music with the highest matching degree with the music 1, and the number of the recommended music here can be preset according to actual requirements. When the 5 pieces of music with the highest matching degree with music 1 are: music 2, music 7, music 11, music 50, and music 20, the terminal device may add music 2, music 7, music 11, music 50, and music 20 to the playlist of the user in the music platform, so that the user may continuously listen to the interested music, and the time for the user to retrieve the relevant music may be effectively reduced.

Please refer to fig. 3, which is a flowchart illustrating an audio data processing method according to an embodiment of the present disclosure. As shown in fig. 3, the audio data processing method may include the steps of:

step S101, at least two frequency sequences corresponding to each audio data in the audio data set are obtained, and at least two frequency sequences corresponding to each audio data are clustered to obtain at least two sequence clusters.

Specifically, the terminal device may obtain audio data from a background database of the audio application, and add the obtained audio data to an audio data set, where the audio data set may include all audio data included in the background database of the audio application. The terminal device may perform digital processing on each audio data included in the audio data set to obtain at least two frequency sequences corresponding to each audio data, where each frequency sequence may be regarded as a vector, and the frequency sequence may be a vector of an original spectrogram of the audio data, or a vector obtained after the audio data is subjected to an audio representation method (e.g., an audio representation vector obtained after feature extraction is performed on the audio data). It can be understood that, when the audio data contained in the background database of the audio application is updated, the terminal device may update the audio data set according to the updated audio data in the background database, and if the audio data is newly added to the background database, the terminal device may add the newly added audio data in the background database to the audio data set, and perform digital processing on the newly added audio data to obtain at least two frequency sequences corresponding to the newly added audio data; or some audio data are deleted from the background database, the terminal device may delete the audio data deleted from the background database from the audio data set, or delete at least two frequency sequences corresponding to the deleted audio data. Among these, audio-like applications may include, but are not limited to: music applications, radio applications, audio book applications.

Each audio data contained in the audio data set is an analog signal that changes continuously in time, so the terminal device can perform digital processing on each audio data, and the digital processing process may include: the terminal equipment can acquire each audio data contained in the audio data set, and respectively sample each audio data according to the sampling interval time to obtain a sampling time sequence corresponding to each audio data; the sampling time sequences can be grouped according to the time period information to obtain at least two time sequences corresponding to each audio data; and respectively carrying out frequency domain transformation on the at least two time sequences to obtain at least two frequency sequences corresponding to each audio data. The terminal device can respectively perform time-domain sampling on each audio data according to the sampling interval time, if the sampling interval time is 0.1 second, the terminal device can acquire a sampling point every 0.1 second to obtain a sampling time sequence corresponding to each audio data, each audio data corresponds to one sampling time sequence, and then the terminal device combines the sampling points in the sampling time sequences according to the time period information, if the time period information is 3 seconds, the terminal device can take the sampling points contained in the time length of 3s as one time sequence, each time sequence can comprise 3/0.1 sampling points, and then at least two time sequences corresponding to each audio data can be obtained; further, the terminal device may perform frequency domain transformation on each time sequence to obtain frequency signals corresponding to each time sequence, perform frequency domain sampling on the frequency signals corresponding to each time sequence according to the sampling interval frequency, for example, when the sampling interval frequency is 10 hz, obtain a sampling point every 10 hz from the frequency signals to obtain frequency sequences corresponding to each time sequence, so as to obtain at least two frequency sequences corresponding to each audio data, and when the frequency range of the frequency signals is 0 to f, each frequency sequence may include f/10 sampling points. Among them, the frequency domain transformation method may include, but is not limited to: fast Fourier Transform (FFT), Mel-Frequency cepstrum coefficient (MFCC), Discrete Fourier Transform (DFT), Inverse cascade Fast Fourier Transform (Inverse cascade FFT).

Please refer to fig. 4, which is a schematic diagram of an acquisition frequency sequence according to an embodiment of the present application. The process of acquiring a frequency sequence will be described with reference to any audio data in the audio data set as an example. As shown in fig. 4, the terminal device obtains certain audio data from the audio data set, and may obtain a time domain diagram of the audio data (as shown by a curve 30 a), where an abscissa in the time domain diagram may represent time (in units of: seconds), an ordinate in the time domain diagram may represent amplitude of the audio data (such as "loudness" of the audio data, in units of: joules), and the curve 30a is a graph of a change of the audio data with time, and the duration of the audio data may be determined to be 50 seconds through the curve 30 a. The terminal equipment can perform time domain sampling on the audio data according to the sampling interval time, namely perform time domain sampling on a curve 30a in a time domain graph to obtain a time domain sampling graph of the audio data, wherein the meanings represented by horizontal and vertical coordinates in the time domain sampling graph are the same as the meanings represented by the horizontal and vertical coordinates in the time domain graph; when the sampling interval time is 1 second, the terminal device may obtain one sampling value every 1 second, 50 sampling values may be obtained for audio data having a duration of 50 seconds, and the 50 sampling values are represented as a discrete sampling time series { a1, a2, a3, …, a50}, so that the sampling time series { a1, a2, a3, …, a50} may be a time-domain sampling map of the audio data. The terminal device may combine the sample values contained in the sample time series { a1, a2, a3, …, a50} according to the time period information, and divide the sample time series { a1, a2, a3, …, a50} into at least two time series; when the time period information is 10s, the sampling values obtained in each 10s may be taken as a time series, that is, the sampling time series { a1, a2, a3, …, a50} may be divided into 5 time series, each of which may include 10 sampling values, for example, the time series 30b is the first time series of the audio data, which may be represented as [ a1, a2, a3, …, a10 ]; the time series 30c is a second time series of audio data, which may be denoted as [ a11, a12, a13, …, a20 ]; the time series 30d is the last time series of audio data and may be represented as [ a41, a42, a43, …, a50], etc.

Further, the terminal device may perform frequency domain transformation on each time sequence corresponding to the audio data to obtain a frequency domain map (which may also be referred to as an amplitude-frequency characteristic curve) corresponding to each time sequence. Taking the time series 30b as an example, after the terminal device performs frequency domain transformation on the time series 30b, a frequency domain graph (as shown by a curve 30 f) corresponding to the time series 30b can be obtained, an abscissa in the frequency domain graph can be expressed as a frequency (unit is: hertz), an ordinate in the frequency domain graph can be expressed as an amplitude (as defined by the ordinate in the time domain graph), and as can be seen from the frequency domain graph corresponding to the time series 30b, a frequency range corresponding to the curve 30f is: 0 to 50 Hz. The terminal device may perform frequency domain sampling on the curve 30f according to the sampling interval frequency to obtain a frequency sequence 30g corresponding to the time sequence 30 b; when the sampling interval frequency is 10 hz, the terminal device may obtain one sampling value every 10 hz, and after the sampling of the curve 30f of 0 to 50 hz in the frequency domain is completed, 5 sampling values may be obtained, where the 5 sampling values may form a frequency series 30g corresponding to the time series 30b, and the frequency series 30g may be represented as [ b1, b2, b3, b4, b5 ]. Similarly, the terminal device may obtain a frequency sequence corresponding to each time sequence included in the audio data.

After obtaining at least two frequency sequences corresponding to each audio data in the audio data set, the terminal device may cluster the at least two frequency sequences corresponding to each audio data, cluster similar frequency sequences in the same sequence cluster, and partition dissimilar frequency sequences into different sequence clusters to obtain at least two clusters. Optionally, the terminal device may also directly cluster at least two time sequences corresponding to each audio data in the audio data set, so as to obtain at least two sequence clusters.

Optionally, taking K-means as an example, the following specifically describes a clustering process of at least two frequency sequences respectively corresponding to each piece of audio data.

The terminal device may determine at least two frequency sequences corresponding to each audio data as sequences to be processed, add each sequence to be processed to a sequence set, and select a center sequence t from the sequence set_k(ii) a k is a positive integer less than or equal to the number of center sequences; acquiring each sequence to be processed and a central sequence t contained in a sequence set_kThe similarity between the sequence G to be processed in the sequence set_iWith the central sequence t_kIf the similarity is maximum, the sequence G to be processed is determined_iAddition to the central sequence t_kBelonging cluster C to be processed_k(ii) a i is a positive integer less than or equal to the number of sequences to be processed contained in the sequence set; according to the cluster C to be processed_kThe central sequence t is updated according to the sequence to be processed contained in (1)_kUp to cluster C to be processed_kMiddle updated center sequence t_kAnd the central sequence t before updating_kWhen the same, the cluster C to be processed_kDetermined as a sequence cluster.

The terminal device may preset a cluster number (also referred to as a cluster number) corresponding to all sequences to be processed in the sequence set, and when the cluster number is p (p is a positive integer smaller than the number of the sequences to be processed in the sequence set), may randomly select p sequences to be processed from the sequence set as an initial center sequence, that is, a center sequence t_k(k is a positive integer less than or equal to p); the cluster number set in the clustering algorithm can influence the final clustering effect of the frequency sequences, and the cluster number is too large, so that similar frequency sequences are possibly not classifiedIn the same sequence cluster, the cluster number is too small, so that it is possible that dissimilar frequency sequences are classified into the same sequence cluster, in this embodiment, the cluster number may be set empirically, for example, p is 500, and certainly, a method such as cross validation may also be used to determine the cluster number.

Optionally, in order to achieve a better clustering effect, the terminal device may select p to-be-processed sequences as far as possible from each other from the sequence set as an initial center sequence. The specific selection mode is as follows: randomly selecting a sequence to be processed from the sequence set as a first central sequence t₁Further, the first center sequence t may be selected from the remaining sequences to be processed (unselected sequences to be processed) in the sequence set₁The most distant sequence to be processed is taken as a second central sequence t₂(ii) a The first central sequence t can be calculated₁And a second center sequence t₂The central point in between (the central point may be the first central sequence t)₁And a second center sequence t₂To-be-processed sequence corresponding to the average value of) the sequence set, selecting the to-be-processed sequence farthest from the central point from the remaining to-be-processed sequences in the sequence set as a third central sequence t₃And so on until the p-th central sequence t is determined_pP central sequences can be determined.

The terminal device may calculate a distance between each to-be-processed sequence in the sequence set and each central sequence (the distance between two to-be-processed sequences may be measured by a similarity between the two to-be-processed sequences, where the greater the similarity, the shorter the distance, the smaller the similarity, and the greater the distance), that is, each to-be-processed sequence G in the sequence set_iWith each central sequence t_kDistance between, sequence G to be processed_iDividing to the nearest center sequence t_kBelonging cluster C to be processed_kP to-be-processed clusters (each central sequence corresponds to one to-be-processed cluster) can be obtained preliminarily, and each to-be-processed cluster can comprise a plurality of to-be-processed sequences (the number of the to-be-processed sequences in each to-be-processed cluster can be different). The method for calculating the distance between two sequences to be processed may include, but is not limited to: european style tea tableDistances (Euclidean Distance), Manhattan Distance (Manhattan Distance), Minkowski Distance (Minkowski Distance), and Cosine Similarity (Cosine Similarity) are obtained. Taking cosine similarity as an example, for any two frequency sequences in the sequence set: g_iAnd G_jTwo frequency sequences G_iAnd G_jThe distance between can be expressed as: d<G_i，G_j>＝cosin(G_i，G_j) Wherein d is<G_i，G_j>Representing a sequence of frequencies G_iSum frequency sequence G_jThe cosin is a cosine function, if the sequence G to be processed in the sequence set_iThe distances from the 8 center sequences (assuming that the number of clusters p is 8) are: 35, 17, 25, 30, 41, 5, 10, 28, the sequence G to be processed can be processed_iAnd dividing the cluster to be processed to which the central sequence corresponding to the distance of 5 belongs.

For the p to-be-processed clusters obtained above, the mean value of all to-be-processed sequences included in each to-be-processed cluster can be calculated, the center sequence corresponding to the to-be-processed cluster is updated according to the mean value, a new center sequence is obtained, and then the to-be-processed sequence included in each to-be-processed cluster is updated according to the distance between each to-be-processed sequence in the sequence set and each new center sequence. And repeating the process continuously, wherein when the sequence to be processed contained in each cluster to be processed does not change any more, namely the central sequence corresponding to each cluster to be processed is fixed, the p clusters to be processed at the moment can be determined as the final clustering result of the sequence set.

Please refer to fig. 5, which is a schematic diagram of clustering frequency sequences provided in the embodiment of the present application. As shown in fig. 5, when the positions of the sequences to be processed (i.e., the frequency sequences included in each audio data) included in the sequence set 40a are as shown in fig. 5, and the number of clusters is 3 (circles in the figure are used to represent the sequences to be processed), the terminal device may select 3 sequences to be processed from the sequence set 40a as an initial center sequence, for example, select the sequence to be processed t1 as a first center sequence, the sequence to be processed t2 as a second center sequence, and the sequence to be processed t3 as a third center sequence; the terminal device can calculate the distance between each to-be-processed sequence in the sequence set 40a and the to-be-processed sequence t1, the to-be-processed sequence t2 and the to-be-processed sequence t3, divide the to-be-processed sequences into to-be-processed clusters to which the center sequences with the shortest distance belong, and complete the first iteration process in the k-means clustering algorithm, wherein the to-be-processed cluster C1, the to-be-processed cluster C2 and the to-be-processed cluster C3 are results obtained after the first iteration in the sequence set 40 a; the terminal device may update the center sequence in each to-be-processed cluster according to-be-processed sequences respectively included in the to-be-processed cluster C1, the to-be-processed cluster C2, and the to-be-processed cluster C3, for example, the center sequence of the to-be-processed cluster C1 is updated from the to-be-processed sequence t1 to the to-be-processed sequence t4, the center sequence of the to-be-processed cluster C2 is updated from the to-be-processed sequence t2 to the to-be-processed sequence t5, and the center sequence of the to-be-processed cluster C3 is updated from the to-be-processed sequence t; calculating the distance between each sequence to be processed in the sequence set 40a and the sequence to be processed t4, the sequence to be processed t5 and the sequence to be processed t6 again, updating the cluster to be processed C1 into a cluster to be processed C4 based on the distance, updating the cluster to be processed C2 into a cluster to be processed C5, updating the cluster to be processed C3 into a cluster to be processed C6, and finishing the second iteration process in the clustering algorithm; the above process is repeated continuously until the sequence to be processed included in each cluster to be processed does not change any more, or a preset maximum number of iterations is reached, and the finally obtained cluster to be processed is determined as 3 sequence clusters corresponding to the sequence set 40a, such as sequence cluster 1, sequence cluster 2, and sequence cluster 3 in fig. 5.

Optionally, the remaining clustering algorithms may also be used to cluster the to-be-processed sequences included in the sequence set, and the hierarchical clustering algorithm is taken as an example to specifically describe the clustering process of the to-be-processed sequences.

The terminal equipment can acquire M to-be-processed clusters corresponding to the sequence set and acquire the similarity between any two to-be-processed clusters in the M to-be-processed clusters; m is the number of sequences to be processed contained in the sequence set, and each cluster to be processed comprises one sequence to be processed; combining two clusters to be processed corresponding to the maximum similarity in the M clusters to be processed to obtain updated M-1 clusters to be processed, and obtaining the similarity between any two updated clusters to be processed in the updated M-1 clusters to be processed; and combining the two updated clusters to be processed corresponding to the maximum similarity in the updated M-1 clusters to be processed to obtain updated M-2 clusters to be processed until the number of the updated clusters to be processed is equal to the threshold of the number of clusters, and determining the updated clusters to be processed as sequence clusters.

The terminal device may preset the cluster number (which may also be referred to as a cluster number) corresponding to all sequences to be processed in the sequence set, for example, the cluster number is p (p is a positive integer less than or equal to M). When the number of to-be-processed sequences contained in the sequence set is M, the terminal device may regard each frequency sequence in the sequence set as one to-be-processed cluster, that is, may obtain M initial to-be-processed clusters, and may further calculate a distance between any two to-be-processed clusters (at this time, the distance between two to-be-processed clusters is a distance between two corresponding frequency sequences, and the distance here may still be measured by a similarity degree), merge two to-be-processed clusters with the shortest distance, and obtain M-1 to-be-processed clusters after updating; and continuously calculating the distance between any two clusters to be processed in the updated M-1 clusters to be processed, combining the two clusters to be processed with the shortest distance in the updated M-1 clusters to be processed to obtain updated M-2 clusters to be processed, repeating the process until the number of the clusters to be processed is equal to the p, and determining the final p clusters to be processed as p sequence clusters corresponding to the sequence set.

Please refer to fig. 6, which is a schematic diagram of another frequency sequence clustering scheme provided in the embodiment of the present application. As shown in fig. 6, when the positions of the sequences to be processed (i.e., the frequency sequences included in the audio data) included in the sequence set 50a are as shown in fig. 6 (circles in the figure are used to represent the sequences to be processed), the terminal device may use each sequence to be processed included in the sequence set 50a as an initial cluster to be processed, that is, may obtain 8 clusters to be processed; the distance between any two clusters to be processed can be calculated, the two clusters to be processed with the closest distance are merged to obtain the cluster to be processed 50b, the number of the clusters to be processed is 7 at this time, the cluster to be processed 50b in the 7 clusters to be processed includes two sequences to be processed, and the rest of the clusters to be processed only include one sequence to be processed. And then, the distance between any two to-be-processed clusters in the 7 to-be-processed clusters can be calculated, the two to-be-processed clusters with the minimum distance in the 7 to-be-processed clusters are combined into one to-be-processed cluster to obtain 6 to-be-processed clusters, and after 4 iteration processes, the sequence set 50a can be clustered into 4 to-be-processed clusters, namely, a to-be-processed cluster 50b, a to-be-processed cluster 50c, a to-be-processed cluster 50d and a to-be-processed cluster 50 e. Of course, the terminal device may also continue to iterate the above process, and after 6 iterations, the cluster to be processed 50f and the cluster to be processed 50g may be obtained. When the set cluster number is 4, the cluster to be processed 50b, the cluster to be processed 50c, the cluster to be processed 50d, and the cluster to be processed 50e may be used as 4 sequence clusters finally obtained by the sequence set 50 a; when the set number of clusters is 2, the cluster to be processed 50f and the cluster to be processed 50g may be set as 2 sequence clusters finally obtained by the sequence set 50 a.

In the process of calculating the distance between two clusters to be processed, the distance calculation between a single sequence to be processed and a single sequence to be processed, the distance calculation between a single sequence to be processed and a combination of sequences to be processed, and the distance calculation between the combination of sequences to be processed and the combination of sequences to be processed are involved. When the distance between two clusters to be processed is calculated, if the two clusters to be processed only contain one sequence to be processed, the distance between the two clusters to be processed is the distance between the two sequences to be processed; if one cluster to be processed includes two sequences to be processed, G1 and G2, and the other cluster to be processed includes only one sequence to be processed, G3, the distance between the two clusters to be processed can be expressed as: (d < G1, G3> + d < G1, G3>)/2, wherein d < G1, G3> represents the distance between the sequence to be processed G1 and the sequence to be processed G3, and d < G1, G3> represents the distance 2 between the sequence to be processed G2 and the sequence to be processed G3; if two to-be-processed clusters each include two to-be-processed clusters, one to-be-processed cluster includes two to-be-processed sequences G1 and G2, and the other to-be-processed cluster includes two to-be-processed sequences G4 and G5, the distance between the two to-be-processed clusters can be expressed as: (d < G1, G4> + d < G1, G5> + d < G2, G4> + d < G2, G5>)/4, wherein d < G1, G4> represents the distance between the sequence to be processed G1 and the sequence to be processed G4, d < G1, G5> represents the distance between the sequence to be processed G1 and the sequence to be processed G5, d < G2, G4> represents the distance between the sequence to be processed G2 and the sequence to be processed G4, and d < G2, G5> represents the distance between the sequence to be processed G2 and the sequence to be processed G5.

Step S102, according to the cluster identification of the sequence cluster to which at least two frequency sequences in each audio data respectively belong, determining the index information corresponding to each audio data respectively.

Specifically, the terminal device may set cluster identifiers for at least two sequence clusters, where the cluster identifiers may refer to cluster numbers corresponding to the sequence clusters, and if the number of the sequence clusters is 500, the cluster identifier of the sequence cluster may be a certain number between 1 and 500; the corresponding frequency sequence may be represented by using a cluster identifier, and if a cluster identifier of a sequence cluster to which a certain frequency sequence belongs is 4, 4 may be used as an index of the frequency sequence, that is, 4 is used to represent the frequency sequence, so that the terminal device may determine, according to the cluster identifiers of the sequence clusters to which at least two frequency sequences in each audio data respectively belong, index information corresponding to each audio data respectively, that is, may represent each audio data in an audio data set in the form of the index information. For example, if the audio data 1 includes the frequency series G1, the frequency series G2, the frequency series G3, and the frequency series G4, and the cluster of sequences to which the frequency series G1 belongs is identified as: 2, the cluster of the sequence cluster to which the frequency sequence G2 belongs is identified as: 1, the cluster of the sequence cluster to which the frequency sequence G3 belongs is identified as: 1, the cluster of the sequence cluster to which the frequency sequence G4 belongs is identified as: 5, the index information of the audio data 1 can be expressed as: {2,1,1,5}.

Alternatively, since abrupt change information may exist in the audio data (e.g., there may be a change in melody in music), it may not be possible to characterize the abrupt change information only by using cluster identification, and therefore, the abrupt change point information may be combined and added to the index information by using n-gram (n-gram). The terminal device may determine initial index information corresponding to each audio data according to cluster identifiers of sequence clusters to which at least two frequency sequences in each audio data belong, respectively, where the initial index information only includes the cluster identifiers; when the adjacent cluster identifications in the initial index information are different, indicating that mutation information exists between the frequency sequences corresponding to the adjacent cluster identifications, acquiring a combined identification corresponding to the adjacent cluster identifications, and adding the combined identification to the initial index information to obtain index information corresponding to each audio data; when the adjacent cluster identifiers in the initial index information are the same, the fact that mutation information does not exist between the frequency sequences corresponding to the adjacent cluster identifiers is represented, and the same cluster identifiers do not need to be combined. Wherein, the used n-grams are different, the combined identifier of the same initial index information is also different, n is used to indicate the number of the combined cluster identifiers, and as mentioned in the foregoing example, the initial index information of the audio data 1 is represented as: {2, 1, 1, 5}, if a 2-gram is used, the combined identity in the initial index information {2, 1, 1, 5} includes: 2#1, 1#5, the index information corresponding to the audio data 1 is: {2, 1, 1, 5, 2#1, 1#5 }; if a 3-gram is adopted, the combined identification in the initial index information {2, 1, 1, 5} includes: 2#1#1, 1#1#5, the index information corresponding to audio data 1 is: {2, 1, 1, 5, 2#1#1, 1#1#5 }.

And step S103, acquiring recommended audio data corresponding to the audio data to be retrieved in the audio data set according to the index information.

Specifically, the terminal device may detect a retrieval triggering operation in the audio application (e.g., a user clicks to play audio data, a user adds audio data to a favorite bar, a user clicks to retrieve similar audio data, and other user behaviors), and obtain audio data to be retrieved operated by the retrieval triggering operation, where the audio data to be retrieved belongs to the audio data set; the terminal device may determine the index information corresponding to the audio data to be retrieved as target index information, determine the remaining audio data in the audio data set except the audio data to be retrieved as candidate audio data, and determine the index information corresponding to the candidate audio data as candidate index information. And determining recommended audio data corresponding to the audio data to be retrieved from the audio data set by calculating the matching degree between the target index information and all the candidate index information respectively. Wherein, the retrieval triggering operation may include but is not limited to: clicking a playing operation, adding to a collection bar operation, a label adding operation and a retrieval operation; the audio data to be retrieved may refer to audio data being played in the audio-type application by the user, or audio data to which a tag is added by the user (e.g., audio data marked as "i like", or audio data added to a favorite bar, etc.), or audio data being retrieved by the user, or audio data played by the user the most times in the audio-type application, etc.

Further, the terminal device may calculate, according to candidate index information corresponding to target index information corresponding to the audio data to be retrieved and remaining audio data (i.e., candidate audio data) in the audio data set, matching degrees between the target index information and each candidate index information, sort the candidate audio data included in the audio data set in an order from a large matching degree to a small matching degree, obtain L recommended audio data from the sorted candidate audio data in the order of the sort, display L recommended audio data in an audio application, so that a user may view the recommended audio data in a screen interface of the terminal device, L is a positive integer greater than or equal to 1, and L is smaller than the number of the candidate audio data.

Optionally, when the recommended audio data is determined by a server (e.g., a background server corresponding to the audio application), after the server determines the recommended audio data, the server needs to send the recommended audio data to the terminal device corresponding to the audio data to be retrieved, so that the terminal device can display the recommended audio data in the presentation page of the audio application.

Optionally, in a scene where the audio data is music, the terminal device may monitor behavior operations of the user in real time, when the terminal device detects that the user triggers a retrieval operation for the music to be retrieved, the music to be retrieved operated by the user may be acquired, and after determining recommended music similar to the music to be retrieved according to matching degrees of target index information corresponding to the music to be retrieved and remaining music, the recommended music may be added to a music play list of the music to be retrieved, and the recommended music is displayed in the music play list. For the recommended music displayed in the music playlist, the user may click to play the recommended music in the music playlist, and view detailed information of the recommended music, such as an album to which the recommended music belongs, lyrics corresponding to the recommended music, singers, duration, and the like.

When the index information corresponding to each piece of audio data includes the cluster identifier and the combination identifier, the following specifically describes a matching degree calculation process between the target index information and the candidate index information, taking the target index information and the single candidate index information as an example. The terminal equipment can acquire target index information corresponding to audio data to be retrieved in the audio data set, and determines cluster identifiers and combined identifiers contained in the target index information as first identifiers to be processed; selecting candidate audio data y from a set of audio data_nCorresponding candidate index information x_nThe candidate index information x_nThe cluster identifier and the combined identifier contained in the first group are determined as a second identifier to be processed; n is a positive integer less than or equal to the number of candidate audio data; classifying first to-be-processed identifications contained in the target index information to obtain at least two target identification categories, and respectively counting a first quantity of the first to-be-processed identifications contained in each target identification category; for candidate index information x_nThe second to-be-processed identifiers contained in the first to-be-processed identifier list are classified to obtain at least two candidate identifier categories, and the second number of the second to-be-processed identifiers contained in each candidate identifier category is respectively counted; determining target index information and candidate index information x according to the first number and the second number_nFurther, the matching identification category of the at least two target identification categories and the at least two candidate identification categories can be obtained; at least two target identity classes and at least two candidate identitiesThe categories all comprise matching identification categories; matching the first number of the identification categories in the target index information with the candidate index information x_nThe second quantity in (2) is subjected to logarithmic summation to obtain target index information and candidate index information x_nThe degree of match between them. In other words, the target index information and the candidate index information x_nThe terminal device may determine the cluster identifier and the combined identifier in the target index information as a first identifier to be processed, and divide the same first identifier to be processed in the target index information into a same target identifier category to obtain at least two target identifier categories corresponding to the target index information, and respectively count a first number of the first identifiers to be processed included in each target identifier category; likewise, candidate index information x may be expressed_nDetermines the cluster identifier and the combined identifier in the first cluster identifier as a second identifier to be processed, and determines candidate index information x_nDividing the same second to-be-processed identifiers into the same candidate identifier category to obtain at least two candidate identifier categories corresponding to the candidate index information, and respectively counting the second number of the second to-be-processed identifiers contained in each candidate identifier category; target index information and candidate index information x_nThe cluster identification or the combined identification contained in the index information is determined as the matching identification category, the first quantity of each matching identification category in the target index information and the candidate index information x of the matching identification category are determined_nThe second quantity in (2) is subjected to logarithmic summation to obtain target index information and candidate index information x_nThe matching degree between the two; similarly, the matching degree between the target index information and each candidate index information can be obtained. Wherein, the target index information x₀And candidate index information x₁The calculation method of the matching degree between the two can be expressed as follows: s<x₀，x₁>＝Σ{log(cnt(c_dof x₀))+log(cnt(c_dof x₁) C) where cnt denotes a counter, c_dFor representing cluster identities and combination identities, S<x₀，x₁>Representing target index information x₀And candidate index information x₁BetweenAnd (5) matching degree.

For example, target index information x₀Is {9, 1, 1, 1, 2, 2, 3, 9#1, 1#2, 2#3}, candidate index information x₁Is {9, 1, 2, 1, 2, 2, 5, 9#1, 1#2, 2#1, 1#2, 2#5}, the target index information x₀Also denoted as 9:1, 1:3, 2:2, 3:1, 9#1:1, 1#2:1, 2#3:1, indicating the target index information x₀Comprises the following steps: 1 cluster mark 9, 3 cluster marks 1, 2 cluster marks 2, 1 cluster mark 3, 1 cluster mark 9#1, 1 cluster mark 1#2 and 1 cluster mark 2# 3; candidate index information x₁And also can be expressed as {9:1, 1:2, 2:3, 5:1, 9#1:1, 1#2:2, 2#1:1, 2#5:1}, indicating candidate index information x₁Comprises the following steps: 1 cluster mark 9, 2 cluster marks 1, 3 cluster marks 2, 1 cluster mark 5, 1 cluster mark 9#1, 2 cluster marks 1#2, 1 cluster mark 2#1, 1 cluster mark 2# 5. Therefore, the target index information and the candidate index information x₁The matching degree between can be expressed as: log (1+1) + log (3+2) + log (2+3) + log (1+1) + log (1+ 2).

Please refer to fig. 7, which is a schematic view of an audio data retrieval scene according to an embodiment of the present application. As shown in fig. 7, taking a music application as an example, the terminal device 60a may open the music application, the user may open a music list 60b in the music application to select music to be played, the music list 60b may include a plurality of pieces of music, such as music 1, music 2, music 3, and the like, and the user may select any piece of music from the music list 60b to play. If the user is interested in music 1 and wants to play more other music similar to music 1, the user may press (long press means that the terminal device detects that the duration of pressing the terminal screen by the user is greater than a set threshold, such as 2 seconds) music 1 in the music list 60b for a long time; after detecting the long-press operation of the user for the music 1, the terminal device may pop up the information prompt box 60c in the music playlist 60b, and the user may click on the information prompt box 60c to obtain the search result corresponding to the music 1. It is understood that the triggering operation of popping up the information prompt box 60c in the music list 60b may include, in addition to the long press operation described above: single click, double click, voice, etc.

After detecting the click operation of the user on the information prompt box 60c, the terminal device 60a may obtain the index information set 60d corresponding to all music in the music application, where the index information 60d may include index information corresponding to each piece of music in the music application, and for example, the index information corresponding to music 1 is: {2, 1, …, 2#1, … }, the index information corresponding to music 2 is: {5, 3, …, 5#3, … }, etc., each index information may include cluster id and combination id. And (3) sequencing the rest of music except the music 1 in the music application by calculating the matching degrees between the index information of the music 1 and the index information of the rest of music respectively to obtain a music ordered list 60e, wherein the music in the music ordered list 60e is arranged from big to small according to the matching degree with the music 1. Terminal device 60a may select the top 4 pieces of music (including music 20, music 11, music 50, and music 7) from music ranking list 60e as the recommended music for music 1, and present the 4 pieces of recommended music as the search result for music 1 in music result search page 60f of terminal device 60 a. The user can click on music from the music result retrieval page 60f for playback, e.g., the user can click on music 20 for playback. It can be understood that the above calculation processes of the index information set 60d, the music ordered list 60e and the matching degree are all background processing processes of the terminal device 60a, and are not displayed in the terminal screen.

The method and the device for searching the audio data can obtain at least two frequency sequences corresponding to each audio data in the audio data set, cluster is conducted on the frequency sequences corresponding to all the audio data in the audio data set to obtain a plurality of sequence clusters, a cluster identifier is set for each sequence cluster, the cluster identifier of the sequence cluster to which each frequency sequence in the audio data to be searched belongs is used as index information of the audio data to be searched, a combined identifier is added into the index information, recommended audio data corresponding to the audio data to be searched is determined according to the index information of the audio data to be searched and the index information of the rest audio data, the similarity between the audio data to be searched and the recommended audio data can be enhanced, and the searching accuracy of the audio data can be further improved; the audio data is represented by the index information, so that the audio data can be compressed, and the index information is used for searching the audio data, so that the searching efficiency of the audio data can be improved.

Fig. 8 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application. The audio data processing means may be a computer program (comprising program code) running on a computer device, for example the audio data processing means being an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 8, the audio data processing apparatus 1 may include: the system comprises a clustering module 10, a determining module 11 and a recommending module 12;

the clustering module 10 is configured to obtain at least two frequency sequences corresponding to each piece of audio data in an audio data set, and cluster the at least two frequency sequences corresponding to each piece of audio data to obtain at least two sequence clusters;

a determining module 11, configured to determine, according to cluster identifiers of sequence clusters to which at least two frequency sequences in each audio data respectively belong, index information corresponding to each audio data respectively;

and the recommending module 12 is configured to obtain recommended audio data corresponding to the audio data to be retrieved in the audio data set according to the index information.

The specific functional implementation manners of the clustering module 10, the determining module 11, and the recommending module 12 may refer to steps S101 to S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring also to fig. 8, clustering module 10 may include: a sampling unit 100, a grouping unit 101, a frequency domain transforming unit 102, a sequence selecting unit 103, a sequence dividing unit 104, a cluster updating unit 105, a sequence set acquiring unit 106, a similarity acquiring unit 107, a first merging unit 108, and a second merging unit 109;

the sampling unit 100 is configured to acquire each audio data included in the audio data set, and respectively sample each audio data according to a sampling interval time to obtain a sampling time sequence corresponding to each audio data;

a grouping unit 101, configured to group the sampling time sequences according to the time period information to obtain at least two time sequences corresponding to each piece of audio data;

a frequency domain transforming unit 102, configured to perform frequency domain transformation on the at least two time sequences respectively to obtain at least two frequency sequences corresponding to each piece of audio data;

a sequence selection unit 103, configured to determine at least two frequency sequences corresponding to each piece of audio data as to-be-processed sequences, add each to-be-processed sequence to a sequence set, and select a center sequence t from the sequence set_k(ii) a k is a positive integer less than or equal to the number of center sequences;

a sequence dividing unit 104, configured to obtain each to-be-processed sequence and a central sequence t included in the sequence set_kThe similarity between the sequence G to be processed in the sequence set_iWith the central sequence t_kIf the similarity is maximum, the sequence G to be processed is determined_iAddition to the central sequence t_kBelonging cluster C to be processed_k(ii) a i is a positive integer less than or equal to the number of sequences to be processed contained in the sequence set;

a cluster updating unit 105 for updating the cluster C according to the cluster C to be processed_kThe central sequence t is updated according to the sequence to be processed contained in (1)_kUp to cluster C to be processed_kMiddle updated center sequence t_kAnd the central sequence t before updating_kWhen the same, the cluster C to be processed_kDetermining the sequence cluster;

a sequence set obtaining unit 106, configured to determine at least two frequency sequences corresponding to each piece of audio data as to-be-processed sequences, and add each to-be-processed sequence to a sequence set;

a similarity obtaining unit 107, configured to obtain M to-be-processed clusters corresponding to the sequence set, and obtain a similarity between any two to-be-processed clusters in the M to-be-processed clusters; m is the number of sequences to be processed contained in the sequence set, and each cluster to be processed comprises one sequence to be processed;

a first merging unit 108, configured to merge two to-be-processed clusters corresponding to the maximum similarity among the M to-be-processed clusters to obtain updated M-1 to-be-processed clusters, and obtain a similarity between any two updated to-be-processed clusters in the updated M-1 to-be-processed clusters;

a second merging unit 109, configured to merge two updated to-be-processed clusters corresponding to the maximum similarity in the updated M-1 to-be-processed clusters to obtain updated M-2 to-be-processed clusters, and determine the updated to-be-processed clusters as sequence clusters until the number of the updated to-be-processed clusters is equal to the threshold of the number of clusters.

For specific functional implementation manners of the sampling unit 100, the grouping unit 101, the frequency domain transforming unit 102, the sequence selecting unit 103, the sequence dividing unit 104, the cluster updating unit 105, the sequence set obtaining unit 106, the similarity obtaining unit 107, the first combining unit 108, and the second combining unit 109, reference may be made to step S101 in the embodiment corresponding to fig. 3, which is not described herein again. When the sequence selection unit 103, the sequence division unit 104, and the cluster update unit 105 execute corresponding operations, the sequence set acquisition unit 106, the similarity acquisition unit 107, the first merging unit 108, and the second merging unit 109 all suspend execution of the operations; when the sequence set acquisition unit 106, the similarity acquisition unit 107, the first merging unit 108, and the second merging unit 109 perform corresponding operations, the sequence selection unit 103, the sequence division unit 104, and the cluster update unit 105 perform operations while pausing.

Referring to fig. 8, the determining module 11 may include: an initial index acquisition unit 111, a combined identifier acquisition unit 112, a combined identifier addition unit 113;

an initial index obtaining unit 111, configured to set cluster identifiers for at least two sequence clusters, and determine initial index information corresponding to each audio data according to the cluster identifiers of the sequence clusters to which at least two frequency sequences in each audio data belong;

a combined identifier obtaining unit 112, configured to obtain a combined identifier corresponding to an adjacent cluster identifier when adjacent cluster identifiers in the initial index information are different;

and a combined identifier adding unit 113, configured to add a combined identifier to the initial index information, so as to obtain index information corresponding to each piece of audio data.

For specific functional implementation manners of the initial index obtaining unit 111, the combined identifier obtaining unit 112, and the combined identifier adding unit 113, reference may be made to step S102 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring also to fig. 8, the recommendation module 12 may include: a detection unit 120, a sorting unit 121, a recommended audio acquisition unit 122;

the detecting unit 120 is configured to, when a retrieval trigger operation for audio data to be retrieved in an audio data set is detected, determine index information corresponding to the audio data to be retrieved as target index information, and determine index information corresponding to other audio data except the audio data to be retrieved in the audio data set as candidate index information;

the sorting unit 121 is configured to obtain a matching degree between the target index information and the candidate index information, and sort the candidate audio data included in the audio data set according to the matching degree; the candidate audio data are audio data corresponding to the candidate index information;

and the recommended audio acquiring unit 122 is configured to acquire recommended audio data from the sorted candidate audio data according to the sorting order, and send the recommended audio data to the terminal device corresponding to the audio data to be retrieved.

The specific functional implementation manners of the retrieving unit 120, the sorting unit 121, and the recommended audio acquiring unit 122 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8, when the target index information and the candidate index information both include the cluster identifier and the combination identifier, the sorting unit 121 may include: a first determining subunit 1211, a second determining subunit 1212, a first statistics subunit 1213, a second statistics subunit 1214, a matching degree determining subunit 1215;

a first determining subunit 1211, configured to determine a cluster identifier and a combined identifier included in the target index information as a first identifier to be processed;

a second determining subunit 1212, configured to select a candidate audio data y from the set of audio data_nCorresponding candidate index information x_nThe candidate index information x_nThe cluster identifier and the combined identifier contained in the first group are determined as a second identifier to be processed; n is a positive integer less than or equal to the number of candidate audio data;

a first statistics subunit 1213, configured to classify the first to-be-processed identifier included in the target index information to obtain at least two target identifier categories, and count a first number of the first to-be-processed identifiers included in each target identifier category respectively;

a second statistic subunit 1214 for calculating the candidate index information x_nThe second to-be-processed identifiers contained in the first to-be-processed identifier list are classified to obtain at least two candidate identifier categories, and the second number of the second to-be-processed identifiers contained in each candidate identifier category is respectively counted;

a matching degree determination subunit 1215 for determining the target index information and the candidate index information x according to the first number and the second number_nThe degree of match between them.

The specific functional implementation manners of the first determining subunit 1211, the second determining subunit 1212, the first statistics subunit 1213, the second statistics subunit 1214, and the matching degree determining subunit 1215 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8, the matching degree determining subunit 1215 may include: a matching identity acquisition sub-unit 12151, a summation sub-unit 12152;

a matching identifier obtaining subunit 12151, configured to obtain at least two target identifier categories and a matching identifier category of the at least two candidate identifier categories; the at least two target identification categories and the at least two candidate identification categories each comprise a matching identification category;

a summation subunit 12152, configured to sum the first number of the matching identification categories in the target index information with the number of the matching identification categories in the candidate index information x_nSecond amount ofSumming row logarithms to obtain target index information and candidate index information x_nThe degree of match between them.

The specific functional implementation manners of the matching identifier obtaining subunit 12151 and the summing subunit 12152 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the audio data processing method in the embodiment corresponding to fig. 3, and may also perform the description of the audio data processing apparatus 1 in the embodiment corresponding to fig. 8, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the audio data processing apparatus 1 mentioned above is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the audio data processing method in the embodiment corresponding to fig. 3 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. As an example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network, which may constitute a block chain system.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of audio data processing, comprising:

determining index information corresponding to each audio data according to cluster identifiers of the sequence clusters to which the at least two frequency sequences respectively belong in each audio data;

2. The method of claim 1, wherein obtaining at least two frequency sequences corresponding to each audio data in the audio data set comprises:

acquiring each audio data contained in the audio data set, and sampling each audio data according to sampling interval time to obtain a sampling time sequence corresponding to each audio data;

grouping the sampling time sequences according to time period information to obtain at least two time sequences corresponding to each audio data;

and respectively carrying out frequency domain transformation on the at least two time sequences to obtain the at least two frequency sequences corresponding to each audio data.

3. The method according to claim 1, wherein the clustering the at least two frequency sequences respectively corresponding to each audio data to obtain at least two sequence clusters comprises:

determining the at least two frequency sequences respectively corresponding to each audio data as to-be-processed sequences, adding each to-be-processed sequence to a sequence set, and selecting a center sequence t from the sequence set_k(ii) a k is a positive integer less than or equal to the number of center sequences;

obtaining each sequence to be processed and the central sequence t contained in the sequence set_kIf the sequence G to be processed in the sequence set is similar to the sequence G to be processed in the sequence set_iAnd the central sequence t_kIf the similarity is maximum, the sequence G to be processed is determined_iAdded to the central sequence t_kBelonging cluster C to be processed_k(ii) a i is a positive integer less than or equal to the number of sequences to be processed contained in the sequence set;

according to the cluster C to be processed_kThe central sequence t is updated according to the sequence to be processed contained in_kUp to the cluster C to be processed_kMiddle updated center sequence t_kAnd the central sequence t before updating_kWhen the same, the cluster C to be processed is treated_kDetermined as a sequence cluster.

4. The method according to claim 1, wherein the clustering the at least two frequency sequences respectively corresponding to each audio data to obtain at least two sequence clusters comprises:

determining the at least two frequency sequences respectively corresponding to each piece of audio data as sequences to be processed, and adding each sequence to be processed to a sequence set;

acquiring M to-be-processed clusters corresponding to the sequence set, and acquiring the similarity between any two to-be-processed clusters in the M to-be-processed clusters; the M is the number of sequences to be processed contained in the sequence set, and each cluster to be processed comprises a sequence to be processed;

merging two clusters to be processed corresponding to the maximum similarity in the M clusters to be processed to obtain updated M-1 clusters to be processed, and obtaining the similarity between any two updated clusters to be processed in the updated M-1 clusters to be processed;

and merging the two updated clusters to be processed corresponding to the maximum similarity in the updated M-1 clusters to be processed to obtain updated M-2 clusters to be processed until the number of the updated clusters to be processed is equal to the threshold of the number of clusters, and determining the updated clusters to be processed as sequence clusters.

5. The method according to claim 1, wherein the determining the index information corresponding to each of the audio data according to the cluster identifier of the cluster of the sequences to which the at least two frequency sequences respectively belong in each of the audio data comprises:

respectively setting cluster identifiers for the at least two sequence clusters, and determining initial index information corresponding to each audio data according to the cluster identifiers of the sequence clusters to which the at least two frequency sequences respectively belong in each audio data;

and adding the combined identifier to the initial index information to obtain the index information corresponding to each audio data.

6. The method according to claim 1, wherein the obtaining, according to the index information, recommended audio data corresponding to audio data to be retrieved in the audio data set includes:

obtaining the matching degree between the target index information and the candidate index information, and sorting the candidate audio data contained in the audio data set according to the matching degree; the candidate audio data is audio data corresponding to the candidate index information;

7. The method of claim 6, wherein the target index information and the candidate index information each comprise a cluster identifier and a combination identifier;

the obtaining of the matching degree between the target index information and the candidate index information includes:

determining the cluster identifier and the combined identifier contained in the target index information as a first identifier to be processed;

selecting candidate audio data y from the set of audio data_nCorresponding candidate index information x_nThe candidate index information x_nThe cluster identifier and the combined identifier contained in the first group are determined as a second identifier to be processed; n is a positive integer less than or equal to the number of candidate audio data;

classifying the first to-be-processed identifiers contained in the target index information to obtain at least two target identifier categories, and respectively counting a first number of the first to-be-processed identifiers contained in each target identifier category;

for the candidate index information x_nThe second to-be-processed identifiers contained in the first to-be-processed identifier group are classified to obtain at least two candidate identifier classes, and the second number of the second to-be-processed identifiers contained in each candidate identifier class is respectively counted;

determining the target index information and the candidate index information x according to the first quantity and the second quantity_nThe degree of match between them.

8. The method of claim 7, wherein the determining the target index information and the candidate index information x according to the first number and the second number_nThe matching degree between the two, including:

acquiring the at least two target identification categories and a matching identification category in the at least two candidate identification categories; the at least two target identity categories and the at least two candidate identity categories each comprise the matching identity category;

matching the first number of the matching identification categories in the target index information with the matching identification categories in the candidate index information x_nThe second quantity in (2) is subjected to logarithmic summation to obtain the target index information and the candidate index information x_nThe degree of match between them.

9. An audio data processing apparatus, comprising:

the clustering module is used for acquiring at least two frequency sequences corresponding to each audio data in an audio data set, and clustering the at least two frequency sequences corresponding to each audio data to obtain at least two sequence clusters;

a determining module, configured to determine, according to cluster identifiers of the sequence clusters to which the at least two frequency sequences in each piece of audio data respectively belong, index information corresponding to each piece of audio data respectively;

10. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the steps of the method according to any one of claims 1 to 8.