CN113157968A

CN113157968A - Method, terminal and storage medium for acquiring homopolar musical instrument audio set

Info

Publication number: CN113157968A
Application number: CN202110373096.6A
Authority: CN
Inventors: 孔令城; 胡诗超
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-07-23

Abstract

The application discloses a method, a terminal and a storage medium for acquiring a co-melody audio group, and belongs to the technical field of internet. The method comprises the following steps: determining a plurality of audios corresponding to the target music name; extracting characteristic information of each audio; for each audio, determining the similarity between the characteristic information of the audio and the characteristic information of each other audio respectively, and determining the number of the similarities larger than a first preset threshold value as the number of similar audios corresponding to the audio; determining the characteristic information of the benchmarks according to the number of similar audios corresponding to each audio; and acquiring audios of which the similarity between the corresponding characteristic information and the benchmarking characteristic information is greater than a second preset threshold value from the plurality of audios to form a same-melody audio group corresponding to the target music name. The embodiment of the application can accurately gather the audio frequencies which are as many as possible and have small differences, and then the audio frequencies can be accurately sequenced according to the same melody audio frequency group.

Description

Method, terminal and storage medium for acquiring homopolar musical instrument audio set

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, a terminal, and a storage medium for acquiring a set of homopolar musical tones.

Background

In the field of music, the most common recommendation is hot. The popular recommendation method is to determine the total playing amount of the corresponding homopolar musical tone groups of each music name, sort the corresponding homopolar musical tone groups of each music name and obtain a sorting result. And determining the homomelody audio groups corresponding to the preset number of music names in the sequencing result, and determining the audio with the largest playing quantity in the homomelody audio groups according to the playing quantity corresponding to each audio in the homomelody audio groups for each homomelody audio group in the preset number of homomelody audio groups, and further taking the audio as the audio to be recommended corresponding to the homomelody audio groups. And recommending the determined preset number of audios to be recommended to the user. Wherein the melody of each audio in the same set of melody tones is the same.

In the related art, in order to determine the homomelody audio group corresponding to the target music name, it is necessary to determine a plurality of audios corresponding to the target music name, extract feature information of each audio, use the feature information of the audio with the largest playing amount among the plurality of audios as the target feature information, calculate a similarity between the feature information of each audio and the target feature information, and further make the audios corresponding to the similarities greater than a preset value form the homomelody audio group.

In the above process, after mapping multiple audios to the feature space, if the feature point corresponding to the audio with the largest playing amount is an outlier 1' (as shown in fig. 1), the audios with the largest playing amount and small difference cannot be aggregated together, so that the total playing amount of the same melody audio group is small, and the ranking result is inaccurate, where the outlier is a feature point far away from a dense region of the feature points.

Disclosure of Invention

The embodiment of the application provides a method, a terminal and a storage medium for obtaining a same melody tone set, and the scheme can rapidly and accurately determine the same melody tone set. The technical scheme is as follows:

in one aspect, a method for acquiring a set of homopolar musical tones is provided, the method comprising:

determining a plurality of audios corresponding to the target music name;

extracting characteristic information of each audio;

for each audio, determining the similarity between the characteristic information of the audio and the characteristic information of each other audio respectively, and determining the number of the similarities larger than a first preset threshold value as the number of similar audios corresponding to the audio;

determining the characteristic information of the benchmarks according to the number of similar audios corresponding to each audio;

and acquiring audios of which the similarity between the corresponding characteristic information and the benchmarking characteristic information is greater than a second preset threshold value from the plurality of audios to form a same-melody audio group corresponding to the target music name.

Optionally, the determining the flagpole feature information according to the number of the similar audio corresponding to each audio includes:

according to the number of similar audios corresponding to each audio, ordering the audios from large to small to obtain an audio sequence;

determining the number N of related characteristic information of the flagpole characteristic information according to the maximum number M of similar audios in the audio sequence;

and determining the characteristic information of the benchmarks according to the characteristic information of the N audio frequencies arranged at the front in the audio sequence.

Optionally, the determining, according to the maximum number M of similar audios, the number N of relevant feature information of the benchmarking feature information includes:

and determining the maximum similar audio number M as the related characteristic information number N of the benchmarking characteristic information.

Optionally, the determining, according to the maximum number M of similar audios in the audio sequence, the number N of relevant feature information of the benchmarking feature information includes:

if M is an odd number, determining the number of similar audios corresponding to (M +1)/2 th audio in the audio sequence as the number N of related characteristic information of the benchmarking characteristic information;

and if M is an even number, determining the average value of the number of the similar audios corresponding to the (M +1)/2 th audio and the number of the similar audios corresponding to the (M-1)/2 th audio in the audio sequence as the number N of the related characteristic information of the benchmarking characteristic information.

Optionally, the determining the flagpole feature information according to the feature information of the N audio frequencies arranged at the top in the audio sequence includes:

and calculating the average value of the feature information of the N front-arranged audios in the audio sequence as the flagpole feature information.

determining at least one characteristic information group according to the characteristic information of the N front audios arranged in the audio sequence, wherein the similarity between the characteristic information of every two audios in each characteristic information group is greater than a third preset threshold value, and the similarity between the characteristic information of the audios in any two characteristic information groups is less than a fourth preset threshold value;

and determining the marker post characteristic information corresponding to each characteristic information group according to the characteristic information in each characteristic information group.

Optionally, the determining of the flagpole feature information corresponding to each feature information group includes

And calculating the average value of all the characteristic information in each characteristic information group as the flagpole characteristic information corresponding to each characteristic information group.

Optionally, the determining the similarity between the feature information of the audio and the feature information of each other audio respectively includes:

determining the similarity between the characteristic information of the audio and the characteristic information of other audio according to a formula d-1-Ei @ Ecenter;

wherein d is the similarity between the feature information of the audio and the feature information of other audios, Ei is the feature information of the audio, and Ecenter is the feature information of other audios. .

In one aspect, an apparatus for acquiring a set of homopolar musical tones is provided, the apparatus comprising:

the first determining module is configured to determine a plurality of audios corresponding to the target music name;

an extraction module configured to extract feature information of each audio;

the similar audio number determining module is configured to determine similarity between the feature information of each audio and the feature information of each other audio respectively for each audio, and determine the number of similarities larger than a first preset threshold as the number of similar audios corresponding to the audio;

the second determining module is configured to determine the flagpole feature information according to the number of similar audios corresponding to each audio;

and the same melody audio group determining module is configured to acquire audio frequencies, of which the similarity between the corresponding characteristic information and the benchmarking characteristic information is greater than a second preset threshold value, from the multiple audio frequencies to form a same melody audio group corresponding to the target music name.

Optionally, the second determining module is configured to:

Optionally, the similar audio number determining module is configured to:

wherein d is the similarity between the feature information of the audio and the feature information of other audios, Ei is the feature information of the audio, and Ecenter is the feature information of other audios.

In one aspect, a terminal is provided, which includes a processor and a memory, where at least one program code is stored in the memory, and the at least one program code is loaded into and executed by the processor to implement the method for acquiring a set of homophones.

In one aspect, a computer readable storage medium having at least one program code stored therein is provided, the at least one program code being loaded and executed by a processor to implement the method for retrieving a set of homopolar musical tones described above.

In one aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer-readable storage medium, the computer program code being read by a processor of a computer device from the computer-readable storage medium, the computer program code being executable by the processor to cause the computer device to perform the above method of retrieving a set of bar tones.

In this embodiment of the present application, the flagpole feature information is determined according to the number of similar audios corresponding to each audio, where the number of similar audios corresponding to each audio indicates the density degree corresponding to each audio in a certain area, that is, the greater the number of similar audios corresponding to a certain audio, the higher the density degree corresponding to the audio is, the fewer the number of similar audios corresponding to a certain audio is, and the lower the density degree corresponding to the audio is. Therefore, the marker post characteristic information determined based on the method is related to the density degree of the audio, the possibility that the characteristic points corresponding to the calculated marker post characteristic information are outliers is avoided, the method can accurately gather the audio with the largest possible number and small difference, and then the audio can be accurately sorted according to the same melody audio group.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a method for obtaining a set of homopolar audio frequencies according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an implementation environment for a method for retrieving a set of homophones provided by an embodiment of the present application;

fig. 3 is a flowchart of a method for obtaining a set of homopolar audio provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for obtaining a set of homopolar tones according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a method for obtaining a set of homopolar tones according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a method for obtaining a set of homopolar tones according to an embodiment of the present application;

FIG. 7 is a diagram illustrating a method for retrieving a set of homopolar tones according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an apparatus for retrieving a set of homophones according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 2 is a schematic diagram of an implementation environment of a method for acquiring a set of homopolar audio provided by an embodiment of the present application. Referring to fig. 2, the implementation environment includes: a server 201 and a terminal 202.

The server 201 may be one server or a server cluster composed of a plurality of servers. The server 201 may be at least one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server 201 may send the audio to be recommended to the terminal 202. Of course, the server 201 may also include other functional servers to provide more comprehensive and diversified services.

The terminal 202 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III) player, an MP4(Moving Picture Experts Group Audio Layer IV) player, and a laptop computer. The terminal 202 is connected to the server 201 through a wired network or a wireless network, and an application program supporting music playing is installed and operated in the terminal 202. The terminal 202 may receive the audio to be recommended sent by the server 201 and recommend the audio to be recommended to the user.

Fig. 3 is a flowchart of a method for obtaining a set of homopolar tones according to an embodiment of the present application. The embodiment is described with a server as an execution subject, and referring to fig. 3, the embodiment includes:

step 301, determining a plurality of audios corresponding to the target music name.

The audio referred to in the embodiments of the present application includes pure music audio and song audio.

In implementation, all audio corresponding to the target music name is determined in the server.

Step 302, extracting feature information of each audio.

The characteristic information may be frequency spectrum characteristic information or other characteristic information.

The specific process of obtaining the audio characteristic information is to input each audio into a pre-trained characteristic extraction model respectively to obtain the characteristic information of each audio.

The training method of the feature extraction model comprises the following steps: randomly selecting a sample from a sample set, wherein the sample set comprises a plurality of samples, and each sample comprises a first sample audio, a second sample audio and a manual labeling result. And the manual marking result is whether the first sample audio and the second sample audio are homopolar audio. Randomly selecting a sample from the sample set, and obtaining a first sample audio, a second sample audio and a labeling result in the sample, as shown in fig. 4, respectively inputting the first sample audio and the second sample audio into the feature extraction model, so as to obtain first feature information corresponding to the first sample audio and second feature information corresponding to the second sample audio. And calculating the similarity of the characteristic information of the first sample audio and the characteristic information of the second sample audio, and determining a prediction result based on the similarity. When the similarity is greater than a preset threshold, the prediction result is that the first sample audio and the second sample audio are the same melody audio, and when the similarity is less than the preset threshold, the prediction result is that the first sample audio and the second sample audio are not the same melody audio. And training and adjusting the feature extraction model according to the prediction result, the manual labeling result and the loss function, and further completing a training and adjusting process. And randomly selecting other samples from the sample set, training and adjusting the feature extraction model after the last training and adjustment based on the other samples until the preset training and adjustment is completed, and obtaining the pre-trained feature extraction model.

In the above process, when the artificial labeling result is that the first sample audio and the second sample audio are homopolar audio, the artificial labeling result may be set to 1, and when the artificial labeling result is that the first sample audio and the second sample audio are not homopolar audio, the artificial labeling result may be set to 0. Similarly, when the prediction result is that the first sample audio and the second sample audio are the same melody audio, the prediction result is set to 1, and when the prediction result is that the first sample audio and the second sample audio are not the same melody audio, the prediction result is set to 0.

Of course, the result that the first sample audio and the second sample audio do not correspond to the melody audio may be set to 1, and the result that the first sample audio and the second sample audio correspond to the melody audio may be set to 0.

The feature information extracted in the embodiment of the present application may be a feature vector or a feature matrix, which is not limited in the embodiment of the present application.

Step 303, for each audio, determining similarity between the feature information of the audio and the feature information of each other audio, and determining the number of similarities larger than a first preset threshold as the number of similar audios corresponding to the audio.

In the embodiment of the present application, the sum of the similarity and the distance is 1, that is, the similarity and the distance are inversely related. That is, the greater the similarity, the smaller the distance, the smaller the similarity, and the greater the distance.

In implementation, each audio is mapped into a feature space according to its corresponding feature information. The feature space includes a plurality of feature points, and each feature point corresponds to the feature information of each audio. After determining the corresponding feature points of each audio in the feature space, for any feature point a in the feature space, determining the distances between other feature points and the feature point a, further determining the feature points of which the distances to the feature point a are smaller than a first distance threshold, further taking the feature points as the similar feature points of the feature point a, and taking the number of the feature points as the number of the similar audio corresponding to the feature point a.

It should be noted that the sum of the first distance threshold and the first preset threshold is 1.

Optionally, the similarity between the feature information of the audio and the feature information of other audio is determined according to the formula d-1-E @ Ecenter. Wherein d is the similarity between the feature information of the audio and the feature information of other audios, E is the feature information of the audio, and Ecenter is the feature information of other audios.

The "b" in the above formula refers to an inner product, and the specific operational relationship may be a @ b ═ a1 ═ b1+ a2 ═ b2+. + an × (bn), where a ═ a (a1, a2 … … an) and b ═ b (b1, b2 … … bn).

For example, in an actual process, it is necessary to calculate the similarity between the feature information a and the feature information B. According to the formula d-1-E_A@E_BAnd calculating the similarity between the characteristic information A and the characteristic information B. Where d is the similarity between the characteristic information A and the characteristic information B, E_ACharacteristic information for audio, E_BCharacteristic information of other audio.

In the embodiment of the present application, the similarity between feature information may also be calculated using other distance formulas, for example, an euclidean distance formula, a mahalanobis distance formula, a manhattan distance formula, a chebyshev distance formula, and the like.

And step 304, determining the flagpole feature information according to the number of the similar audios corresponding to each audio.

In the embodiment of the application, when the number of the similar audio corresponding to each audio is larger, it is indicated that the probability of the feature point corresponding to the audio in the feature point dense region is larger, and then the feature point of the benchmarking feature information determined according to the number of the similar audio corresponding to each audio is certainly in the feature point dense region.

Optionally, the audios are sorted from large to small according to the number of similar audios corresponding to each audio, so as to obtain an audio sequence. And determining the number N of the relevant characteristic information of the benchmarking characteristic information according to the maximum number M of the similar audios. And determining the characteristic information of the benchmarks according to the characteristic information of the N front-arranged audios in the audio sequence.

In the embodiment of the present application, there are three methods for determining the number N of related feature information, which are specifically described as follows:

in the first way, the maximum number M of similar audios is determined as the number N of related feature information of the flagpole feature information.

In the second way, if M is an odd number, the number of similar tones corresponding to the (M +1)/2 th tone in the tone sequence is determined as the number N of associated feature information of the benchmarking feature information. And if M is an even number, determining the average value of the number of the similar audios corresponding to the (M +1)/2 th audio and the number of the similar audios corresponding to the (M-1)/2 th audio in the audio sequence as the number N of the relevant characteristic information of the benchmarking characteristic information.

In the embodiment of the application, after the relevant feature information N is determined, an average value of feature information of N audio frequencies arranged in front in the audio sequence may be directly calculated, and the average value is used as the flagpole feature information.

It should be noted that, the number of similar audios corresponding to the N audios arranged in the front is larger, so that each audio in the N audios arranged in the front corresponds to a feature point dense region, and these feature point dense regions are likely to be a feature point dense region, and further, the feature points corresponding to the N audios arranged in the front are likely to be in a feature point dense region. Therefore, the characteristic points corresponding to the flagpole characteristic information are determined to be in the characteristic point dense area according to the characteristic information of the N audio frequencies arranged in front.

The calculation method is suitable for most characteristic points in the characteristic point dense area and a small number of points scattered in other positions of the characteristic space, namely only one characteristic point dense area in the characteristic space. As shown in particular in fig. 5. When the distribution characteristics of the feature points are similar to those of the feature points in fig. 5, it can be considered that the region where the feature points corresponding to the N audio signals arranged in the front are located is the same feature point dense region, and the feature point corresponding to the flagpole feature information is the center point of the feature point dense region.

However, in actual practice, one music name may correspond to the audios of at least two melodies, and in this case, the feature space may have at least two feature point dense regions. For example, taking an audio with one music name corresponding to two melodies as an example, two feature point dense regions exist in the feature space, the distribution of the feature points in the feature space is shown in fig. 6, a part of the feature points are distributed in the feature point dense region a, and a part of the feature points are distributed in the feature point dense region B.

For the situation that one music name may correspond to the audios of at least two melodies, after the N audios arranged in the front in the audio sequence are determined, the N audios arranged in the front in the audio sequence can be divided, and at least one benchmarking song is determined according to the dividing result. The method comprises the following specific steps: and determining at least one characteristic information group according to the characteristic information of the N front audios in the audio sequence, wherein the similarity between the characteristic information of every two audios in each characteristic information group is greater than a third preset threshold, and the similarity between the characteristic information of the audios in any two characteristic information groups is less than a fourth preset threshold. And determining the marker post characteristic information corresponding to each characteristic information group.

The third preset threshold and the fourth preset threshold in the embodiment of the present application may be equal or different. When the third preset threshold and the fourth preset threshold are not equal, the fourth preset threshold may be greater than the third preset threshold.

In implementation, the corresponding feature points of the N audios in the feature space are determined, the distance between each feature point is calculated, and then the feature points with the distance smaller than a third preset threshold are connected to obtain an undirected graph. And the vertex in the phase-free graph is a characteristic point, and the edge in the phase-free graph is the connection relation between the two characteristic points. In the undirected graph, at least one sub-graph is determined, and further, the feature information corresponding to the feature points contained in one sub-graph is used as a feature information group.

In one embodiment, after determining at least one sub-graph, it is determined that each sub-graph contains a number of feature points. And if the number of the characteristic points contained in a certain sub-graph is greater than the preset number, determining the characteristic information corresponding to each characteristic point contained in the sub-graph, and taking the characteristic information as a characteristic information group. If the number of the feature points contained in a certain sub-graph is less than the preset number, the sub-graph is ignored, namely the feature information corresponding to each feature point contained in the sub-graph is not determined, and the feature information is not used as a feature information group.

Optionally, an average value of all feature information in each feature information group is calculated, and the flagpole feature information corresponding to each feature information group is determined.

In the embodiment of the application, after at least one feature information group is obtained, for each feature information group, an average value of all feature information in the feature information group is calculated, and the average value is used as the flagpole feature information corresponding to the feature information group. The average value involved in the process is essentially an average characteristic information, not a numerical value.

The characteristic points corresponding to the flagpole characteristic information determined by the technical scheme in the application are determined to be in the characteristic point dense area, so that the possibility that the characteristic points corresponding to the flagpole characteristic information are outliers in the prior art is avoided, and the audio of the songs in the same rotation law can be accurately clustered together.

And 305, acquiring audios of which the similarity between the corresponding characteristic information and the benchmarking characteristic information is greater than a second preset threshold from the plurality of audios to form a same-melody audio group corresponding to the target music name.

In the embodiment of the present application, the first preset threshold and the second preset threshold are empirically determined by those skilled in the art. There is no clear relationship between the first preset threshold and the second preset threshold, that is, the first preset threshold may be greater than the second preset threshold, may also be smaller than the second preset threshold, and may even be equal to the second preset threshold. The melody of each audio in the same melody audio set in this application is similar.

In implementation, in a plurality of audios, the similarity between the characteristic information corresponding to each audio and the benchmarking characteristic information is calculated, the audio with the similarity larger than a second preset threshold is determined, and the audios are combined into a same-melody audio group corresponding to the target music name.

In step 304, if a plurality of benchmarking feature information are determined, for each benchmarking feature information, calculating a similarity between the feature information of each audio frequency of the plurality of audio frequencies and the benchmarking feature information, further determining the audio frequency with the similarity to the benchmarking feature information being greater than a second preset threshold, and using the audio frequencies as a same melody audio group corresponding to the benchmarking feature information. And analyzing each audio frequency in the same melody audio frequency group corresponding to each benchmarking characteristic information, and re-determining the music name corresponding to each same melody audio frequency group.

In the above process, the multiple audios are all audio corresponding to the target music name, that is, for each benchmarking feature information, the similarity between the benchmarking feature information and the feature information of all audio corresponding to the target music name is calculated, and then the audio with the similarity to the benchmarking feature information greater than the second preset threshold is determined.

In the embodiment of the application, the homopolar music channel group comprises at most one original music channel and a reverse music channel, wherein the reverse music channel is a music channel obtained by other persons reversing the original music channel, and the original music channel is a song which is published by a singer for the first time and is sung by the singer or a cooperative person. For example, two benchmarking feature information are determined in the audio corresponding to "FourMoss", and two identical melody audio groups are determined. When the audios in one same melody audio group are analyzed, the original singing audio corresponding to each audio is found to be 'Folmos' in Zhou Chang, and at the moment, the music name corresponding to the same melody audio group is set to be 'Folmos-Zhou Chang'. When another same melody audio group is analyzed, the original audio corresponding to each audio is found to be "Folmos" in Xiaojing, and the music name corresponding to the same melody audio group is set to be "Folmos-Xiaojing". In the above process, it should be understood that the original audio corresponding to "wel moss-zhou chang" and the original audio corresponding to "wel moss-xiaojingtang" are the audio with unequal content and melody.

It should be noted that, in an actual process, an original song label may be marked on an original song audio, and then the original song audio in the same melody audio group is determined by identifying the label of each audio in the same melody audio group. The original and the cover tones in the same melody tone group may also be identified by the technician.

In this embodiment of the present application, the flagpole feature information is determined according to the number of similar audios corresponding to each audio, where the number of similar audios corresponding to each audio indicates the density degree corresponding to each audio in a certain area, that is, the greater the number of similar audios corresponding to a certain audio, the higher the density degree corresponding to the audio is, the fewer the number of similar audios corresponding to a certain audio is, and the lower the density degree corresponding to the audio is. Therefore, the marker post characteristic information determined based on the method is related to the density degree of the audio, the possibility that the characteristic points corresponding to the calculated marker post characteristic information are outliers is avoided, and the method can accurately gather the audio with the maximum number and small difference.

For example, the distance between the feature point a and the feature point B in the feature space is 10, and the second preset threshold is 6, where the audio a corresponding to the feature point a and the audio B corresponding to the feature point B are a set of homopolar audio. According to the method in the related art, the determined set of the homopolar tones includes only tone a or tone B, and the homopolar tones cannot be accurately grouped together. And the characteristic point that the pole characteristic information that gives the determination of this application corresponds is close or even the central point of characteristic point A and characteristic point B, therefore, the homophythmic audio group that this application was determined includes audio A and audio B, can be accurate with all together the audio aggregation of homophythmic.

Further, as shown in fig. 7, the feature point 1 corresponding to the flagpole feature information determined in the present application is necessarily in the feature point dense area, while as shown in fig. 1, the feature point corresponding to the audio with the largest playing amount in the related art may be an outlier 1', which may not aggregate all the same melody audios together, and thus may not achieve accurate recommendation.

In the embodiment of the application, after the same melody audio group corresponding to each music name is determined, the audio with the highest playing amount in each same melody audio group can be sequenced according to the total playing amount corresponding to each same melody audio group, and the audio to be recommended is determined. The method comprises the following specific steps: adding the playing amount of each audio frequency in the corresponding same-melody audio group of each music name to determine the total playing amount corresponding to each music name, and determining the first audio frequency corresponding to each music name in the corresponding same-melody audio group of each music name, wherein the first audio frequency is the audio frequency with the largest playing amount in the same-melody audio group. And sequencing the first audio corresponding to each music name according to the total playing amount corresponding to each music name to obtain a sequencing result. And determining the audio with the maximum total playing quantity of the preset number in the sequencing result as the audio to be recommended.

Or after the total playing amount corresponding to each music name is determined, the first audio corresponding to each music name is not determined, but the second audio corresponding to each music name is determined, wherein the second audio is the original audio corresponding to the music name. And sequencing the second audio corresponding to each music name according to the total playing amount corresponding to each music name to obtain a sequencing result, and further determining the audio to be recommended.

In the above process, the first audio is the audio with the largest playing amount in the same melody audio group, and the second audio is the original audio corresponding to the music name, for example, the original song corresponding to the music name "folmosi-zhou chang" is "folmosi" singing in zhou chang. In the embodiment of the present application, the audio that may be recommended to the user is the audio with the largest playing amount, and may also be the original audio. Generally, the melody in the original singing audio is more perfect, bringing the user the best listening experience. However, in the actual process, the singing turning audio capable of bringing auditory feast to the user is not available.

After the server determines the audio to be recommended, the server determines the audio identification of the audio to be recommended and sends a display notice to the terminal, wherein the display notice comprises the audio identification of the audio to be recommended. And after the terminal receives the display notice sent by the server, the audio identifiers in the display notice are obtained, and the audio identifiers are displayed at the preset position of the terminal interface. When the terminal receives a selected instruction of a certain audio identifier, the identifier is sent to the server, so that the server sends the audio corresponding to the audio identifier to the terminal, and the terminal plays the audio.

Fig. 8 is a schematic structural diagram of an apparatus for acquiring a set of homopolar audio provided by an embodiment of the present application, where the apparatus can only be disposed in a server, and referring to fig. 8, the apparatus includes:

a first determining module 810 configured to determine a plurality of audios corresponding to a target music name;

an extraction module 820 configured to extract feature information of each audio;

the similar audio number determining module 830 is configured to determine, for each audio, similarity between the feature information of the audio and the feature information of each other audio, and determine the number of similarities larger than a first preset threshold as the number of similar audios corresponding to the audio;

a second determining module 840 configured to determine the flagpole feature information according to the number of similar audios corresponding to each audio;

and a co-melody audio group determination module 850 configured to obtain, from the plurality of audios, an audio whose similarity between the corresponding characteristic information and the benchmarking characteristic information is greater than a second preset threshold, and form a co-melody audio group corresponding to the target music name.

Optionally, the second determining module 840 is configured to:

and determining the marker post characteristic information corresponding to each characteristic information group.

Optionally, the second determining module 840 is configured to:

and calculating the average value of all the characteristic information in each characteristic information group, and determining the flagpole characteristic information corresponding to each characteristic information group.

Optionally, the similar audio number determining module 830 is configured to:

It should be noted that: in practical applications of the method for obtaining a set of homopolar musical tones provided in the above embodiment, the above functions may be distributed by different functional modules according to requirements, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above-described functions. In addition, the embodiment of the method for obtaining the same melody audio and the embodiment of the method for obtaining the same melody audio group provided by the above embodiments belong to the same concept, and the specific implementation process is detailed in the method embodiments and will not be described herein.

Fig. 9 shows a block diagram of a terminal 900 according to an exemplary embodiment of the present application. The terminal 900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one program code for execution by processor 901 to implement the method of retrieving a set of homopolar tones provided by method embodiments herein.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, disposed on the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side bezel of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

Proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the display 905 is controlled by the processor 901 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

The computer device provided by the embodiment of the application can be provided as a server. Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1001 and one or more memories 1002, where the memory 1002 stores at least one program code, and the at least one program code is loaded and executed by the processors 1001 to implement the method for obtaining a bar code audio group provided by the various method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer readable storage medium, such as a memory including program code, which is executable by a processor in a terminal or a server to perform the method of retrieving a set of bar tones in the above embodiments is also provided. For example, the computer-readable storage medium may be a read-only memory (ROM), a Random Access Memory (RAM), a compact-disc read-only memory (cd-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware associated with program code, and the program may be stored in a computer readable storage medium, and the above mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of retrieving a set of homopolar musical tones, the method comprising:

determining a plurality of audios corresponding to the target music name;

extracting characteristic information of each audio;

2. The method according to claim 1, wherein the determining the benchmarking feature information according to the number of similar audios corresponding to each audio comprises:

3. The method according to claim 2, wherein the determining the number N of the related feature information of the benchmarking feature information according to the maximum number M of similar audios in the audio sequence comprises:

4. The method according to claim 2, wherein the determining the number N of related feature information of the benchmarking feature information according to the maximum number M of similar audios comprises:

5. The method according to claim 2, wherein determining the benchmarking feature information according to the feature information of the first N audio frequencies in the audio sequence comprises:

6. The method according to claim 2, wherein determining the benchmarking feature information according to the feature information of the first N audio frequencies in the audio sequence comprises:

7. The method of claim 6, wherein determining the flagpole feature information corresponding to each feature information group comprises

8. The method of claim 1, wherein the determining the similarity between the feature information of the audio and the feature information of each other audio comprises:

determining the similarity between the characteristic information of the audio and the characteristic information of other audios according to a formula d-1-E @ Ecenter;

wherein d is the similarity between the feature information of the audio and the feature information of other audios, E is the feature information of the audio, and Ecenter is the feature information of other audios.

9. A terminal, characterized in that the terminal comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to perform the operations performed by the method of retrieving a set of co-melody tones as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded into and executed by a processor to perform the operations performed by the method of retrieving a set of co-melody tones as claimed in any one of claims 1 to 8.