CN111143604A

CN111143604A - Audio similarity matching method and device and storage medium

Info

Publication number: CN111143604A
Application number: CN201911353609.6A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-12
Anticipated expiration: 2039-12-25
Also published as: CN111143604B

Abstract

The embodiment of the invention discloses a method and a device for matching audio similarity and a storage medium. The scheme determines a similar user group from a plurality of users according to the audio list of the users; determining a characteristic audio set of a similar user group based on user behavior data corresponding to the audio; calculating the similarity between sample audio pairs in the training set according to the characteristic audio set, and determining a positive sample and a negative sample in the training set according to the similarity between the sample audio pairs; training a twin network model by using a positive sample and a negative sample, wherein the twin network model comprises two basic networks for obtaining a feature vector of audio, and the two basic networks have the same structure and share weight; and matching the similarity of the audio based on the trained twin network model. The accuracy of audio similarity matching is improved.

Description

Audio similarity matching method and device and storage medium

Technical Field

The invention relates to the technical field of audio processing, in particular to a method and a device for matching similarity of audio and a storage medium.

Background

With the popularization of networks and the convenience of song production, thousands of new songs emerge every day, and the amount of songs will grow exponentially in the future. While the songs are accumulated quantitatively, the user also shows obvious personalized music preferences. For example, a user prefers a certain song and often wishes to continue listening to this type of song. In a huge song library, how to recommend other songs similar to the song to the user according to the song preferred by the user becomes a problem to be solved at present.

Most of the traditional song recommendations adopt a collaborative filtering method for recommendation, and most of the collaborative filtering recommendation methods are to search for similar songs based on attribute information of the songs, such as artists, genres, languages and the like, for example, to regard songs with the same genres and language labels as similar songs for recommendation of the songs.

However, the collaborative filtering method depends too much on the attribute information of the songs, and the similarity judgment can be performed only when the songs have enough attribute information, so that the possibility of entering the recommendation pool is provided.

Disclosure of Invention

The embodiment of the invention provides an audio similarity matching method, an audio similarity matching device and a storage medium, and aims to improve the accuracy of audio similarity matching.

The embodiment of the invention provides an audio similarity matching method, which comprises the following steps:

determining a similar user group from a plurality of users according to the audio list of the users;

determining a characteristic audio set of the similar user group based on user behavior data corresponding to audio;

calculating the similarity between sample audio pairs in a training set according to the characteristic audio set, and determining a positive sample and a negative sample in the training set according to the similarity between the sample audio pairs;

training a twin network model by using the positive sample and the negative sample, wherein the twin network model comprises two basic networks for obtaining the feature vector of the audio, and the two basic networks have the same structure and share the weight;

and carrying out similarity matching of audio based on the trained twin network model.

The embodiment of the present invention further provides an audio similarity matching apparatus, including:

the user clustering unit is used for determining a similar user group from a plurality of users according to the audio list of the users;

the set determining unit is used for determining a characteristic audio set of the similar user group based on user behavior data corresponding to audio;

the sample acquisition unit is used for calculating the similarity between sample audio pairs in a training set according to the characteristic audio set and determining a positive sample and a negative sample in the training set according to the similarity between the sample audio pairs;

a model training unit, configured to train a twin network model using the positive samples and the negative samples, where the twin network model includes two basic networks for obtaining feature vectors of audio, and the two basic networks have the same structure and share weights;

and the audio matching unit is used for matching the similarity of the audio based on the trained twin network model.

The embodiment of the invention also provides a storage medium, wherein a plurality of instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to execute the method for matching the similarity of any audio frequency provided by the embodiment of the invention.

According to the audio similarity matching scheme provided by the embodiment of the invention, a similar user group is determined from a plurality of users according to an audio list of the users; determining a characteristic audio set of a similar user group based on user behavior data corresponding to the audio; calculating the similarity between sample audio pairs in the training set according to the characteristic audio set, and determining a positive sample and a negative sample in the training set according to the similarity between the sample audio pairs; training a twin network model by using a positive sample and a negative sample, wherein the twin network model comprises two basic networks for obtaining a feature vector of audio, and the two basic networks have the same structure and share weight; and matching the similarity of the audio based on the trained twin network model. According to the method and the device, the similarity between the audios is preliminarily determined according to the similar user groups, so that the twin network model is trained, when the similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, the attribute information of the songs is not needed, the matching of the similar songs can be realized only according to the songs, and the matching accuracy of the similar songs is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a first flowchart of a similarity matching method for audio according to an embodiment of the present invention;

FIG. 1b is a schematic structural diagram of a twin network model in the audio similarity matching method according to an embodiment of the present invention;

fig. 2a is a schematic diagram of a second process of the similarity matching method for audio according to the embodiment of the present invention;

fig. 2b is a third flow chart of the audio similarity matching method according to the embodiment of the present invention;

fig. 3a is a schematic diagram of a first structure of an audio similarity matching apparatus according to an embodiment of the present invention;

fig. 3b is a schematic diagram of a second structure of the audio similarity matching apparatus according to the embodiment of the present invention;

FIG. 3c is a schematic diagram of a third structure of an audio similarity matching apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The embodiment of the present invention provides an audio similarity matching method, where an execution subject of the audio similarity matching method may be the audio similarity matching device provided in the embodiment of the present invention, or an electronic device integrated with the audio similarity matching device, where the audio similarity matching device may be implemented in a hardware or software manner. Wherein the electronic device may be a server.

Referring to fig. 1a, fig. 1a is a first flow chart of an audio similarity matching method according to an embodiment of the present invention. The specific process of the audio similarity matching method may be as follows:

101. a group of similar users is determined from the plurality of users based on the audio list of the users.

The audio of the scheme of this embodiment may be various forms of audio data such as songs, tracks in video, and so on. The song may be a song with lyrics and rhythm, or may be pure music with melody and no lyrics. The following description is given by taking songs as an example, the scheme can be applied to servers for music application programs or music websites and the like, taking a music application program as an example, a song library is maintained in a server of a certain music application program, a large number of songs are stored in the song library, a certain similarity exists among some songs, and if the similarity among the songs is matched, the server can recommend some similar songs to a user according to the song listening habits of the user or a recommendation request sent by a client, so that the music operation efficiency is improved.

According to the scheme, the similarity among the songs is determined according to the user behavior data generated when the user listens to the songs and the song content, without acquiring attribute information of the songs, wherein the user behavior data comprises data such as the playing amount, the collection amount and the comment amount of the user on the songs. Specifically, the scheme mainly determines similarity between songs through a twin network model. The twin network model comprises two basic networks for obtaining feature vectors of songs, and the two basic networks have the same structure and share weight.

The method of training the twin network model will be described below. To train the twin network model, a training set needs to be prepared. First, when the user uses the music application, the user creates a song list according to the personal listening habits and preferences and collects the audio to each song list, so that the similarity between the two users can be determined according to the song collection list of the user.

In some embodiments, "determining a group of similar users from among a plurality of users according to an audio list of users" may include: acquiring audio lists of users, and calculating Jacard coefficients of the audio lists of every two users as similarity between the two users; and dividing the users into a plurality of similar user groups according to the similarity between the users, wherein the similarity between any two users in one similar user group is greater than a first preset threshold value.

In this embodiment, a training set may be constructed according to song related data of all or part of registered users of the music application, and the similarity between the positive sample song pair and the negative sample song pair in the training set may be determined, where the number of users may be set as needed. For each two of these users, a similarity between the two users is calculated from the user's song favorites lists.For example, the set similarity of the song collections of the two users is calculated, such as by calculating a Jaccard coefficient (also referred to as a Jaccard similarity coefficient) of the two song collections as the similarity between the two users. Wherein the song collection list S of the user A can be calculated according to the following formula_ASong collection list S with user B_BJacard coefficient of_AB＝|SA∩SB|/|SA∪SB|。

After the similarity between every two users is determined, the users with the similarity larger than a first preset threshold value are divided into a similar user group, and all the selected users can be divided into n similar user groups U_iI ∈ (1, n). It is understood that, in order to improve the reliability of the data, similar user groups with a number smaller than a certain number are filtered, and only similar user groups with a larger number of users are used, for example, similar user groups with a number smaller than 1000 are filtered.

102. And determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio.

After a plurality of similar user groups are determined, all collected songs of all users in the similar user groups are used as a candidate song set of the similar user groups, and songs meeting a certain condition are selected from the candidate song set to be used as a characteristic audio set of the similar user groups.

In some embodiments, the user behavior data is a collection; the "determining a characteristic audio set of a similar user group based on user behavior data corresponding to audio" may include: and determining the audios with the collection amount larger than a second preset threshold from the audios corresponding to the similar user group to form a characteristic audio set of the similar user group.

In this embodiment, the feature song set (i.e., feature audio set) corresponding to the similar user group is determined according to the collection amount of songs. For example, for a similar user group, the collection amounts of the collected songs of all the users in the similar user group are counted, and the songs with the collection amounts larger than a second preset threshold value are determined to form a characteristic song set. Such as a group of similar users U_iAll for one thousandAnd if the second preset threshold value is one thousand, adding the song to the characteristic song set corresponding to the similar user group when the song is collected by one thousand of the ten thousand users. The second preset threshold is only for illustration, and in other embodiments, the second preset threshold may be set to other values according to the accuracy requirement of the model and the amount of songs in the song library.

Alternatively, in other embodiments, the user behavior data is a play volume; and selecting a characteristic song set corresponding to the similar user group according to the playing amount.

103. And calculating the similarity between the sample audio pairs in the training set according to the characteristic audio set, and determining the positive sample and the negative sample in the training set according to the similarity between the sample audio pairs.

After the similar user groups and the feature song sets corresponding to the similar user groups are determined, the similarity between the sample audio pairs in the training set is calculated according to the feature song sets, and a positive sample and a negative sample are determined. Wherein the sample audio pair is a sample song pair.

All the songs of the user selected in 101 may be used as the songs in the training set, or all or part of the songs may be selected from all the feature song sets to form the training set. Any two songs in the training set constitute a sample song pair.

The similarity between the two songs in the sample song pair is calculated as follows. "calculating the similarity between the sample audio pairs in the training set from the feature audio set" may include: for any sample audio pair in the training set, calculating the number of characteristic audio sets simultaneously having two audios in the sample audio pair; and dividing the number by the total number of the characteristic audio sets to obtain the similarity of the sample audio pairs.

Assuming that a sample song pair includes a song C and a song D, the probability of defining a characteristic song set where the song C and the song D are located is P, where P is the number of characteristic song sets having the song C and the song D at the same time/the total number of the characteristic song sets, and since songs in the characteristic song sets of a similar user group can be considered to have certain similarity, the probability of a characteristic song set where the song C and the song D are located can be used as the similarity between the song C and the song D.

In some embodiments, the "determining positive and negative examples in the training set according to the similarity between the sample audio pairs" may include:

and taking the sample song pair with the similarity larger than a fourth preset threshold as a positive sample, and taking the sample song pair with the similarity not larger than a fifth preset threshold as a negative sample, wherein the fourth preset threshold is larger than or equal to the fifth preset threshold. For example, the fourth preset threshold is 0.6, and the fifth preset threshold is 0.3; for another example, the fourth preset threshold and the fifth preset threshold are both 0.5.

Since the twin network model is used in this embodiment, the positive sample and the negative sample are both in the form of a song pair, that is, one positive sample contains two songs whose similarity is greater than the fourth preset threshold, and one negative sample contains two songs whose similarity is not greater than the fifth preset threshold.

104. And training the twin network model by using the positive sample and the negative sample, wherein the twin network model comprises two basic networks for obtaining the feature vector of the audio, and the two basic networks have the same structure and share the weight.

After positive and negative examples in the training set are determined, the twin network model is trained using the positive and negative examples.

In some embodiments, "training the twin network model using positive and negative examples" may include: extracting audio features of the audio in the positive sample and the negative sample; and training the twin network model based on the audio features and the similarity of the positive samples and the audio features and the similarity of the negative samples until the loss value of the loss function reaches the minimum value.

Extracting audio features of songs in the positive sample and the negative sample, such as Mel frequency spectrum features, taking the Mel frequency spectrum features as input data of the twin network model, taking the similarity between the sample pairs obtained by the calculation as output, then performing iterative training on the model according to a loss function of the model, and obtaining a loss value of the twin network model according to the loss function and the output of the twin network model in each training; and adjusting parameters of the twin network model according to the loss value of the twin network model in each training cycle until the loss value of the loss function reaches the minimum value, stopping training, determining model parameters, and obtaining the trained twin network model.

Loss function L (F)₁，F₂Y) is as follows:

wherein, F₁、F₂Feature vectors corresponding to audio features output by two underlying networks, respectively, Y being the similarity between two songs, D (F)₁，F₂) The distance function is used to calculate the distance between feature vectors, and the distance function may be a cosine distance, a euclidean distance, or the like, and this embodiment is not particularly limited to this, and may be any function capable of calculating the distance between two vectors.

Referring to fig. 1b, fig. 1b is a schematic structural diagram of a twin network model in the audio similarity matching method according to the embodiment of the present invention. The twin network model used in this embodiment includes two base networks that are structurally identical, and the two base networks share weights. The same structure means that the number of each layer included in each network and the number of neurons in each layer are the same.

In the training process, two basic networks are jointly learned, the learning content of each network is different and can form complementation, and the final training target of the model is to make two similar input distances as small as possible and two different input distances as large as possible. The two basic networks are jointly learned in the training process, compared with a single network, a twin network model obtained through learning has higher accuracy and better performance, and the similarity between two audios can be more accurately calculated by using the feature vectors output in the twin network model.

In addition, the basic network in this embodiment is not particularly limited, and may be any network capable of learning audio features, such as a multilayer convolutional neural network, a cyclic neural network, a transform network, a VGG (Visual geometry group) network, a ResNet (Deep residual network) network, and the like.

In some embodiments, "extracting audio features of the audio in the positive and negative examples" may include: for any audio in the positive samples and the negative samples, dividing the audio into a plurality of audio segments; carrying out short-time Fourier transform on each audio clip to obtain a frequency domain signal; carrying out Mel scale transformation on the frequency domain signal to obtain Mel frequency spectrum characteristics of the audio frequency fragment; and combining the Mel frequency spectrum characteristics of the plurality of audio frequency segments to obtain the audio frequency characteristics of the audio frequency.

In this embodiment, a complete song is divided into a plurality of audio segments with short duration, each audio segment is subjected to short-time fourier transform and mel scale transform to obtain mel spectrum features of each audio segment, the mel spectrum features of one audio segment are taken as a line, and then the mel spectrum features of all the audio segments are arranged longitudinally and stacked to form mel spectrum features corresponding to the complete song. It can be understood that the present solution is not limited to obtaining the mel spectrum of the audio as the audio feature in the above manner, and in other embodiments, the audio feature of the song may be extracted in other manners as long as the characteristics of the audio content can be embodied.

105. And matching the similarity of the audio based on the trained twin network model.

After the twin network model is obtained through training, song similarity matching can be performed according to the twin network model, and therefore similar songs can be recommended for the user. There are various similarity matching methods for songs, and three of them are listed below for explanation.

In the first mode, when the similarity of the song E and the song F is calculated, the audio characteristics of the song E and the song F are respectively extracted, the trained twin network model is input, and the twin network model can calculate and output the similarity of the song E and the song F.

And secondly, combining all songs in the song library pairwise, and calculating the similarity between every two songs by using a twin network model to form a similarity matrix, wherein each row or column in the similarity matrix is the similarity between one song and all other songs in the song library. When a song with similarity to a certain song is inquired, the song with the similarity larger than a certain threshold value with the song is searched according to the similarity matrix.

And thirdly, calculating the feature vector of each song in the song library by using the twin network model to form a feature vector library. When a similar song of a song is inquired, approximate vector search is carried out in a feature vector library based on the feature vector of the song. For example, similar vectors can be quickly found from a large number of vectors by searching for similar vectors using a perceptual hash, LSH (local-Sensitive Hashing) algorithm, Annoy algorithm, and the like.

In practical application, different ways can be selected according to the number of songs in the song library to search for similar songs. For example, when the number of songs stored in the song library is small, the scheme of the first mode or the second mode can be used for recommending similar songs, and when the number of songs stored in the song library is large, the scheme of the third mode can be used for recommending similar songs. Alternatively, in other embodiments, similar song searches may be performed in other ways based on the twin network model.

Furthermore, it will be appreciated that the training set may be periodically updated to retrain the twin network model, updating the parameters of the model to ensure that the model has a higher accuracy.

In particular implementation, the present application is not limited by the execution sequence of the described steps, and some steps may be performed in other sequences or simultaneously without conflict.

As described above, in the audio similarity matching method provided in the embodiment of the present invention, a similar user group is determined from a plurality of users according to an audio list of the users; determining a characteristic audio set of a similar user group based on user behavior data corresponding to the audio; calculating the similarity between sample audio pairs in the training set according to the characteristic audio set, and determining a positive sample and a negative sample in the training set according to the similarity between the sample audio pairs; training a twin network model by using a positive sample and a negative sample, wherein the twin network model comprises two basic networks for obtaining a feature vector of audio, and the two basic networks have the same structure and share weight; and matching the similarity of the audio based on the trained twin network model. According to the method and the device, the similarity between the audios is preliminarily determined according to the similar user groups, so that the twin network model is trained, when the similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, the attribute information of the songs is not needed, the matching of the similar songs can be realized only according to the songs, and the matching accuracy of the similar songs is improved.

The method according to the preceding embodiment is illustrated in further detail below by way of example.

Referring to fig. 2a, fig. 2a is a second flow chart of the audio similarity matching method according to the embodiment of the present invention. The method comprises the following steps:

201. and receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of the query target audio.

202. A first feature vector of a target audio is obtained.

The audio recommendation request may be a song recommendation request, and the user may send the song recommendation request to the server through a music application program of the client, where the audio recommendation request carries identification information of the target song. When receiving an audio recommendation request, a server may obtain a first feature vector of a target song, for example, obtain a corresponding first feature vector from a feature vector library according to identification information of a target audio; or extracting the audio features of the target audio, and calculating a first feature vector of the target audio according to the audio features and the twin network model.

203. And searching a first preset number of second characteristic vectors with the highest similarity with the first characteristic vectors from the characteristic vector library, wherein the characteristic vectors of the audios in the song library are calculated according to the trained twin network model to form the characteristic vector library, and the twin network model is obtained by training according to the collected songs of the user.

According to the twin network model obtained by training in the above embodiment, the feature vector of each song in the song library is calculated, that is, the output of the base network in the twin network model is obtained. And forming a feature vector library by using the feature vectors of all the songs, wherein each feature vector in the feature vector library corresponds to one song. Please refer to the above embodiments, and details are not repeated herein.

After the first feature vector is obtained, approximate vector retrieval is performed in the feature vector library based on the first feature vector, and a plurality of second feature vectors closest to the first feature vector are found, wherein the number of the second feature vectors can be set as required. For example, the similar vectors of a certain vector can be quickly found from a large number of vectors by searching the similar vectors by using perceptual hashing, an LSH algorithm, an Annoy algorithm, and the like.

204. And responding to the audio recommendation request, and pushing the audio corresponding to the second feature vector to the terminal corresponding to the audio recommendation request.

And after the second characteristic vector is found, responding to the audio recommendation request, and pushing the song corresponding to the second characteristic vector to the terminal corresponding to the audio recommendation request. For example, the information of the name, singer and the like of the song corresponding to the second feature vector is pushed to the terminal.

As described above, the audio similarity matching method provided in the embodiment of the present invention calculates the feature vector of each song in the song library in advance according to the trained twin network model to form a feature vector library, performs approximate vector search in the feature vector library according to the first feature vector of the target song when receiving the song recommendation request indicating to search for a similar song of the target song, so as to obtain a plurality of second feature vectors most similar to the first feature vectors, and pushes the song corresponding to the second feature vectors to the user.

Referring to fig. 2b, fig. 2b is a third flow chart of the audio similarity matching method according to the embodiment of the invention. The method comprises the following steps:

205. and calculating the similarity between every two audios in the music library according to the twin network model to obtain a similarity matrix corresponding to the music library, wherein the twin network model is obtained according to the collected songs of the user.

In this embodiment, after the trained twin network model is obtained, the similarity between every two songs may be calculated for all songs in the song library of the server. For example, for any two songs, the audio features of the two songs are extracted, and the two audio features are input into the trained twin network model, so that the similarity of the two songs can be output. And according to the similarity between every two songs, obtaining a similarity matrix corresponding to the whole song library. Assuming that the song library contains 1000 songs, a similarity matrix of 1000 × 1000 is finally obtained, and for any two songs in the song library, the similarity between the two songs can be found in the song library.

206. And obtaining a similar audio list of each audio in the music library according to the similarity matrix, wherein when the similarity between the two audios is greater than a third preset threshold value, the two audios are judged to be similar audios.

After the similarity matrix corresponding to the song library is obtained, the similarity between one song and all other songs in the song library is obtained in each row or each column in the similarity matrix. A list of similar songs for each song in the song library may be derived from the matrix. For example, for song a in the song library, a song with a similarity greater than a third preset threshold is acquired from the row (or column) where the song a is located, and the acquired song is used as a similar song list of song a, and the similar song list is stored in association with song a. In this way, a list of similar songs for each song in the song library can be obtained.

207. And receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of the query target audio.

208. And acquiring a similar audio list corresponding to the target audio, and responding to the audio recommendation request according to the similar audio list.

When the server receives the audio recommendation request, the server can obtain a similar song list of the target song, and push all or part of songs with highest similarity in the similar song list to a terminal corresponding to the audio recommendation request according to needs.

As described above, the audio similarity matching method provided in the embodiment of the present invention calculates the similarity between every two songs in the song library in advance according to the trained twin network model to form a similarity matrix, and can implement matching of the similarity between any two songs based on the similarity matrix, and the scheme does not depend on attribute information of the songs, but performs matching based on the content of the songs, thereby improving matching accuracy of similar songs.

In order to implement the method, an embodiment of the present invention further provides an audio similarity matching apparatus, where the audio similarity matching apparatus may be specifically integrated in a terminal device such as a mobile phone and a tablet computer.

For example, please refer to fig. 3a, fig. 3a is a first structural diagram of an audio similarity matching apparatus according to an embodiment of the present invention. The audio similarity matching device may include a user clustering unit 301, a set determining unit 302, a sample obtaining unit 303, a model training unit 304, and an audio matching unit 305, as follows:

a user clustering unit 301, configured to determine a similar user cluster from multiple users according to the audio list of the users;

a set determining unit 302, configured to determine a feature audio set of the similar user group based on user behavior data corresponding to audio;

a sample obtaining unit 303, configured to calculate a similarity between sample audio pairs in a training set according to the feature audio set, and determine a positive sample and a negative sample in the training set according to the similarity between the sample audio pairs;

a model training unit 304, configured to train a twin network model using the positive samples and the negative samples, where the twin network model includes two basic networks for obtaining feature vectors of audio, and the two basic networks have the same structure and share weights;

an audio matching unit 305, configured to perform audio similarity matching based on the trained twin network model.

Referring to fig. 3c, fig. 3c is a schematic diagram illustrating a third structure of an audio similarity matching apparatus according to an embodiment of the present invention. In some embodiments of the present invention, the,

in some embodiments, the user clustering unit 301 is further configured to: acquiring audio lists of users, and calculating Jacard coefficients of the audio lists of every two users as similarity between the two users; and dividing the users into a plurality of similar user groups according to the similarity between the users, wherein the similarity between any two users in one similar user group is greater than a first preset threshold value.

In some embodiments, the sample acquiring unit 303 is further configured to: for any sample audio pair in the training set, calculating the number of characteristic audio sets simultaneously having two audios in the sample audio pair; and dividing the number by the total number of the characteristic audio sets to obtain the similarity of the sample audio pairs.

In some embodiments, model training unit 304 is further configured to: extracting audio features of the audio in the positive sample and the negative sample; and training the twin network model based on the audio features and the similarity of the positive samples and the audio features and the similarity of the negative samples until the loss value of the loss function reaches the minimum value.

In some embodiments, model training unit 304 is further configured to:

for any audio of a positive sample and a negative sample, dividing the audio into a plurality of audio segments;

carrying out short-time Fourier transform on each audio clip to obtain a frequency domain signal;

carrying out Mel scale transformation on the frequency domain signal to obtain Mel frequency spectrum characteristics of the audio frequency fragment;

and combining the Mel frequency spectrum characteristics of the plurality of audio frequency fragments to obtain the audio frequency characteristics of the audio frequency.

In some embodiments, the user behavior data is a collection; the set determining unit 302 is further configured to: and determining the audios with the collection amount larger than a second preset threshold value from the audios corresponding to the similar user group to form a characteristic audio set of the similar user group.

Referring to fig. 3b, fig. 3b is a schematic diagram illustrating a second structure of an audio similarity matching apparatus according to an embodiment of the present invention.

In some embodiments, the apparatus for matching similarity of audio may further include a first recommending unit 306, where the first recommending unit 306 is configured to:

calculating the characteristic vector of the audio frequency in the curved library according to the trained twin network model to form a characteristic vector library; receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of a query target audio; acquiring a first feature vector of the target audio; searching a first preset number of second feature vectors with highest similarity with the first feature vectors from the feature vector library; and responding to the audio recommendation request, and pushing the audio corresponding to the second feature vector to a terminal corresponding to the audio recommendation request.

In some embodiments, the first recommending unit 306 is further configured to: acquiring a corresponding first feature vector from the feature vector library according to the identification information of the target audio; or extracting the audio feature of the target audio, and calculating a first feature vector of the target audio according to the audio feature and the twin network model.

In some embodiments, the apparatus for matching similarity of audio may further include a second recommending unit 307, where the second recommending unit 307 is configured to:

calculating the similarity between every two audios in a music library according to the twin network model to obtain a similarity matrix corresponding to the music library; and obtaining a similar audio list of each audio in the music library according to the similarity matrix, wherein when the similarity between the two audios is greater than a third preset threshold value, the two audios are judged to be similar audios.

In some embodiments, the second recommending unit 307 is further configured to:

receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of a query target audio; and acquiring a similar audio list corresponding to the target audio, and responding to the audio recommendation request according to the similar audio list.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

It should be noted that the audio similarity matching device provided in the embodiment of the present invention and the audio similarity matching method in the foregoing embodiment belong to the same concept, and any method provided in the audio similarity matching method embodiment may be run on the audio similarity matching device, and a specific implementation process thereof is detailed in the audio similarity matching method embodiment and is not described herein again.

In the audio similarity matching device provided in the embodiment of the present invention, the user clustering unit 301 determines a similar user group from a plurality of users according to the audio list of the users; the set determining unit 302 determines a feature audio set of a similar user group based on user behavior data corresponding to the audio; the sample obtaining unit 303 calculates the similarity between the sample audio pairs in the training set according to the feature audio set, and determines the positive sample and the negative sample in the training set according to the similarity between the sample audio pairs; the model training unit 304 trains a twin network model using the positive samples and the negative samples, wherein the twin network model includes two basic networks for obtaining feature vectors of audio, and the two basic networks have the same structure and share weights; the audio matching unit 305 performs similarity matching of audio based on the trained twin network model. According to the method and the device, the similarity between the audios is preliminarily determined according to the similar user groups, so that the twin network model is trained, when the similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, the attribute information of the songs is not needed, the matching of the similar songs can be realized only according to the songs, and the matching accuracy of the similar songs is improved.

Fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention, where fig. 4 is a schematic structural diagram of the electronic device according to an embodiment of the present invention. Specifically, the method comprises the following steps:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

In some embodiments, the processor 401 runs an application program stored in the memory 402, and may also implement the following functions:

acquiring audio lists of users, and calculating Jacard coefficients of the audio lists of every two users as similarity between the two users;

and dividing the users into a plurality of similar user groups according to the similarity between the users, wherein the similarity between any two users in one similar user group is greater than a first preset threshold value.

for any sample audio pair in the training set, calculating the number of characteristic audio sets simultaneously having two audios in the sample audio pair;

and dividing the number by the total number of the characteristic audio sets to obtain the similarity of the sample audio pairs.

extracting audio features of the audio in the positive sample and the negative sample;

and training the twin network model based on the audio features and the similarity of the positive samples and the audio features and the similarity of the negative samples until the loss value of the loss function reaches the minimum value.

and determining the audios with the collection amount larger than a second preset threshold value from the audios corresponding to the similar user group to form a characteristic audio set of the similar user group.

calculating the characteristic vector of the audio frequency in the curved library according to the trained twin network model to form a characteristic vector library;

receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of a query target audio;

acquiring a first feature vector of the target audio;

searching a first preset number of second feature vectors with highest similarity with the first feature vectors from the feature vector library;

and responding to the audio recommendation request, and pushing the audio corresponding to the second feature vector to a terminal corresponding to the audio recommendation request.

acquiring a corresponding first feature vector from the feature vector library according to the identification information of the target audio;

or extracting the audio feature of the target audio, and calculating a first feature vector of the target audio according to the audio feature and the twin network model.

calculating the similarity between every two audios in a music library according to the twin network model to obtain a similarity matrix corresponding to the music library;

and obtaining a similar audio list of each audio in the music library according to the similarity matrix, wherein when the similarity between the two audios is greater than a third preset threshold value, the two audios are judged to be similar audios.

and acquiring a similar audio list corresponding to the target audio, and responding to the audio recommendation request according to the similar audio list.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

As described above, in the electronic device according to the embodiment of the present invention, a similar user group is determined from a plurality of users according to an audio list of the users; determining a characteristic audio set of a similar user group based on user behavior data corresponding to the audio; calculating the similarity between sample audio pairs in the training set according to the characteristic audio set, and determining a positive sample and a negative sample in the training set according to the similarity between the sample audio pairs; training a twin network model by using a positive sample and a negative sample, wherein the twin network model comprises two basic networks for obtaining a feature vector of audio, and the two basic networks have the same structure and share weight; and matching the similarity of the audio based on the trained twin network model. According to the method and the device, the similarity between the audios is preliminarily determined according to the similar user groups, so that the twin network model is trained, when the similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, the attribute information of the songs is not needed, the matching of the similar songs can be realized only according to the songs, and the matching accuracy of the similar songs is improved.

To this end, the embodiment of the present invention provides a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute any one of the methods for matching similarity of audio provided by the embodiment of the present invention. For example, the instructions may perform:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instruction stored in the storage medium can execute any audio similarity matching method provided by the embodiment of the present invention, beneficial effects that can be achieved by any audio similarity matching method provided by the embodiment of the present invention can be achieved, and detailed descriptions are omitted herein for the sake of detail in the foregoing embodiment. The method, the apparatus and the storage medium for matching audio similarity according to the embodiments of the present invention are described in detail above, and a specific example is applied in the description to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for matching similarity of audio, comprising:

2. The audio similarity matching method according to claim 1, wherein the determining a similar user group from a plurality of users according to the audio list of the users comprises:

3. The method for matching similarity of audio according to claim 1, wherein the calculating the similarity between the sample audio pairs in the training set according to the feature audio set comprises:

4. The audio similarity matching method of claim 1, wherein the training of the twin network model using the positive samples and the negative samples comprises:

5. The method for matching similarity of audio according to claim 4, wherein the extracting the audio features of the audio in the positive and negative samples comprises:

6. The audio similarity matching method according to claim 1, wherein the user behavior data is a collection; determining a feature audio set of the similar user group based on the user behavior data corresponding to the audio, including:

7. The audio similarity matching method according to any one of claims 1 to 6, wherein the audio similarity matching based on the trained twin network model comprises:

acquiring a first feature vector of the target audio;

8. The audio similarity matching method according to claim 7, wherein the obtaining the first feature vector of the target audio includes:

9. The audio similarity matching method according to any one of claims 1 to 6, wherein the audio similarity matching based on the trained twin network model comprises:

10. The audio similarity matching method according to claim 9, wherein after obtaining the list of similar audios for each audio in the music library according to the similarity matrix, the method further comprises:

11. An apparatus for matching similarity of audio, comprising:

12. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the method of similarity matching of audio according to any one of claims 1 to 10.