CN111444383A

CN111444383A - Audio data processing method and device and computer readable storage medium

Info

Publication number: CN111444383A
Application number: CN202010237268.2A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-24
Anticipated expiration: 2040-03-30
Also published as: CN111444383B

Abstract

The application discloses an audio data processing method, an audio data processing device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a segment spectrum sequence of a target audio; the segment spectrum sequence is obtained by sampling the energy amplitude of the audio segment of the target audio; generating an integral frequency spectrum sequence of the target audio according to the fragment frequency spectrum sequence; generating an initial fitting spectrum function according to the spectrum representation basis function, and adjusting the spectrum representation basis function to obtain an adjusted initial fitting spectrum function; when the convergence condition is met between the adjusted initial fitting spectrum function and the whole spectrum sequence, obtaining at least two adjusted spectrum representation basis functions as spectrum reconstruction basis functions according to the adjusted initial fitting spectrum function; and reconstructing the segment spectrum sequence according to the spectrum reconstruction basis function, and determining an audio representation vector of the target audio according to the reconstructed segment spectrum sequence. By the method and the device, the accuracy of the audio expression vector of the acquired target audio can be improved.

Description

Audio data processing method and device and computer readable storage medium

Technical Field

The present application relates to the field of audio data processing technologies, and in particular, to an audio data processing method and apparatus, and a computer-readable storage medium.

Background

Listening to music is a national activity, and the amount of music on the internet is increasing nowadays, including various styles of music, various forms of music, and music of singers. Therefore, it is difficult for the user to select the music to be played from the plurality of music, and it is necessary to search for music that may be of interest to the user by means of the system and recommend the searched music to the user.

When recommending music for the user, the retrieved music B similar to the music a marked as favorite by the user can be recommended to the user. In this process, a process of retrieving music B similar to music a from a plurality of music in a music library is involved, and before starting the process, music a and each music in the music library need to be converted into machine language (for example, the music is expressed as a vector) so that the music similar to music a can be retrieved in the music library.

In the prior art, the energy amplitude in each music is usually directly sampled, and a music representation vector of each music is generated by sampling the obtained sequence. However, in a music, there are usually cases where the energy amplitude at the time t1 is large and the energy amplitudes at other times are relatively small, or the energy amplitude at the time t2 is small and the energy amplitudes at other times are relatively large. Therefore, if the sequence including the energy amplitude at the time t1 or the time t2 is directly used to generate the music representation vector of the music, the obtained music representation vector of the music is not accurate.

Content of application

The application provides an audio data processing method, an audio data processing device and a computer readable storage medium, which can improve the accuracy of an audio representation vector of an acquired target audio.

One aspect of the present application provides an audio data processing method, including:

acquiring at least two fragment frequency spectrum sequences of a target audio; the at least two segment spectrum sequences are obtained by sampling the energy amplitudes of at least two audio segments of the target audio;

generating an integral frequency spectrum sequence of the target audio according to the energy amplitude values contained in the at least two fragment frequency spectrum sequences respectively;

generating an initial fitting spectrum function according to the at least two spectrum representation basis functions, and adjusting the at least two spectrum representation basis functions to obtain an adjusted initial fitting spectrum function;

when the convergence condition is met between the adjusted initial fitting spectrum function and the whole spectrum sequence, acquiring at least two adjusted spectrum representation basis functions as spectrum reconstruction basis functions according to the adjusted initial fitting spectrum function;

reconstructing at least two fragment frequency spectrum sequences according to the frequency spectrum reconstruction basis function to obtain reconstructed fragment frequency spectrum sequences; the reconstructed segment spectral sequence is used to determine an audio representation vector for the target audio.

The obtaining of the at least two segment spectrum sequences of the target audio includes:

sampling the energy amplitude of the target audio according to the sampling time interval to obtain an energy time sequence corresponding to the target audio;

segmenting the energy time sequence according to the sampling time period to obtain at least two segment energy time sequences contained in the energy time sequence;

at least two segment spectral sequences are generated from the at least two segment energy time sequences.

Wherein, according to the at least two segment energy time sequences, generating at least two segment frequency spectrum sequences comprises:

respectively carrying out frequency domain transformation on at least two fragment energy time sequences to obtain a fragment frequency signal corresponding to each fragment energy time sequence;

sampling each segment frequency signal according to the sampling frequency interval to obtain a segment frequency sequence corresponding to each segment frequency signal;

and determining the segment frequency sequence corresponding to each segment frequency signal as the segment frequency spectrum sequence of the target audio.

Each of the at least two segment spectrum sequences comprises energy amplitude values corresponding to at least two sampling frequencies respectively;

generating an overall frequency spectrum sequence of the target audio according to the energy amplitude values contained in the at least two fragment frequency spectrum sequences respectively, wherein the overall frequency spectrum sequence comprises the following steps:

summing the energy amplitudes belonging to the same sampling frequency in each fragment frequency spectrum sequence to obtain a summed energy amplitude corresponding to each sampling frequency;

and generating the whole frequency spectrum sequence of the target audio according to the summation energy amplitude value corresponding to each sampling frequency.

The frequency spectrum reconstruction basis function comprises a function dependent variable; the value range of the function dependent variable comprises at least two sampling frequencies;

reconstructing at least two segment spectrum sequences according to the spectrum reconstruction basis function to obtain a reconstructed segment spectrum sequence, including:

inputting at least two sampling frequencies into the spectrum reconstruction basis function to obtain a reconstruction energy amplitude value corresponding to the spectrum reconstruction basis function;

and reconstructing at least two fragment frequency spectrum sequences according to the reconstructed energy amplitude and the energy amplitude corresponding to each sampling frequency included in each fragment frequency spectrum sequence to obtain a reconstructed fragment frequency spectrum sequence.

Wherein generating an initial fit spectral function from at least two spectral representation basis functions comprises:

obtaining at least two normal distribution functions, and determining the at least two normal distribution functions as at least two frequency spectrum representation basis functions; each frequency spectrum representation basis function comprises a function dependent variable; the value range of the function dependent variable comprises at least two sampling frequencies;

respectively inputting at least two sampling frequencies into each frequency spectrum representation basis function to obtain at least two energy amplitude expressions respectively corresponding to each frequency spectrum representation basis function; each frequency spectrum representation basis function is used for outputting an energy amplitude expression corresponding to each sampling frequency;

inputting energy amplitude expressions with the same sampling frequency into at least two energy amplitude expressions corresponding to each frequency spectrum representation basis function respectively for weighted summation to obtain a summation energy amplitude expression corresponding to each sampling frequency respectively;

and generating an initial fitting spectrum function according to the summation energy amplitude expression corresponding to each sampling frequency respectively.

Each frequency spectrum representation basis function comprises corresponding initial function parameters;

adjusting at least two frequency spectrum representation basis functions to obtain an adjusted initial fitting frequency spectrum function, comprising:

acquiring a convergence function; the convergence function comprises an initial fitting spectrum function and an overall spectrum sequence; the convergence function is used for representing the difference degree between the initially fitted spectrum function and the whole spectrum sequence;

adjusting initial fitting spectrum functions in the convergence function, wherein each spectrum represents an initial function parameter corresponding to the basis function respectively, and obtaining adjusted initial fitting spectrum functions;

when the initial function parameters respectively corresponding to each frequency spectrum representation basis function are adjusted until the convergence function reaches the minimum value, determining that the convergence condition is met between the adjusted initial fitting frequency spectrum function and the whole frequency spectrum sequence;

then, when the convergence condition is satisfied between the adjusted initial fitting spectrum function and the entire spectrum sequence, obtaining at least two adjusted spectrum representation basis functions as spectrum reconstruction basis functions according to the adjusted initial fitting spectrum function, including:

when the convergence condition is met between the adjusted initial fitting spectrum function and the whole spectrum sequence, replacing the initial function parameter in each spectrum representation basis function with a fixed function parameter corresponding to each spectrum representation basis function in the adjusted initial fitting spectrum function respectively to obtain at least two adjusted spectrum representation basis functions; the fixed function parameter corresponding to each frequency spectrum representation basis function is the adjusted initial function parameter corresponding to each frequency spectrum representation basis function;

and determining the at least two adjusted frequency spectrum representation basis functions as frequency spectrum reconstruction basis functions.

Wherein, still include:

respectively obtaining the vector distance between the audio expression vector of the target audio and the audio expression vectors of at least two audios to be matched in the audio library;

determining the audio to be matched corresponding to the audio representation vector with the minimum vector distance with the audio representation vector of the target audio in the audio library as the similar audio of the target audio;

and recommending the similar audio of the target audio to the target client.

Recommending the similar audio of the target audio to the target client, wherein the recommending comprises the following steps:

determining the user client terminal with the playing times of the target audio greater than the playing time threshold value in the target time period as a target client terminal; alternatively, the first and second electrodes may be,

determining a user client side with a system collection song list containing target audio as a target client side; alternatively, the first and second electrodes may be,

determining the user client end responding to the similar retrieval operation aiming at the target audio as a target client end;

and recommending the similar audio of the target audio to the target client.

An aspect of the present application provides an audio data processing apparatus, including:

the sequence acquisition module is used for acquiring at least two fragment frequency spectrum sequences of the target audio; the at least two segment spectrum sequences are obtained by sampling the energy amplitudes of at least two audio segments of the target audio;

the sequence generation module is used for generating an integral frequency spectrum sequence of the target audio according to the energy amplitude values respectively contained in the at least two fragment frequency spectrum sequences;

the adjusting module is used for generating an initial fitting spectrum function according to the at least two spectrum representation basic functions, and adjusting the at least two spectrum representation basic functions to obtain an adjusted initial fitting spectrum function;

the convergence determining module is used for acquiring at least two adjusted spectrum representation basis functions as spectrum reconstruction basis functions according to the adjusted initial fitting spectrum function when the convergence condition is met between the adjusted initial fitting spectrum function and the whole spectrum sequence;

the reconstruction module is used for reconstructing the at least two fragment frequency spectrum sequences according to the frequency spectrum reconstruction basis function to obtain reconstructed fragment frequency spectrum sequences; the reconstructed segment spectral sequence is used to determine an audio representation vector for the target audio.

Wherein, the sequence acquisition module includes:

the time sampling unit is used for sampling the energy amplitude of the target audio according to the sampling time interval to obtain an energy time sequence corresponding to the target audio;

the sequence segmentation unit is used for segmenting the energy time sequence according to the sampling time period to obtain at least two fragment energy time sequences contained in the energy time sequence;

and the sequence generating unit is used for generating at least two fragment frequency spectrum sequences according to the at least two fragment energy time sequences.

Wherein, the sequence generation unit includes:

the conversion subunit is used for respectively carrying out frequency domain conversion on the at least two fragment energy time sequences to obtain a fragment frequency signal corresponding to each fragment energy time sequence;

the frequency sampling subunit is used for respectively sampling each segment frequency signal according to the sampling frequency interval to obtain a segment frequency sequence corresponding to each segment frequency signal;

and the sequence determining subunit is used for determining the segment frequency sequence corresponding to each segment frequency signal as the segment frequency spectrum sequence of the target audio.

a sequence generation module comprising:

the amplitude summing unit is used for summing the energy amplitudes belonging to the same sampling frequency in each segment frequency spectrum sequence to obtain a summed energy amplitude corresponding to each sampling frequency;

and the summation sequence generating unit is used for generating the whole frequency spectrum sequence of the target audio according to the summation energy amplitude value corresponding to each sampling frequency.

a reconstruction module comprising:

the input unit is used for inputting at least two sampling frequencies into the spectrum reconstruction basis function to obtain a reconstruction energy amplitude value corresponding to the spectrum reconstruction basis function;

and the reconstruction unit is used for reconstructing at least two segment frequency spectrum sequences according to the reconstructed energy amplitude and the energy amplitude corresponding to each sampling frequency included in each segment frequency spectrum sequence to obtain a reconstructed segment frequency spectrum sequence.

Wherein, the adjustment module includes:

a basis function obtaining unit, configured to obtain at least two normal distribution functions, and determine the at least two normal distribution functions as at least two spectrum representation basis functions; each frequency spectrum representation basis function comprises a function dependent variable; the value range of the function dependent variable comprises at least two sampling frequencies;

the expression obtaining unit is used for respectively inputting at least two sampling frequencies into each frequency spectrum representation basis function to obtain at least two energy amplitude expressions respectively corresponding to each frequency spectrum representation basis function; each frequency spectrum representation basis function is used for outputting an energy amplitude expression corresponding to each sampling frequency;

the expression summation unit is used for inputting the energy amplitude expressions with the same sampling frequency to carry out weighted summation in at least two energy amplitude expressions respectively corresponding to each frequency spectrum representation basis function so as to obtain a summation energy amplitude expression respectively corresponding to each sampling frequency;

and the initial function generating unit is used for generating an initial fitting spectrum function according to the summation energy amplitude expression corresponding to each sampling frequency.

an adjustment module, comprising:

a convergence function acquisition unit configured to acquire a convergence function; the convergence function comprises an initial fitting spectrum function and an overall spectrum sequence; the convergence function is used for representing the difference degree between the initially fitted spectrum function and the whole spectrum sequence;

the parameter adjusting unit is used for adjusting initial fitting spectrum functions in the convergence function, and each spectrum represents an initial function parameter corresponding to the basis function to obtain an adjusted initial fitting spectrum function;

the convergence determining unit is used for determining that the convergence condition is met between the adjusted initial fitting spectrum function and the whole spectrum sequence when the initial function parameters respectively corresponding to each spectrum representation basis function are adjusted until the convergence function reaches the minimum value;

then, a convergence determination module comprising:

the parameter replacing unit is used for respectively replacing the initial function parameters in each spectrum representation basis function with the fixed function parameters corresponding to each spectrum representation basis function in the adjusted initial fitting spectrum function when the convergence condition is met between the adjusted initial fitting spectrum function and the whole spectrum sequence, so as to obtain at least two adjusted spectrum representation basis functions; the fixed function parameter corresponding to each frequency spectrum representation basis function is the adjusted initial function parameter corresponding to each frequency spectrum representation basis function;

and a basis function determining unit, configured to determine the adjusted at least two frequency spectrum representation basis functions as frequency spectrum reconstruction basis functions.

Wherein, the audio data processing device still includes:

the distance acquisition module is used for respectively acquiring the vector distance between the audio expression vector of the target audio and the audio expression vectors of at least two audios to be matched in the audio library;

the similar audio determining module is used for determining the audio to be matched corresponding to the audio expression vector with the minimum vector distance with the audio expression vector of the target audio in the audio library as the similar audio of the target audio;

and the recommending module is used for recommending the similar audio of the target audio to the target client.

Wherein, the recommendation module includes:

the first client determining unit is used for determining the user client with the playing times of the target audio greater than the playing time threshold value in the target time period as the target client; alternatively, the first and second electrodes may be,

a second client determining unit for determining a user client having a system favorite song list including a target audio as a target client; alternatively, the first and second electrodes may be,

a third client determining unit configured to determine, as a target client, a user client that responds to a similar retrieval operation for the target audio;

and the recommending unit is used for recommending the similar audio of the target audio to the target client.

An aspect of the application provides a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform a method as in an aspect of the application.

An aspect of the application provides a computer-readable storage medium having stored thereon a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the above-mentioned aspect.

The method and the device for obtaining the target audio frequency spectrum sequence can obtain at least two segment spectrum sequences of the target audio frequency, and can obtain the whole spectrum sequence of the target audio frequency through the at least two segment spectrum sequences. Secondly, an initial fitting spectrum function can be obtained through the spectrum representation basis function, and then the initial fitting spectrum function can be adjusted through the whole spectrum sequence (namely, the spectrum representation basis function is adjusted), so that the adjusted initial fitting spectrum function is obtained. When the convergence condition is satisfied between the adjusted initial fitting spectrum function and the whole spectrum sequence, the adjusted spectrum representation basis function can be used as a spectrum reconstruction basis function. And then, reconstructing the segment spectrum sequence of the target audio through the obtained spectrum reconstruction basis function to obtain a reconstructed segment spectrum sequence, wherein the reconstructed segment spectrum sequence is used for obtaining an audio representation vector of the target audio. The reconstructed segment spectrum sequence is obtained by adjusting and reconstructing an energy amplitude value, which is relatively abrupt (for example, an energy amplitude value that is too large compared with other energy amplitude values in the segment spectrum sequence, or an energy amplitude value that is too small compared with other energy amplitude values in the segment spectrum sequence), in the segment spectrum sequence of the target audio through a spectrum reconstruction basis function, so that an audio representation vector of the target audio obtained by reconstructing the segment spectrum sequence can more accurately represent the target audio.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a system architecture provided herein;

FIG. 2 is a schematic diagram of a scenario of audio data processing provided herein;

FIG. 3 is a schematic flow chart of an audio data processing method provided by the present application;

FIG. 4 is a schematic diagram of a data sampling scenario provided herein;

FIG. 5 is a schematic diagram of a sequence generation scenario provided herein;

FIG. 6 is a schematic view of a sequence reconstruction scenario provided herein;

fig. 7 is a schematic page diagram of a terminal device provided in the present application;

FIG. 8 is a schematic diagram of an audio data processing apparatus according to the present application;

fig. 9 is a schematic structural diagram of a computer device provided in the present application.

Detailed Description

The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The Machine learning is the core of artificial intelligence, which is a fundamental way to make computers have intelligence and is applied to various fields of artificial intelligence.

In the present application, a series of algorithms are mainly provided to enable a computer to learn how to reconstruct the frequency spectrum of the audio through the overall features (mainly energy spectral density features) of the audio to obtain a reconstructed frequency spectrum which is more capable of representing the audio as a whole, and the reconstructed frequency spectrum can be understood as the features of the audio learned by the computer.

Please refer to fig. 1, which is a schematic diagram of a system architecture provided in the present application. As shown in fig. 1, the system architecture diagram includes a server 100 and a plurality of terminal devices, and the plurality of terminal devices specifically include a terminal device 200a, a terminal device 200b, and a terminal device 200 c. The terminal device 200a, the terminal device 200b, and the terminal device 200c can communicate with the server 100 through a network, and the terminal device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a wearable device (e.g., a smart watch, a smart band, etc.). Here, the communication between the terminal device 200a and the server 100 is taken as an example for explanation, and please refer to the following process.

Please refer to fig. 2, which is a schematic view of a scene of audio data processing according to the present application. As shown in fig. 2, the terminal device 200a may carry a music client, which may be a web page type client or an application type client (i.e. music-related app, music may also be referred to as audio). The music client can be used for music related operations such as retrieving music, playing music, and uploading music. As shown in fig. 2, the terminal page 100a, the terminal page 101a, and the terminal page 102a displayed in the terminal apparatus 200a are each a page of a music client carried in the terminal apparatus 200 a. In the terminal page 100a, a music list is displayed, and the music list includes music a, music b, music c, music d, and music e. The terminal device 200a may display the terminal page 101a in response to a long press operation (actually, a double-click operation, a slide operation, or the like, which is not limited thereto) for the music a in the terminal page 100 a. The terminal page 101a has one more button 103a for "retrieve similar music" for music a than the terminal page 100 a. The terminal device 200a may generate a similar retrieval instruction for the music a in response to the click operation of the button 103a in the terminal page 101 a. The terminal apparatus 200a may transmit the generated similarity retrieval instruction to the server 100, and the server 100 may retrieve music similar to the music a in an audio library (including several pieces of music) after acquiring the similarity retrieval instruction. After the server 100 completes the retrieval of the music similar to music a, the retrieved music similar to music a may be transmitted to the terminal apparatus 200 a. The terminal apparatus 200a may jump from the terminal page 101a to the terminal page 102a after acquiring the music similar to the music a transmitted by the server. The terminal page 102a displays therein a music retrieval result for music a (i.e., retrieved music similar to music a) including music f, music g, and music h (the music f, music g, and music h are names of corresponding music) similar to music a transmitted by the server acquired by the terminal apparatus 200 a. After that, the terminal apparatus 200a may also respond to the click operation for the music f, the music g, or the music h in the terminal page 102a, and start playing the clicked music f, music g, or music h.

The above-mentioned process of the server 100 retrieving music similar to the music a from the audio library may be performed by the terminal device 200 a. In other words, the execution subject for retrieving music similar to music a may be the server 100 or the terminal device 200a, and is not limited thereto, depending on the actual application scenario. The following describes in detail the process of retrieving music similar to music a, and the server 100 is taken as an execution subject to retrieve english similar to music a.

In the following process, music is referred to as audio. As shown in fig. 2, the curve 104a is an energy distribution curve of the audio a (i.e. the music a) in the time domain, and the energy of the audio at a certain time point, i.e. the loudness of the audio at the certain time point, and the unit thereof may be joule (i.e. J). The horizontal axis of the coordinate axis of the curve 104a represents time, and the vertical axis represents energy amplitude (i.e., the magnitude of energy at each time point). The curve 104a is a curve continuous in the time domain. First, the server 100 may sample the energy distribution curve 104 of the audio a, and the sampling time interval of the sampling may be specifically determined according to the actual application scenario. For example, if x1 seconds is used as the sampling time interval, it means that one energy amplitude needs to be sampled every x1 seconds in the curve 104a, and if the audio a has x2 seconds in total, then x2/x1 energy amplitudes can be finally sampled. The server 100 may arrange all the sampled energy amplitudes according to their corresponding time sequence to obtain a sequence, which may be referred to as an energy time sequence.

Next, the server 100 may segment the obtained energy time series, i.e., segment the energy time series into a plurality of (at least two) sequences, and each of the plurality of sequences may be referred to as a segment energy time series. The server 100 may slice the energy time sequence according to a sampling time period, for example, the sampling time period may be 3 seconds, that is, the server 100 needs to slice the energy amplitude in the energy time sequence every other 3 seconds to obtain a segment energy time sequence. For example, assuming that the energy time series is [1, 2, 3, 4, 5, 6], the sampling time period is 3 seconds, and the energy amplitude 1 and the energy amplitude 2 in the energy time series are sampled by the server 100 in the energy amplitude within the 1 st 3 seconds of the curve 104a, the segmented energy time series is [1, 2 ]. The energy amplitude 3 and the energy amplitude 4 in the energy time series are sampled by the server 100 in the energy amplitude within the 2 nd 3 seconds in the curve 104a, and a segmented energy time series obtained by segmentation is [3, 4 ]. The energy amplitude 5 and the energy amplitude 6 in the energy time series are sampled by the server 100 in the energy amplitude within the 3 rd second in the curve 104a, and a segmented energy time series is [5, 6 ]. Therefore, by segmenting the energy time series [1, 2, 3, 4, 5, 6], the resulting fragment energy time series includes the fragment energy time series [1, 2], the fragment energy time series [3, 4] and the fragment energy time series [5, 6 ]. Here, as shown in fig. 2, the plurality of segment energy time series obtained by the server 100 specifically include a plurality of series in the set 105a (the number of the plurality of series needs to be determined according to the actual sampling time interval, the sampling time period, and the total duration of the audio a), and the plurality of series specifically may include a plurality of series including a series 106a (i.e., [10, 10, 13, 15, 52, 6]) and a series 107a (i.e., [11, 12, 13, 14, 15, 20 ]).

After obtaining the plurality of segment energy time sequences, the server 100 needs to obtain a segment spectrum sequence corresponding to each segment energy time sequence (the segment spectrum sequence characterizes the distribution of the energy amplitude of the audio a in the frequency domain, that is, the segment energy time sequence is transformed from the time domain to the frequency domain). Specifically, the process of obtaining the segment spectrum sequence corresponding to each segment energy time sequence is the same, and here, the process of obtaining the segment spectrum sequence 110a corresponding to the segment energy time sequence 106a is taken as an example for explanation. First, the server 100 may perform frequency domain transformation (the frequency domain transformation may be fourier transformation) on the segment energy time series 106a to obtain a spectrum curve 108a corresponding to the segment energy time series 106a, where the spectrum curve 108a is a continuous curve. The horizontal axis of the plot 108a represents frequency (in Hz) and the vertical axis represents energy amplitude (in joules J). Then, the server 100 may sample the spectrum curve 108a, and the sampling frequency interval of the sampling may be determined according to a specific actual scenario, which is not limited thereto. For example, if the sampling frequency interval is 10Hz, an energy amplitude is obtained by sampling every 10Hz in the spectrum curve 108a, and if the upper limit of the frequency of the spectrum curve 108a is 50Hz and the lower line of the frequency is 0, a plurality of energy amplitudes of 50Hz/10Hz can be obtained after sampling the spectrum curve 108 a. The server 100 may sort the plurality of energy amplitudes obtained by sampling the spectrum curve 108a according to the frequency size, and obtain a segment spectrum sequence 110a (i.e., [38, 35, 30, 15, 5]) corresponding to the segment energy time sequence 106 a. Through the above process, the server 100 may obtain the segment spectrum sequence corresponding to each segment energy time sequence in the set 105 a. As shown in a set 109a in fig. 2, the set 109a includes each segment spectrum sequence obtained by the server 100, and specifically includes a plurality of sequences, such as a segment spectrum sequence 110a corresponding to the segment energy time sequence 106a and a segment spectrum sequence 111a (i.e., [11, 12, 40, 14, 15]) corresponding to the segment energy time sequence 107 a.

It will be appreciated that the number of energy magnitudes in each of the sequences of segment spectra in the set 109a is the same, and the sampling frequency in different sequences of segment spectra is the same. For example, assuming that the sequences [1, 2, 3] and [4, 5, 6] are two fractional spectral sequences, the energy amplitude 1 in the sequences [1, 2, 3] and the energy amplitude 4 in the sequences [4, 5, 6] correspond to the same sampling frequency, for example, 10 Hz; the energy amplitude 2 in the sequence [1, 2, 3] and the energy amplitude 5 in the sequence [4, 5, 6] correspond to the same sampling frequency, for example both to 20 Hz; the energy amplitude 3 in the sequence [1, 2, 3] corresponds to the same sampling frequency as the energy amplitude 6 in the sequence [4, 5, 6], for example to 30 Hz. I.e. the sampling frequencies for sequence [1, 2, 3] and sequence [4, 5, 6] are 10Hz, 20Hz and 30 Hz. Therefore, the server 100 may add (sum) the energy amplitudes corresponding to the same (same) sampling frequency in each of the segment spectrum sequences in the set 109a to obtain a summed energy amplitude corresponding to each sampling frequency. For example, the energy amplitude 1 in the sequence [1, 2, 3] can be added to the energy amplitude 4 in the sequence [4, 5, 6] to obtain a summed energy amplitude 5 corresponding to the sampling frequency of 10 Hz; the energy amplitude 2 in the sequence [1, 2, 3] can be added to the energy amplitude 5 in the sequence [4, 5, 6] to obtain a summation energy amplitude 7 corresponding to the sampling frequency of 20 Hz; the energy amplitude 3 in the sequence [1, 2, 3] may be added to the energy amplitude 6 in the sequence [4, 5, 6] to obtain a summed energy amplitude 9 corresponding to a sampling frequency of 30 Hz. The entire spectrum sequence 112a can be obtained by the above-mentioned obtained summed energy amplitude value corresponding to each sampling frequency (i.e., D ═ 110, 120, 130, 140, 150). For example, the overall spectrum sequence obtained by the above sequence [1, 2, 3] and sequence [4, 5, 6] may be [5, 7, 9 ]. As can be known from the above, the dimension of the segment spectrum sequence is the same as that of the whole spectrum sequence, that is, how many energy amplitudes are included in one segment spectrum sequence, and so many energy amplitudes are also included in the corresponding whole spectrum sequence.

Further, the server 100 may also represent the entire spectrum sequence by using a spectrum representation basis function, as shown in fig. 2, the spectrum representation basis function may specifically include basis functions f1(x), f2(x), f3(x), f4(x), and f5(x) in the set 113a, where x represents a sampling frequency, in other words, a value range of x may include each sampling frequency corresponding to the entire spectrum sequence (for example, the sampling frequency 10Hz, the sampling frequency 20Hz, and the sampling frequency 30Hz corresponding to the entire spectrum sequence [5, 7, 9 ]. The number of spectrum expression basis functions is determined according to the actual application scenario, and 5 are taken as an example here for explanation. Each spectral representation basis function may be a normal distribution function with different (in practice, the same) function parameters. The server 100 may bring each value (i.e., each sampling frequency) of x into each spectrum representation basis function in the set 113a, and add each spectrum representation basis function with the value of x (corresponding to the sampling frequency), to obtain f (x) (i.e., the equation 115a), where f (x) is also a sequence.

For example, if the sampling frequency includes 10Hz, 20Hz, and 30Hz, F (x) is [ f1(10) + f2(10) + f3(10) + f4(10) + f5 (10)), f1(20) + f2(20) + f3 (20))

+ f4(20) + f5(20), f1(30) + f2(30) + f3(30) + f4(30) + f5 (30). It is to be understood that, since each of the spectrum-representing basis functions includes a respective function parameter (which is random at first), the function parameter of each of the spectrum-representing basis functions is also included in f (x). The server 100 may adjust the function parameters of each spectrum representation basis function in f (x) until f (x) approaches the obtained overall spectrum sequence (as shown in equation 114 a), i.e., f (x) and the overall spectrum sequence approach the same as possible. After f (x) is approximately the same as the whole spectrum sequence, the server 100 may take the function parameter of each adjusted spectrum representation function from f (x) at this time, and then may bring the function parameter of each adjusted spectrum representation basis function into each spectrum representation basis function to obtain a final spectrum representation basis function (i.e., an adjusted spectrum representation basis function). As shown in the set 116a in fig. 2, the basis function f11(x) can be obtained after adjusting the spectral representation basis function f1 (x); the basic function f22(x) can be obtained after the frequency spectrum representation basic function f2(x) is adjusted; the basic function f33(x) can be obtained after the frequency spectrum representation basic function f3(x) is adjusted; the basic function f44(x) can be obtained after the frequency spectrum representation basic function f4(x) is adjusted; the basis function f55(x) can be obtained after adjusting the spectral representation basis function f5 (x).

Then, the server 100 may reconstruct each segment spectrum sequence in the set 105a through the obtained basis function f11(x), basis function f22(x), basis function f33(x), basis function f44(x), and basis function f55(x), to obtain a reconstructed spectrum sequence corresponding to each segment spectrum sequence, which may specifically include a plurality of sequences including a reconstructed spectrum sequence 118a corresponding to the segment spectrum sequence 106a in the set 117a and a reconstructed spectrum sequence 119a corresponding to the segment spectrum sequence 106 a. The specific process of reconstructing the spectrum sequence of each segment through the finally obtained basis functions can be seen in step S105. The server 100 may obtain an audio representation vector 120a of the audio a through the obtained reconstructed spectrum sequence corresponding to each segment spectrum sequence, and may interpret the audio representation vector 120a as the machine characteristic of the finally obtained audio a.

It is understood that the server 100 can obtain the audio representation vector of each audio in the audio library by the same principle as the above-described process. Therefore, the server 100a retrieves the audio similar to the audio a in the audio library, that is, retrieves the audio corresponding to the audio representation vector having the smallest vector distance with the audio representation vector 120a of the audio a in the audio representation vector corresponding to each audio in the audio library as the audio similar to the audio a.

As can be seen from the above, in the present application, the segment spectrum sequences of the audio frequency need to be reconstructed by adjusting the completed basis functions, and the operation of reconstruction can achieve the purposes of enhancing the partial sequences of each segment spectrum sequence that are consistent with the variation trend of the entire spectrum sequence, and weakening the partial sequences of each segment spectrum sequence that are inconsistent with the variation trend of the entire spectrum sequence. For example, if a certain spectrum sequence of a segment is [1, 2, 1, 2, 1, 2, 51, 1, 2, 3], then 51 in the sequence has an energy amplitude that is particularly large compared to other energy amplitudes (i.e. 1, 2, 1, 2, 1, 2 before 51 and 1, 2, 3 after 51), so that when reconstructing the sequence [1, 2, 1, 2, 1, 2, 51, 1, 2, 3], the purpose of appropriately reducing the value of the energy amplitude 51 and adjusting the

energy amplitudes

1, 2, 1, 2, 1, 2 before 51 and 1, 2, 3 after 51 to be closer to the variation trend of the whole spectrum sequence can be achieved. Therefore, the audio representation vector obtained through the finally reconstructed segment spectrum sequence can more accurately represent the corresponding audio (because the variation trend of the energy amplitude in each segment spectrum sequence is more fit with the variation trend of the energy amplitude in the whole spectrum sequence), so that other audio similar to a certain audio can be more accurately retrieved through the audio representation vector of each audio.

Referring to fig. 3, it is a schematic flow chart of an audio data processing method provided in the present application, and as shown in fig. 3, the method may include:

step S101, acquiring at least two fragment frequency spectrum sequences of a target audio; the at least two segment spectrum sequences are obtained by sampling the energy amplitudes of at least two audio segments of the target audio;

specifically, the execution subject in this embodiment may be a terminal device, or may be a server. Here, a server is taken as an execution subject. The target audio may refer to any audio, such as any song or any recording. First, the server may obtain an energy distribution of the target audio in a time domain (i.e., a time dimension) (the energy distribution is a size of energy of the target audio at each time point, and the size of the energy also represents a loudness of the target audio, and the size of the energy of the target audio at each time point may be referred to as an energy amplitude of the target audio at the corresponding time point, in other words, the energy amplitude is a value of the energy size of the target audio), and the curve 104a in fig. 2 represents the energy distribution of the audio a in the time domain. The server may sample the energy distribution of the target audio in the time domain (the energy distribution may be represented as a curve, and thus the sampling may be to sample the energy distribution curve), obtain a sequence of the energy of the target audio in the time domain (the sequence includes the energy amplitude of the target audio at each sampling time point), and may be referred to as an energy time sequence of the target audio. The specific process of sampling the energy distribution of the target audio in the time domain can be referred to the process described in fig. 4 below.

Please refer to fig. 4, which is a schematic view of a data sampling scenario provided in the present application. As shown in fig. 4, it is assumed that a curve 100b is an energy distribution curve of the target audio in the time domain, and the horizontal axis of the coordinate axis of the curve 100b is time (in seconds, i.e., s) and the vertical axis is energy (in joules, i.e., J). As can be seen from fig. 4, the target audio has a total of 200 seconds (that is, the maximum value of the abscissa axis of the coordinate axis in fig. 4 is 200), and the sequence [5, 5, 5, 10, 25, 22, 13, 8, 7, 6, 7, 7, 6, 7, 8, 9, 8, 6, 5, 5] or the sequence 101b can be obtained by sampling the curve 100b at sampling time intervals of 10 seconds (the sampling time intervals may be determined according to the actual application scene, for example, 0.1 second or the like). The number of the energy amplitude values in the sequence 101b is the audio time divided by the sampling time interval, that is, 200/10 is equal to 20, and the sequence 101b is the sampled energy time sequence of the target audio. The sequence 101b includes energy amplitudes corresponding to the target audio at 20 sampling time points (see a curve 100b, which specifically includes a sampling time point corresponding to 10 seconds, a sampling time point corresponding to 20 seconds, a sampling time point corresponding to 30 seconds, a sampling time point corresponding to 40 seconds, a sampling time point corresponding to 50 seconds, a sampling time point corresponding to 60 seconds, a sampling time point corresponding to 70 seconds, a sampling time point corresponding to 80 seconds, a sampling time point corresponding to 90 seconds, a sampling time point corresponding to 100 seconds, a sampling time point corresponding to 110 seconds, a sampling time point corresponding to 120 seconds, a sampling time point corresponding to 130 seconds, a sampling time point corresponding to 140 seconds, a sampling time point corresponding to 150 seconds, a sampling time point corresponding to 160 seconds, a sampling time point corresponding to 170 seconds, a sampling time point corresponding to 180 seconds, a sampling time point corresponding to 190 seconds, a sampling time point corresponding to 200 seconds).

Next, the server may segment the energy time series of the target audio according to the sampling time period, that is, segment the energy time series into a plurality of (in this application, "a plurality" refers to "at least two") sequences, and may refer to the plurality of sequences obtained by segmenting the energy time series as a plurality of segment energy time series of the target audio.

Similarly, referring to fig. 4, assuming that the sampling time period is 50 seconds, the number of energy amplitudes in one energy time sequence is equal to 5, that is, 50/10 is equal to 5, that is, the energy time sequence 101b is sliced every 5 energy amplitudes to obtain a plurality of energy time sequences (i.e., sequences in the set 102 b), and the plurality of energy time sequences specifically include the sequence [5, 5, 5, 10, 25] (i.e., the sequence 103b), the sequence [22, 13, 8, 7, 6] (i.e., the sequence 104b), the sequence [7, 7, 6, 7, 8] (i.e., the sequence 105b), and the sequence [9, 8, 6, 5, 5] (i.e., the sequence 106 b). Wherein, 5 energy amplitudes in the sequence [5, 5, 5, 10, 25] (i.e. the sequence 103b) are obtained by sampling the energy distribution of the target audio within 1 st 50 seconds (i.e. 0 to 50 seconds); the 5 energy amplitudes in the sequence [22, 13, 8, 7, 6] (i.e., the sequence 104b) are obtained by sampling the energy distribution of the target audio within 2 nd 50 seconds (i.e., 50 seconds to 100 seconds); the 5 energy amplitudes in the sequence [7, 7, 6, 7, 8] (i.e., the sequence 105b) are obtained by sampling the energy distribution of the target audio within the 3 rd 50 seconds (i.e., 100 seconds to 150 seconds); the 5 energy amplitudes in the sequence [9, 8, 6, 5, 5] (i.e., the sequence 106b) are sampled for the energy distribution of the target audio in the 4 th 50 seconds (i.e., 150 seconds to 200 seconds).

After obtaining the plurality of segment energy time series of the target audio, the server may generate a plurality of segment spectrum sequences of the target audio through the obtained plurality of segment energy time series of the target audio, one segment energy time series corresponding to one segment spectrum sequence. The specific process can be as follows: the server may perform frequency domain transformation on each segment energy time sequence, that is, transform the segment energy time sequence from a time domain (i.e., a time dimension) to a frequency domain (i.e., a frequency dimension), to obtain segment frequency signals corresponding to each segment energy time sequence, where one segment energy time sequence corresponds to one segment frequency signal. The slice frequency signal can also be represented as a curve, the horizontal axis of the coordinate axis of the curve is frequency (in hertz, Hz), the vertical axis is energy (in joules, J), and the magnitude of the energy of the target audio at each frequency can also be referred to as energy amplitude. Wherein the segment frequency signal reflects the energy spectral density of the target audio (energy spectral density is the distribution of energy over different frequencies). The server may sample each segment frequency signal according to the sampling frequency interval (the sampling mode is the same as the above-described sampling mode for the energy distribution of the target audio in the time domain), to obtain a segment frequency sequence corresponding to each segment frequency signal, where one segment frequency signal corresponds to one segment frequency sequence. Assuming that the upper frequency limit of a segment frequency signal is f1, the lower frequency limit is f2, and the sampling frequency interval is f3, the number of energy amplitudes in a segment frequency sequence is (f1-f2)/f 3. Wherein, the upper frequency limit and the lower frequency limit of each fragment frequency sequence are the same.

For example, if the upper limit of the segment frequency signal is 50, the lower limit is 0, and the sampling frequency interval is 10, the segment frequency sequence obtained after sampling the segment frequency signal includes an energy amplitude corresponding to the sampling frequency 10, an energy amplitude corresponding to the sampling frequency 20, an energy amplitude corresponding to the sampling frequency 30, an energy amplitude corresponding to the sampling frequency 40, and an energy amplitude corresponding to the sampling frequency 50 in the segment frequency signal.

Each of the obtained segment frequency sequences is a segment spectrum sequence of the target audio. As can be understood from the above process, the plurality of segment spectrum sequences of the target audio are obtained by sampling energy magnitudes of a plurality of audio segments of the target audio (which may be understood as splitting the target audio into a plurality of audio segments through a sampling time period, and one segment energy time sequence corresponds to one audio segment).

Step S102, generating an integral frequency spectrum sequence of the target audio according to the energy amplitude values respectively contained in the at least two fragment frequency spectrum sequences;

specifically, the server may generate the entire spectrum sequence of the target audio by using the energy amplitude respectively included in each segment spectrum sequence. Here, it should be noted that each segment spectrum sequence of the target audio includes energy amplitudes corresponding to at least two sampling frequencies, and the sampling frequencies corresponding to each segment spectrum sequence are the same. The sampling frequency is the frequency corresponding to each energy amplitude in the segment spectrum sequence. For example, if the upper frequency limit of the segment frequency signal 1 is 30, the lower frequency limit is 0, and the sampling frequency interval is 10, and the segment frequency signal 1 is sampled to obtain the segment frequency spectrum sequences [1, 2, 3], then 1 in the sequences [1, 2, 3] corresponds to the

sampling frequency

10, 2 in the sequences [1, 2, 3] corresponds to the

sampling frequency

20, and 3 in the sequences [1, 2, 3] corresponds to the sampling frequency 30, that is, in this case, each segment frequency spectrum sequence includes an energy amplitude corresponding to the sampling frequency 10, an energy amplitude corresponding to the sampling frequency 20, and an energy amplitude corresponding to the sampling frequency 30. Except that the energy amplitude may be different for the same sampling frequency (e.g., sampling frequency 10) in different fractional spectral sequences.

Therefore, the server can sum the energy amplitude values corresponding to the same sampling frequency in each fragment frequency spectrum sequence to obtain a sum energy amplitude value corresponding to each sampling frequency, and then the server can generate the whole frequency spectrum sequence of the target audio according to each sum energy amplitude value.

For example, please refer to fig. 5, which is a schematic view of a sequence generation scenario provided in the present application. As shown in fig. 5, the time series of segment energies of the target audio obtained by the server includes a sequence 100c (i.e., [5, 5, 5, 10, 25]), a sequence 102c (i.e., [22, 13, 8, 7, 6]), a sequence 104c (i.e., [7, 7, 6, 7, 8]), and a sequence 106c (i.e., [9, 8, 6, 5, 5 ]). The server may perform frequency domain transformation on the sequence 100c to obtain a segment frequency signal 100d (which is a curve); the server may perform frequency domain transformation on the sequence 102c to obtain a segment frequency signal 101d (which is a curve); the server may perform frequency domain transformation on the sequence 104c to obtain a segment frequency signal 102d (which is a curve); the server may perform a frequency domain transform on the sequence 106c to obtain a segment frequency signal 103d (which is a curve). The server may sample the segment frequency signal 100d (the sampling frequency interval is determined according to the actual application scenario, which is not limited), to obtain a segment spectrum sequence 101c (i.e., [5, 10, 7, 9, 8 ]); the server may sample the segment frequency signal 101d to obtain a segment spectrum sequence 103c (i.e., [6, 8, 9, 10, 14 ]); the server may sample the segment frequency signal 102d to obtain a segment spectrum sequence 105c (i.e., [15, 12, 8, 7, 6 ]); the server may sample the segment frequency signal 103d to obtain a segment spectrum sequence 107c (i.e., [5, 8, 6, 7, 5 ]).

Wherein the 1 st energy amplitude 5 in the segment spectrum sequence 101c, the 1 st energy amplitude 6 in the segment spectrum sequence 103c, the 1 st energy amplitude 15 in the segment spectrum sequence 105c, and the 1 st energy amplitude 5 in the segment spectrum sequence 107c of fig. 5 correspond to the same sampling frequency (assumed to be sampling frequency 1); the 2 nd energy amplitude 10 in the segment spectrum sequence 101c, the 2 nd energy amplitude 8 in the segment spectrum sequence 103c, the 2 nd energy amplitude 12 in the segment spectrum sequence 105c, and the 2 nd energy amplitude 8 in the segment spectrum sequence 107c correspond to the same sampling frequency (assumed to be sampling frequency 2); the 3 rd energy amplitude 7 in the segment spectrum sequence 101c, the 3 rd energy amplitude 9 in the segment spectrum sequence 103c, the 3 rd energy amplitude 8 in the segment spectrum sequence 105c, and the 3 rd energy amplitude 6 in the segment spectrum sequence 107c correspond to the same sampling frequency (assumed to be sampling frequency 3); the 4 th energy amplitude 9 in the segment spectrum sequence 101c, the 4 th energy amplitude 10 in the segment spectrum sequence 103c, the 4 th energy amplitude 7 in the segment spectrum sequence 105c, and the 4 energy amplitudes 7 in the segment spectrum sequence 107c correspond to the same sampling frequency (assumed to be sampling frequency 4); the 5 th energy amplitude 8 in the segment spectrum sequence 101c, the 5 th energy amplitude 14 in the segment spectrum sequence 103c, the 5 th energy amplitude 6 in the segment spectrum sequence 105c, and the 5 th energy amplitude 5 in the segment spectrum sequence 107c correspond to the same sampling frequency (assumed to be sampling frequency 5).

Then, the server may add the energy amplitudes corresponding to the sampling frequency 1 (including the energy amplitude 5, the energy amplitude 6, the energy amplitude 15, and the energy amplitude 5) to obtain a summed energy amplitude 31 corresponding to the sampling frequency 1 (as shown in equation 108 c); the server may add the energy amplitudes corresponding to the sampling frequency 2 (including the energy amplitude 10, the energy amplitude 8, the energy amplitude 12, and the energy amplitude 8) to obtain a summed energy amplitude 38 corresponding to the sampling frequency 2 (as shown in equation 109 c); the server may add the energy amplitudes corresponding to the sampling frequency 3 (including the energy amplitude 7, the energy amplitude 9, the energy amplitude 8, and the energy amplitude 6) to obtain a summed energy amplitude 30 corresponding to the sampling frequency 3 (as shown in equation 110 c); the server may add the energy amplitudes corresponding to the sampling frequency 4 (including the energy amplitude 9, the energy amplitude 10, the energy amplitude 7, and the energy amplitude 7) to obtain a summed energy amplitude 33 corresponding to the sampling frequency 4 (as shown in equation 111 c); the server may add the energy amplitudes corresponding to the sampling frequency 5 (including the energy amplitude 8, the energy amplitude 14, the energy amplitude 6, and the energy amplitude 5) to obtain a summed energy amplitude 33 corresponding to the sampling frequency 5 (as shown in equation 112 c). Then, the server may sequence the summation energy amplitude 31 corresponding to the sampling frequency 1, the summation energy amplitude 38 corresponding to the sampling frequency 2, the summation energy amplitude 30 corresponding to the sampling frequency 3, the summation energy amplitude 33 corresponding to the sampling frequency 4, and the summation energy amplitude 33 corresponding to the sampling frequency 5 in sequence (time sequence, or frequency sequence), so as to obtain an overall frequency spectrum sequence 113c of the target audio (i.e., [31, 38, 30, 33, 33 ]). It is understood that the sampling frequencies corresponding to the entire spectrum sequence and the segment spectrum sequence are also the same. As can be seen from the above, the dimension of the whole spectrum sequence of the target audio is the same as that of each segment spectrum sequence of the target audio, in other words, the number of energy amplitudes contained in the whole spectrum sequence of the target audio is the same as the number of energy amplitudes contained in one segment spectrum sequence of the target audio.

Step S103, generating an initial fitting spectrum function according to the at least two spectrum representation basis functions, and adjusting the at least two spectrum representation basis functions to obtain an adjusted initial fitting spectrum function;

specifically, since the normal distribution function is the most suitable distribution function for describing the probability distribution in the physical world, the normal distribution function can be selected as the spectrum expression basis function, as shown in the following formula (1):

where x refers to a sampling frequency, that is, a value range of x includes each sampling frequency, for example, if the sampling frequency includes 10Hz, 20Hz, and 30Hz, a value of x may include 10, 20, and 30. Where x may be referred to as a function dependent variable in the spectral representation basis function. Both σ and μ are function parameters in the spectral representation basis function.

Here, the initial fitting spectrum function may be generated using a plurality of spectrum representation basis functions, all of which are normal distribution functions, that is, all of the function forms of the plurality of spectrum representation basis functions are in the form of formula (1), except that the function parameters (including σ and μ) of the plurality of spectrum representation basis functions may be different. The final fitted spectrum sequence of the whole spectrum sequence can be obtained by initially fitting the spectrum function (that is, the adjusted initially fitted spectrum function actually represents a sequence), and the fitted spectrum sequence is a sequence which is approximated to the whole spectrum sequence and is obtained by using a plurality of spectrum representation basis functions, that is, the fitted spectrum sequence is almost the same as the whole spectrum sequence. This process may be referred to as spectral estimation, i.e. fitting the overall spectral sequence by spectral representation basis functions. The number of the spectrum representation basis functions is determined according to the actual application scenario, and is not limited to this, for example, 5 spectrum representation basis functions may be used to implement the spectrum estimation process. It can be understood that, in general, the more the number of the spectrum representation basis functions is, the more accurate the resulting fitted spectrum sequence is, but the more the number of the spectrum representation basis functions is, the more the calculation amount is for the server, so that a compromise number can be adopted, for example, 5 spectrum representation basis functions are adopted.

Here, assuming that 3 spectral representation basis functions are adopted here, which are respectively denoted as f1(x), f2(x) and f3(x), each spectral representation basis function may be weighted and summed to obtain an initial fitting spectral function, which is denoted as f (x), see the following equation (2):

F(x)＝a1*f1(x)+a2*f2(x)+a3*f3(x) (2)

because the frequency spectrum representation basis functions are probability distribution, the value range of the frequency spectrum representation basis functions is 0 to 1, the value of the energy amplitude can be hundreds of thousands, and the like, and therefore, a weight needs to be added in front of each frequency spectrum representation basis function to accurately fit the whole frequency spectrum sequence through the frequency spectrum representation basis functions. In the above formula (2), the weight of the spectral representation basis function f1(x) is a1, the weight of the spectral representation basis function f2(x) is a2, and the weight of the spectral representation basis function f3(x) is a 3.

The manner in which the above-described plurality of spectral representation basis functions (including f1(x), f2(x), and f3(x)) are summed is described below: each sampling frequency (i.e., each value in the value range of x) can be respectively substituted into each spectrum representation basis function to obtain a plurality of energy amplitude expressions respectively corresponding to each spectrum representation basis function. For example, assuming that the sampling frequency includes 10, 20, and 30, the value of x may include 10, 20, and 30. Each value of x can be substituted into a spectrum representation basis function f1(x) to obtain an energy amplitude expression corresponding to the spectrum representation basis function f1(x), wherein the energy amplitude expression corresponding to f1(x) specifically comprises f1(10), f1(20) and f1 (30); similarly, each value of x may be substituted into the spectrum representation basis function f2(x) to obtain an energy amplitude expression corresponding to the spectrum representation basis function f2(x), where the energy amplitude expression corresponding to f2(x) specifically includes f2(10), f2(20), and f2 (30); similarly, each value of x may be substituted into the spectrum representation basis function f3(x), so as to obtain an energy amplitude expression corresponding to the spectrum representation basis function f3(x), where the energy amplitude expression corresponding to f3(x) specifically includes f3(10), f3(20), and f3 (30).

Then, the energy amplitude expressions with the same sampling frequency (i.e., the same value of x) can be brought into the multiple energy amplitude expressions corresponding to each spectrum representation basis function for weighted summation, so as to obtain the summed energy amplitude expression corresponding to each sampling frequency. For example, the above-mentioned energy amplitude expression with the sampling frequency 10 (i.e. x is equal to 10) includes the energy amplitude expression f1(10), the energy amplitude expression f2(10) and the energy amplitude expression f3(10), and then the summation energy amplitude expression corresponding to the sampling frequency 10 is f1(10) + f2(10) + f3 (10). Similarly, the above-mentioned energy amplitude expression with the sampling frequency 20 (i.e. x is equal to 20) includes the energy amplitude expression f1(20), the energy amplitude expression f2(20) and the energy amplitude expression f3(20), so that the summation energy amplitude expression corresponding to the sampling frequency 20 is f1(20) + f2(20) + f3 (20). Similarly, the energy amplitude expression carried in the sampling frequency 30 (i.e. x equals to 30) includes the energy amplitude expression f1(30), the energy amplitude expression f2(30) and the energy amplitude expression f3(30), and then the summed energy amplitude expression corresponding to the sampling frequency 30 is f1(30) + f2(30) + f3 (30).

In fact, the initial fit spectral function f (x) represents a sequence, and the sampling frequency corresponding to f (x) the entire spectral sequence is the same. Therefore, if the sampling frequency includes 10, 20, and 30 (i.e., x may be 10, 20, and 30), and the spectrum representation basis function includes f1(x), f2(x), and f3(x), f (x) may be [ f1(10) + f2(10) + f3 (10)), f1(20) + f2(20) + f3 (20)), f1(30) + f2(30) + f3(30) ].

The function parameters of the spectral representation basis function f1(x), the spectral representation basis function f2(x), and the spectral representation basis function f3(x) are all randomly generated at the beginning, the randomly generated basis function at the beginning of each spectral representation basis function can be referred to as the initial function parameter of each spectral representation basis function, and the initial function parameter of the spectral representation basis function f1(x) can be referred to as σ₁And mu₁It can be noted that the initial function parameter of the spectral representation basis function f2(x) is σ₂And mu₂It can be noted that the initial function parameter of the spectral representation basis function f3(x) is σ₃And mu₃。

The server may adjust the function parameters (including specifically the above σ) of each spectral representation basis function in the initially fitted spectral function f (x)₁、μ₁、σ₂、μ₂、σ₃And mu₃) And eachThe spectrum represents the corresponding weights of the basis functions (including a1, a2 and a3 above), so that the adjusted initial fit spectral function f (x) approximates to the whole spectral sequence, i.e. the adjusted initial fit spectral function is approximately the same as the whole spectral sequence. The whole spectrum sequence may be denoted as D, and in the adjustment process, the approximation degree between the adjusted initial fitting spectrum function f (x) and the whole spectrum sequence D may be represented by the following formula (3):

that is, the function J may be used as an objective function (the function J may also be referred to as a convergence function), and a smaller value of the function J indicates a smaller degree of difference between the adjusted initial fitting spectral function f (x) and the entire spectral sequence D, and a smaller degree of difference indicates a greater similarity (i.e., a greater tendency to be the same) between the adjusted initial fitting spectral function f (x) and the entire spectral sequence. Conversely, the larger the value of the function J, the larger the difference between the adjusted initial fit spectral function f (x) and the entire spectral sequence, and the larger the difference, the more dissimilar (i.e., more prone to be different) between the adjusted initial fit spectral function f (x) and the entire spectral sequence. For example, when the spectrum indicates that the basis functions include a basis function f1(x), a basis function f2(x) and a basis function f3(x), the basis function f1(x) corresponds to a1 by weight, the basis function f2(x) corresponds to a2 by weight, and the basis function f3(x) corresponds to a3 by weight, the values of x include 10, 20 and 30 (i.e., the sampling frequency includes 10Hz, 20Hz and 30Hz), and the entire spectrum sequence is [ g (10), g (20), g (30) ] (where g (10) is the energy amplitude corresponding to the sampling frequency 10 in the entire spectrum sequence, g (20) is the energy amplitude corresponding to the sampling frequency 20 in the entire spectrum sequence, and g (30) is the energy amplitude corresponding to the sampling frequency 30 in the entire spectrum sequence), if the function J is expanded, J may be [ a1 × f1(10 × 2a 68692 f) and g (30) + f3 a) + g (4620 a) may be 1) + g (19 a) and 1b 42 b 9 b) 20) + [ a 1f 1(30) + a 2f 2(30) + a 3f 3(30) -g (30) ].

Thus, the server may adjust each spectral representation basis function in the initially fitted spectral function f (x) in the convergence function JInitial function parameters (including σ as described above)₁、μ₁、σ₂、μ₂、σ₃And mu₃) Until the value of the function J is adjusted to the minimum. When the function parameter of each spectrum representation basis function is adjusted so that the value of the function J reaches the minimum value, the initial fitting spectrum function adjusted at this time can be obtained, which means that the convergence condition (it can be understood that the convergence condition is that the value of the function J is the minimum) is satisfied between the initial fitting spectrum function adjusted at this time and the whole spectrum sequence.

Step S104, when the convergence condition is satisfied between the adjusted initial fitting spectrum function and the whole spectrum sequence, obtaining at least two adjusted spectrum representation basis functions as spectrum reconstruction basis functions according to the adjusted initial fitting spectrum function;

specifically, when it is determined that the convergence condition is satisfied between the adjusted initial fitting spectral function and the entire spectral sequence, the adjusted function parameter σ at this time may be obtained from the adjusted initial fitting spectral function₁Is marked as σ₁₁(ii) a Obtaining the adjusted function parameter mu from the adjusted initial fitting spectrum function₁Is recorded as mu₁₁(ii) a Obtaining the adjusted function parameter sigma at the moment from the adjusted initial fitting spectrum function₂Is marked as σ₂₂(ii) a Obtaining the adjusted function parameter mu from the adjusted initial fitting spectrum function₂Is recorded as mu₂₂(ii) a Obtaining the adjusted function parameter sigma at the moment from the adjusted initial fitting spectrum function₃Is marked as σ₃₃(ii) a Obtaining the adjusted function parameter mu from the adjusted initial fitting spectrum function₃Is recorded as mu₃₃。

Wherein the function parameter σ₁₁、μ₁₁、σ₂₂、μ₂₂、σ₃₃、μ₃₃Respectively, the function parameter sigma of the adjustment₁、μ₁、σ₂、μ₂、σ₃、μ₃. The adjusted initial function parameter corresponding to each spectral representation basis function can be called as each spectral representation basis function pairCorresponding fixed function parameters (including σ as described above)₁₁、μ₁₁、σ₂₂、μ₂₂、σ₃₃、μ₃₃)。

Wherein the frequency spectrum may be represented as a basis function

The initial function parameter σ in₁Substitution as a fixed function parameter σ₁₁Initial function parameter μ₁Replacement by a fixed function parameter mu₁₁Obtaining a spectral reconstruction basis function corresponding to the spectral representation basis function f1(x)

Similarly, the frequency spectrum can be represented as a basis function

The initial function parameter σ in₂Substitution as a fixed function parameter σ₂₂Initial function parameter μ₂Replacement by a fixed function parameter mu₂₂Obtaining a spectral reconstruction basis function corresponding to the spectral representation basis function f2(x)

Similarly, the frequency spectrum can be represented as a basis function

The initial function parameter σ in₃Substitution as a fixed function parameter σ₃₃Initial function parameter μ₃Replacement by a fixed function parameter mu₃₃Obtaining a spectral reconstruction basis function corresponding to the spectral representation basis function f3(x)

Through the above process, a plurality of spectrum reconstruction basis functions can be obtained, and the plurality of spectrum reconstruction basis functions specifically include a spectrum reconstruction basis function f11(x) corresponding to the spectrum representation basis function f1(x), a spectrum reconstruction basis function f22(x) corresponding to the spectrum representation basis function f2(x), and a spectrum reconstruction basis function corresponding to the spectrum representation basis function f3(x)f33(x)。

Step S105, reconstructing at least two fragment frequency spectrum sequences according to the frequency spectrum reconstruction basis function to obtain reconstructed fragment frequency spectrum sequences; reconstructing the segment spectrum sequence for determining an audio representation vector of the target audio;

specifically, the server may reconstruct each segment spectrum sequence of the target audio through all the obtained spectrum reconstruction basis functions, and this process may be referred to as a spectrum reconstruction process. The sequence of the segment spectrum can be denoted as G_i(x) And i represents the number of the segment spectrum sequences, the minimum value of i is 1, and the maximum value of i is the number of the segment spectrum sequences. Can be used for the segment spectrum sequence G_i(x) The sequence obtained after reconstitution was denoted G'_i(x) (i.e., the reconstructed fragment spectrum sequence described above), G may be represented by_i(x) Is recorded as G for each energy amplitude_iX. Thus, for the segment spectrum sequence G_i(x) The principle of performing the reconstruction can be seen in the following equation (4):

wherein, ∑ G_i(x) Is the sum of the energy amplitudes corresponding to the same sampling frequency x in each sequence of the segment spectrum.

For example, please refer to fig. 6, which is a schematic view of a sequence reconstruction scenario provided in the present application. The sequence of the spectrum of the segment of the target audio includes a sequence 100e (i.e., [1, 2, 3]), a sequence 101e (i.e., [4, 5, 6]), and a sequence 102e (i.e., [7, 8, 9 ]). Wherein the energy amplitude 1 in the sequence 100e, the energy amplitude 4 in the sequence 101e, and the energy amplitude 7 in the sequence 102e correspond to the same sampling frequency of 10 hz; energy amplitude 2 in sequence 100e, energy amplitude 5 in sequence 101e, and energy amplitude 8 in sequence 102e correspond to the same sampling frequency of 20 hz; the energy amplitude 3 in the sequence 100e, the energy amplitude 6 in the sequence 101e and the energy amplitude 9 in the sequence 102e correspond to the same sampling frequency of 30 hz. The spectral reconstruction basis functions include basis function f11(x), basis function f22(x), and basis function f33(x) (as shown in set 103 c). Each sampling frequency (i.e., each value of x) may be respectively brought into each spectrum reconstruction basis function to obtain a reconstruction energy amplitude corresponding to each spectrum reconstruction basis function, where the reconstruction energy amplitude includes each energy amplitude in the set 104c, and specifically includes f11(10), f11(20), f11(30), f22(10), f22(20), f22(30), f33(10), f33(20), and f33 (30). Each segment spectral sequence may be reconstructed by each reconstruction energy magnitude in the set 104c, the reconstruction process is as follows:

g in the above equation (4) when reconstructing the energy amplitude 1 (corresponding to the sampling frequency 10) in the sequence 100e_iX is equal to 1, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 10 in each segment spectrum sequence, that is, 1+4+7 equals 12, f11(x) + f22(x) + f33(x) equals f11(10) + f22(10) + f33(10), therefore, the energy amplitude 105c reconstructed from the energy amplitude 1 in the sequence 100e can be obtained by substituting the values corresponding to the energy amplitude 1 into the formula (4). G in the above equation (4) when reconstructing the energy amplitude 2 (corresponding to the sampling frequency 20) in the sequence 100e_iX is equal to 2, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 20 in each segment spectrum sequence, that is, 2+5+8 equals 15, f11(x) + f22(x) + f33(x) equals f11(20) + f22(20) + f33(20), therefore, the energy amplitude 106c after reconstructing the energy amplitude 2 in the sequence 100e can be obtained by substituting the values corresponding to the energy amplitude 2 into the formula (4). G in equation (4) above when reconstructing the energy amplitude 3 (corresponding to the sampling frequency 30) in the sequence 100e_iX is equal to 3, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 30 in each segment spectrum sequence, that is, 3+6+9 equals 18, f11(x) + f22(x) + f33(x) equals f11(30) + f22(30) + f33(30), the energy amplitude 107c after reconstruction of the energy amplitude 3 in the sequence 100e can be obtained by substituting the values corresponding to the energy amplitude 3 into the formula (4). Through the energy amplitude 105c, the energy amplitude 106c and the energy amplitude 107c, a reconstructed segment spectrum sequence [ energy amplitude 105c, energy amplitude 106c, energy amplitude 107c ] obtained by reconstructing the segment spectrum sequence 100e can be obtained]。

Similarly, the energy amplitude 4 (corresponding to the sampling frequency 10) in the sequence 101e is reconstructedG in the above formula (4)_iX is equal to 4, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 10 in each segment spectrum sequence, that is, 1+4+7 equals 12, f11(x) + f22(x) + f33(x) equals f11(10) + f22(10) + f33(10), therefore, the energy amplitude 108c reconstructed for the energy amplitude 4 in the sequence 100e can be obtained by substituting the values corresponding to the energy amplitude 4 into the formula (4). G in the above equation (4) when reconstructing the energy amplitude 5 (corresponding to the sampling frequency 20) in the sequence 101e_iX is equal to 5, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 20 in each segment spectrum sequence, that is, 2+5+8 equals 15, f11(x) + f22(x) + f33(x) equals f11(20) + f22(20) + f33(20), the energy amplitude 109c reconstructed from the energy amplitude 5 in the sequence 101e can be obtained by substituting the values corresponding to the energy amplitude 5 into the formula (4). G in equation (4) above when reconstructing the energy amplitude 6 (corresponding to the sampling frequency 30) in the sequence 101e_iX is equal to 6, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 30 in each segment spectrum sequence, that is, 3+6+9 equals 18, f11(x) + f22(x) + f33(x) equals f11(30) + f22(30) + f33(30), the energy amplitude 110c reconstructed from the energy amplitude 6 in the sequence 101e can be obtained by substituting the values corresponding to the energy amplitude 6 into the formula (4). Through the energy amplitude 108c, the energy amplitude 109c and the energy amplitude 110c, a reconstructed segment spectrum sequence [ energy amplitude 108c, energy amplitude 109c, energy amplitude 110c ] obtained by reconstructing the segment spectrum sequence 101e can be obtained]。

Similarly, when the energy amplitude 7 (corresponding to the sampling frequency 10) in the sequence 102e is reconstructed, G in the above formula (4)_iX is equal to 7, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 10 in each segment spectrum sequence, that is, 1+4+7 equals 12, f11(x) + f22(x) + f33(x) equals f11(10) + f22(10) + f33(10), the energy amplitude 111c after reconstruction of the energy amplitude 7 in the sequence 102e can be obtained by substituting the values corresponding to the energy amplitude 7 into the formula (4). G in equation (4) above when reconstructing the energy amplitude 8 (corresponding to the sampling frequency 20) in the sequence 102e_iX is equal to 8, ∑ G_i(x)For the sum of the energy amplitudes corresponding to the sampling frequency 20 in each segment spectrum sequence, that is, 2+5+8 equals 15, f11(x) + f22(x) + f33(x) equals f11(20) + f22(20) + f33(20), the energy amplitude 112c after the energy amplitude 8 in the sequence 102e is reconstructed can be obtained by substituting the values corresponding to the energy amplitude 8 into the formula (4). G in equation (4) above when reconstructing the energy amplitude 9 (corresponding to the sampling frequency 30) in the sequence 102e_iX is equal to 9, ∑ G_i(x) For the sum of the energy amplitudes corresponding to the sampling frequency 30 in each segment spectrum sequence, that is, 3+6+9 equals 18, f11(x) + f22(x) + f33(x) equals f11(30) + f22(30) + f33(30), the energy amplitude 113c reconstructed from the energy amplitude 9 in the sequence 102e can be obtained by substituting the values corresponding to the energy amplitude 9 into the formula (4). Through the energy amplitude 111c, the energy amplitude 112c and the energy amplitude 113c, a reconstructed segment spectrum sequence [ energy amplitude 111c, energy amplitude 112c, energy amplitude 113c ] obtained by reconstructing the segment spectrum sequence 102e can be obtained]。

The server can obtain an audio representation vector of the target audio through each reconstructed fragment spectrum sequence of the target audio, and the audio representation vector of the target audio contains audio features of the target audio. Through the above process of reconstructing the segment spectrum sequence of the target audio, a small amount of energy amplitude values in the segment spectrum sequence, which are especially large compared with other most energy amplitude values, can be reduced, and a small amount of energy amplitude values in the segment spectrum sequence, which are especially small compared with other most energy amplitude values, can be increased, the magnitude of the energy amplitude in the reconstructed fragment frequency spectrum sequence obtained by final reconstruction is enabled to be closer to the magnitude of the overall energy distribution of the target audio, therefore, the audio representation vector of the target audio obtained by reconstructing the segment spectrum sequence can more accurately represent the audio features of the target audio (because only one tone in the target audio may have a particularly large energy amplitude and the other tones have relatively small energy amplitudes, and if the target audio is not reconstructed, the tone with the particularly large energy amplitude cannot accurately represent the distribution size of the energy amplitude of the target audio as a whole).

Further, the server may obtain the audio representation vector of each audio in the audio library in the same way as obtaining the audio representation vector of the target audio. Subsequently, when the server needs to retrieve other audio similar to the target audio in the audio library, the other audio in the audio library except the target audio can be used as the audio to be matched. The server can obtain the vector distance between the audio representation vector of the target audio and the audio representation vector of each audio to be matched. It can be understood that when the vector distance between the audio representation vector of a certain audio to be matched and the audio representation vector of the target audio is smaller, the audio to be matched is indicated to be more similar to the target audio. Conversely, when the vector distance between the audio representation vector of a certain audio to be matched and the audio representation vector of the target audio is larger, the audio to be matched is indicated to be more dissimilar to the target audio.

Optionally, the server may use the audio to be matched corresponding to the audio representation vector having the smallest vector distance with the audio representation vector of the target audio in the audio library as the similar audio of the target audio. For example, if the audio library includes audio 1 to be matched, audio 2 to be matched, and audio 3 to be matched, the vector distance between the audio representation vector of the audio 1 to be matched and the audio representation vector of the target audio is d1, the vector distance between the audio representation vector of the audio 2 to be matched and the audio representation vector of the target audio is d2, the vector distance between the audio representation vector of the audio 3 to be matched and the audio representation vector of the target audio is d3, d1 is greater than d2, and d2 is greater than d3, the server may use the audio 3 to be matched as the similar audio of the target audio.

Or, the server may use 2 (the number may be determined according to an actual application scenario, and is not limited thereto) audios to be matched with the audio representation vector of the target audio in the audio library as similar audios of the target audio. For example, if the audio library includes audio 1 to be matched, audio 2 to be matched, and audio 3 to be matched, the vector distance between the audio representation vector of the audio 1 to be matched and the audio representation vector of the target audio is d1, the vector distance between the audio representation vector of the audio 2 to be matched and the audio representation vector of the target audio is d2, the vector distance between the audio representation vector of the audio 3 to be matched and the audio representation vector of the target audio is d3, d1 is greater than d2, and d2 is greater than d3, the server may use the audio 2 to be matched and the audio 3 to be matched as similar audio of the target audio.

Or, the server may use, as the similar audio of the target audio, the audio to be matched corresponding to the audio representation vector in the audio library, where a vector distance between the audio representation vector and the target audio is smaller than a vector distance threshold (which may be set according to an actual application scenario, but is not limited to this). For another example, if the audio library includes audio 1 to be matched, audio 2 to be matched, and audio 3 to be matched, the vector distance between the audio representation vector of the audio 1 to be matched and the audio representation vector of the target audio is d1, the vector distance between the audio representation vector of the audio 2 to be matched and the audio representation vector of the target audio is d2, and the vector distance between the audio representation vector of the audio 3 to be matched and the audio representation vector of the target audio is d 3. If the vector distance threshold is d0, and the d1 is smaller than d0, d2 is smaller than d0, and d3 is larger than d0, the audio 1 to be matched and the audio 2 to be matched can be used as similar audio of the target audio.

Optionally, the server may obtain an audio representation vector of the target audio through the reconstructed segment spectrum sequence of the target audio (each reconstructed segment spectrum sequence of the target audio may be sequentially spliced (according to time sequence) to obtain an audio representation vector of the target audio), and may also obtain an audio representation matrix of the target audio through the reconstructed segment spectrum sequence of the target audio. Wherein, the audio representation matrix of the target audio and the audio representation vector of the target audio have the same function, and both contain the audio features of the target audio. The audio representation matrix of the target audio may comprise a plurality of vectors, it being understood that a row or a column in the audio representation matrix of the target audio is a vector, and that a vector in the audio representation matrix of the target audio may be obtained by reconstructing a sequence of the segment spectra, i.e. a reconstructed sequence of the segment spectra is a row or a column in the audio representation matrix of the target audio. Therefore, the server may also obtain an audio representation matrix of each audio (including the target audio) in the audio library through the same process as described above, and subsequently, the server may also retrieve other audio similar to the target audio through the audio representation matrix of each audio in the audio library.

The target audio may be music (for example, any song), the server may recommend similar audio of the target audio to the target client, where the target client is a user client and the target client is a client related to the music. The server can send the similar audio of the target audio to the target client, the target client can display a popup window containing the similar audio of the target audio in a client page, or the target client can add the similar audio of the target audio in a recommended song list in the client page so as to achieve the purpose of recommendation.

Optionally, the target client may be a client that plays the target audio for a time period that reaches a play time threshold, for example, the time period may be 1 day, the play time threshold may be 5 times, and then the target client may be a user client that plays the target audio for more than 5 times in one day. Or, each user client may further include a system favorite song list, where the system favorite song list is a song list set by the system for each user client, and the target client may be a user client having a system favorite song list containing target audio, in other words, the target audio is added to the system favorite song list in the target client. Alternatively, the target client may also be a user client responding to a similar retrieval operation for the target audio, that is, a user initiates a request for retrieving other audio similar to the target audio through the target client.

By the method, more accurate audio expression vectors of each audio (including the target audio and each audio in the audio library) can be obtained, so that similar audio of the target audio can be more accurately retrieved when other audio similar to the target audio is retrieved subsequently, and more accurate audio recommendation and audio retrieval service can be provided for the target client.

Please refer to fig. 7, which is a schematic page diagram of a terminal device provided in the present application. As shown in fig. 7, the page 100f, the page 101f, the page 102f, the page 103f, the page 104f, and the page 105f are all terminal pages in the terminal device, and the terminal pages may be client pages in a user client in the terminal device. In the first recommended case, the page 100f includes a list corresponding to popular songs including music a, music b, music c, music d, and music e. Also, music a, music b, music c, music d, and music e in the page 100f each correspond to a "search for similarity" button. The terminal device may jump to the page 101f in response to a click operation of a "retrieve similar" button corresponding to the music a in the page 100f, and include in the page 101f a music retrieval result for the music a, the music retrieval result including the music f and the music g, indicating that the retrieved music similar to the music a includes the music f and the music g. When the terminal device searches for music similar to music a, the method provided by the application can be adopted to generate the audio expression vector of each audio, so as to realize the search operation of music similar to music a. The process of retrieving other music similar to music a may be performed by the terminal device independently, or may be performed by the terminal device through the server.

In the second recommendation scenario, as shown in fig. 7, the system favorite song list is included in the page 102f, and the system favorite song list is provided by the system for each user client. The system favorite song list in the page 102f includes music a, music b, and music c, which may all be the target audio. The terminal device may retrieve other music similar to music a, music b and music c in the audio library, and after the retrieval is completed, the terminal device may add the retrieved music to the "daily recommended" song list. The song list recommended every day also is the song list set by the system for each user client, and the song list recommended every day comprises music recommended by the system to the user client every day. The terminal device may display the page 103f in response to a user's click operation for the "daily recommendation" song list. Other music similar to music a, music b and music c, specifically including music f and music g similar to music a, music d similar to music b and music e similar to music c, which are retrieved, are displayed in the page 103.

As shown in fig. 7, in the third recommended case, the page 104f displays history music including music a, music b, and music c, indicating that music a, music b, and music c have been played. In addition, the terminal device may acquire the playing times of each piece of music in the history playing music, and determine which pieces of music in the history playing music have the playing times reaching the playing time threshold. Assuming that the number of times of playing of music a in the history playing music reaches the threshold number of times of playing, the terminal device may retrieve other music similar to music a and may add music f and music g similar to music a in the "daily recommendation" song list as well (as shown in page 105 f).

Please refer to fig. 8, which is a schematic structural diagram of an audio data processing apparatus according to the present application. The audio data processing device may be a computer program (including program code) running on a computer device, for example, the audio data processing device is an application software, and the image data processing device may be configured to execute corresponding steps in the method provided by the embodiment of the present application. As shown in fig. 8, the audio data processing apparatus 1 may include: the system comprises a sequence acquisition module 11, a sequence generation module 12, an adjustment module 13, a convergence determination module 14 and a reconstruction module 15;

a sequence obtaining module 11, configured to obtain at least two segment spectrum sequences of a target audio; the at least two segment spectrum sequences are obtained by sampling the energy amplitudes of at least two audio segments of the target audio;

the sequence generating module 12 is configured to generate an overall frequency spectrum sequence of the target audio according to the energy amplitudes respectively included in the at least two segment frequency spectrum sequences;

the adjusting module 13 is configured to generate an initial fitting spectrum function according to the at least two spectrum representation basis functions, and adjust the at least two spectrum representation basis functions to obtain an adjusted initial fitting spectrum function;

a convergence determining module 14, configured to, when a convergence condition is satisfied between the adjusted initial fitting spectrum function and the entire spectrum sequence, obtain, according to the adjusted initial fitting spectrum function, at least two adjusted spectrum representation basis functions as spectrum reconstruction basis functions;

the reconstruction module 15 is configured to reconstruct the at least two segment spectrum sequences according to the spectrum reconstruction basis function to obtain reconstructed segment spectrum sequences; the reconstructed segment spectral sequence is used to determine an audio representation vector for the target audio.

For specific functional implementation manners of the sequence obtaining module 11, the sequence generating module 12, the adjusting module 13, the convergence determining module 14, and the reconstructing module 15, please refer to steps S101 to S105 in the corresponding embodiment of fig. 3, which is not described herein again.

Wherein, the sequence acquiring module 11 includes: a time sampling unit 111, a sequence segmentation unit 112, and a sequence generation unit 113;

the time sampling unit 111 is configured to sample an energy amplitude of the target audio according to a sampling time interval to obtain an energy time sequence corresponding to the target audio;

the sequence segmentation unit 112 is configured to segment the energy time sequence according to the sampling time period to obtain at least two segment energy time sequences included in the energy time sequence;

a sequence generating unit 113, configured to generate at least two segment spectrum sequences according to the at least two segment energy time sequences.

For a specific implementation manner of functions of the time sampling unit 111, the sequence segmentation unit 112, and the sequence generation unit 113, please refer to step S101 in the embodiment corresponding to fig. 3, which is not described herein again.

Wherein, the sequence generation unit includes: a transform subunit 1131, a frequency sampling subunit 1132, and a sequence determination subunit 1133;

a transform subunit 1131, configured to perform frequency domain transform on at least two segment energy time sequences respectively to obtain a segment frequency signal corresponding to each segment energy time sequence;

the frequency sampling subunit 1132 is configured to respectively sample each segment frequency signal according to the sampling frequency interval to obtain a segment frequency sequence corresponding to each segment frequency signal;

the sequence determining subunit 1133 is configured to determine the segment frequency sequence corresponding to each segment frequency signal as the segment frequency spectrum sequence of the target audio.

For a specific function implementation manner of the transformation subunit 1131, the frequency sampling subunit 1132 and the sequence determining subunit 1133, please refer to step S101 in the corresponding embodiment of fig. 3, which is not described herein again.

a sequence generation module 12, comprising: an amplitude summing unit 121 and a sum sequence generating unit 122;

an amplitude summing unit 121, configured to sum energy amplitudes belonging to the same sampling frequency in each segment spectrum sequence to obtain a summed energy amplitude corresponding to each sampling frequency;

and a summation sequence generating unit 122, configured to generate an overall frequency spectrum sequence of the target audio according to the summation energy amplitude corresponding to each sampling frequency.

For a specific functional implementation manner of the amplitude summation unit 121 and the summation sequence generation unit 122, please refer to step S102 in the embodiment corresponding to fig. 3, which is not described herein again.

a reconstruction module 15 comprising: an input unit 151 and a reconstruction unit 152;

an input unit 151, configured to input at least two sampling frequencies into the spectrum reconstruction basis function to obtain a reconstruction energy amplitude corresponding to the spectrum reconstruction basis function;

and a reconstructing unit 152, configured to reconstruct the at least two segment spectrum sequences according to the reconstructed energy amplitude and the energy amplitude corresponding to each sampling frequency included in each segment spectrum sequence, so as to obtain a reconstructed segment spectrum sequence.

Please refer to step S105 in the embodiment corresponding to fig. 3 for a specific implementation manner of functions of the input unit 151 and the reconstructing unit 152, which is not described herein again.

Wherein, the adjusting module 13 includes: a basis function acquisition unit 131, an expression acquisition unit 132, an expression summation unit 133, and an initial function generation unit 134;

a basis function obtaining unit 131, configured to obtain at least two normal distribution functions, and determine the at least two normal distribution functions as at least two spectrum representation basis functions; each frequency spectrum representation basis function comprises a function dependent variable; the value range of the function dependent variable comprises at least two sampling frequencies;

the expression obtaining unit 132 is configured to input at least two sampling frequencies into each spectrum representation basis function, so as to obtain at least two energy amplitude expressions corresponding to each spectrum representation basis function; each frequency spectrum representation basis function is used for outputting an energy amplitude expression corresponding to each sampling frequency;

the expression summing unit 133 is configured to input, in at least two energy amplitude expressions corresponding to each spectrum representation basis function, the energy amplitude expressions with the same sampling frequency for weighted summation to obtain a summed energy amplitude expression corresponding to each sampling frequency;

and an initial function generating unit 134, configured to generate an initial fitting spectrum function according to the summation energy amplitude expression corresponding to each sampling frequency.

For specific functional implementation manners of the basis function obtaining unit 131, the expression obtaining unit 132, the expression summing unit 133, and the initial function generating unit 134, please refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

an adjustment module 13, comprising: a convergence function acquisition unit 135, a parameter adjustment unit 136, and a convergence determination unit 137;

a convergence function acquisition unit 135 for acquiring a convergence function; the convergence function comprises an initial fitting spectrum function and an overall spectrum sequence; the convergence function is used for representing the difference degree between the initially fitted spectrum function and the whole spectrum sequence;

the parameter adjusting unit 136 is configured to adjust initial fitting spectrum functions in the convergence function, where each spectrum represents an initial function parameter corresponding to the basis function, and obtain an adjusted initial fitting spectrum function;

a convergence determining unit 137, configured to determine that a convergence condition is satisfied between the adjusted initial fitting spectrum function and the entire spectrum sequence when the initial function parameter corresponding to each spectrum representation basis function is adjusted until the convergence function reaches a minimum value;

then, the convergence determination module 14 includes: a parameter replacement unit 141 and a basis function determination unit 142;

the parameter replacing unit 141 is configured to, when the convergence condition is satisfied between the adjusted initial fitting spectral function and the entire spectral sequence, replace the initial function parameter in each spectral representation basis function with a fixed function parameter corresponding to each spectral representation basis function in the adjusted initial fitting spectral function, respectively, to obtain at least two adjusted spectral representation basis functions; the fixed function parameter corresponding to each frequency spectrum representation basis function is the adjusted initial function parameter corresponding to each frequency spectrum representation basis function;

a basis function determining unit 142, configured to determine the adjusted at least two spectrum representation basis functions as spectrum reconstruction basis functions.

For specific functional implementation manners of the convergence function obtaining unit 135, the parameter adjusting unit 136 and the convergence determining unit 137, please refer to step S103 in the embodiment corresponding to fig. 3, and for specific functional implementation manners of the parameter replacing unit 141 and the basis function determining unit 142, refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Wherein, the audio data processing device 1 further comprises: a distance acquisition module 16, a similar audio determination module 17 and a recommendation module 18;

the distance acquisition module 16 is configured to respectively acquire vector distances between an audio representation vector of a target audio and audio representation vectors of at least two to-be-matched audios in an audio library;

the similar audio determining module 17 is configured to determine, as a similar audio of the target audio, an audio to be matched corresponding to an audio representation vector having a minimum vector distance from an audio representation vector of the target audio in the audio library;

and the recommending module 18 is used for recommending the similar audio of the target audio to the target client.

For specific functional implementation manners of the distance obtaining module 16, the similar audio determining module 17, and the recommending module 18, please refer to step S105 in the corresponding embodiment of fig. 3, which is not described herein again.

Wherein, the recommending module 18 includes: a first client determining unit 181, a second client determining unit 182, a third client determining unit 183, and a recommending unit 184;

a first client determining unit 181, configured to determine, as a target client, a user client whose play frequency of the target audio is greater than the play frequency threshold in the target time period; alternatively, the first and second electrodes may be,

a second client determining unit 182 for determining a user client having the system favorite song list including the target audio as a target client; alternatively, the first and second electrodes may be,

a third client determining unit 183 configured to determine, as a target client, a user client that responds to a similar retrieval operation for the target audio;

and the recommending unit 184 is configured to recommend the similar audio of the target audio to the target client.

For specific functional implementation manners of the first client determining unit 181, the second client determining unit 182, the third client determining unit 183, and the recommending unit 184, please refer to step S105 in the corresponding embodiment of fig. 3, which is not described herein again.

Please refer to fig. 9, which is a schematic structural diagram of a computer device provided in the present application. As shown in fig. 9, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; the processor 1001 may be configured to call the device control application stored in the memory 1005 to implement the audio data processing method described in the embodiment corresponding to fig. 3. It should be understood that the computer device 1000 described in this application can also perform the description of the audio data processing apparatus 1 in the embodiment corresponding to fig. 8, and the description is not repeated here. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores the aforementioned computer program executed by the audio data processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the audio data processing method in the embodiment corresponding to fig. 3 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application. As an example, program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network, which may comprise a block chain system.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto but rather by the claims appended hereto.

Claims

1. A method of audio data processing, comprising:

acquiring at least two fragment frequency spectrum sequences of a target audio; the at least two segment spectrum sequences are obtained by sampling energy amplitudes of at least two audio segments of the target audio;

generating an initial fitting spectrum function according to the at least two spectrum representation basic functions, and adjusting the at least two spectrum representation basic functions to obtain an adjusted initial fitting spectrum function;

reconstructing the at least two fragment frequency spectrum sequences according to the frequency spectrum reconstruction basis function to obtain reconstructed fragment frequency spectrum sequences; the reconstructed segment spectral sequence is used to determine an audio representation vector for the target audio.

2. The method of claim 1, wherein the obtaining at least two sequences of segment spectrums of the target audio comprises:

segmenting the energy time sequence according to a sampling time period to obtain at least two fragment energy time sequences contained in the energy time sequence;

and generating the at least two fragment frequency spectrum sequences according to the at least two fragment energy time sequences.

3. The method of claim 2, wherein the generating the at least two sequences of segment spectra from the at least two sequences of segment energy times comprises:

respectively carrying out frequency domain transformation on the at least two fragment energy time sequences to obtain a fragment frequency signal corresponding to each fragment energy time sequence;

4. The method according to claim 1, wherein each of the at least two segment spectrum sequences includes energy amplitude values corresponding to at least two sampling frequencies;

the generating of the whole spectrum sequence of the target audio according to the energy amplitude values respectively contained in the at least two segment spectrum sequences comprises:

5. The method of claim 4, wherein the spectral reconstruction basis functions include function dependent variables; the value range of the function dependent variable comprises the at least two sampling frequencies;

reconstructing the at least two segment spectrum sequences according to the spectrum reconstruction basis function to obtain reconstructed segment spectrum sequences, including:

inputting the at least two sampling frequencies into the frequency spectrum reconstruction basis function to obtain a reconstruction energy amplitude value corresponding to the frequency spectrum reconstruction basis function;

and reconstructing the at least two fragment frequency spectrum sequences according to the reconstructed energy amplitude and the energy amplitude corresponding to each sampling frequency included in each fragment frequency spectrum sequence to obtain the reconstructed fragment frequency spectrum sequence.

6. The method of claim 4, wherein generating an initial fit spectral function from at least two spectral representation basis functions comprises:

obtaining at least two normal distribution functions, and determining the at least two normal distribution functions as the at least two frequency spectrum representation basis functions; each frequency spectrum representation basis function comprises a function dependent variable; the value range of the function dependent variable comprises the at least two sampling frequencies;

inputting the at least two sampling frequencies into each frequency spectrum representation basis function respectively to obtain at least two energy amplitude expressions corresponding to each frequency spectrum representation basis function respectively; each frequency spectrum representation basis function is used for outputting an energy amplitude expression corresponding to each sampling frequency;

and generating the initial fitting spectrum function according to the summation energy amplitude expression corresponding to each sampling frequency respectively.

7. The method according to claim 6, wherein each of the spectrum representation basis functions includes the corresponding initial function parameter;

the adjusting the at least two spectrum representation basis functions to obtain an adjusted initial fitting spectrum function includes:

acquiring a convergence function; the convergence function comprises the initial fitting spectrum function and the whole spectrum sequence; the convergence function is used for representing the difference degree between the initial fitting spectrum function and the whole spectrum sequence;

adjusting initial function parameters corresponding to the basis functions in the initial fitting spectrum functions in the convergence function respectively by each spectrum to obtain the adjusted initial fitting spectrum functions;

when the initial function parameters corresponding to each frequency spectrum representation basis function are adjusted until the convergence function reaches the minimum value, determining that the convergence condition is met between the adjusted initial fitting frequency spectrum function and the whole frequency spectrum sequence;

when the convergence condition is satisfied between the adjusted initial fitting spectrum function and the whole spectrum sequence, replacing the initial function parameters in each spectrum representation basis function with fixed function parameters corresponding to each spectrum representation basis function in the adjusted initial fitting spectrum function respectively to obtain at least two adjusted spectrum representation basis functions; the fixed function parameter corresponding to each frequency spectrum representation basis function is the adjusted initial function parameter corresponding to each frequency spectrum representation basis function;

and determining the at least two adjusted spectrum representation basis functions as the spectrum reconstruction basis functions.

8. The method of claim 1, further comprising:

respectively obtaining the vector distance between the audio expression vector of the target audio and the audio expression vectors of at least two audios to be matched in an audio library;

determining the audio to be matched corresponding to the audio representation vector with the minimum vector distance to the audio representation vector of the target audio in the audio library as the similar audio of the target audio;

recommending the similar audio of the target audio to a target client.

9. The method of claim 8, wherein recommending the similar audio of the target audio to a target client comprises:

determining the user client terminal with the playing times of the target audio frequency being larger than the playing time threshold value in the target time period as the target client terminal; alternatively, the first and second electrodes may be,

determining a user client having a system favorite song list containing the target audio as the target client; alternatively, the first and second electrodes may be,

determining the user client end responding to the similar retrieval operation aiming at the target audio frequency as the target client end;

recommending the similar audio of the target audio to the target client.

10. An audio data processing apparatus, comprising:

the sequence acquisition module is used for acquiring at least two fragment frequency spectrum sequences of the target audio; the at least two segment spectrum sequences are obtained by sampling energy amplitudes of at least two audio segments of the target audio;

a convergence determining module, configured to, when a convergence condition is satisfied between the adjusted initial fitting spectrum function and the entire spectrum sequence, obtain, according to the adjusted initial fitting spectrum function, at least two adjusted spectrum representation basis functions as spectrum reconstruction basis functions;

11. The apparatus according to claim 10, wherein each of the at least two segment spectrum sequences includes energy amplitude values corresponding to at least two sampling frequencies;

the sequence generation module comprises:

and the summation sequence generating unit is used for generating the whole spectrum sequence of the target audio according to the summation energy amplitude value corresponding to each sampling frequency respectively.

12. The apparatus of claim 11, wherein the spectral reconstruction basis functions include function dependent variables; the value range of the function dependent variable comprises the at least two sampling frequencies;

the reconstruction module comprises:

the input unit is used for inputting the at least two sampling frequencies into the spectrum reconstruction basis function to obtain a reconstruction energy amplitude value corresponding to the spectrum reconstruction basis function;

and the reconstruction unit is used for reconstructing the at least two fragment frequency spectrum sequences according to the reconstructed energy amplitude and the energy amplitude corresponding to each sampling frequency included in each fragment frequency spectrum sequence to obtain the reconstructed fragment frequency spectrum sequence.

13. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1-9.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method according to any one of claims 1-9.