CN113593607B

CN113593607B - Audio processing method and device and electronic equipment

Info

Publication number: CN113593607B
Application number: CN202010364644.4A
Authority: CN
Inventors: 张家隆
Original assignee: Guangzhou Huancheng Culture Media Co ltd
Current assignee: Guangzhou Huancheng Culture Media Co ltd
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2024-07-30
Anticipated expiration: 2040-04-30
Also published as: CN113593607A

Abstract

The invention discloses an audio processing method, an audio processing device and electronic equipment, wherein the method comprises the following steps: acquiring target audio to be processed; obtaining a selected feature vector, wherein the feature vector includes at least one feature that affects a score of the audio; obtaining a mapping function between the feature vector and the score; and obtaining a prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio.

Description

Audio processing method and device and electronic equipment

Technical Field

The present invention relates to the field of internet technology, and more particularly, to an audio processing method, an audio processing apparatus, an electronic device, and a computer-readable storage medium.

Background

Along with the rapid development of science and technology, the singing becomes a new singing entertainment mode. The user can play the corresponding chord by clicking the corresponding chord key while singing, accompaniment is carried out for the user, and audio is generated.

In the prior art, background operators are often required to score audio content generated by users through a playing tool, and select high-quality content and a player for subsequent content consumption.

However, the background operators do not have objective and same standard in the way of scoring the audio, so that the obtained scoring result is subjective and has high cost.

Disclosure of Invention

It is an object of the present invention to provide a new solution for automatically scoring audio.

According to a first aspect of the present invention, there is provided an audio processing method comprising:

acquiring target audio to be processed;

Obtaining a selected feature vector, wherein the feature vector includes at least one feature that affects a score of audio;

Obtaining a mapping function between the feature vector and the score;

and obtaining a prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio.

Optionally, the at least one feature includes: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formants.

Optionally, the obtaining a mapping function between the feature vector and the score includes:

obtaining training samples, wherein each training sample is audio and marked as corresponding actual scores;

and training to obtain the mapping function according to the vector value and the actual score of the feature vector of the training sample.

Optionally, the acquiring the training sample includes:

acquiring at least one initial audio, wherein each initial audio is marked as a corresponding actual score;

the actual score is taken as the initial audio of the appointed score and is taken as the reference audio;

determining a reference user according to the reference audio;

acquiring other audio generated by the reference user as extension audio;

marking the expanded audio as the specified score;

And taking the marked extension audio and the marked initial audio as the training samples.

Optionally, the determining the reference user according to the reference sample includes:

determining a user generating each reference audio as a target user;

Determining, for each of the target users, a first number of generated reference audio and a second number of generated initial audio;

Determining, for each of the target users, a ratio of the first number to the second number;

and selecting the reference user from the target users according to the ratio.

determining a user generating each reference audio as a target user;

determining, for each of the target users, a first number of generated reference audio;

and selecting the reference user from the target users according to the first quantity.

Optionally, the training to obtain the mapping function according to the vector value and the actual score of the feature vector of the training sample includes:

Taking undetermined coefficients of the mapping function as variables, and determining a scoring prediction expression of each training sample according to the vector value of the feature vector of each training sample;

Constructing a loss function according to the scoring prediction expression of each training sample and the actual score of each training sample;

And determining the undetermined coefficient according to the loss function, and completing the training of the mapping function.

Optionally, said constructing a loss function based on said scoring predictive expression for each said training sample and an actual score for each said training sample comprises:

For each training sample, determining a corresponding loss expression according to the scoring prediction expression and the actual score;

and summing the loss expression of each training sample to obtain the loss function.

Optionally, the method further comprises:

acquiring an actual score of the target audio;

taking the target audio as a new training sample, and marking the new training sample according to the actual score;

and correcting the mapping function according to the vector value of the feature vector of the new training sample and the actual score of the new training sample.

Optionally, the method further comprises:

And executing the step of training the mapping function according to a preset training period.

Optionally, the method further comprises:

And providing the predictive score of the target audio to a client generating the target audio for presentation.

Optionally, the method further comprises:

determining whether the target audio is high-quality audio according to the prediction scores;

and adding the target audio to a recommendation list in the case that the target audio is high-quality audio.

According to a second aspect of the present invention, there is provided an audio processing apparatus, comprising:

the audio acquisition module is used for acquiring target audio to be processed;

A feature acquisition module for acquiring a selected feature vector, wherein the feature vector includes at least one feature that affects a score of audio;

the function acquisition module is used for acquiring a mapping function between the feature vector and the score;

and the scoring prediction module is used for obtaining the prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio.

According to a third aspect of the present invention, there is provided an electronic device comprising:

the apparatus according to the second aspect of the invention; or alternatively

A processor and a memory for storing instructions for controlling the processor to perform the method according to the first aspect of the invention.

According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method according to the first aspect of the present invention.

In the embodiment of the invention, the prediction score of the target audio can be obtained according to the feature vector and the mapping function, the prediction score of the target audio can be automatically obtained without manual scoring, and the labor cost can be reduced. In addition, the mapping function is obtained through training according to a large number of training samples, so that when the mapping function is used for determining the scores of the predicted target audio, the accuracy of the obtained predicted scores can be improved, and the results of the predicted scores can be more objective.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1a is a block diagram showing one example of a hardware configuration of an electronic device that may be used to implement an embodiment of the present invention.

Fig. 1b is a block diagram showing another example of a hardware configuration of an electronic device that may be used to implement an embodiment of the invention.

Fig. 2 shows a schematic diagram of an application scenario of an audio processing method according to an embodiment of the present invention.

Fig. 3 shows a flow diagram of an audio processing method according to an embodiment of the invention.

FIG. 4 shows a flow chart of the steps of obtaining training samples according to an embodiment of the present invention.

Fig. 5 shows a block diagram of an audio processing device of an embodiment of the invention.

Fig. 6 shows a block diagram of an example of an electronic device of an embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

< Hardware configuration >

Fig. 1a and 1b are block diagrams of hardware configurations of an electronic device 1000 that may be used to implement the audio processing method of any embodiment of the invention.

In one embodiment, as shown in FIG. 1a, electronic device 1000 may be a server 1100.

The server 1100 provides a computer for processing, database, communication facilities. The server 1100 may be a monolithic server or a distributed server across multiple computers or computer data centers. The server may be of various types such as, but not limited to, a web server, news server, mail server, message server, advertisement server, file server, application server, interaction server, database server, or proxy server. In some embodiments, each server may include hardware, software, or embedded logic components or a combination of two or more such components for performing the appropriate functions supported by or implemented by the server. For example, a server, such as a blade server, cloud server, etc., or may be a server group consisting of multiple servers, may include one or more of the types of servers described above, etc.

In this embodiment, the server 1100 may include a processor 1110, a memory 1120, an interface device 1130, a communication device 1140, a display device 1150, and an input device 1160, as shown in fig. 1 a.

In this embodiment, the server 1100 may also include a speaker, microphone, etc., without limitation.

The processor 1110 may be a dedicated server processor, or may be a desktop processor, a mobile processor, or the like that meets performance requirements, which is not limited herein. The memory 1120 includes, for example, ROM (read only memory), RAM (random access memory), nonvolatile memory such as a hard disk, and the like. The interface device 1130 includes, for example, various bus interfaces such as a serial bus interface (including a USB interface), a parallel bus interface, and the like. The communication device 1140 can perform wired or wireless communication, for example. The display device 1150 is, for example, a liquid crystal display, an LED display touch display, or the like. The input device 1160 may include, for example, a touch screen, a keyboard, and the like.

In this embodiment, the memory 1120 of the server 1100 is used to store instructions for controlling the processor 1110 to operate at least to perform an audio processing method according to any embodiment of the present invention. The skilled person can design instructions according to the disclosed solution. How the instructions control the processor to operate is well known in the art and will not be described in detail here.

Although a plurality of devices of the server 1100 are shown in fig. 1a, the present invention may relate to only some of the devices, for example, the server 1100 may relate to only the memory 1120 and the processor 1110.

In one embodiment, the electronic device 1000 may be a terminal device 1200 such as a PC, a notebook computer, etc. used by an operator, which is not limited herein.

In this embodiment, as shown with reference to fig. 1b, the terminal apparatus 1200 may include a processor 1210, a memory 1220, an interface device 1230, a communication device 1240, a display device 1250, an input device 1260, a speaker 1270, a microphone 1280, and the like.

Processor 1210 may be a mobile version processor. The memory 1220 includes, for example, ROM (read only memory), RAM (random access memory), nonvolatile memory such as a hard disk, and the like. The interface device 1230 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1240 may be, for example, a wired or wireless communication device, and the communication device 1240 may include a short-range communication device, for example, any device that performs short-range wireless communication based on a Hilink protocol, wiFi (IEEE 802.11 protocol), mesh, bluetooth, zigBee, thread, Z-Wave, NFC, UWB, liFi, or the like, and the communication device 1240 may also include a remote communication device, for example, any device that performs WLAN, GPRS, 2G/3G/4G/5G remote communication. The display device 1250 is, for example, a liquid crystal display, a touch display, or the like. The input device 1260 may include, for example, a touch screen, a keyboard, and the like. A user may input/output voice information through the speaker 1270 and the microphone 1280.

In this embodiment, the memory 1220 of the terminal device 1200 is used to store instructions for controlling the processor 1210 to operate at least to perform the audio processing method according to any embodiment of the present invention. The skilled person can design instructions according to the disclosed solution. How the instructions control the processor to operate is well known in the art and will not be described in detail here.

Although a plurality of devices of the terminal apparatus 1200 are shown in fig. 1b, the present invention may relate to only some of the devices thereof, for example, the terminal apparatus 1200 may relate to only the memory 1220 and the processor 1210 and the display device 1250.

< Application scenario >

Fig. 2 is a schematic diagram of an application scenario of an audio processing method according to an embodiment of the present invention.

The voice processing method of the embodiment can be particularly applied to a KTV scene, a playing scene, a recording scene, a live broadcast scene, a virtual anchor scene and the like for scoring singing of a user.

As shown in fig. 2, user a may input speech through his client, and the client of user a generates target audio from the speech input by user a and provides the target audio to the electronic device 1000. The electronic device 1000 obtains the selected feature vector and a mapping function between the feature vector and the score, and obtains a prediction score of the target audio according to the mapping function and a vector value of the feature vector of the target audio. The electronic device 1000 may be configured to return the predicted score of the target audio to the client of user a for presentation.

According to the method provided by the embodiment of the invention, the prediction score of the target audio can be automatically obtained without manual scoring, and the labor cost can be reduced. In addition, the mapping function is obtained through training according to a large number of training samples, so that when the mapping function is used for determining the scores of the predicted target audio, the accuracy of the obtained predicted scores can be improved, and the results of the predicted scores can be more objective.

For example, in a KTV, pop, record, etc., a user may input singing voice through the client a, and the client a generates target audio according to the singing voice of the user and provides the target audio to the electronic device 1000. The electronic device 1000 obtains the selected feature vector and a mapping function between the feature vector and the score, and obtains a prediction score of the target audio according to the mapping function and a vector value of the feature vector of the target audio. If a predicted score is obtained, the electronic device 1000 may return the predicted score to the client a of the user. Under the condition that the client A obtains the prediction score, the prediction score can be displayed in an interface for a user to view.

For another example, in a live scenario, a host may input singing voice through client B, which in turn generates target audio from the host's singing voice and provides the target audio to electronic device 1000. The electronic device 1000 obtains the selected feature vector and a mapping function between the feature vector and the score, and obtains a prediction score of the target audio according to the mapping function and a vector value of the feature vector of the target audio. When the predicted score is obtained, the electronic device 1000 may return the predicted score to the client B of the user. And under the condition that the client B obtains the prediction score, the prediction score can be displayed in an interface for the anchor to check.

In one example, the electronic device 1000 may also return a predictive score to the client of the user within the hosting live room for the user within the live room to also view the predictive score of the host's singing voice.

For another example, in a virtual anchor scenario, a dubbing person enters singing speech through a client C, which in turn generates target audio from the dubbing person's singing speech and provides the target audio to the electronic device 1000. The electronic device 1000 obtains the selected feature vector and a mapping function between the feature vector and the score, and obtains a prediction score of the target audio according to the mapping function and a vector value of the feature vector of the target audio. If a predicted score is obtained, the electronic device 1000 may return the predicted score to the client C of the user. And under the condition that the client C obtains the prediction score, the prediction score can be displayed in an interface for the dubbing staff to check.

In one example, the electronic device 1000 may also return the prediction score to the client of the user watching the video corresponding to the singing voice, so that the user watching the video may also view the prediction score of the singing voice of the dubbing person.

< Method example >

In this embodiment, an audio processing method is provided. The audio processing method may be implemented by an electronic device. The electronic device may be a server 1100 as shown in fig. 1a or a terminal device 1200 as shown in fig. 1 b.

According to fig. 3, the audio processing method of the present embodiment may include the following steps S1000 to S4000:

step S1000, obtaining target audio to be processed.

In one embodiment of the invention, the target audio may be audio derived by a user's client that contains at least speech input by the user.

The client may be a designated application, such as a pop-song type or a singing-type application.

In one example, the target audio may be derived from a user's client capturing both speech input by the user and accompaniment played by the client.

In another example, the target audio may be synthesized from speech input by the user and accompaniment generated by the client.

Step S2000, the selected feature vector is acquired.

Wherein the feature vector includes at least one feature that affects the score of the audio.

The feature vector X includes at least one feature X _j, j having a natural number of 1 to n, which represents the total number of features that the feature vector X has, which affects the score of the audio.

In one embodiment of the present invention, the at least one feature x _j may be a feature for measuring a plurality of dimensions of the corresponding audio in timbre, intonation, rhythm and chord, and may include: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formants.

Mel Frequency cepstrum coefficient (Mel Frequency CepstrumCoefficient, MFCC) is a Hz spectrum characteristic calculated by using a nonlinear correspondence between Mel Frequency and Hz Frequency. The Mel frequency is proposed based on the auditory characteristics of the human ear.

The zero-crossing rate (ZCR) refers to the rate of sign change of a signal, e.g., the signal changes from positive to negative or back. This feature is widely used in the fields of speech comparison, speech recognition and music information retrieval (music information retrieval), and is a main feature for classifying a tapping sound.

The short-time energy is the speech energy calculated for a short time. The shorter time herein is generally referred to as a frame. That is, the speech energy in one frame time is short-time energy.

The short-time autocorrelation function is a result obtained by intercepting a section of signal by using a short-time window near the Nth sample point of the signal and performing autocorrelation calculation. Since the speech signal is a non-stationary signal, a short-time autocorrelation function is used for the processing of the signal.

The short-time average amplitude difference may be used for pitch period detection.

The spectrogram is a speech spectrogram, and is generally obtained by processing a received time-domain signal. The abscissa of the spectrogram is time, the ordinate is frequency, and the coordinate point value is voice data energy. Since the three-dimensional information is expressed by using a two-dimensional plane, the magnitude of the energy value is expressed by a color, and the darker the color, the stronger the speech energy representing the point.

Spectral entropy describes the relationship between power spectrum and entropy rate.

The fundamental frequency is the fundamental frequency, and determines the pitch of the whole tone. In sound, the fundamental frequency refers to the frequency of the fundamental tone in a complex tone. Among the several tones constituting one complex tone, the fundamental tone has the lowest frequency and the greatest intensity. The level of the fundamental frequency determines the level of a tone. The frequency of speech is usually the frequency of fundamental tone.

Formants, when excited into the vocal tract by a quasi-periodic pulse at the glottis, cause resonance characteristics that create a set of resonance frequencies, referred to as formant frequencies or formants for short.

In this embodiment, X _j may be a feature capable of affecting the score of the audio, for example, the at least one feature may include a mel-frequency cepstrum coefficient, a zero-crossing rate, a short-time energy, a short-time autocorrelation function, a short-time average amplitude difference, a spectrogram, a spectral entropy, a fundamental frequency, and a formant, where the feature vector X may have 9 features, i.e., n=9, and where the feature vector X may be denoted as X＝(x₁,x₂,x₃,x₄,x₅,x₆,x₇,x₈,x₉)., and of course, other features related to the score of the audio may be included in the feature vector X.

Step S3000, a mapping function between the feature vector and the score is obtained.

The independent variable of the mapping function F (X) is the feature vector X, and the dependent variable F (X) is the prediction score determined by the feature vector X.

In one embodiment of the present invention, the obtaining of the mapping function between the feature vector and the score includes steps S3100 to S3200 as follows:

step S3100, a training sample is acquired.

Wherein each training sample is audio and is marked as a corresponding actual score.

In one embodiment of the present invention, each training sample may be manually scored.

In one embodiment of the present invention, obtaining training samples includes steps S3110-S3160 as shown in fig. 4:

In step S3110, at least one initial audio is acquired.

Wherein each initial audio is marked as a corresponding actual score.

In one embodiment of the invention, the initial audio may be generated by a plurality of users. Specifically, the user may generate the voice by recording the voice of the user through the playing tool in the respective client. The generating manner of the initial audio may refer to the generating manner of the target audio, which is not described herein.

The actual score of the initial audio may be obtained by a manual scoring by a background operator.

In one example, the actual score for each piece of initial audio may be a specific score, or may be a first score and a second score that are used to distinguish premium audio from non-premium audio. The first score and the second score may be values set according to an application scenario or specific requirements, for example, the first score may be 1, and the second score may be 0.

Step S3120, the actual score is the initial audio of the specified score as the reference audio.

The specified score may be set in advance according to the application scenario or specific requirements. In embodiments where the actual score includes a first score for indicating that the corresponding audio is premium audio and a second score for indicating that the corresponding audio is non-premium audio, the specified score may be, for example, the first score or the second score.

In step S3130, the reference user is determined from the reference audio.

In this embodiment, other audio with actual scores as specified scores may also be generated according to the reference user determined by the reference audio.

In one embodiment of the present invention, determining the reference user from the reference sample may include steps S3131-a to S3134-a as follows:

in step S3131-a, the user who generated each reference audio is determined as the target user.

Step S3132-a, for each target user, determining a first number of generated reference audio and a second number of generated initial audio.

Specifically, the number of reference audio generated by each target user may be used as the first number of corresponding target users, and the number of initial audio generated by each target user may be used as the second number of corresponding target users.

Step S3133-a, for each target user, determining a ratio of the first number to the second number.

Specifically, the ratio of the first number and the second number of each target user may be calculated as the ratio of the corresponding target users.

And S3134-a, selecting a reference user from target users according to the ratio.

In one embodiment of the present invention, a target user whose ratio exceeds a preset first threshold may be selected as the reference user. The first threshold may be set in advance according to an application scenario or specific requirements, and may be, for example, but not limited to, 90%.

In another embodiment of the present invention, it is also possible to perform ascending order or descending order on all the target users according to the ratio, and obtain the ranking value of each target user. And selecting a target user with the sorting value within a set range as a reference user. The setting range may be set according to a sorting manner (ascending order or descending order), an application scenario or specific requirements. For example, in the case where the sorting manner is descending, the setting range may be 1 to 3, and then, 3 target users having the largest ratio may be selected as reference users.

In another embodiment of the present invention, determining a reference user from a reference sample includes steps S3131-b through S3133-b as follows:

in step S3131-b, the user who generated each reference audio is determined as the target user.

Step S3132-b, for each target user, a first number of generated reference audio is determined.

Step S3133-b, selecting a reference user from the target users according to the first quantity.

In one embodiment of the present invention, the first number of target users exceeding the preset second threshold may be selected as the reference users. The second threshold may be set in advance according to an application scenario or specific requirements, and may be, for example, but not limited to, 90%.

In another embodiment of the present invention, it is also possible to perform ascending order or descending order on all the target users according to the first number, and obtain an order value of each target user. And selecting a target user with the sorting value within a set range as a reference user. The setting range may be set according to a sorting manner (ascending order or descending order), an application scenario or specific requirements. For example, in the case where the sorting manner is descending, the setting range may be 1 to 3, and then the first largest number of 3 target users may be selected as the reference users.

In step S3140, other audio generated by the reference user is acquired as extension audio.

In the present embodiment, the other audio may be audio other than the initial audio.

In step S3150, the expanded audio is marked as a specified score.

In step S3160, the extension audio and the initial audio are used as training samples.

The reference user selected in step S3130 may consider the actual score of the other audio generated by the reference user as a specified score. Thus, the extension audio may be marked directly as a specified score, and both the initial audio and the extension audio that have been marked may be used as training samples.

In this embodiment, in order to reduce the manual marking cost, the number of samples may be expanded by selecting a reference user capable of generating an actual score as a specified score, marking the generated other audio as an expanded audio as a specified score, and using the marked expanded audio as a training sample.

In step S3200, training is performed to obtain a mapping function according to the vector value and the actual score of the feature vector of the training sample.

In one embodiment of the present invention, steps S3100 to S3200 of the training mapping function may be performed according to a preset training period. The training period may be set according to a specific application scenario or application requirement, for example, may be set to 1 day.

In this embodiment, the mapping function F (x) may be obtained by various fitting means based on the vector value of the feature vector of the training sample and the actual score corresponding to the training sample, for example, the mapping function F (x) may be obtained by using an arbitrary multiple linear regression model, which is not limited herein.

In one example, the multiple linear regression model may be a simple polynomial function reflecting the mapping function F (x), where each order coefficient of the polynomial function is unknown, and each order coefficient of the polynomial function may be determined by substituting the actual score corresponding to the training sample and the vector value of the feature vector of the training sample into the polynomial function, so as to obtain the mapping function F (x).

In another example, various regression models, such as a classification model, may be used, and multiple rounds of training may be performed using the actual scores corresponding to the vector values of the feature vectors of the training samples and the actual scores corresponding to the training samples as accurate samples, where each round learns the residuals after the previous round of fitting, and iterating the T rounds, so that the residuals may be controlled to very low values, so that the finally obtained mapping function F (x) has very high accuracy. The classification model is, for example, svm, GBDT, CNN or the like, and is not limited herein.

In one embodiment of the present invention, according to the vector value and the actual score of the feature vector of the training sample, the training and obtaining the mapping function may include steps S3210 to S3230 as follows:

step S3210, determining a scoring prediction expression of each training sample according to the vector value of the feature vector of each training sample by using the undetermined coefficient of the mapping function as a variable.

Assuming that the feature vector X in the mapping function includes n features X ₁,x₂,......,x_n, in determining the value of the kth training sample for the n featuresThen, taking a constant weight b and n feature weights a ₁,a₂,......,a_n included in the undetermined coefficient set as variables, the scoring prediction expression of the kth training sample can be obtained as Y _k:

Step S3220, constructing a loss function according to the scoring prediction expression of each training sample and the actual score of each training sample.

In one embodiment of the present invention, constructing the loss function according to the scoring prediction expression of each training sample and the actual score of each training sample may include steps S3221 to S3222 as follows:

in step S3221, for each training sample, a corresponding loss expression is determined according to the scoring prediction expression and the actual score.

Assuming that the number of training samples collected is m, for which the kth training sample, the actual score obtained is Y _k, the score prediction expression is Y _k, and the corresponding loss expression is (Y _k-Y_k)² (k=1.; m); wherein,

In step S3222, the loss expressions of each training sample are summed to obtain a loss function.

In this embodiment, the loss function may be:

Wherein,

And step S3230, determining undetermined coefficients according to the loss function, and completing the training of the mapping function.

In one embodiment of the present invention, determining the undetermined coefficient according to the loss function, completing the training of the mapping function may further include steps S3231 to S3233 as follows:

step S3231, setting the constant weight in the undetermined coefficient set and the initial value of each characteristic weight as the random number in the preset numerical range.

Assuming that the set of undetermined coefficients { b, a ₁,a₂,......,a_n } includes a constant weight b and n feature weights a ₁,a₂,......,a_n, the initial value can be set to a random number of a preset numerical range. The preset value range may be set according to an application scenario or an application requirement, for example, the preset value range is set to 0-1, so that initial values of the constant weight b and the n feature weights a ₁,a₂,......,a_n are random numbers between 0 and 1.

And step S3232, substituting the constant weight and each characteristic weight after the initial value setting into the loss function, and performing iterative processing.

In this embodiment, the step S3232 of substituting the constant weight and each characteristic weight after the initial value is set into the loss function, and the iterative process may further include the following steps S3232-1 to S3232-2:

Step S3232-1, for each constant weight and each characteristic weight, obtaining the corresponding constant weight or the value of the characteristic weight after iteration according to the constant weight or the value of the characteristic weight before the iteration, the convergence parameter and the loss function substituted into the undetermined coefficient set before the iteration.

The convergence parameter is a relevant parameter for controlling the convergence speed of the iterative process, and can be set according to the application scenario or the application requirement, for example, set to 0.01.

And step S3232-2, obtaining a set of undetermined coefficients after the iteration according to the constant weights and the values of each characteristic weight after the iteration.

Assuming that the iteration is the (k+1) th iteration (the initial value of k is 0, and 1 is added with each iteration), the set of undetermined coefficients after the iteration is { b, a ₁,a₂,...,a_n}^(k+1).

And step S3233, terminating the iterative process when the undetermined coefficient set obtained by the iterative process meets the convergence condition, determining the constant weight of the undetermined coefficient set and the value of each characteristic weight, and if not, continuing the iterative process.

The convergence condition can be set according to specific application scenarios or application requirements.

For example, the convergence condition is that the number of iterative processes is greater than a preset number of times threshold. The preset number of times threshold may be set according to engineering experience or experimental simulation results, for example, may be set to 300. Correspondingly, assuming that the number of iterative processes is k+1, the number of times threshold is itemNums, and the corresponding convergence condition is: k is more than or equal to itemNums.

For another example, the convergence condition is that an iteration result value of the undetermined coefficient set obtained by the iteration process is smaller than a preset result threshold value. The iteration result value is determined according to the loss function substituted by the undetermined coefficient set obtained through the iteration process and the corresponding constant weight or the result of the bias derivative of each characteristic weight.

In one example, the convergence condition is that any one of the above two examples is satisfied, and a specific convergence condition is described in the above two examples, which is not described herein.

And if the set { b, a ₁,a₂,...,a_n}^(k+1) of the undetermined coefficients obtained by the k+1th iteration process accords with the convergence condition, terminating the iteration process to obtain values corresponding to all a _i ^(k+1) (i=1,..n) and b ^(k+1), otherwise, continuing the iteration process until the set of undetermined coefficients accords with the convergence condition.

According to the embodiment of the invention, the mapping function can be obtained through training according to a large number of training samples, so that the accuracy of the obtained prediction score can be improved when the score of the prediction audio is determined by using the mapping function.

And S4000, obtaining a prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio.

The vector value may specifically be a value of a feature vector of the target audio.

In this embodiment, according to step S3000, a mapping function between the feature vector and the driving range is obtained, and according to the vector value of the feature vector, the vector value can be substituted into the mapping function F (x) so as to obtain the prediction score of the target audio.

According to the embodiment of the invention, the prediction score of the target audio can be automatically obtained without manual scoring, and the labor cost can be reduced. In addition, the mapping function is obtained through training according to a large number of training samples, so that when the mapping function is used for determining the scores of the predicted target audio, the accuracy of the obtained predicted scores can be improved, and the results of the predicted scores can be more objective.

In embodiments where the actual score of the training sample includes a first score for representing that the corresponding audio is premium audio and a second score for representing that the corresponding audio is non-premium audio, the output of the mapping function may be a probability that the target audio is premium audio.

In one example of the present invention, the probability output by the mapping function may be directly used as a predictive score for the target audio. For example, the mapping function may output a result of 0.23, and then the prediction score for the target audio may be 0.23.

In another example of the present invention, the probability output by the mapping function may be normalized according to the first score and the second score, and then multiplied by 100 to obtain the prediction score of the target audio. For example, the first score is 1, the second score is 0, the output of the mapping function is 0.89, and then the prediction score of the target audio may be 89. For another example, the first score is 2 and the second score is 1, the output of the mapping function is 1.96, and then the prediction score of the target audio may be 96.

In one embodiment of the present invention, the method may further comprise: and providing the prediction scores to a client for displaying, wherein the client generates the target audio, so that the user generates the target audio to view.

In one embodiment of the present invention, the method may further comprise:

Determining whether the target audio is high-quality audio according to the prediction scores; in the case where the target audio is excellent audio, the target audio is added to the recommendation list.

The audio in the recommendation list may be provided to the client of each user in a preset manner. The preset mode can be preset according to application scenes or specific requirements. The preset manner may be, for example, at least one of a manner of scoring from high to low, a manner of playing times from high to low, a preference of each user, and a random manner.

In one embodiment of the present invention, in the case where the target audio is premium audio, the target audio may be taken as exemplary audio of the corresponding song for other users to learn.

In one embodiment of the present invention, the method may further comprise:

acquiring an actual score of a target audio;

In this embodiment, the actual score of the target audio may be obtained by manually scoring the target audio by a background operator.

According to the embodiment of the invention, the target audio marked as the corresponding actual score is used as a new training sample, the mapping function is revised, namely the new training samples are added, and the mapping function is retrained, so that the score prediction result of the mapping function is more and more accurate.

< Device example >

In the present embodiment, an audio processing apparatus 5000 is provided, which includes an audio acquisition module 5100, a feature acquisition module 5200, a function acquisition module 5300, and a score prediction module 5400, as shown in fig. 5. The audio acquisition module 5100 is configured to acquire target audio to be processed; the feature acquisition module 5200 is configured to acquire a selected feature vector, wherein the feature vector includes at least one feature that affects a score of the audio; the function obtaining module 5300 is configured to obtain a mapping function between the feature vector and the score; the score prediction module 5400 is used for obtaining a prediction score of the target audio according to the mapping function and vector values of feature vectors of the target audio.

In one embodiment of the invention, at least one feature comprises: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formants.

In one embodiment of the present invention, the function obtaining module 5300 may be configured to:

and training to obtain a mapping function according to the vector value and the actual score of the feature vector of the training sample.

In one embodiment of the invention, obtaining training samples comprises:

Determining a reference user according to the reference audio;

acquiring other audio generated by a reference user as extension audio;

marking the expanded audio as a specified score;

the marked extension audio and the initial audio are used as training samples.

In one embodiment of the invention, determining a reference user from a reference sample comprises:

determining a user generating each reference audio as a target user;

for each target user, determining a first number of generated reference audio and a second number of generated initial audio;

for each target user, determining a ratio of the first number and the second number;

and selecting a reference user from the target users according to the ratio.

determining a user generating each reference audio as a target user;

determining, for each target user, a first number of generated reference audio;

And selecting a reference user from the target users according to the first quantity.

In one embodiment of the present invention, training to obtain the mapping function based on vector values and actual scores of feature vectors of training samples includes:

The undetermined coefficient of the mapping function is used as a variable, and the scoring prediction expression of each training sample is determined according to the vector value of the feature vector of each training sample;

constructing a loss function according to the score prediction expression of each training sample and the actual score of each training sample;

And determining undetermined coefficients according to the loss function, and finishing the training of the mapping function.

In one embodiment of the invention, constructing the loss function based on the scoring predictive expression for each training sample and the actual score for each training sample comprises:

the loss expressions for each training sample are summed to obtain a loss function.

In one embodiment of the present invention, the audio processing apparatus 5000 may further include:

A module for obtaining an actual score of the target audio;

A module for taking the target audio as a new training sample and marking the new training sample according to the actual score;

And a module for correcting the mapping function according to the vector value of the feature vector of the new training sample and the actual score of the new training sample.

And a module for executing the step of training the mapping function according to a preset training period.

And providing the predictive score of the target audio to a client for presentation.

And means for adding the target audio to the recommendation list if the target audio is premium audio.

Those skilled in the art will appreciate that the audio processing device 5000 may be implemented in a variety of ways. For example, the audio processing device 5000 may be implemented by an instruction configuration processor. For example, instructions may be stored in a ROM, and when the device is started, the instructions are read from the ROM into a programmable device to implement the audio processing apparatus 5000. For example, the audio processing device 5000 may be solidified into a dedicated device (e.g., ASIC). The audio processing means 5000 may be divided into mutually independent units or they may be implemented by combining them together. The audio processing device 5000 may be implemented by one of the above-described various implementations, or may be implemented by a combination of two or more of the above-described various implementations.

In this embodiment, the audio processing device 5000 may have various implementation forms, for example, the audio processing device 5000 may be any functional module running in a software product or an application program that provides an audio processing service, or a peripheral embedded part, a plug-in part, a patch part, or the like of the software product or the application program, or may be the software product or the application program itself.

< Electronic device >

In the present embodiment, an electronic apparatus 6000 is also provided. The electronic device 6000 may be the server 1100 shown in fig. 1a or the terminal device 1200 shown in fig. 1 b.

In one aspect, the electronic device 6000 may include the aforementioned audio processing apparatus 5000 for implementing the audio processing method according to any embodiment of the present invention.

In another aspect, as shown in fig. 6, the electronic device 6000 may further include a processor 6100 and a memory 6200, the memory 6200 for storing executable instructions; the processor 6100 is configured to execute the electronic device 6000 according to the control of the instruction to perform the audio processing method according to any embodiment of the present invention.

In this embodiment, the electronic device 6000 may be a terminal device such as an intelligent speaker, an earphone, a mobile phone, a tablet computer, a palm computer, a desktop computer, a notebook computer, or a server. For example, the electronic device 6000 may be an electronic product having an audio processing function.

< Computer-readable storage Medium >

In this embodiment, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an audio processing method as in any of the embodiments of the present invention.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. An audio processing method, comprising:

acquiring target audio to be processed;

Obtaining a mapping function between the feature vector and the score;

Obtaining a prediction score of the target audio according to the mapping function and vector values of the feature vectors of the target audio;

The obtaining a mapping function between the feature vector and the score includes:

training to obtain the mapping function according to the vector value and the actual score of the feature vector of the training sample;

The obtaining training samples includes:

determining a reference user according to the reference audio;

acquiring other audio generated by the reference user as extension audio;

marking the expanded audio as the specified score;

-taking the marked extension audio and the initial audio as the training samples;

Said determining a reference user from said reference audio comprises:

determining a user generating each reference audio as a target user;

selecting a target user with the ratio exceeding a preset first threshold as the reference user; or sorting the target users according to the ratio, obtaining the sorting value of each target user, and selecting the target users with the sorting values within a set range as the reference users.

2. The method of claim 1, the at least one feature comprising: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formants.

3. The method of claim 1, the training to derive the mapping function from vector values and actual scores of the feature vectors of the training samples comprising:

4. The method of claim 3, said constructing a loss function from said scoring predictive expression for each said training sample and an actual score for each said training sample comprising:

5. The method of claim 1, further comprising:

acquiring an actual score of the target audio;

6. The method of claim 1, further comprising:

7. The method of claim 1, further comprising:

8. The method of claim 1, further comprising:

9. An audio processing apparatus, comprising:

the scoring prediction module is used for obtaining a prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio;

The obtaining training samples includes:

determining a reference user according to the reference audio;

acquiring other audio generated by the reference user as extension audio;

marking the expanded audio as the specified score;

Said determining a reference user from said reference audio comprises:

determining a user generating each reference audio as a target user;

10. An electronic device, comprising:

the apparatus of claim 9; or alternatively

A processor and a memory for storing instructions for controlling the processor to perform the method according to any one of claims 1 to 8.

11. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 8.