CN113593607A

CN113593607A - Audio processing method and device and electronic equipment

Info

Publication number: CN113593607A
Application number: CN202010364644.4A
Authority: CN
Inventors: 张家隆
Original assignee: Beijing Wall Breaker Technology Co ltd
Current assignee: Guangzhou Huancheng culture media Co.,Ltd.
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-11-02

Abstract

The invention discloses an audio processing method, an audio processing device and electronic equipment, wherein the method comprises the following steps: acquiring target audio to be processed; obtaining a selected feature vector, wherein the feature vector comprises at least one feature that affects a score of the audio; acquiring a mapping function between the feature vector and the score; and obtaining the prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio.

Description

Audio processing method and device and electronic equipment

Technical Field

The present invention relates to the field of internet technologies, and in particular, to an audio processing method, an audio processing apparatus, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of science and technology, singing becomes a new singing entertainment mode. The user can play corresponding chord for the user accompanies and generates the audio by clicking the corresponding chord key while singing.

In the prior art, a background operator is usually required to score audio content generated by a user through a singing tool, and select high-quality content and a person who reaches the user for subsequent content consumption.

However, the mode of scoring the audio by background operators lacks objective and same standards, so that the obtained scoring result is subjective and has high cost.

Disclosure of Invention

It is an object of the present invention to provide a new technical solution for automatically scoring audio.

According to a first aspect of the present invention, there is provided an audio processing method comprising:

acquiring target audio to be processed;

obtaining a selected feature vector, wherein the feature vector comprises at least one feature that affects a score of audio;

acquiring a mapping function between the feature vector and the score;

obtaining a prediction score of the target audio according to the mapping function and vector values of the feature vectors of the target audio.

Optionally, the at least one feature comprises: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formant.

Optionally, the obtaining a mapping function between the feature vector and the score includes:

acquiring training samples, wherein each training sample is audio and is marked as a corresponding actual score;

and training to obtain the mapping function according to the vector value and the actual score of the feature vector of the training sample.

Optionally, the obtaining the training sample includes:

obtaining at least one initial audio, wherein each initial audio is marked as a corresponding actual score;

taking the actual score as the initial audio of the designated score as the reference audio;

determining a reference user according to the reference audio;

acquiring other audio generated by the reference user as an extended audio;

tagging the expanded audio as the specified score;

and using the marked extension audio and the initial audio as the training sample.

Optionally, the determining a reference user according to the reference sample includes:

determining a user generating each reference audio as a target user;

for each of the target users, determining a first number of reference audios to be generated and a second number of initial audios to be generated;

for each of the target users, determining a ratio of a first quantity to a second quantity;

and selecting the reference user from the target users according to the ratio.

determining a user generating each reference audio as a target user;

for each of the target users, determining a first amount of reference audio to generate;

and selecting the reference user from the target users according to the first quantity.

Optionally, the training to obtain the mapping function according to the vector value of the feature vector of the training sample and the actual score includes:

determining a grade prediction expression of each training sample by taking the undetermined coefficient of the mapping function as a variable and respectively according to the vector value of the feature vector of each training sample;

constructing a loss function according to the score prediction expression of each training sample and the actual score of each training sample;

and determining the undetermined coefficient according to the loss function, and finishing the training of the mapping function.

Optionally, the constructing a loss function according to the score prediction expression of each training sample and the actual score of each training sample includes:

for each training sample, determining a corresponding loss expression according to the score prediction expression and the actual score;

and summing the loss expressions of each training sample to obtain the loss function.

Optionally, the method further includes:

acquiring an actual score of the target audio;

taking the target audio as a new training sample, and marking the new training sample according to the actual score;

and correcting the mapping function according to the vector value of the feature vector of the new training sample and the actual score of the new training sample.

Optionally, the method further includes:

and executing the step of training the mapping function according to a preset training period.

Optionally, the method further includes:

providing the prediction score of the target audio to a client generating the target audio for presentation.

Optionally, the method further includes:

determining whether the target audio is a high-quality audio according to the prediction score;

and adding the target audio into a recommendation list under the condition that the target audio is the high-quality audio.

According to a second aspect of the present invention, there is provided an audio processing apparatus, comprising:

the audio acquisition module is used for acquiring target audio to be processed;

a feature obtaining module for obtaining a selected feature vector, wherein the feature vector comprises at least one feature that affects a score of audio;

the function acquisition module is used for acquiring a mapping function between the feature vector and the score;

and the score prediction module is used for obtaining the prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising:

the apparatus according to the second aspect of the invention; alternatively, the first and second electrodes may be,

a processor and a memory for storing instructions for controlling the processor to perform the method according to the first aspect of the invention.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method according to the first aspect of the present invention.

In the embodiment of the invention, the prediction score of the target audio can be obtained according to the feature vector and the mapping function, the prediction score of the target audio can be automatically obtained without manual scoring, and the labor cost can be reduced. Moreover, the mapping function is obtained by training according to a large number of training samples, so that when the mapping function is used for determining the score of the prediction target audio, the accuracy of the obtained prediction score can be improved, and the result of the prediction score can be more objective.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1a is a block diagram showing one example of a hardware configuration of an electronic device that may be used to implement an embodiment of the invention.

FIG. 1b is a block diagram showing another example of a hardware configuration of an electronic device that may be used to implement an embodiment of the invention.

Fig. 2 shows a schematic diagram of an application scenario of an audio processing method of an embodiment of the present invention.

Fig. 3 shows a flow diagram of an audio processing method of an embodiment of the invention.

FIG. 4 shows a flowchart of the steps of obtaining training samples according to an embodiment of the present invention.

Fig. 5 shows a block diagram of an audio processing device of an embodiment of the invention.

FIG. 6 shows a block diagram of one example of an electronic device of an embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

< hardware configuration >

Fig. 1a and 1b are block diagrams of hardware configurations of an electronic apparatus 1000 that can be used to implement an audio processing method of any embodiment of the present invention.

In one embodiment, as shown in FIG. 1a, the electronic device 1000 may be a server 1100.

The server 1100 provides the computers for processing, databases, and communications facilities. The server 1100 can be a unitary server or a distributed server across multiple computers or computer data centers. The server may be of various types, such as, but not limited to, a web server, a news server, a mail server, a message server, an advertisement server, a file server, an application server, an interaction server, a database server, or a proxy server. In some embodiments, each server may include hardware, software, or embedded logic components or a combination of two or more such components for performing the appropriate functions supported or implemented by the server. For example, a server, such as a blade server, a cloud server, etc., or may be a server group consisting of a plurality of servers, which may include one or more of the above types of servers, etc.

In this embodiment, the server 1100 may include a processor 1110, a memory 1120, an interface device 1130, a communication device 1140, a display device 1150, and an input device 1160, as shown in fig. 1 a.

In this embodiment, the server 1100 may also include a speaker, a microphone, and the like, which are not limited herein.

The processor 1110 may be a dedicated server processor, or may be a desktop processor, a mobile version processor, or the like that meets performance requirements, and is not limited herein. The memory 1120 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1130 includes various bus interfaces such as a serial bus interface (including a USB interface), a parallel bus interface, and the like. The communication device 1140 is capable of wired or wireless communication, for example. The display device 1150 is, for example, a liquid crystal display panel, an LED display panel touch display panel, or the like. Input devices 1160 may include, for example, a touch screen, a keyboard, and the like.

In this embodiment, the memory 1120 of the server 1100 is configured to store instructions for controlling the processor 1110 to operate at least to perform an audio processing method according to any of the embodiments of the present invention. The skilled person can design the instructions according to the disclosed solution. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.

Although shown as multiple devices in FIG. 1a, the present invention may relate to only some of the devices, e.g., server 1100 may relate to only memory 1120 and processor 1110.

In one embodiment, the electronic device 1000 may be a terminal device 1200 such as a PC, a notebook computer, or the like used by an operator, which is not limited herein.

In this embodiment, referring to fig. 1b, the terminal device 1200 may include a processor 1210, a memory 1220, an interface 1230, a communication device 1240, a display device 1250, an input device 1260, a speaker 1270, a microphone 1280, and the like.

The processor 1210 may be a mobile version processor. The memory 1220 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1230 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1240 may be capable of wired or wireless communication, for example, the communication device 1240 may include a short-range communication device, such as any device that performs short-range wireless communication based on short-range wireless communication protocols, such as the Hilink protocol, WiFi (IEEE 802.11 protocol), Mesh, bluetooth, ZigBee, Thread, Z-Wave, NFC, UWB, LiFi, and the like, and the communication device 1240 may also include a long-range communication device, such as any device that performs WLAN, GPRS, 2G/3G/4G/5G long-range communication. The display device 1250 is, for example, a liquid crystal display, a touch display, or the like. The input device 1260 may include, for example, a touch screen, a keyboard, and the like. A user can input/output voice information through the speaker 1270 and the microphone 1280.

In this embodiment, the memory 1220 of the terminal device 1200 is used to store instructions for controlling the processor 1210 to operate at least to perform an audio processing method according to any of the embodiments of the present invention. The skilled person can design the instructions according to the disclosed solution. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.

Although a plurality of devices of the terminal apparatus 1200 are shown in fig. 1b, the present invention may relate to only some of the devices, for example, the terminal apparatus 1200 relates to only the memory 1220 and the processor 1210 and the display device 1250.

< application scenarios >

Fig. 2 is a schematic diagram of an application scenario of the audio processing method according to the embodiment of the present invention.

The voice processing method of the embodiment can be particularly applied to scenes scoring the singing of the user, such as KTV scenes, singing scenes, recording scenes, live broadcast scenes, virtual anchor scenes and the like.

As shown in fig. 2, the user a may input a voice through its client, and the client of the user a generates a target audio according to the voice input by the user a and provides the target audio into the electronic device 1000. The electronic device 1000 obtains the selected feature vector and the mapping function between the feature vector and the score, and obtains the predicted score of the target audio according to the mapping function and the vector value of the feature vector of the target audio. The electronic device 1000 may present the predicted score of the target audio back to the client of user a.

By the method, the prediction score of the target audio can be automatically obtained without manual scoring, and the labor cost can be reduced. Moreover, the mapping function is obtained by training according to a large number of training samples, so that when the mapping function is used for determining the score of the prediction target audio, the accuracy of the obtained prediction score can be improved, and the result of the prediction score can be more objective.

For example, in a scenario such as KTV, singing, or recording, a user may input a singing voice through the client a, and the client a generates a target audio according to the singing voice of the user, and provides the target audio to the electronic device 1000. The electronic device 1000 obtains the selected feature vector and the mapping function between the feature vector and the score, and obtains the predicted score of the target audio according to the mapping function and the vector value of the feature vector of the target audio. When the predicted score is obtained, the electronic apparatus 1000 may return the predicted score to the client a of the user. Under the condition that the client A obtains the prediction score, the prediction score can be displayed in an interface for a user to check.

For another example, in a live scene, the anchor may input a singing voice through the client B, and the client B generates a target audio according to the singing voice of the anchor and provides the target audio to the electronic device 1000. The electronic device 1000 obtains the selected feature vector and the mapping function between the feature vector and the score, and obtains the predicted score of the target audio according to the mapping function and the vector value of the feature vector of the target audio. When the predicted score is obtained, the electronic device 1000 may return the predicted score to the client B of the user. Under the condition that the client B obtains the prediction score, the prediction score can be displayed in an interface for the anchor to view.

In one example, the electronic device 1000 may also be a client that returns the prediction score to a user within the live room of the anchor, so that the user within the live room may also view the prediction score of the singing voice of the anchor.

For another example, in the virtual anchor scene, the dubbing staff inputs the singing voice through the client C, and the client C generates the target audio according to the singing voice of the dubbing staff and provides the target audio to the electronic device 1000. The electronic device 1000 obtains the selected feature vector and the mapping function between the feature vector and the score, and obtains the predicted score of the target audio according to the mapping function and the vector value of the feature vector of the target audio. When the predicted score is obtained, the electronic device 1000 may return the predicted score to the client C of the user. Under the condition that the client C obtains the prediction score, the prediction score can be displayed in an interface for a dubbing staff to view.

In one example, the electronic device 1000 may also be a client that returns the prediction score to the user viewing the video corresponding to the singing voice, so that the user viewing the video may also view the prediction score of the singing voice of the dubbing staff.

< method examples >

In the present embodiment, an audio processing method is provided. The audio processing method may be implemented by an electronic device. The electronic device may be the server 1100 as shown in fig. 1a or the terminal device 1200 as shown in fig. 1 b.

As shown in fig. 3, the audio processing method of the present embodiment may include the following steps S1000 to S4000:

step S1000, a target audio to be processed is obtained.

In one embodiment of the invention, the target audio may be audio obtained by the user's client that contains at least the speech input by the user.

The client may be a designated application, such as a singing type or singing type application.

In one example, the target audio may be a result of a client of the user simultaneously capturing voice input by the user and accompaniment played by the client.

In another example, the target audio may be synthesized from a voice input by the user and an accompaniment generated by the client.

Step S2000, the selected feature vector is acquired.

Wherein the feature vector comprises at least one feature that affects a score of the audio.

The feature vector X comprises at least one feature X that affects the score of the audio_jJ takes a natural number from 1 to n, and n represents the total number of features of the feature vector X.

In one embodiment of the invention, the at least one feature x_jThe features for measuring the multiple dimensions of the corresponding audio in tone, intonation, rhythm and chord, etc. may include: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formant.

The Mel-Frequency cepstrum coefficient (MFCC) is an Hz spectrum characteristic calculated by using a nonlinear correspondence relationship between Mel Frequency and Hz Frequency. Wherein the Mel frequency is extracted based on the auditory characteristics of human ears.

Zero-crossing rate (ZCR) refers to the rate at which the sign of a signal changes, e.g., the signal changes from positive to negative or vice versa. This feature is widely used in the fields of speech comparison, speech recognition, and music information retrieval (music information retrieval), and is a main feature for classifying a tapping sound.

The short-time energy is the speech energy calculated in a short time. The shorter time here is usually referred to as one frame. That is, the speech energy in one frame time is the short-term energy.

The short-time autocorrelation function is the result of intercepting a segment of signal with a short time window near the Nth sample point of the signal and performing autocorrelation calculation. Since the speech signal is a non-stationary signal, the short-time autocorrelation function is used for processing the signal.

The short-time average amplitude difference may be used for pitch period detection.

A spectrogram is a speech spectrogram, and is generally obtained by processing a received time domain signal. The abscissa of the spectrogram is time, the ordinate is frequency, and coordinate point values are voice data energy. Because the three-dimensional information is expressed by adopting the two-dimensional plane, the size of the energy value is expressed by the color, and the deeper the color, the stronger the voice energy for expressing the point is.

Spectral entropy describes the relationship between the power spectrum and the entropy rate.

The fundamental frequency is the frequency of the fundamental tone, and determines the pitch of the entire tone. In sound, fundamental frequency refers to the frequency of a fundamental tone in a complex tone. Among the several tones constituting a complex tone, the fundamental tone has the lowest frequency and the highest intensity. The level of the fundamental frequency determines the level of a tone. The frequency of speech is usually the frequency of fundamental tones.

Formants, which cause a resonance characteristic when excited by a quasi-periodic pulse at the glottis into the vocal tract, produce a set of resonance frequencies, referred to as the formant frequencies or simply formants.

In this example, x_jMay be features that can affect the score of the audio, for example, the at least one feature may include mel-frequency cepstral coefficients, zero-crossing rates, short-time energies, short-time autocorrelation functions, short-time average amplitude differences, spectrogram, spectral entropy, fundamental frequency, and formants, where the feature vector X may have 9 features, i.e., n 9, where the feature vector X may be represented as X (X9), where X may be represented as X (X)₁,x₂,x₃,x₄,x₅,x₆,x₇,x₈,x₉). Of course, other features related to audio scores may also be included in the feature vector X.

Step S3000, a mapping function between the feature vector and the score is obtained.

The independent variable of the mapping function f (X) is the feature vector X, and the dependent variable f (X) is the prediction score determined by the feature vector X.

In an embodiment of the present invention, obtaining the mapping function between the feature vector and the score includes steps S3100 to S3200 as follows:

step S3100, a training sample is acquired.

Each training sample is audio and is labeled as a corresponding actual score.

In one embodiment of the present invention, each training sample may be scored manually.

In one embodiment of the present invention, obtaining training samples includes steps S3110-S3160 as shown in FIG. 4:

in step S3110, at least one initial audio is acquired.

Wherein each initial audio is labeled as a corresponding actual score.

In one embodiment of the invention, the initial audio may be generated by multiple users. Specifically, the user may be generated by recording the voice of the user through a singing tool in the respective client. The generation manner of the initial audio may refer to the generation manner of the target audio, and is not described herein again.

The actual rating of the initial audio may be manually scored by a back-office operator.

In one example, the actual score for each initial audio may be a specific score, or may be a first score and a second score that are used to distinguish between premium audio and non-premium audio. The first score and the second score may be values set according to an application scenario or specific requirements, for example, the first score may be 1, and the second score may be 0.

And step S3120, taking the initial audio with the actual score as the reference audio.

The assigned score may be set in advance according to an application scenario or a specific requirement. In an embodiment where the actual score includes a first score indicating that the corresponding audio is premium audio and a second score indicating that the corresponding audio is non-premium audio, the specified score may be, for example, the first score or the second score.

Step S3130, a reference user is determined according to the reference audio.

In this embodiment, according to the reference user determined by the reference audio, other audio actually scored as the designated score may also be generated.

In one embodiment of the present invention, determining a reference user according to a reference sample may include steps S3131-a to S3134-a as follows:

in step S3131-a, the user who generated each reference audio is determined as a target user.

Step S3132-a, for each target user, determines a first number of reference audios to be generated and a second number of initial audios to be generated.

Specifically, the number of reference audios generated by each target user may be used as the first number of corresponding target users, and the number of initial audios generated by each target user may be used as the second number of corresponding target users.

Step S3133-a, for each target user, determines a ratio of the first number and the second number.

Specifically, a ratio of the first number to the second number of each target user may be calculated as a ratio of the corresponding target user.

Step S3134-a, selecting a reference user from the target users according to the ratio.

In an embodiment of the present invention, a target user whose ratio exceeds a preset first threshold may be selected as a reference user. The first threshold may be set in advance according to an application scenario or specific requirements, and the first threshold may be, for example and without limitation, 90%.

In another embodiment of the present invention, all the target users may be sorted in an ascending order or a descending order according to the ratio, and the sorting value of each target user is obtained. And selecting the target user with the ranking value within the set range as a reference user. The setting range may be set according to a sorting manner (ascending or descending), an application scenario or a specific requirement. For example, when the sorting manner is descending, the setting range may be 1 to 3, and then, 3 target users with the largest ratio may be selected as the reference users.

In another embodiment of the present invention, determining the reference user according to the reference sample includes steps S3131-b to S3133-b as follows:

step S3131-b, the user who generated each reference audio is determined as the target user.

Step S3132-b, for each target user, determines a first amount of reference audio to generate.

Step S3133-b, selecting a reference user from the target users according to the first number.

In an embodiment of the present invention, a first number of target users exceeding a preset second threshold may be selected as the reference user. The second threshold may be set in advance according to an application scenario or specific requirements, and the second threshold may be, for example and without limitation, 90%.

In another embodiment of the present invention, all the target users may be sorted in an ascending order or a descending order according to the first number, and the sorting value of each target user may be obtained. And selecting the target user with the ranking value within the set range as a reference user. The setting range may be set according to a sorting manner (ascending or descending), an application scenario or a specific requirement. For example, when the sorting manner is descending, the setting range may be 1 to 3, and then, 3 target users with the largest first number may be selected as the reference users.

Step S3140, another audio generated by the reference user is acquired as an extended audio.

In the present embodiment, the other audio may be audio other than the initial audio.

Step S3150, the expanded audio is marked as the assigned score.

Step S3160, the extension audio and the initial audio are used as training samples.

The reference user selected in the foregoing step S3130 may consider the actual scores of the other audios generated by the reference user to be the designated scores. Thus, the extended audio may be directly marked as a specified score, and both the already marked initial audio and extended audio may be used as training samples.

In this embodiment, in order to reduce the cost of manual tagging, a reference user capable of generating an actual score as a designated score is screened out, other generated audios are used as extension audios and are tagged as the designated score, and the tagged extension audios are also used as training samples, so that the number of samples can be extended.

Step S3200, training to obtain a mapping function according to the vector value and the actual score of the feature vector of the training sample.

In an embodiment of the present invention, the steps S3100 to S3200 of training the mapping function may be performed according to a preset training period. The training period may be set according to a specific application scenario or application requirements, and may be set to 1 day, for example.

In this embodiment, the mapping function f (x) may be obtained by various fitting means based on the vector values of the feature vectors of the training samples and the actual scores corresponding to the training samples, for example, the mapping function f (x) may be obtained by using an arbitrary multiple linear regression model, which is not limited herein.

In one example, the multiple linear regression model may be a simple polynomial function reflecting the mapping function f (x), wherein the coefficients of each order of the polynomial function are unknown, and the coefficients of each order of the polynomial function may be determined by substituting the vector values of the feature vectors of the training samples and the actual scores corresponding to the training samples into the polynomial function, thereby obtaining the mapping function f (x).

In another example, various regression models, such as a classification model, may be used to perform multiple rounds of training with the vector values of the feature vectors of the training samples and the actual scores corresponding to the training samples as accurate samples, each round learns the residual after the last round of fitting, and the residual is controlled to a very low value by iterating T rounds, so that the finally obtained mapping function f (x) has very high accuracy. The classification model is, for example, svm, GBDT, CNN, etc., and is not limited herein.

In an embodiment of the present invention, training to obtain a mapping function according to vector values and actual scores of feature vectors of training samples may include steps S3210 to S3230 as follows:

step S3210, determining a score prediction expression of each training sample according to the vector value of the feature vector of each training sample by using the undetermined coefficient of the mapping function as a variable.

Assume that the feature vector X in the mapping function includes n features X₁,x₂,......,x_nDetermining the value of the k training sample for n features

Then, the undetermined coefficient set comprises a constant weight b and n characteristic weights a₁,a₂,......,a_nAs a variable, the score prediction expression that can obtain the kth training sample is Y_k:

Step S3220, a loss function is constructed according to the score prediction expression of each training sample and the actual score of each training sample.

In an embodiment of the present invention, constructing the loss function according to the score prediction expression of each training sample and the actual score of each training sample may include steps S3221 to S3222 as follows:

step S3221, for each training sample, determining a corresponding loss expression according to the score prediction expression and the actual score.

Assuming that the number of training samples collected is m, the actual score obtained for the k-th training sample is y_kScoring the predictive expression as Y_kThe corresponding loss expression is (y)_k-Y_k)²(k ═ 1.., m); wherein the content of the first and second substances,

step S3222, the loss expressions of each training sample are summed to obtain a loss function.

In this embodiment, the loss function may be:

wherein the content of the first and second substances,

and step S3230, determining undetermined coefficients according to the loss function, and finishing the training of the mapping function.

In an embodiment of the present invention, the undetermined coefficient is determined according to the loss function, and the completing the training of the mapping function may further include steps S3231 to S3233 as follows:

step S3231, setting constant weights in the undetermined coefficient set and initial values of each characteristic weight as random numbers in a preset numerical range.

Assume a set of pending coefficients b, a₁,a₂,......,a_nComprises a constant weight b and n characteristic weights a₁,a₂,......,a_nThe initial value may be set to a random number of a preset numerical range. The preset value range may be set according to an application scenario or an application requirement, for example, the preset value range is set to 0-1, such that the constant weight b and the n feature weights a₁,a₂,......,a_nAre all random numbers between 0-1.

Step S3232 is to substitute the constant weight and each feature weight after the initial value is set into the loss function, and perform iterative processing.

In this embodiment, the step S3232 of substituting the constant weight after setting the initial value and each feature weight into the loss function may further include the following steps S3232-1 to S3232-2:

step S3232-1, for each constant weight and each feature weight, obtaining a value of the constant weight or the feature weight after the corresponding iteration according to the constant weight or the value of the feature weight before the current iteration, the convergence parameter, and the loss function substituted into the undetermined coefficient set before the current iteration.

The convergence parameter is a relevant parameter for controlling the convergence speed of the iterative process, and may be set according to an application scenario or an application requirement, for example, to 0.01.

And step S3232-2, obtaining a undetermined coefficient set after the iteration according to the constant weight and the value after the iteration of each characteristic weight.

Assuming that the iteration is the (k + 1) th iteration (the initial value of k is 0, and 1 is added along with each iteration), the undetermined coefficient set after the iteration is the { b, a₁,a₂,...,a_n}^(k+1)。

Step S3233, when the undetermined coefficient set obtained by the iterative processing meets the convergence condition, terminating the iterative processing, and determining the constant weight of the undetermined coefficient set and the value of each characteristic weight, otherwise, continuing the iterative processing.

The convergence condition may be set according to a specific application scenario or application requirements.

For example, the convergence condition is that the number of iterative processes is greater than a preset number threshold. The preset time threshold may be set according to engineering experience or experimental simulation results, and may be set to 300, for example. Correspondingly, assuming that the number of iterative processes is k +1, the number threshold is itemnams, and the corresponding convergence condition is: k is not less than itemNums.

For another example, the convergence condition is that an iteration result value of the undetermined coefficient set obtained by the iteration processing is smaller than a preset result threshold. The iteration result value is determined according to the result of partial derivation of the loss function substituted by the undetermined coefficient set obtained by iteration processing and the corresponding constant weight or each characteristic weight.

In an example, the convergence condition is that any one of the convergence conditions in the two examples is satisfied, and the specific convergence condition has been described in the two examples and is not described herein again.

Suppose that the undetermined coefficient set { b, a obtained by the (k + 1) th iteration processing₁,a₂,...,a_n}^(k+1)When the convergence condition is met, stopping the iterative processing to obtain all the corresponding a_i ^(k+1)(i ═ 1.., n) and b^(k+1)And taking values, otherwise, continuing the iterative processing until the undetermined coefficient set meets the convergence condition.

According to the embodiment of the invention, the mapping function can be obtained according to a large number of training samples, so that when the score of the predicted audio is determined by using the mapping function, the accuracy of the obtained predicted score can be improved.

And step S4000, obtaining the prediction score of the target audio according to the mapping function and the vector value of the feature vector of the target audio.

The vector value may specifically be a value of a feature vector of the target audio.

In this embodiment, a mapping function between the feature vector and the mileage is obtained according to step S3000, and the vector value is substituted into the mapping function f (x) according to the vector value of the feature vector, so as to obtain the prediction score of the target audio.

According to the embodiment of the invention, the prediction score of the target audio can be automatically obtained without manual scoring, so that the labor cost can be reduced. Moreover, the mapping function is obtained by training according to a large number of training samples, so that when the mapping function is used for determining the score of the prediction target audio, the accuracy of the obtained prediction score can be improved, and the result of the prediction score can be more objective.

In an embodiment where the actual scores of the training samples include a first score representing that the corresponding audio is a premium audio and a second score representing that the corresponding audio is a non-premium audio, the output of the mapping function may be a probability that the target audio is a premium audio.

In one example of the present invention, the probability output by the mapping function may be directly used as the prediction score of the target audio. For example, the output result of the mapping function is 0.23, and then the prediction score of the target audio may be 0.23.

In another example of the present invention, the probability output by the mapping function may be normalized according to the first score and the second score, and then multiplied by 100 to obtain the prediction score of the target audio. For example, if the first score is 1 and the second score is 0, the output result of the mapping function is 0.89, and then the prediction score of the target audio may be 89. As another example, if the first score is 2, the second score is 1, and the output result of the mapping function is 1.96, then the prediction score of the target audio may be 96.

In one embodiment of the present invention, the method may further comprise: and providing the prediction score to a client side which generates the target audio for display, so that the user which generates the target audio can view the prediction score.

In one embodiment of the present invention, the method may further comprise:

determining whether the target audio is high-quality audio according to the prediction score; and in the case that the target audio is the high-quality audio, adding the target audio to the recommendation list.

The audio in the recommendation list may be provided to the client of each user in a preset manner. The preset mode can be preset according to an application scene or specific requirements. For example, the preset manner may be at least one of a manner of high-to-low scoring, a manner of high-to-low playing times, a preference of each user, and a random manner, for example.

In one embodiment of the present invention, in the case that the target audio is the high-quality audio, the target audio may be taken as the exemplary audio of the corresponding song for other users to learn.

In one embodiment of the present invention, the method may further comprise:

acquiring an actual score of the target audio;

In this embodiment, the actual score of the target audio may be obtained by manually scoring the target audio by a background operator.

According to the embodiment of the invention, the target audio marked as the corresponding actual score is used as the new training sample to modify the mapping function, namely, the new training samples are added to retrain the mapping function, so that the score prediction result of the mapping function is more and more accurate.

< apparatus embodiment >

In the present embodiment, there is provided an audio processing apparatus 5000, as shown in fig. 5, including an audio acquisition module 5100, a feature acquisition module 5200, a function acquisition module 5300, and a score prediction module 5400. The audio obtaining module 5100 is configured to obtain a target audio to be processed; the feature obtaining module 5200 is configured to obtain a selected feature vector, where the feature vector includes at least one feature that affects a score of the audio; the function obtaining module 5300 is configured to obtain a mapping function between the feature vectors and the scores; the score prediction module 5400 is configured to obtain a prediction score of the target audio according to the mapping function and a vector value of the feature vector of the target audio.

In one embodiment of the invention, the at least one feature comprises: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formant.

In an embodiment of the present invention, the function acquisition module 5300 may be configured to:

and training to obtain a mapping function according to the vector value and the actual score of the feature vector of the training sample.

In one embodiment of the invention, obtaining training samples comprises:

determining a reference user according to the reference audio;

acquiring other audio generated by a reference user as an extended audio;

tagging the expanded audio as a specified score;

and taking the marked extension audio and the initial audio as training samples.

In one embodiment of the present invention, determining the reference user from the reference sample comprises:

determining a user generating each reference audio as a target user;

for each target user, determining a first number of reference audios to be generated and a second number of initial audios to be generated;

for each target user, determining a ratio of the first number to the second number;

and selecting a reference user from the target users according to the ratio.

determining a user generating each reference audio as a target user;

for each target user, determining a first number of reference audios to generate;

and selecting a reference user from the target users according to the first quantity.

In an embodiment of the present invention, training to obtain the mapping function according to the vector values and the actual scores of the feature vectors of the training samples includes:

determining a grade prediction expression of each training sample by taking undetermined coefficients of the mapping function as variables and respectively according to the vector value of the feature vector of each training sample;

In one embodiment of the present invention, constructing the loss function according to the score prediction expression of each training sample and the actual score of each training sample comprises:

and summing the loss expressions of each training sample to obtain a loss function.

In an embodiment of the present invention, the audio processing apparatus 5000 may further include:

means for obtaining an actual score for the target audio;

a module for taking the target audio as a new training sample and marking the new training sample according to the actual score;

and the module is used for correcting the mapping function according to the vector value of the feature vector of the new training sample and the actual score of the new training sample.

and a module for executing the step of training the mapping function according to a preset training period.

and providing the prediction score of the target audio to a client generating the target audio for presentation.

a module for determining whether the target audio is a high-quality audio according to the prediction score;

and the module is used for adding the target audio to the recommendation list in the case that the target audio is the high-quality audio.

It will be appreciated by those skilled in the art that the audio processing apparatus 5000 may be implemented in various ways. The audio processing apparatus 5000 may be implemented by, for example, an instruction configuration processor. For example, the audio processing apparatus 5000 may be implemented by storing instructions in a ROM and reading the instructions from the ROM into a programmable device when starting the device. For example, the audio processing device 5000 may be cured into a dedicated device (e.g., ASIC). The audio processing apparatus 5000 may be divided into units independent of each other, or may be implemented by combining them together. The audio processing apparatus 5000 may be implemented by one of the various implementations described above, or may be implemented by a combination of two or more of the various implementations described above.

In this embodiment, the audio processing apparatus 5000 may have various implementation forms, for example, the audio processing apparatus 5000 may be any functional module running in a software product or an application program providing an audio processing service, or a peripheral insert, a plug-in, a patch, or the like of the software product or the application program, and may also be the software product or the application program itself.

< electronic apparatus >

In this embodiment, an electronic device 6000 is also provided. The electronic device 6000 may be the server 1100 shown in fig. 1a, or may be the terminal device 1200 shown in fig. 1 b.

In one aspect, the electronic device 6000 may include the aforementioned audio processing apparatus 5000 for implementing the audio processing method of any embodiment of the present invention.

In another aspect, as shown in fig. 6, the electronic device 6000 may further include a processor 6100 and a memory 6200, the memory 6200 being configured to store executable instructions; the processor 6100 is configured to operate the electronic device 6000 to perform an audio processing method according to any of the embodiments of the present invention according to the control of the instructions.

In this embodiment, the electronic device 6000 may be a terminal device such as a smart speaker, an earphone, a mobile phone, a tablet computer, a palm computer, a desktop computer, and a notebook computer, or may be a server. For example, the electronic device 6000 may be an electronic product having an audio processing function.

< computer-readable storage Medium >

In this embodiment, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an audio processing method according to any of the embodiments of the present invention.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. An audio processing method, comprising:

acquiring target audio to be processed;

acquiring a mapping function between the feature vector and the score;

2. The method of claim 1, the at least one feature comprising: at least one of mel frequency cepstrum coefficient, zero crossing rate, short-time energy, short-time autocorrelation function, short-time average amplitude difference, spectrogram, spectral entropy, fundamental frequency and formant.

3. The method of claim 1, the obtaining a mapping function between the feature vectors and scores comprising:

4. The method of claim 3, the obtaining training samples comprising:

determining a reference user according to the reference audio;

acquiring other audio generated by the reference user as an extended audio;

tagging the expanded audio as the specified score;

5. The method of claim 4, the determining a reference user from the reference sample comprising:

determining a user generating each reference audio as a target user;

and selecting the reference user from the target users according to the ratio.

6. The method of claim 4, the determining a reference user from the reference sample comprising:

determining a user generating each reference audio as a target user;

7. The method of claim 3, wherein the training the mapping function according to the vector values and the actual scores of the feature vectors of the training samples comprises:

8. The method of claim 7, the constructing a loss function from the scoring prediction expression for each of the training samples and an actual score for each of the training samples comprising:

9. The method of claim 3, further comprising:

acquiring an actual score of the target audio;

10. The method of claim 3, further comprising:

11. The method of claim 1, further comprising:

12. The method of claim 1, further comprising:

13. An audio processing apparatus, comprising:

14. An electronic device, comprising:

the device of claim 13; alternatively, the first and second electrodes may be,

a processor and a memory for storing instructions for controlling the processor to perform the method of any of claims 1 to 12.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 12.