WO2019233361A1

WO2019233361A1 - Method and device for adjusting volume of music

Info

Publication number: WO2019233361A1
Application number: PCT/CN2019/089758
Authority: WO
Inventors: 姚青山; 秦宇; 喻浩文; 卢峰
Original assignee: 安克创新科技股份有限公司
Priority date: 2018-06-05
Filing date: 2019-06-03
Publication date: 2019-12-12
Also published as: CN109147816B; CN109147816A

Abstract

A method and device for adjusting a volume of music. The method comprises: obtaining a time domain waveform of music to be played and a time domain waveform of noise in a playback environment (S210); using, according to the time domain waveform of the music and the time domain waveform of the noise, a pre-trained neural network to obtain volume settings for the music (S220); and using the volume settings to adjust a volume of the music (S230). The invention employs a pre-trained neural network comprising a music style neural network, a noise class recognition neural network and a volume adjustment neural network, and takes into consideration factors, such as an ambient noise class and a music style, which influence a current user volume preference, such that a volume of music to be played by a user can be automatically adjusted, thereby maximally simplifying an operation procedure for the user, and improving user experience.

Description

Method and equipment for adjusting volume of music

This application claims the priority of a Chinese invention patent application filed on June 5, 2018 with an application number of 2018105831114.1 and an invention name of "Method and Device for Adjusting Volume of Music".

Technical field

Embodiments of the present invention relate to the field of sound, and more particularly, to a method and device for adjusting volume of music.

Background technique

Sound quality is a subjective evaluation of audio quality. Generally, sound quality is divided into dozens of indicators, and loudness is also called loudness, which is one of the important indicators. The volume will affect the quality of people receiving music information. The volume setting is generally related to the ambient sound. For example, the volume of music in a noisy environment is generally higher than the volume of music in a quiet environment.

The current volume setting is mainly adjusted by the user himself, which brings complexity to the user and affects the user's experience. In addition, some existing automatic volume adjustment technologies generally only consider environmental noise parameters, so the automatic volume adjustment ability is limited. In fact, the individual user's preference for volume is related to many factors, such as the type of music, when people listen to different types of music Different volume may be set, and different types of environmental noise will have different effects on the volume setting. Other factors include personal preferences and personal hearing, audio playback device parameters, etc. The volume model must fully consider these factors to achieve Better performance.

Summary of the Invention

Embodiments of the present invention provide a method and device for automatically adjusting the volume of music, which can adjust the volume of music based on deep learning, simplify user operations, and thereby improve the user experience.

In a first aspect, a method for adjusting volume of music is provided, including:

Obtain the time domain waveform of the music to be played and the time domain waveform of the noise of the playback environment;

Obtaining a volume setting of the music to be played according to the time domain waveform of the music to be played and the time domain waveform of the noise by using a neural network trained in advance;

Use the volume setting to adjust the volume of the music to be played.

In an implementation manner of the present invention, the method further includes:

Using the pre-trained neural network as a baseline model;

Repeat the following steps until the number of times the specific user adjusts the instruction again is less than the preset value:

For playing music, use the baseline model to get the corresponding volume setting;

Obtaining a re-adjustment instruction of the corresponding volume setting by the specific user;

If the number of readjustment instructions of the specific user reaches a preset value, the volume adjusted by the specific user is used as a training sample, and learning is performed on the basis of the parameters of the baseline model to obtain an updated model and use The updated model described above replaces the baseline model.

In an implementation manner of the present invention, the pre-trained neural network includes a music style neural network, a noise category identification neural network, and a volume adjustment neural network.

In an implementation manner of the present invention, the process of obtaining the volume setting of the music to be played includes:

Using the music style neural network to obtain a style vector of the music to be played according to the time-domain waveform of the music to be played;

Use the noise category identification neural network to obtain the category of the noise according to the time domain waveform of the noise;

Obtaining an energy characteristic of the music to be played according to a time-domain waveform of the music to be played;

Obtaining an energy characteristic of the noise according to a time-domain waveform of the noise;

The style vector of the music to be played, the category of the noise, the energy characteristics of the music to be played, and the energy characteristics of the noise are input to the volume adjustment neural network to obtain the volume setting of the music to be played.

In an implementation manner of the present invention, a process of obtaining a style vector of the music to be played includes:

Frame the time-domain waveform of the music to be played, and perform feature extraction on each frame after the frame to obtain the characteristics of the music to be played;

The characteristics of the music to be played are input to the music style neural network to obtain the style vector of the music to be played.

In an implementation manner of the present invention, the process of obtaining the category of the noise includes:

Frame the time-domain waveform of the noise, and extract features for each frame after the frame to obtain the characteristics of the noise;

The characteristics of the noise are input to the noise category identification neural network to obtain the category of the noise.

In an implementation manner of the present invention, the energy characteristics of the music to be played include the average amplitude of the music to be played, and the process of obtaining the energy characteristics of the music to be played includes:

Calculate the absolute value of the amplitude of each point of the time-domain waveform of the music to be played, and divide by the total number of points to obtain the average amplitude of the music to be played.

In an implementation manner of the present invention, the energy characteristic of the noise includes an average amplitude of the noise, and a process of obtaining the energy characteristic of the noise includes:

The absolute value of the amplitude of each point in the time domain waveform of the noise is calculated, and then divided by the total number of points to obtain the average amplitude of the noise.

In an implementation manner of the present invention, before using a musical style neural network, the method further includes:

Based on the music training data set, the music style neural network is obtained through training.

In an implementation manner of the present invention, each music training data in the music training data set has a music style vector, and the music style vector of the music training data is obtained in the following manner:

Acquiring style annotation information of a large number of users on multiple music training data, and generating a annotation matrix based on the style annotation information;

A music style vector of each music training data is determined according to the annotation matrix.

In an implementation manner of the present invention, the determining a music style vector of each music training data according to the annotation matrix includes:

Decomposing the labeling matrix into a product of a first matrix and a second matrix;

Each row vector of the first matrix is determined as a music style vector of the corresponding music training data.

In an implementation manner of the present invention, before using the noise category identification neural network, the method further includes:

Based on the noise training data set, the noise class identification neural network is obtained through training.

In an implementation manner of the present invention, the time-domain waveform of the noise is collected by a pickup device of a user audio playback device.

Play the music to be played after the volume is adjusted.

In a second aspect, a device for volume adjustment of music is provided, the device is configured to implement the steps of the method described in the first aspect or any implementation manner, and the device includes:

An acquisition module for acquiring a time-domain waveform of music to be played and a time-domain waveform of noise of a playback environment;

A determining module, configured to obtain a volume setting of the music to be played according to the time domain waveform of the music to be played and the time domain waveform of the noise by using a pre-trained neural network;

An adjustment module is used to adjust the volume of the music to be played using the volume setting.

According to a third aspect, a device for adjusting volume of music is provided, which includes a memory, a processor, and a computer program stored on the memory and running on the processor. When the processor executes the computer program, Implement the steps of the method described in the foregoing first aspect or any implementation.

According to a fourth aspect, a computer storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the method according to the foregoing first aspect or any implementation manner are implemented.

It can be seen that the embodiment of the present invention uses a pre-trained neural network including a music style neural network, a noise category identification neural network, and a volume adjustment neural network, which takes into account the noise category and music style of the environment to affect the user's current volume Preference factors can automatically adjust the volume of the music to be played by the user, which can greatly simplify the user's operation and improve the user experience. And it can be adjusted again according to the volume preference of a specific user, and a volume adjustment model dedicated to a specific user can be obtained through online learning. Therefore, the volume adjustment model dedicated to a specific user can be used to automatically set the volume of the music to be played that the specific user wants to play.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings used in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings in the following description are just some of the present invention For those of ordinary skill in the art, other embodiments may be obtained based on these drawings without paying creative labor.

FIG. 1 is a schematic flowchart of obtaining a music style vector of music training data according to an embodiment of the present invention; FIG.

2 is a schematic diagram of a labeling matrix in an embodiment of the present invention;

3 is a schematic flowchart of a method for adjusting volume of music in an embodiment of the present invention;

4 is another schematic flowchart of a method for adjusting volume of music in an embodiment of the present invention;

5 is a schematic flowchart of readjusting a user based on a volume setting according to an embodiment of the present invention;

6 is a schematic flowchart of obtaining a volume adjustment model dedicated to a specific user through online learning based on a baseline model in an embodiment of the present invention;

7 is a schematic flowchart of obtaining a volume adjustment model dedicated to a specific user in an embodiment of the present invention;

8 is a schematic block diagram of a device for adjusting volume of music in an embodiment of the present invention;

FIG. 9 is another schematic block diagram of a device for adjusting volume of music in an embodiment of the present invention.

Detailed ways

In the following, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Deep learning is a machine learning method that uses deep neural networks to learn features of data with complex models, and intelligently organizes low-level features of data to form more advanced abstract forms. Because deep learning has strong feature extraction and modeling capabilities for complex data that is difficult to abstract and model manually, deep learning is an effective implementation method for tasks such as adaptive adjustment of sound quality that are difficult to model manually.

An embodiment of the present invention provides a pre-trained neural network, which includes a musical style neural network, a noise category identification neural network, and a volume adjustment neural network. Each will be explained below.

In the embodiment of the present invention, a musical style neural network is constructed based on deep learning. The musical style neural network is trained based on the music training data set. Among them, the music training data set includes a large amount of music training data, and a single music training data is described in detail below.

The music training data is music data, including the characteristics of the music training data, which can be used as the input of the neural network; it also includes the music style vector of the music training data, which can be used as the output of the neural network.

Exemplarily, for the music training data, the original music waveform is a time-domain waveform, and the time-domain waveform may be framed, and feature extraction is performed for each frame after the framed frame to obtain the characteristics of the music training data. Optionally, as an example, short-time Fourier transform (Short-Time Fourier Transform, STFT) can be used for feature extraction, and the extracted feature can be Mel Frequency Frequency Cepstrum Coefficient (MFCC). It should be understood that the manner of feature extraction in this article is only schematic, and other features, such as amplitude spectrum, log spectrum, energy spectrum, etc. can also be obtained, which are not listed here one by one. Optionally, in the embodiments of the present invention, the features obtained by feature extraction here and thereafter may be expressed as a feature tensor, for example, as an N-dimensional feature vector; or, the extracted features may also be expressed in other forms. It is not limited here.

Exemplarily, the music style vector of the music training data can be obtained by referring to the method shown in FIG. 1, and the process includes:

S101. Acquire style annotation information of a plurality of music training data by a user, and generate a annotation matrix based on the style annotation information.

For certain music training data, the style annotation information of different users may be the same or different. For example, for the song "My Motherland", some users may label it as "Folk Music", some users may label it as "Popular", some users may label it as "Folk Music" and "Beisheng", and so on. By counting the style annotation information of multiple users, the number of different style annotations can be obtained. As an example, referring to FIG. 2, for "My Motherland", the number of annotations of "Folk Music" is 12, the number of annotations of "Popular" is 3, and the number of annotations of "Beisheng" is 10.

Further, a labeling matrix may be generated based on labeling information of a plurality of music training data. The rows of the labeling matrix may represent labeling information of a certain music training data, for example, each row represents a "style label" of the corresponding music training data. The columns of the label matrix represent style. Referring to FIG. 2, the labeling matrix generated for labeling information of "My Motherland", "Qilixiang", "Coral Sea", and "Ten Send Red Army" can be expressed as:

It should be understood that FIG. 2 is only schematic. Although only 4 pieces of music training data and 4 styles are shown therein, the present invention is not limited thereto, and may be obtained based on a larger amount of music training data and a larger number of styles Annotation matrix.

S102. Determine a music style vector of each music training data according to the annotation matrix.

Specifically, a music style vector can be extracted from the annotation matrix. As an example, a vector corresponding to a row of music training data in the labeling matrix may be used as its music style vector. For "My Motherland", the music style vector is [12,3,0,10]. As another example, a vector corresponding to a row of music training data in the labeling matrix may be normalized as its music style vector. For "My Motherland", the music style vector is [12 / 25,3 / 25,0,10 / 25]. It can be understood that the dimensions of the music style vectors obtained in these two examples are relatively large and are sparse vectors. As yet another example, the sparseness of the labeling matrix can be considered, and music style vectors can be extracted therefrom. The extraction algorithms include, but are not limited to, matrix decomposition, factorization machine, or word vectorization algorithm. The dimension of the music style vector obtained in this example is smaller, that is, a denser music style vector can be obtained.

In Figure 2, matrix extraction is used as an example to illustrate the extraction process. The vectors of each row in the labeling matrix are sparse vectors. For example, for a certain style label of music training data, some of the values are positive integers, and the rest are 0. It is rare that all items in the style labels are positive integers, that is, a specific Music training data generally corresponds to only one or several styles. Therefore, the labeling matrix is also a sparse matrix. By extracting the sparse matrix, the dimension of the music style vector of each music training data is smaller than the number of columns of the labeling matrix, and it can better reflect the correlation between different music training data. .

Referring to FIG. 2, the labeling matrix may be decomposed into a first matrix multiplied by a second matrix. The rows of the first matrix represent music style vectors corresponding to the music training data, which can be regarded as compression of style labels in the form of sparse vectors. As shown in the first matrix in Figure 2, the music style vector of "My Motherland" is [1.2,3.7,3.1], and the music style vector of "Ten Free Red Army" is [1.8,4.0,4.1]. The cosine similarity between the vectors is high, so it can be determined that "My Motherland" and "Ten Free Red Army" are similar music.

The second matrix is a weight representing each item of the first matrix (specific values of each element of the second matrix are not shown in FIG. 2). Specifically, each column of the second matrix is for a music style, and the values in one column represent the weight of the music style class to each element in the first matrix.

It can be understood that by multiplying the first matrix and the second matrix, the annotation matrix can be restored, and the annotation matrix can more intuitively display various different styles being annotated. In addition, it can be understood that FIG. 2 is only schematic. Although it shows that the dimension of the number of columns of the labeling matrix is 4 and the dimension of the obtained music style vector is 3, the present invention is not limited thereto. For example, in practical applications, the dimensions of matrices and vectors can be larger.

In this way, for each piece of music training data, its features can be obtained through feature extraction. Through the process shown in Fig. 1 and Fig. 2, the music style vector of each music training data can be obtained. Taking the features as input and the music style vector as output, the music style neural network is trained until convergence, and then a trained music style neural network can be obtained.

In the embodiment of the present invention, a noise class recognition neural network is also constructed based on deep learning. The noise class recognition neural network is trained based on the noise training data set. Among them, the noise training data set includes a large amount of noise training data, and a single noise training data is described in detail below.

The noise training data is noise data, including the characteristics of the noise training data, which can be used as the input of the neural network; it also includes the noise category of the noise training data, which can be used as the output of the neural network.

Exemplarily, for the noise training data, the original noise waveform is a time-domain waveform, and the time-domain waveform may be framed, and feature extraction is performed on each frame after the framed frame to obtain the characteristics of the noise training data. Optionally, as an example, feature extraction may be performed through Short-Time Fourier Transform (STFT), and the extracted feature may be Mel Frequency Frequency Cepstrum Coefficient (MFCC). It should be understood that the manner of feature extraction in this article is only schematic, and other features, such as amplitude spectrum, log spectrum, energy spectrum, etc. can also be obtained, which are not listed here one by one.

For example, each noise training data may be labeled with a noise category to which it belongs. Noise categories may include, but are not limited to, airports, pedestrian streets, buses, shopping malls, restaurants, and the like. The method of marking is not limited in the present invention. For example, "000" may be used to indicate an airport, "001" to indicate a pedestrian street, and "010" to indicate a bus, etc .; other methods may also be used for marking, which are not listed here one by one.

For ease of understanding, here is an example to illustrate one implementation of the tag. Specifically, one noise training data may be marked by one user or multiple users, and the noise categories marked by different users may be the same or different. After acquiring a plurality of users' labels for one noise training data, the most marked number among them can be determined as the noise category to which the one noise training data belongs. For example, suppose the noise training data A is labeled as "000" by m1 users, "001" by m2 users, and "010" by m3 users. If m1> m2 and m1> m3, you can It is determined that the noise category to which the noise training data A belongs is "000".

In this way, for each noise training data, its features can be obtained through feature extraction, and the noise category to which it belongs is marked. Taking the features as input and the noise category as output, the noise category recognition neural network is trained until convergence, and the trained noise category recognition neural network can be obtained.

In the embodiment of the present invention, a volume adjustment neural network is also constructed based on deep learning. The volume adjustment neural network is obtained by training according to a training data set. The training data set includes a large amount of training data, and the training data set may be a user behavior set, such as collecting data of multiple users listening to music in various environments.

The single training data is explained in detail below. Exemplarily, when a user listens to certain music in a certain environment, the data can be acquired as training data. Specifically, the time domain waveform of the music can be obtained according to the music being played by the user, the time domain waveform of the ambient noise can be obtained through the pickup device of the playback terminal used by the user, and the user's volume setting can be obtained. .

The acquiring the time-domain waveform of the music may include: acquiring the time-domain waveform of the music from a client used by the user. Alternatively, it may include: acquiring music information of the music from a client used by the user, and acquiring the time-domain waveform of the music from a music database on the server according to the music information, so that the transmission amount can be reduced. The music information may include at least one of a song title, a singer, an album, and the like. It can be understood that the music information described in the embodiment of the present invention is only exemplary, and it may include other information, such as duration, format, etc., which are not listed here one by one.

Among them, pickup devices such as a headset microphone and a mobile phone microphone are not limited here. Among them, it is possible to obtain a volume adjustment instruction of the user or obtain a stable volume set by the user when the music is stably played. Optionally, the volume may be expressed as a percentage, or the volume may also be expressed in other manners, which is not limited in the present invention.

The characteristics of the music included in the training data can be obtained based on the time-domain waveform of the music included in the training data. Specifically, the time-domain waveform of the music can be framed, and feature extraction is performed on each frame after the framed frame to obtain the characteristics of the music. Then, the characteristics of the music are input to the aforementioned music style neural network, and a style vector of the music can be obtained. Exemplarily, if the style vectors of the music obtained in different frames are different, the style vectors obtained in these frames may be averaged, and the averaged style vector may be used as the style vector of the music. It should be noted that the "average" used herein is a result value obtained by averaging a plurality of style vector items (or values). For example, it can be arithmetic mean. However, it can be understood that the "average" can also obtain the result value through other calculation methods, such as a weighted average, in which the weights of different items can be equal or different, and the embodiment of the present invention does not limit the average method.

The characteristics of the noise can be obtained based on the time-domain waveform of the noise included in the training data. Specifically, the time-domain waveform of the noise can be framed, and feature extraction is performed on each frame after the framed frame to obtain the characteristics of the noise. Then, the characteristics of the noise are input to the aforementioned noise category identification neural network, and the category of the noise can be obtained. Exemplarily, if the types of noise obtained from different frames are different, the categories obtained from these frames may be classified and counted, and the category with the largest number is used as the category of the noise.

Music energy characteristics can be obtained based on the time-domain waveform of the music included in the training data. The embodiment of the present invention does not limit the manner of calculating the energy characteristics of music. For example, the energy characteristics of music can be calculated according to the amplitude of each point of the time-domain waveform of the music. As an example, the music energy feature may include the average amplitude of music. Specifically, the absolute value of the amplitude of each point in the time-domain waveform of the music may be calculated, and then divided by the total number of points to obtain the average music amplitude. That is, the arithmetic mean of the amplitudes of all points in the time domain waveform of the music can be used as the music energy feature. As another example, the geometric mean or weighted mean of the amplitudes of all points in the time domain waveform of the music may be used as the music energy feature. As yet another example, the amplitudes of all points in the time-domain waveform of the music may be taken as natural logarithms and then arithmetically averaged as the music energy feature. Of course, the energy characteristics of music can also be obtained by other calculation methods, which is not limited in the present invention.

The noise energy characteristics can be obtained based on the time-domain waveform of the noise included in the training data. The embodiment of the present invention does not limit the manner of calculating the noise energy characteristics. For example, the noise energy characteristics may be calculated according to the amplitude of each point of the time domain waveform of the noise. As an example, the noise energy characteristic may include the average amplitude of the noise. Specifically, the absolute value of the amplitude of each point in the time domain waveform of the noise may be calculated, and then divided by the number of points to obtain the average amplitude of the noise. That is, the arithmetic mean of the amplitudes of all points in the time domain waveform of the noise can be used as the noise energy feature. As another example, a geometric mean or a weighted mean of the amplitudes of all points in the time domain waveform of the noise may be used as the noise energy feature. As yet another example, the amplitudes of all points in the time domain waveform of the noise may be taken as natural logarithms and then arithmetically averaged as the noise energy characteristic. Of course, the noise energy characteristics can also be obtained by other calculation methods, which is not limited in the present invention.

In this way, for each training data, the style vector of the music, the type of noise, the characteristics of the music energy, and the characteristics of the noise energy can be obtained, and the user's volume setting can be obtained. Taking the style vector of the music, the category of the noise, the characteristics of the music energy, and the characteristics of the noise energy as the input and the volume setting as the output, the volume adjustment neural network is trained until convergence, and the trained volume adjustment neural network can be obtained.

An embodiment of the present invention provides a method for adjusting volume of music. As shown in FIG. 3, a flowchart of the method includes:

S210. Obtain the time domain waveform of the music to be played and the time domain waveform of the noise of the playback environment;

S220. Use a pre-trained neural network to obtain a volume setting of the music to be played according to the time domain waveform of the music to be played and the time domain waveform of the noise;

S230. Use the volume setting to adjust the volume of the music to be played.

The pre-trained neural network may include a musical style neural network, a noise category identification neural network, and a volume adjustment neural network. Specifically, in S220, according to the time domain waveform of the music to be played and the time domain waveform of the noise, a music style neural network, a noise category identification neural network, and a volume adjustment neural network may be used to obtain the music to be played. Volume settings. The music style neural network, the noise category identification neural network, and the volume adjustment neural network can be the aforementioned trained music style neural network, the trained noise category identification neural network, and the volume adjustment neural network, respectively. It is understandable that The aforementioned training process is generally performed on the server side (ie, the cloud).

The method shown in FIG. 3 may be executed by a server (that is, the cloud), or may be executed by a client.

In the embodiment performed by the client, in S210, if the music to be played is the client's local music, the client can directly obtain the time domain waveform of the music to be played. If the music to be played is online music, the client can obtain the time domain waveform of the music to be played from the server. In addition, the time-domain waveform of the noise in the environment can be obtained by the pickup device of the client. Before S220, the client can obtain the pre-trained music style neural network, noise category identification neural network, and volume adjustment neural network from the server.

In the embodiment performed by the server, in S210, if the music to be played is the client's local music, the server (ie, the cloud) receives the music to be played from the client to obtain the time domain waveform of the music to be played. If the music to be played is the music stored on the server, such as the music database stored on the server, the server (that is, the cloud) receives the music information of the music to be played from the client. The music information here may include the song name, singer, album At least one of them. Acquire the music to be played from the music database on the server side according to the music information, thereby obtaining the time domain waveform of the music to be played. In addition, the server can also receive the time-domain waveform of the ambient noise collected by the client's pickup device from the client.

Exemplarily, as shown in FIG. 4, S220 may include:

S2201. Use a music style neural network to obtain a style vector of the music to be played according to the time-domain waveform of the music to be played.

Specifically, the time-domain waveform of the music to be played can be framed, and feature extraction is performed for each frame after the framed frame to obtain the characteristics of the music to be played. Then, the features of the music to be played can be input to a music style neural network to obtain a style vector of the music to be played.

The method for feature extraction may include, but is not limited to, STFT, MFCC, and the like. The extracted features may be amplitude spectrum, log spectrum, energy spectrum, etc., which is not limited in the present invention.

S2202. Use a noise category identification neural network to obtain the category of the noise according to the time-domain waveform of the noise.

Specifically, the time domain waveform of the noise can be framed, and feature extraction is performed on each frame after the framed frame to obtain the characteristics of the noise. The characteristics of the noise can then be input to a noise category identification neural network to obtain the category of the noise.

S2203: Obtain an energy characteristic of the music to be played according to a time-domain waveform of the music to be played.

Alternatively, the energy characteristics of the music may include the average amplitude of the music. The absolute value of the amplitude of each point of the time-domain waveform of the music to be played can be calculated, and then divided by the total number of points to obtain the average amplitude of the music to be played.

Optionally, a geometric average or a weighted average of the amplitudes of all points of the time-domain waveform of the music to be played may be used as the energy feature of the music to be played.

Optionally, the amplitudes of all points of the time-domain waveform of the music to be played may be taken as natural logarithms and then arithmetically averaged as the energy characteristic of the music to be played.

S2204. Obtain an energy characteristic of the noise according to a time-domain waveform of the noise.

Alternatively, the energy characteristics of the noise may include the average amplitude of the noise. The absolute value of the amplitude of each point in the time domain waveform of the noise can be calculated, and then divided by the total number of points to obtain the average amplitude of the noise.

Optionally, a geometric average or a weighted average of the amplitudes of all points of the time domain waveform of the noise may be used as the energy characteristic of the noise.

Alternatively, the amplitudes of all points in the time domain waveform of the noise may be taken as natural logarithms and then arithmetically averaged as the energy characteristic of the noise.

It should be noted that although the process is shown in FIG. 4 according to S2201 to S2204, the embodiment of the present invention does not limit the execution order of S2201 to S2204. For example, the four steps S2201-S2204 can be executed in parallel. For example, S2201 and S2202 can be executed sequentially or in parallel, and then S2203 and S2204 can be executed sequentially or in parallel. For example, S2204 and S2203 can be executed sequentially or in parallel, and then S2201 and S2202 can be executed sequentially or in parallel. For example, S2201 and S2203 may be executed sequentially or in parallel, and then S2202 and S2204 may be executed sequentially or in parallel. In other words, S2201-S2204 can be executed in any order, and no longer listed here.

S2205: input the style vector of the music to be played, the category of the noise, the energy characteristics of the music to be played, and the energy characteristics of the noise to a volume adjustment neural network to obtain a volume setting of the music to be played.

It can be seen that the embodiment of the present invention adopts a pre-trained neural network including a musical style neural network, a noise category identification neural network, and a volume adjustment neural network, which considers various influences on the user such as the noise category and music style of the environment The current volume preference factor can automatically adjust the volume of the user's music to be played, which can greatly simplify the user's operation and improve the user experience.

Because different users have different preferences for volume, some people like high-volume surging, some people like to sleep with low-volume music before going to sleep; for example, the elderly may need high volume because of hearing loss, but for young people Low volume is enough. In the above-mentioned training of the volume adjustment neural network, the differences between the individual users are not considered. Therefore, the trained volume adjustment neural network may be referred to as a volume adjustment baseline neural network or may be referred to as a volume adjustment baseline model.

Based on the baseline model of volume adjustment, the user's preferences can be considered, and the volume adjustment neural network for specific users can be obtained through online learning.

Exemplarily, the volume adjustment neural network in S2205 may be a volume adjustment baseline model, and in S230, the volume setting determined by S2205 may be used to adjust the volume of the music to be played. And, after S230, the adjusted volume can be used to play the music to be played.

It can be understood that if the volume setting obtained by S230 is satisfactory to the user, the volume setting can be used to play the music to be played, and the above-mentioned volume adjustment baseline model is also a proprietary volume adjustment model suitable for the user. However, considering the different preferences of different users for the volume, the volume obtained by S230 may not be satisfactory to the user. Therefore, after S230, the user may adjust the volume again on this basis to obtain the desired volume of the user. . This process can be shown in Figure 5.

In the embodiment of the present invention, a volume adjustment model dedicated to a specific user can be obtained through online learning based on a user's readjustment based on a pre-trained neural network. Specifically, as shown in FIG. 6, the process may include:

S310. A pre-trained neural network is used as a baseline model.

S320. Repeat the following steps until the number of times the specific user adjusts the instruction again is less than a preset value:

S3201, for the music being played, the corresponding volume setting can be obtained using the baseline model.

S3202. Obtain a readjustment instruction of the volume setting in S3201 by a specific user.

S3203, if the number of re-adjustment instructions of the specific user reaches a preset value, the volume adjusted by the specific user is used as a training sample, and learning is performed on the basis of the baseline model to obtain an updated model and use the updated model Replace the baseline model.

Understandably, in S320, the baseline model can be learned online through the readjustment instruction of the specific user (that is, the user's feedback on the volume setting) until the user has little or no feedback, and the model finally obtained in S320 can be determined to be dedicated to a specific User's volume adjustment model. In other words, if the volume setting determined by the baseline model finally obtained by S320 is no longer or rarely reported by the user, the model is a volume adjustment model dedicated to a specific user. After that, the dedicated model can be used to automatically set the volume for the music played by a specific user without manual adjustment by the user, thereby improving the user experience.

Specifically, assuming that a particular user plays N pieces of music, the volume adjustment baseline model can be used to obtain corresponding N volume settings. If the specific user is not satisfied with some of the volume settings later, it will be adjusted again, assuming that the specific user has adjusted the volume of the N1 music again. If N1 is greater than the preset value (assuming N0), you can use this N1 music as a training sample to train on the basis of the volume adjustment baseline model to get the trained model, which is called model M (T = 1 ). Among them, T may represent a batch of online training for a specific user. After that, when the particular user plays music, the model M (T = 1) can be used instead of the baseline model for volume adjustment. Specifically, if a specific user plays N pieces of music, then the model M (T = 1) can be used to obtain the corresponding N volume settings. If the specific user is not satisfied with some of the volume settings, it will be adjusted again, assuming The specific user adjusted the volume of the N2 music again. If N2 is greater than a preset value (assuming N0), you can use these N2 music as training samples to train on the basis of model M (T = 1), get the trained model, and call it model M ( T = 2). After that, when the particular user plays music, the model M (T = 2) can be used instead of the volume adjustment baseline model and model M (T = 1) ... and so on, until the model M (T = n ). After that, the model M (T = n) can be used when the specific user plays music. That is, you can use M (T = n) to get the corresponding volume setting. If a specific user is satisfied with the volume settings obtained this time and does not perform adjustment again, the model M (T = n) is a volume adjustment model dedicated to the specific user for the specific user. Alternatively, even if a specific user is not satisfied with some of the volume settings, but the number of readjustments made by the specific user is less than a preset value, the model M (T = n) is a volume adjustment model dedicated to the specific user for the specific user. Exemplarily, this process can be shown in FIG. 7.

Wherein, the number of readjustments performed by a specific user is less than a preset value, which may mean that the frequency of readjustment performed by a specific user is less than a preset frequency. For example, the preset frequency may be equal to N0 / N. For example, using the model M (T = n) to obtain the volume settings of N pieces of music, the number of pieces of music that the specific user has adjusted again is less than N0. Or, for example, using the model M (T = n) to obtain the volume settings of NN pieces of music, the number of pieces of music that the specific user performs adjustment again is less than NN * N0 / N. It means that the frequency that the specific user adjusts again is less than the preset frequency.

It can be seen that, in the embodiment of the present invention, a volume adjustment model dedicated to a specific user can be obtained through online learning based on the volume adjustment baseline model and according to readjustment by a specific user. After that, the volume adjustment model dedicated to a specific user can be used to automatically set the volume of the music to be played that the specific user wants to play, reducing user operations and improving the user experience.

FIG. 8 is a schematic block diagram of a device for adjusting volume of music according to an embodiment of the present invention. The device 30 shown in FIG. 8 includes an acquisition module 310, a determination module 320, and an adjustment module 330.

The obtaining module 310 is configured to obtain a time-domain waveform of music to be played and a time-domain waveform of noise of a playback environment.

The determining module 320 is configured to obtain a volume setting of the music to be played according to a time domain waveform of the music to be played and a time domain waveform of the noise by using a pre-trained neural network.

The adjusting module 330 is configured to use the volume setting to adjust the volume of the music to be played.

As an implementation manner, the device 30 shown in FIG. 8 may be a server side (that is, the cloud). Optionally, the device 30 may further include a training module for obtaining the pre-trained neural network through training based on the training data set.

As an implementation manner, the device 30 may include a training module for obtaining a volume adjustment neural network dedicated to the specific user through online learning.

Specifically: the pre-trained neural network may be used as a baseline model. Repeat the following steps until the number of times the specific user readjusts the instruction is less than the preset value: for the music being played, use the baseline model to obtain the corresponding volume setting; obtain the specific user's readjustment of the corresponding volume setting Instruction; if the number of times that the specific user adjusts the instruction again reaches a preset value, the volume adjusted by the specific user is used as a training sample, learning is performed on the basis of the baseline model, and an updated model is obtained and used The updated model replaces the baseline model. Then the updated model finally obtained is a volume adjustment neural network dedicated to the specific user.

As an implementation manner, the pre-trained neural network includes: a musical style neural network, a noise category identification neural network, and a volume adjustment neural network. The determining module 320 may be specifically configured to use the music style neural network, the noise category recognition neural network, and the volume adjustment neural network to obtain the music to be played according to the time domain waveform of the music to be played and the time domain waveform of the noise. Volume setting.

Optionally, the determination module 320 may include a style vector determination unit, a noise category determination unit, a music energy feature determination unit, a noise energy feature determination unit, and a volume determination unit.

A style vector determining unit is configured to obtain a style vector of the music to be played according to a time-domain waveform of the music to be played by using the music style neural network.

The noise category determination unit is configured to identify a neural network using the noise category to obtain a category of the noise according to a time domain waveform of the noise.

The music energy characteristic determining unit is configured to obtain an energy characteristic of the music to be played according to a time-domain waveform of the music to be played.

The noise energy characteristic determining unit is configured to obtain an energy characteristic of the noise according to a time-domain waveform of the noise.

The volume determining unit is configured to input a style vector of the music to be played, a category of the noise, energy characteristics of the music to be played, and energy characteristics of the noise to the volume adjustment neural network to obtain the to-be-played music. Music volume setting.

The style vector determining unit is specifically configured to frame the time-domain waveform of the music to be played, and extract features from each frame after the frame to obtain the characteristics of the music to be played; Music characteristics are input to the music style neural network to obtain the style vector of the music to be played.

The noise category determination unit is specifically configured to: frame the time-domain waveform of the noise, and extract features from each frame after the frame to obtain the characteristics of the noise; and input the characteristics of the noise to all The noise category identification neural network is used to obtain the category of the noise.

The energy characteristic of the music to be played includes the average amplitude of the music to be played. The music energy characteristic determining unit is specifically configured to calculate the absolute value of the amplitude of each point of the time domain waveform of the music to be played, and divide The energy characteristics of the music to be played are obtained by the total points.

Wherein, the energy characteristic of the noise includes an average amplitude of the noise, and the noise energy characteristic determining unit is specifically configured to calculate the absolute value of the amplitude of each point in the time-domain waveform of the noise, and divide by the total number of points to obtain the Energy characteristics of noise.

As an implementation manner, the device 30 further includes a training module, configured to obtain the music style neural network through training based on the music training data set.

Wherein, each music training data in the music training data set has a music style vector. The training module obtains the music style vector of the music training data in the following ways: acquiring style annotation information of a plurality of music training data by a large number of users, and generating a annotation matrix based on the style annotation information; determining each music training according to the annotation matrix Data musical style vector.

Specifically, the labeling matrix is decomposed into a product of a first matrix and a second matrix; and each row vector of the first matrix is determined as a music style vector of corresponding music training data.

As an implementation manner, the device 30 further includes a training module, configured to obtain the noise category identification neural network through training based on the noise training data set.

Exemplarily, the time-domain waveform of the noise acquired by the acquiring module 310 is acquired by a pickup device of the client.

As an implementation manner, the device 30 further includes a playback module for playing the music to be played after the volume is adjusted.

The device 30 shown in FIG. 8 can be used to implement the foregoing method for adjusting volume of music. To avoid repetition, details are not described herein again.

As shown in FIG. 9, an embodiment of the present invention further provides another device for adjusting volume of music, which includes a memory, a processor, and a computer program stored on the memory and running on the processor. The steps of the method shown previously are carried out when the program is executed.

Specifically, the processor may obtain the time domain waveform of the music to be played and the time domain waveform of the noise of the playback environment; according to the time domain waveform of the music to be played and the time domain waveform of the noise, use a pre-trained neural network To obtain the volume setting of the music to be played; use the volume setting to adjust the volume of the music to be played. The pre-trained neural network includes a music style neural network, a noise category identification neural network, and a volume adjustment neural network.

The processor can also learn online to obtain a volume-adjusting neural network dedicated to a specific user.

Exemplarily, the device for adjusting the volume of music in the embodiment of the present invention may include: one or more processors, one or more memories, input devices, and output devices, and these components are connected through a bus system and / or other forms. The connection mechanism is interconnected. It should be noted that the device may also have other components and structures as required.

The processor may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and / or instruction execution capabilities, and may control other components in the device to perform desired functions.

The memory may include one or more computer program products, and the computer program product may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, a random access memory (RAM) and / or a cache memory. The non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor may run the program instructions to implement a client function (implemented by the processor) in the embodiments of the present invention described below, and / Or other desired function. Various application programs and various data, such as various data used and / or generated by the application program, can also be stored in the computer-readable storage medium.

The input device may be a device used by a user to input instructions, and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device may output various information (for example, images or sounds) to the outside (for example, a user), and may include one or more of a display, a speaker, and the like.

The input device / output device may be an external device and communicate with the processor through a wired or wireless manner.

In addition, an embodiment of the present invention also provides a computer storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of the aforementioned method for adjusting volume can be implemented. For example, the computer storage medium is a computer-readable storage medium.

It can be seen that the embodiment of the present invention uses a pre-trained neural network including a music style neural network, a noise category identification neural network, and a volume adjustment neural network, which takes into account the noise category and music style of the environment to affect the user's current volume Preference factors can automatically adjust the volume of the music to be played by the user, which can greatly simplify the user's operation and improve the user experience. And it can be adjusted again according to the volume preference of a specific user, and a volume adjustment model dedicated to a specific user can be obtained through online learning, so that the volume adjustment model dedicated to a specific user can be used to automatically Make volume settings.

Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. A person skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the present invention.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices, and units described above can refer to the corresponding processes in the foregoing method embodiments, and are not repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit.

When the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in various embodiments of the present invention. The foregoing storage media include: U disks, mobile hard disks, read-only memories (ROMs), random access memories (RAMs), magnetic disks or compact discs and other media that can store program codes .

The above description is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed by the present invention. It should be covered by the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

A method for adjusting the volume of music, comprising:

Obtain the time domain waveform of the music to be played and the time domain waveform of the noise of the playback environment;

Obtaining a volume setting of the music to be played according to the time domain waveform of the music to be played and the time domain waveform of the noise by using a pre-trained neural network;

Use the volume setting to adjust the volume of the music to be played.
The method according to claim 1, further comprising:

Using the pre-trained neural network as a baseline model;

Repeat the following steps until the number of times the specific user adjusts the instruction again is less than the preset value:

For playing music, use the baseline model to get the corresponding volume setting;

Obtaining a re-adjustment instruction of the corresponding volume setting by the specific user;

If the number of readjustment instructions of the specific user reaches a preset value, the volume adjusted by the specific user is used as a training sample, and learning is performed on the basis of the parameters of the baseline model to obtain an updated model and use The updated model described above replaces the baseline model.
The method according to claim 1, wherein the pre-trained neural network comprises: a musical style neural network, a noise category identification neural network, and a volume adjustment neural network.
The method according to claim 3, wherein the process of obtaining a volume setting of the music to be played comprises:

Using the music style neural network to obtain a style vector of the music to be played according to the time-domain waveform of the music to be played;

Use the noise category identification neural network to obtain the category of the noise according to the time domain waveform of the noise;

Obtaining an energy characteristic of the music to be played according to a time-domain waveform of the music to be played;

Obtaining an energy characteristic of the noise according to a time-domain waveform of the noise;

The style vector of the music to be played, the category of the noise, the energy characteristics of the music to be played, and the energy characteristics of the noise are input to the volume adjustment neural network to obtain the volume setting of the music to be played.
The method according to claim 4, wherein the process of obtaining a style vector of the music to be played comprises:

Frame the time-domain waveform of the music to be played, and perform feature extraction on each frame after the frame to obtain the characteristics of the music to be played;

The characteristics of the music to be played are input to the music style neural network to obtain the style vector of the music to be played.
The method according to claim 4, wherein the process of obtaining the category of the noise comprises:

Frame the time domain waveform of the noise, and extract features for each frame after the frame to obtain the characteristics of the noise;

The characteristics of the noise are input to the noise category identification neural network to obtain the category of the noise.
The method according to claim 4, wherein the energy characteristics of the music to be played include an average amplitude of the music to be played, and the process of obtaining the energy characteristics of the music to be played comprises:

Calculate the absolute value of the amplitude of each point of the time-domain waveform of the music to be played, and divide by the total number of points to obtain the average amplitude of the music to be played.
The method according to claim 4, wherein the energy characteristic of the noise includes an average amplitude of the noise, and a process of obtaining the energy characteristic of the noise includes:

The absolute value of the amplitude of each point in the time domain waveform of the noise is calculated, and then divided by the total number of points to obtain the average amplitude of the noise.
The method according to claim 3, before using the music style neural network, further comprising:

Based on the music training data set, the music style neural network is obtained through training.
The method according to claim 9, wherein each music training data in the music training data set has a music style vector, and the music style vector of the music training data is obtained in the following manner:

Acquiring style annotation information of a large number of users on multiple music training data, and generating a annotation matrix based on the style annotation information;

A music style vector of each music training data is determined according to the annotation matrix.
The method according to claim 10, wherein the determining a music style vector of each music training data according to the annotation matrix comprises:

Decomposing the labeling matrix into a product of a first matrix and a second matrix;

Each row vector of the first matrix is determined as a music style vector of the corresponding music training data.
The method of claim 3, before identifying the neural network using the noise category, further comprising:

Based on the noise training data set, the noise class identification neural network is obtained through training.
The method according to claim 1, wherein the time-domain waveform of the noise is collected by a pickup device of the client.
The method according to any one of claims 1 to 13, further comprising:

Play the music to be played after the volume is adjusted.
A device for adjusting the volume of music, wherein the device is configured to implement the method according to any one of the preceding claims 1 to 14, and the device includes:

An acquisition module for acquiring a time-domain waveform of music to be played and a time-domain waveform of noise of a playback environment;

A determining module, configured to obtain a volume setting of the music to be played according to the time domain waveform of the music to be played and the time domain waveform of the noise by using a pre-trained neural network;

An adjustment module is used to adjust the volume of the music to be played using the volume setting.
A device for adjusting the volume of music includes a memory, a processor, and a computer program stored on the memory and running on the processor, characterized in that the processor implements rights when the processor executes the computer program The steps of the method according to any one of claims 1 to 14.
A computer storage medium having stored thereon a computer program, characterized in that when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 14 are implemented.