CN116072147A

CN116072147A - Music detection model training method and device, electronic equipment and storage medium

Info

Publication number: CN116072147A
Application number: CN202310027938.1A
Authority: CN
Inventors: 郑雪
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-05-05

Abstract

The disclosure relates to a music detection model training method, a device, electronic equipment and a storage medium, and relates to the technical field of music detection. The method comprises the following steps: determining a detection frame corresponding to audio in a training data set, wherein the detection frame is a predefined priori frame for detecting a music event in the audio; acquiring truth labels of a plurality of music events in audio by taking a detection frame as a unit, and converting the truth labels into a standard format which is a data set containing the category of the music event, the starting time and the ending time of the corresponding detection frame and is used as a training label; based on the training data set and the training label, training the neural network to obtain a music detection model, wherein the music detection model is used for detecting the category of the music event and the positioning of the music event in the audio. The method introduces the concept of the detection frame, improves the music detection precision and can simplify the detection process.

Description

Music detection model training method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of music detection, and in particular relates to a music detection model training method, a device, electronic equipment and a storage medium.

Background

As a carrier for emotion expression, music plays an irreplaceable role in the audio and video production process. In order to ensure the development of music industry, the protection of music property rights is particularly important, and music detection is one of important approaches for protecting music property rights. Music detection refers to determining whether a piece of music exists in an audio file given the audio file, and the type and start-stop positions of the piece of music.

In the prior art, a frame-level music detection result is usually output based on a multi-classification or multi-label classification network, and a timestamp of a music event is output through a post-processing stage, so that a plurality of isolated error frame structures are easy to generate, the detection result is inaccurate and discontinuous, and the output frame-level discrete value is required to be converted into a continuous value through the post-processing stage, so that the process is complicated and the precision is low.

The embodiment of the disclosure provides a music detection model training method, a device, electronic equipment and a storage medium, and the music detection model obtained by training through the method provided by the embodiment of the disclosure realizes classification and positioning of music events and can solve the problems in the prior art.

Disclosure of Invention

The disclosure provides a music detection model training method, a device, an electronic device and a storage medium, which at least solve the problems of inaccurate and discontinuous detection results, complicated flow and the like, which are easy to generate a plurality of isolated error frame structures in the related technology. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a music detection model training method, including: determining a detection frame corresponding to audio in a training data set, wherein the detection frame is a predefined priori frame for detecting a music event in the audio; acquiring truth labels of a plurality of music events in audio by taking the detection frame as a unit, and converting the truth labels into a standard format which is a data set containing the category of the music event, the starting time and the ending time of the corresponding detection frame and then serving as training labels; and training the neural network based on the training data set and the training label to obtain a music detection model, wherein the music detection model is used for detecting the category of the music event and the positioning of the music event in the audio.

Optionally, training the neural network based on the training data set and the training label includes: determining deep global features of the audio; identifying the music event in the audio according to the deep global features, and determining the category and the positioning of the music event to obtain an identification result; and adjusting parameters of the neural network according to the difference between the identification result and the training label to obtain the music detection model.

Optionally, determining the deep global feature of the audio includes: extracting mel spectrum characteristics of the audio; inputting the Mel spectrum characteristics into a neural network to obtain shallow characteristics of the audio; carrying out global context information modeling on the shallow features, and extracting global features of the audio; superimposing the global features to the shallow features by means of attention; repeating the steps of inputting the Mel frequency spectrum characteristics into the neural network for several times to obtain shallow characteristics of the audio, and overlapping the global characteristics to the shallow characteristics in a mode of introducing attention to obtain deep global characteristics of the audio.

Optionally, the music detection model includes a classification prediction branch for outputting a prediction result of a class of the music event and a localization prediction branch for outputting a prediction result of localization of the music event in the audio.

Optionally, the adjusting the parameters of the neural network according to the difference between the recognition result and the training label includes: determining a first loss value according to the difference between the category identification result of the music event in the identification result and the category of the music event in the training tag; determining a second loss value according to the difference between the positioning recognition result of the music event in the recognition result and the positioning of the audio event determined according to the training label; and adjusting parameters of the neural network according to the first loss value and the second loss value.

Optionally, the adjusting the parameters of the neural network according to the first loss value and the second loss value includes: adjusting a first network parameter of the classification prediction branch according to the first loss value; adjusting a second network parameter of the positioning prediction branch according to the second loss value; and determining parameters of the neural network according to the first network parameters and the second network parameters.

Optionally, the classification prediction branches and the positioning prediction branches all acquire prediction results through a convolution head; or the classification prediction branch and the positioning prediction branch acquire the prediction result through the full connector; or the classified prediction branch acquires a prediction result through a convolution head, and the positioning prediction branch acquires the prediction result through a full-connector; or the classification prediction branch acquires the prediction result through the full connector, and the positioning prediction branch acquires the prediction result through the convolution head.

According to a second aspect of embodiments of the present disclosure, there is provided a music detection method, including: acquiring audio to be detected; inputting the audio to be detected into the music detection model obtained by training the method in any one of the first aspect, and outputting the category of the music event in the audio to be detected and the positioning of the music event in the audio to be detected.

According to a third aspect of embodiments of the present disclosure, there is provided a music detection model training apparatus, including: a detection frame determining module configured to perform determining a detection frame corresponding to audio in a training dataset, the detection frame being a predefined a priori frame detecting a musical event in the audio; the training label determining module is configured to execute the detection frame as a unit, acquire truth labels of a plurality of music events in the audio, and convert the truth labels into a standard format, wherein the standard format is a data set containing the category of the music events, the starting time and the ending time of the corresponding detection frame as the training label; the model training module is configured to execute training of the neural network based on the training data set and the training label to obtain a music detection model, wherein the music detection model is used for detecting the category of the music event and the positioning of the music event in the audio.

According to a fourth aspect of embodiments of the present disclosure, there is provided a music detection apparatus including: a data acquisition module configured to perform acquisition of audio to be detected; the detection module is configured to execute a music detection model obtained by inputting the audio into the training method in any one of the first aspect, and output the category of the music event in the audio to be detected and the positioning of the music event in the audio to be detected.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; and a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any one of the above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any one of the above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, which when executed by a processor, implements the method of any of the above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

in the music detection model training method provided by the embodiment of the disclosure, firstly, a detection frame corresponding to audio in a training data set is determined, wherein the detection frame is a priori frame for detecting a music event in the audio in a predefined manner; then, taking the predefined detection frame as a unit, acquiring truth labels of a plurality of music events in the audio, and converting the acquired truth labels into a standard format which is a data set containing the category of the music event, the starting time and the ending time of the corresponding detection frame as training labels; and finally, training a neural network based on the training data set and the training label to obtain a music detection model, wherein the music detection model is used for detecting the category of the music event and the positioning of the music event in the audio. On one hand, according to the music detection model training method provided by the embodiment of the disclosure, by introducing the prediction frame, when the music detection model obtained by training detects a music event in audio, the start-stop time of the obtained music event in audio is a continuous value, compared with a discrete value of a frame level, the discrete value and the continuous value are not required to be converted by involving a complex subsequent processing flow, the flow is simple, and the problems of low detection precision and the like caused by factors such as frame selection precision, post-processing stage combination strategy selection and the like in the conversion process are avoided. On the other hand, the truth value label of the music event is converted into the training label, the music detection task is converted into the training label detected by the music detection model obtained based on training, and therefore the training label is accurately detected, and the accuracy of the prediction result is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of a system architecture of an exemplary application environment for a music detection model training method and apparatus, according to an exemplary embodiment;

FIG. 2 is a schematic diagram of a computer system of an electronic device, shown according to an exemplary embodiment;

FIG. 3 is a schematic diagram showing a music detection process and results according to a related art;

FIG. 4 is a flowchart illustrating a music detection model training method, according to an example embodiment;

FIG. 5 is a schematic diagram of audio and predefined prior frames for detecting musical events in the audio for a music detection model training method, according to an exemplary embodiment;

FIG. 6 is a diagram showing 4 ways in which a music detection model of a music detection model training method may obtain predictions, according to an example embodiment;

FIG. 7 is a process diagram of a music detection model training method according to one particular embodiment;

FIG. 8 is a flowchart illustrating a music detection method according to an exemplary embodiment;

fig. 9 is a process diagram of a music detection method according to a specific embodiment shown in an exemplary embodiment;

FIG. 10 is a block diagram of a music detection model training apparatus, according to an example embodiment;

fig. 11 is a block diagram of a music detection apparatus according to an exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram of a system architecture of an exemplary application environment to which a music detection model training method and apparatus of embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of the

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The

terminal devices

101, 102, 103 may be various electronic devices with display screens including, but not limited to, desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.

The music detection model training method provided by the embodiment of the disclosure may be executed by the

terminal devices

101, 102, 103, and correspondingly, the music detection model training apparatus may be set in the

terminal devices

101, 102, 103. The music detection model training method provided by the embodiment of the present disclosure may also be performed by the

terminal devices

101, 102, 103 and the server 105 together, and accordingly, the music detection model training apparatus may be disposed in the

terminal devices

101, 102, 103 and the server 105. In addition, the music detection model training method provided in the embodiment of the present disclosure may also be executed by the server 105, and accordingly, the music detection model training apparatus may be provided in the server 105, which is not particularly limited in the present exemplary embodiment.

For example, the music detection model training method provided by the embodiments of the present disclosure may be performed by the server 105. The server 105 firstly acquires a plurality of audios, takes the acquired audios as a training data set, and determines a detection frame corresponding to the audios in the training data set, wherein the detection frame is a priori frame for detecting musical events in the audio; then, taking the predefined detection frame as a unit, acquiring truth labels of a plurality of music events in the audio, and converting the acquired truth labels into a standard format which is a data set containing the category of the music event, the starting time and the ending time of the corresponding detection frame as training labels; and finally, training a neural network based on the training data set and the training label to obtain a music detection model, wherein the music detection model is used for detecting the category of the music event and the positioning of the music event in the audio.

Fig. 2 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU) 201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data required for the system operation are also stored. The CPU201, ROM 202, and RAM 203 are connected to each other through a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input section 206 including a keyboard, a mouse, and the like; an output portion 207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 208 including a hard disk or the like; and a communication section 209 including a network interface card such as a LAN card, a modem, and the like. The communication section 209 performs communication processing via a network such as the internet. The drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 210 as needed, so that a computer program read out therefrom is installed into the storage section 208 as needed.

Music is used as a carrier for emotion expression and widely applied to audio and video production, and music detection is one of important ways of protecting music property rights. Music detection refers to determining whether a piece of music exists in an audio file given the audio file, and the type and start-stop positions of the piece of music.

The prior art generally outputs frame-level music detection results based on multi-class or multi-tag classification networks and outputs time stamps of musical events through post-processing stages. As shown in fig. 3, the music detection method includes dividing input audio in units of frames, and defining tags of each frame according to occurrence time of each musical event; then, under the monitoring learning paradigm, outputting a frame-level prediction result based on a multi-classification or multi-label classification network; finally, according to the characteristics of each musical event in the audio-visual data, the prediction results of the frame level are combined according to a proper combination strategy, so that the prediction results belong to the labels of the musical events.

However, the above method has the following problems: multiple isolated error frame structures are easy to generate, the detection result is inaccurate and discontinuous, the output discrete value of the frame level is required to be converted into a continuous value through a post-processing stage, and the flow is complex and the precision is low.

In order to solve the problems of the related art, the present exemplary embodiment proposes a technical solution, and the following details the technical solution of the embodiment of the present disclosure are described:

FIG. 4 is a flowchart illustrating a music detection model training method, as shown in FIG. 4, according to an exemplary embodiment, including the following steps.

In step S410, a detection box corresponding to the audio in the training dataset is determined, wherein the detection box is a predefined a priori box for detecting the musical event in the audio.

The audio is any sound file that can be heard. The audio may be any one of human voice, music and other sounds, or may be composed of two or more of them. The musical event is a temporal organization of a series of voices and silences, and includes melodies, rhythms, tunes, harmony, multi-tunes, and the musical event may be, for example, a musical instrument playing sound or a song.

The embodiment of the disclosure obtains a plurality of the audios as a training data set to train to obtain a music detection model based on the training data set. Alternatively, a music relative loudness detection data set may be used as the training data set described above. It will be appreciated that the method provided in the embodiments of the present disclosure may also obtain audio through other approaches as a training data set, for example, may also obtain audio from an audio-video platform, which is not particularly limited in the embodiments of the present disclosure.

In order to improve the accuracy of the music detection model and solve the problems that the prediction result at the frame level is easy to generate an error frame, a post-processing process is needed and the like, the embodiment of the disclosure introduces the detection frame, and the detection frame is a predefined priori frame for detecting the music event in the audio and is used for detecting whether the framed area contains the music event. Illustratively, as shown in fig. 5, 501 is a predefined a priori block for detecting musical events in audio in the drawing. It should be noted that, when the predefined detection frame cannot cover the musical event in the audio, the predefined detection frame may be replaced by a prediction frame defined by the length of the musical event.

In step S420, the truth labels of the plurality of musical events in the audio are obtained by taking the detection frame as a unit, and the truth labels are converted into a standard format, wherein the standard format is a data set containing the category of the musical event, the start time and the end time of the corresponding detection frame.

After determining the detection frame corresponding to the audio through step S410, the method provided by the embodiment of the present disclosure further needs to obtain the truth labels of the music events in the audio, and convert the truth labels into a standard format. The truth labels are used to describe the category of the musical event and the localization in the audio, and for example, the truth label of a certain musical event may be "the category of the musical event is background music, and the truth label is located in the audio from the 2 nd second to the 7 th second". The standard format is a data set containing the category of the musical event and corresponding to the start time and the end time of the detection frame. The type of the standard format is a type of a musical event, for example, the type of the musical event may be foreground music, background music, no music, or the like, the start time of the standard format is a start time of the musical event in the audio, and the end time of the standard format is a time corresponding to the audio after a predefined time length of the detection frame passes from the start time of the musical event in the audio.

Illustratively, the standard format may be defined as the following data set: music object < class, set, offset >, wherein Music object is the name of the data set, class is the category of the musical event, set is the start time, and offset is the end time. Taking the above-mentioned "category as background Music, the Music event from the 2 nd to 7 th seconds" in the audio as an example, assuming that the length of the detection frame is 10 seconds, the result of converting the truth label of the Music event into the standard format is Music object < background Music, 2 nd second, 12 th second >. It will be appreciated that when the length of the detection frame cannot cover the Music event, the length of the Music event is taken as a new detection frame, for example, assuming that the length of the detection frame is 3 seconds, since the detection frame cannot cover the Music event, the length of the new detection frame is defined as the length of the Music event, that is, 5 seconds, and the converted standard format results are Music object < background Music, 2 seconds, 7 seconds >. It should be noted that the above scenario is an exemplary illustration, and the scope of the embodiments of the present disclosure is not limited thereto.

In addition, an audio may contain a plurality of musical events, the truth labels of the musical events contained therein are all converted into the standard format, and a set of standard formats may be obtained, which may be defined as follows, for example: e= { E _i ＝[c _i ,s _i ,d _i ]E is a set name, E _i Representing the ith musical event in audio, c _i Class s representing the ith musical event _i A start time, d, in a standard format representing an ith musical event _i And (3) representing the termination time of the standard format of the ith musical event, after obtaining the set of the standard format corresponding to each musical event in the audio, using the obtained set as the training label so as to train the neural network through the training data set and the training label to obtain the music detection model.

In step S430, training the neural network based on the training data set and the training tag to obtain a music detection model, where the music detection model is used to detect a category of a musical event and a location of the musical event in audio.

After the training data set and the training label are obtained through the steps S410 to S420, the method provided by the embodiment of the present disclosure trains the neural network through the step S430 to obtain the music detection model. The process of training the neural network based on the training data set and the training label can be realized as follows: determining deep global features of the audio; identifying a music event in the audio according to the deep global features, determining the category and the positioning of the music event, and obtaining an identification result; and adjusting parameters of the neural network according to the difference between the identification result and the training label to obtain a music detection model.

Illustratively, the above determination of the deep global features of audio may be implemented as follows: extracting mel spectrum characteristics of the audio; inputting the Mel spectrum characteristics into a neural network to obtain shallow characteristics of the audio; carrying out global context information modeling on the shallow features, and extracting global features of the audio; superimposing the global features to the shallow features by means of attention; repeating the steps of inputting the Mel frequency spectrum characteristics into the neural network for a target number of times to obtain shallow characteristics of the audio, and overlapping the global characteristics to the shallow characteristics in a mode of introducing attention to obtain deep global characteristics of the audio. The global features and the local features are audio features which have influence on the training process of the neural network.

The embodiment of the disclosure also provides a Double-Head Conv Global Context Network structure (dual-head transmission global context network), training is performed based on the network structure of the dual-head transmission, music detection can be divided into two tasks of classification and positioning, and a music detection model obtained by training correspondingly comprises two branches: the system comprises a classification prediction branch and a positioning prediction branch, wherein the classification prediction branch is used for outputting the prediction result of the class of the music event, and the positioning prediction branch is used for outputting the prediction result of the positioning of the music event in the audio.

The process of adjusting the parameters of the neural network according to the difference between the recognition result and the training label can be implemented as follows: determining a first loss value according to the difference between the category identification result of the music event in the identification result and the category of the music event in the training tag; determining a second loss value according to the difference between the positioning recognition result of the music event in the recognition result and the positioning of the audio event determined according to the training label; and adjusting parameters of the neural network according to the first loss value and the second loss value.

The first loss value is a loss value of the branch in the classification prediction process, and the second loss value is a loss value of the branch in the positioning prediction process. The above adjustment of the parameters of the neural network according to the first loss value and the second loss value may be implemented as follows: adjusting a first network parameter of the classification prediction branch according to the first loss value; adjusting a second network parameter of the positioning prediction branch according to the second loss value; and determining parameters of the neural network according to the first network parameters and the second network parameters.

Further, regarding the manner in which the music detection model obtains the prediction result, as shown in fig. 6, the embodiment of the present disclosure provides 4 manners: (1) The classification prediction branches and the positioning prediction branches all acquire prediction results through a convolution head; (2) The classification prediction branch and the positioning prediction branch acquire a prediction result through the full connector; (3) The classification prediction branch acquires a prediction result through a convolution head, and the positioning prediction branch acquires the prediction result through a full-connector; (4) The classification prediction branch obtains a prediction result through the full connector, and the positioning prediction branch obtains the prediction result through the convolution head. Alternatively, the embodiment of the present disclosure may test the above 4 modes, so that an optimal mode of obtaining the prediction result may be selected as a mode of obtaining the prediction result by the music detection model.

In the following, in a specific embodiment, the above process of training the neural network based on the training data set and the training label to obtain the music detection model is described in detail, and as shown in fig. 7, the network structure designed in this specific embodiment includes two parts: a deep global feature extraction part and a musical event classification and positioning part.

The deep global feature extraction part is used for carrying out feature extraction and context global information modeling on the audio so as to extract features with global information and local information. Correspondingly, the above-mentioned determination of the deep global feature of the audio may be specifically implemented as follows: as shown in part indicated by 701 in fig. 7, after audio is input into a neural network, mel spectrum features are extracted from the audio, shallow features are extracted through Feature Extracting (feature extraction), global information is extracted through Context Modeling, and finally the global information is superimposed on the shallow features through a Transform structure (a network structure based on an attention mechanism), that is, attention is added, and the above-described process is repeated to obtain deep global features.

The above-mentioned musical event classification and positioning part is used for classifying and positioning musical events, namely detect whether there is musical event in the audio, if yes, the positioning task will output the specific time stamp. The portion is based on a dual-headed network architecture design, as shown in the portion outlined at 702 in fig. 7, comprising a classification prediction branch for outputting a prediction result of a category of a musical event and a localization prediction branch for outputting a prediction result of localization of the musical event in audio. In this particular embodiment, the classification prediction results are output by the convolution head and the positioning prediction results are output by the full-connector.

The loss of the training process of the music detection model provided by the embodiment of the disclosure to the neural network mainly comprises two parts, namely classification loss and positioning loss, which respectively correspond to the first loss value and the second loss value, and the loss function can be calculated through the following formula:

L＝aL _cl +bL _re

wherein a and b are weight coefficients, L _cl For classification loss, this is achieved by multi-class cross entropy. L (L) _re To locate the loss, the loss under the regression task can be determined by the following formula:

wherein c _i ' being a predicted value of a musical event category, c _i For the true value of the category of musical event, i.e. the category in the training tag, s _i ' is a predicted value of the start time of the corresponding detection frame of the musical event, s _i The actual value of the start time of the corresponding detection frame of the musical event, i.e. the value in the training tag, d _i ' is the predicted value of the expiration time of the corresponding detection frame of the musical event, d _i The real value of the expiration time of the corresponding detection box for the musical event, i.e. the value in the training tag.

Embodiments of the present disclosure may modify the music detection model through the loss function described above. Specifically, the following can be realized: firstly, forward propagation is carried out, and a training data set and a training label are transmitted into a neural network for reasoning to obtain a prediction result; then, inputting the predicted result and the training label into the loss function to calculate the loss; the back propagation is then performed and the gradient is calculated. Repeating the steps, accumulating the gradients, updating network parameters after the gradient accumulation reaches a certain number of times, and setting the gradients to zero. After repeated iterative training, a converged music detection model with a good effect can be obtained.

The embodiment of the disclosure also provides a music detection method, which detects the category and the location of the music event in the audio through the music detection model obtained through the training. As shown in fig. 8, the method comprises the following steps:

in step S810, audio to be detected is acquired.

In step S820, the audio to be detected is input into the above-mentioned trained music detection model, and the category of the music event in the audio to be detected and the location of the music event in the audio to be detected are output.

As shown in fig. 9, the music detection process may be implemented as follows: the audio to be detected is input, the audio to be detected is processed through 901, a classification prediction result is output through a convolution head, a positioning prediction result is output through a full connector, and finally a result shown in 902 is output, as shown in fig. 9, the type and positioning of the musical event predicted by the audio detection model obtained through training in the embodiment of the present disclosure are not needed, and the precision of the scheme prediction result is higher than that shown in fig. 5.

Correspondingly, the embodiment of the disclosure also provides a music detection model training device. As shown in FIG. 10

The apparatus is shown to include a detection frame determination module 1010, a training tag determination module 1020, and a model training module 1030. Wherein:

The detection box determination module 1010 is configured to perform acquiring a plurality of audios as a training data set, and determine a detection box corresponding to the audios, where the detection box is a priori box of a predefined musical event in the detected audios.

The training tag determining module 1020 is configured to execute the detection frame as a unit, obtain truth tags of a plurality of music events in the audio, and convert the truth tags into a standard format, wherein the standard format is a data set containing a category of the music event, a start time and an end time of the corresponding detection frame, as the training tag.

The model training module 1030 is configured to perform training of the neural network based on the training data set and the training tag to obtain a music detection model for detecting a category of a musical event and a localization of the musical event in audio.

Optionally, the model training module includes a result prediction unit and a parameter adjustment unit, where: the result prediction unit is configured to determine deep global features of the audio, identify musical events in the audio according to the deep global features, determine categories and positions of the musical events, and obtain identification results; the parameter adjusting unit is configured to perform adjustment of parameters of the neural network according to the difference between the recognition result and the training label, so as to obtain a music detection model.

Optionally, the result prediction unit specifically performs the determining of the deep global feature of the audio by: extracting mel spectrum characteristics of the audio; inputting the Mel spectrum characteristics into a neural network to obtain shallow characteristics of the audio; carrying out global context information modeling on the shallow features, and extracting global features of the audio; superimposing the global features to the shallow features by means of attention; repeating the steps of inputting the Mel frequency spectrum characteristics into the neural network for several times to obtain shallow characteristics of the audio, and overlapping the global characteristics to the shallow characteristics in a mode of introducing attention to obtain deep global characteristics of the audio.

Optionally, the above parameter adjustment unit is specifically configured to perform: determining a first loss value according to the difference between the category identification result of the music event in the identification result and the category of the music event in the training tag; determining a second loss value according to the difference between the positioning recognition result of the music event in the recognition result and the positioning of the audio event determined according to the training label; and adjusting parameters of the neural network according to the first loss value and the second loss value.

The adjusting the parameters of the neural network according to the first loss value and the second loss value can be implemented as follows: adjusting a first network parameter of the classification prediction branch according to the first loss value; adjusting a second network parameter of the positioning prediction branch according to the second loss value; and determining parameters of the neural network according to the first network parameters and the second network parameters.

The embodiment of the disclosure also provides a music detection device. As shown in fig. 11, the apparatus includes a data acquisition module 1110 and a detection module 1120. Wherein:

the data acquisition module 1110 is configured to perform acquisition of audio to be detected.

The detection module 1120 is configured to execute a music detection model obtained by inputting the audio into the training method, and output the category of the music event in the audio to be detected and the positioning of the music event in the audio to be detected.

The specific manner in which the individual units or modules perform the operations in the apparatus of the above embodiments has been described in detail in relation to the embodiments of the method, and will not be described in detail here.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A music detection model training method, comprising:

determining a detection frame corresponding to audio in a training data set, wherein the detection frame is a predefined priori frame for detecting a music event in the audio;

acquiring truth labels of a plurality of music events in the audio frequency by taking the detection frame as a unit, and converting the truth labels into a standard format which is a data set containing the category of the music event, the starting time and the ending time corresponding to the detection frame and then serving as training labels;

And training a neural network based on the training data set and the training label to obtain a music detection model, wherein the music detection model is used for detecting the category of the music event and the positioning of the music event in the audio.

2. The method of claim 1, wherein the training a neural network based on the training data set and the training label comprises:

determining deep global features of the audio, identifying the musical event in the audio according to the deep global features, and determining the category and the positioning of the musical event to obtain an identification result;

and adjusting parameters of the neural network according to the difference between the identification result and the training label to obtain the music detection model.

3. The music detection model training method of claim 2, wherein the determining deep global features of the audio comprises:

extracting mel spectrum characteristics of the audio;

inputting the Mel spectrum characteristics into the neural network to obtain shallow characteristics of the audio;

carrying out global context information modeling on the shallow features, and extracting global features of the audio;

Superimposing the global feature to the shallow feature by means of attention;

repeating the steps of inputting the mel spectrum features into the neural network for a target number of times to obtain shallow features of the audio and overlapping the global features to the shallow features in a mode of introducing attention to obtain the deep global features of the audio.

4. The music detection model training method of claim 2, wherein the neural network includes a classification prediction branch for outputting a prediction result of a class of the musical event and a localization prediction branch for outputting a prediction result of localization of the musical event in the audio.

5. The method according to claim 4, wherein adjusting the parameters of the neural network according to the difference between the recognition result and the training label comprises:

determining a first loss value according to the difference between the identification result of the category of the music event in the identification result and the category of the music event in the training label;

Determining a second loss value according to the difference between the positioning recognition result of the music event in the recognition result and the positioning of the audio event determined according to the training label;

and adjusting parameters of the neural network according to the first loss value and the second loss value.

6. The method of claim 5, wherein adjusting parameters of the neural network according to the first loss value and the second loss value comprises:

adjusting a first network parameter of the classified prediction branch according to the first loss value;

adjusting a second network parameter of the location prediction branch according to the second loss value;

and determining parameters of the neural network according to the first network parameters and the second network parameters.

7. The music model training method according to claim 4 or 5, wherein the classification prediction branch and the localization prediction branch each acquire a prediction result through a convolution head; or alternatively

The classification prediction branch and the positioning prediction branch acquire prediction results through a full connector; or alternatively

The classification prediction branch obtains a prediction result through a convolution head, and the positioning prediction branch obtains a prediction result through a full-connector; or alternatively

The classification prediction branch obtains a prediction result through the full connector, and the positioning prediction branch obtains a prediction result through the convolution head.

8. A music detection method, comprising:

acquiring audio to be detected;

inputting the audio to be detected into the music detection model according to any one of claims 1-7, and outputting the category of the musical event in the audio to be detected and the positioning of the musical event in the audio to be detected.

9. A music detection model training device, comprising:

a detection frame determining module configured to perform determining a detection frame corresponding to audio in a training dataset, the detection frame being a predefined a priori frame detecting a musical event in the audio;

the training label determining module is configured to execute the detection frame as a unit, acquire truth labels of a plurality of music events in the audio, and convert the truth labels into a standard format, wherein the standard format is a data set containing the category of the music event, the starting time and the ending time corresponding to the detection frame, and the training label is used as a training label;

and the model training module is configured to perform training of the neural network based on the training data set and the training label to obtain a music detection model, wherein the music detection model is used for detecting the category of the music event and the positioning of the music event in the audio.

10. A music detection apparatus, comprising:

a data acquisition module configured to perform acquisition of audio to be detected;

a detection module configured to perform inputting the audio to be detected into the music detection model of any one of claims 1-7, outputting a category of a musical event in the audio to be detected and a localization of the musical event in the audio to be detected.

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 8.

12. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 8.