CN113299318B - Audio beat detection method and device, computer equipment and storage medium - Google Patents

Audio beat detection method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113299318B
CN113299318B CN202110565138.6A CN202110565138A CN113299318B CN 113299318 B CN113299318 B CN 113299318B CN 202110565138 A CN202110565138 A CN 202110565138A CN 113299318 B CN113299318 B CN 113299318B
Authority
CN
China
Prior art keywords
audio
audio signal
attention
background vector
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110565138.6A
Other languages
Chinese (zh)
Other versions
CN113299318A (en
Inventor
罗海斯·马尔斯
胡正倫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bigo Technology Singapore Pte Ltd
Original Assignee
Bigo Technology Singapore Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bigo Technology Singapore Pte Ltd filed Critical Bigo Technology Singapore Pte Ltd
Priority to CN202110565138.6A priority Critical patent/CN113299318B/en
Publication of CN113299318A publication Critical patent/CN113299318A/en
Application granted granted Critical
Publication of CN113299318B publication Critical patent/CN113299318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The embodiment of the invention provides a method, a device, computer equipment and a storage medium for detecting audio beats, wherein the method comprises the following steps: the method comprises the steps of obtaining a multimedia file, wherein the multimedia file is provided with multi-frame audio signals, extracting local features from the multi-frame audio signals of the multimedia file, obtaining multi-frame audio feature vectors, encoding the multi-frame audio feature vectors, obtaining a first background vector, carrying out global decoding on the first background vector under the condition that attention about the audio signals is added to the first background vector, obtaining notes expressed by the audio signals, wherein an attention mechanism has higher receptive field, can sense global information, is used for modeling long-term information, is suitable for processing beats in the audio signals, and does not require sample frame level alignment during training by using the attention mechanism, so that the requirements on samples are reduced, the number of samples meeting the conditions is increased, model training is simplified under the condition that the performance of a model is ensured, and the accuracy of detecting the beats is ensured.

Description

Audio beat detection method and device, computer equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of multimedia, in particular to a method and a device for detecting audio beats, computer equipment and a storage medium.
Background
Multimedia data such as short videos and micro movies are widely applied, different music is used in the multimedia data, the multimedia data and the music are synchronous, post-processing such as subtitles and special effects is convenient to carry out, and the experience of a user for watching the multimedia data is improved.
Many users will make multimedia data using specific tools that will align the multimedia data with musical notes using a music tempo detection algorithm (Musical beat detection), thereby reducing the effort to synchronize the multimedia data with musical notes.
At present, most of beat detection algorithms use a convolutional neural network, the convolutional neural network needs to train by using multimedia data aligned with beats as samples, but continuous music is annotated into segments corresponding to notes, so that accurate positions of beats are difficult and time-consuming to obtain on a time axis, the number of the samples is small, the performance of the convolutional neural network is affected by limited receptive fields and lack of samples, the performance is poor, and the accuracy of detected beats is low.
Disclosure of Invention
The embodiment of the invention provides a method, a device, computer equipment and a storage medium for detecting audio beats, which are used for solving the problem of lower accuracy of detected beats.
In a first aspect, an embodiment of the present invention provides a method for detecting an audio beat, including:
acquiring a multimedia file, wherein the multimedia file is provided with multi-frame audio signals;
extracting local characteristics from multi-frame audio signals of the multimedia file to obtain multi-frame audio characteristic vectors;
encoding the multi-frame audio feature vector to obtain a first background vector;
and carrying out global decoding on the first background vector under the condition that attention about the audio signal is added to the first background vector, so as to obtain notes expressed by the audio signal.
In a second aspect, an embodiment of the present invention further provides an apparatus for detecting an audio beat, including:
the multimedia file acquisition module is used for acquiring a multimedia file, wherein the multimedia file is provided with multi-frame audio signals;
the local feature extraction module is used for extracting local features from multi-frame audio signals of the multimedia file to obtain multi-frame audio feature vectors;
the audio coding module is used for coding the multi-frame audio feature vectors to obtain a first background vector;
and the audio decoding module is used for globally decoding the first background vector under the condition that the attention of the audio signal is added to the first background vector, so as to obtain notes expressed by the audio signal.
In a third aspect, an embodiment of the present invention further provides a computer apparatus, including:
the feature extractor is used for extracting local features from multi-frame audio signals of the multimedia file to obtain multi-frame audio feature vectors;
the encoder is used for encoding the multi-frame audio feature vectors to obtain a first background vector;
and the decoder with attention is used for globally decoding the first background vector under the condition of adding attention to the first background vector about the audio signal to obtain notes expressed by the audio signal.
In a fourth aspect, an embodiment of the present invention further provides a computer apparatus, including:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for detecting audio beats as described in the first aspect.
In a fifth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the method for detecting an audio beat according to the first aspect.
In this embodiment, a multimedia file is obtained, in which a multi-frame audio signal is provided, local features are extracted from the multi-frame audio signal of the multimedia file, a multi-frame audio feature vector is obtained, the multi-frame audio feature vector is encoded, a first background vector is obtained, the first background vector is globally decoded under the condition that attention about the audio signal is added to the first background vector, notes expressed by the audio signal are obtained, an attention mechanism has a higher receptive field, global information can be perceived, long-term information modeling is noted, and the method is suitable for processing beats in the audio signal, and sample frame level alignment is not required when training is performed by using the attention mechanism, so that the requirement on samples is reduced, the number of samples meeting the condition is improved, model training is simplified under the condition that the performance of a model is ensured, and thus the accuracy of detecting beats is ensured.
Drawings
Fig. 1 is a flowchart of a method for detecting audio beats according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an audio beat detection model according to a first embodiment of the present invention;
FIG. 3 is a schematic diagram of an encoder and a decoder with attention according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of an audio beat detection device according to a second embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a third embodiment of the present invention
Fig. 6 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Example 1
Fig. 1 is a flowchart of a method for detecting an audio beat according to a first embodiment of the present invention, where the method may be applicable to modeling local features of an audio learning near distance and global background based on a long distance, so as to detect a beat of the audio, and the method may be performed by an audio beat detection device, where the audio beat detection device may be implemented by software and/or hardware, and may be configured in a computer device, for example, a server, a workstation, a personal computer, and so on, and specifically includes the following steps:
step 101, acquiring a multimedia file.
The form of the multimedia file may be different in different service scenarios, such as short video, movie, television show, etc., and the present embodiment is not limited thereto.
In the multimedia file there are multi-frame audio signals, the format of which may include MP3, WMA, AAC, which the present embodiment is not limited to.
For different contents, the audio signal may be in the form of a voice uttered by the user, a sound of an animal, a car song, a song of a pure music, a humming of the user, etc., that is, there may be at least a part of the audio signal having beats, that is, a combination rule of a strong beat and a weak beat, specifically, the total length of a note of each bar in the music score, commonly 1/4, 2/4, 3/4, 4/4, 3/8, 6/8, 7/8, 9/8, 12/8 beats, etc., and the length of each bar is generally fixed.
Step 102, extracting local characteristics from multi-frame audio signals of the multimedia file to obtain multi-frame audio characteristic vectors.
In this embodiment, an audio signal may be used in advance as a sample to train a detection model of an audio beat, the detection model of the audio beat may be used to detect the beat of the audio signal, when training is completed, the structure of the detection model of the audio beat and its parameters are recorded, when detecting the beat of the current multimedia data, the detection model of the audio beat under the structure is loaded into the memory, and the parameters are applied in the detection model of the audio beat.
Further, as shown in fig. 2, the detection model of the audio beat includes a feature extractor, an Encoder (encocoder), and an Attention-Decoder (Attention-Decoder).
Wherein the feature extractor is operable to extract features of the audio signal, for a current multimedia file, a sequence x of a plurality of audio signals thereof 1 ,x 2 ,…,x T The input feature extractor processes to output features of the audio signal, which may be referred to as audio feature vectors for ease of identification.
In one example of a feature extractor, the feature extractor may be a convolution layer (Convolutional Layer) that is made up of a number of convolution units, the parameters of each of which are optimized by a back-propagation algorithm.
Thus, in this example, a convolution layer may be determined in the detection model of the audio beat, multiple frames of the audio signal of the multimedia file may be input into the convolution layer for a convolution operation to obtain multiple frames of audio feature vectors, and the convolution layer may encode the audio signal into an advanced representation, processing a localized structure of a limited Receptive Field.
The definition of the receptive field is the area size of the data mapped on the input data on the feature map (feature map) output by the convolution layer.
Of course, the above-mentioned feature extractor is merely an example, and other feature extractors may be provided according to actual situations when implementing the embodiment of the present invention, for example, a plurality of convolution layers, a GRU (gated recurrent unit, a gated loop unit), and the like, and the embodiment of the present invention is not limited thereto. In addition, in addition to the above-mentioned judgment processing method, those skilled in the art may also use other feature extractors according to actual needs, which is not limited in this embodiment of the present invention.
And 103, encoding the multi-frame audio feature vector to obtain a first background vector.
In general, an encoder may be used to transform an input sequence of varying length into a background variable of varying length, which encoder may be used to extract high-level speech features in an audio signal in a detection model of the audio beat.
Further, the encoder may comprise a network capable of processing sequence data, typically a stacked neural network, e.g. a recurrent neural network, etc., wherein the recurrent neural network may be unidirectional (the first concealment state of each first time step depends on the audio feature signal of the first time step and before) or bi-directional (the first concealment state of each first time step depends on the audio feature signal of the first time step and before and after the first time step at the same time) (including the audio feature signal input at the current first time step), and encodes the entire audio feature signal so that future information can be used for prediction at the current first time step), after extracting the audio feature vector using the feature extractor, the sequence of audio feature vectors is input to the encoder for encoding, and the encoder outputs the first background vector when the encoding is completed.
In one coding scheme, for encoders such as unidirectional recurrent neural networks, the background variable is usually from the hidden state of the final time step, the audio feature vector x for each audio signal in a multimedia file of batch size 1 1 ,x 2 ,…,x T In a first time step i, the encoder encodes an audio feature vector x t And a first hidden state h of the last first time step i-1 Transforming into the hidden state h of the current first time step i Then the transform of the encoder hidden layer can be expressed with a function f:
h i =f(x i ,h i-1 )
next, the encoder transforms the first hidden states of each first time step into first background variables by a custom function q:
c=q(h 1 ,…,h T )
in some applications, the final hidden layer state may also be directly used as the final semantic code C, i.e. the following are satisfied:
in this case, therefore, an encoder trained in advance for the audio signal may be determined, and the multi-frame audio feature vector is input into the encoder to be encoded to output the multi-frame concealment state at the last first time step in the encoder as the first background vector.
Step 104, globally decoding the first background vector under the condition that attention about the audio signal is added to the first background vector, and obtaining the notes expressed by the audio signal.
In a conventional scheme, convolutional neural networks share a location-based kernel to obtain local information, thereby capturing characteristics such as edges and shapes. However, audio signals such as music often have a long-term hierarchy, and similar patterns may repeatedly appear in the samples.
Convolutional neural networks do not explicitly model time dependencies and patterns in the characteristics of the input audio signal. In the past, recurrent neural network modeling has been used. However, as the length of the audio signal increases, the recurrent neural network cannot remember all past information.
This limitation of forgetting past information with increasing input size can be addressed by using additional attention that mimics the internal process of biological observation behavior, a mechanism that aligns internal experience with external sensations to increase the observation finesse of the partial region. For example, when a picture is processed by human vision, a target area needing to be focused, namely a focus of attention, is obtained by quickly scanning a global image, and then more attention resources are put into the area to obtain detailed information of the target needing to be focused, and other useless information is restrained.
Therefore, in this embodiment, a decoder for adding attention is provided in the detection model of the audio beat, and the decoder is used for transforming the background variable with a fixed length into a target sequence, so that in order to use the self-attention to keep the long-term hierarchical structure, a mechanism of adding attention can be added in the decoder, and in the case of detecting the beat, the attention is performed in the input first background vector itself, so that the attention is self-paid, and the self-attention is helpful for adjusting the encoding and decoding processes according to the importance degree of different audio signals to notes.
Unlike convolutional neural networks, where the self-noted representation is calculated by the attention of each frame of the first background vector at different states, and more relevant states are given more weight, each state perceives global information, which will help model the musical structure and beat long-term information. Furthermore, systems trained using the attention mechanism do not rely on frame level alignment, thus simplifying training of the detection model of the audio beat.
In one embodiment of the present invention, step 104 includes the steps of:
step 1041, determining a decoder trained in advance for the audio signal.
Step 1042, calculate a second background vector at a current second time step in the decoder based on all the current first background vectors to express attention to the audio signal.
For an attention-added decoder, the context vector of the previous encoder's output first background vector and the current second time step's second background vector may be used to predict the current frame second background vector so that, when training, an end-to-end training of the detection model of the audio beat may be performed.
In one example of an embodiment of the present invention, step 1042 comprises the steps of:
step 10421, determining a first concealment status for each first time step currently located in the encoder.
In this example, if an encoder such as a recurrent neural network is used, the multi-frame concealment state of the last first time step is the first background vector, and at this time, the first concealment state of each first time step currently located in the encoder may be queried, where the encoder is used to encode the multi-frame audio feature vector.
Step 10422, configuring weights related to the attention of the audio signal for each of the first hidden states, respectively, under the condition of the current second time step in the synchronous decoder.
In this example, the weight is positively correlated with the attention, i.e. the higher the attention to the first hidden state of a certain frame, the greater the weight of the first hidden state of that frame, whereas the lower the attention to the first hidden state of a certain frame, the less the weight of the first hidden state of that frame.
In a specific implementation, for a decoder employing a recurrent neural network or the like, if the sign y of the audio signal is to be predicted t Then at a second time step t, a second hidden state s output at the hidden layer node of the decoder before can be known t-1 Whereas the purpose of the decoder's attention addition is to calculate the generated symbol y t Each audio signal pair y of the input t For example, the probability distribution of attention distribution may be used for a second hidden state s for a second time step t t-1 Comparing the hidden layer node states corresponding to the encoder to obtain a symbol y t And the likelihood of alignment of each audio signal.
On the one hand a first concealment state at the current respective first time step in the encoder may be determined, and on the other hand a second concealment state at the last second time step in the decoder may be determined.
Whereby the first hidden state and the second hidden state are input into an attention mechanism adapted to the audio signal to output a correlation, wherein the attention mechanism may be used to calculate the contribution of the first hidden state per frame to the second hidden state per frame, e.g. additive attention mechanism (additive attention), position-based attention mechanism (location base attention), dot product attention mechanism (dot product attention), scale dot product attention mechanism (scaled dot product attention), etc., as the embodiment is not limited thereto.
Taking the softmax function as an example, as shown in FIG. 3, the attention mechanism S (score) is based on the second hidden state S of the decoder at the last second time step t-1 And a first concealment state h of the encoder at each first time step i Calculating the input of a softmax function, which outputs a probability distribution as the weight alpha t,i This can be expressed as:
α t,i =softmax(score(s t-1 ,h i ))
step 10423, calculate a second background vector at the current second time step in the decoder based on the first concealment state and the weights to express attention to the audio signal.
In a specific implementation, for each first time step in the encoder, the product between the first concealment state and the weights is calculated, and for all first time steps in the encoder, the sum between the products is calculated, and a second background vector located at a current second time step in the decoder is obtained to express attention to the audio signal.
Further, as shown in fig. 3, attention is paid to the use of a context vector of a second concealment state, which is a weighted average of the first concealment state of the encoder at each first time step, expressed as:
wherein c t A second background vector, h, being a second time step t i For the first hidden state of the encoder at each first time step, α t,i Is the weight for each first hidden state in the second time step.
Step 1043, inputting the second background vector into the decoder for global decoding, so as to output the notes expressed by the audio signal.
In general, a decoder is operable to transform the characteristic information into a fixed-length output sequence, and in a detection model of the audio beat, the encoder is operable to identify notes represented by the audio signal based on the speech characteristics of the audio signal.
Further, the decoder may comprise a network capable of processing sequence data, typically a stacked neural network, e.g., a recurrent neural network, etc.,
the first background variable output by the encoder encodes the speech feature vector x of the entire speech signal 1 ,x 2 ,…,x T A first background vector is obtained and a second background vector is obtained with added attention. Given a sequence y of notes expressed by an audio signal 1 ,y 2 ,…,y T For each second time step t (which is distinguishable from the first time step i of the encoder), the conditional probability of the decoder output yt will be based on the previous output sequence y 1 ,y 2 ,…,y t-1 And a second background variable c, i.e. P (y t ∣y 1 ,y 2 ,…,y t-1 ,c)。
In general, the decoder is a stacked neural network that calculates the symbol sequence y= [ y ] of the output audio signal representation 1 ,y 2 ,…,y T ]The probability of (2) is expressed as follows:
further, the decoder extracts a second background vector h= [ h ] 1 ,h 2 ,...,h T ]Symbol sequence y= [ y ] expressed by predictive audio signal x 1 ,y 2 ,…,y T ]Where T represents the number of symbols predicted by the decoder.
In this embodiment, a multimedia file is obtained, in which a multi-frame audio signal is provided, local features are extracted from the multi-frame audio signal of the multimedia file, a multi-frame audio feature vector is obtained, the multi-frame audio feature vector is encoded, a first background vector is obtained, the first background vector is globally decoded under the condition that attention about the audio signal is added to the first background vector, notes expressed by the audio signal are obtained, an attention mechanism has a higher receptive field, global information can be perceived, long-term information modeling is noted, and the method is suitable for processing beats in the audio signal, and sample frame level alignment is not required when training is performed by using the attention mechanism, so that the requirement on samples is reduced, the number of samples meeting the condition is improved, model training is simplified under the condition that the performance of a model is ensured, and thus the accuracy of detecting beats is ensured.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Example two
Fig. 4 is a block diagram of an audio beat detection apparatus according to a second embodiment of the present invention, which may specifically include the following modules:
a multimedia file obtaining module 401, configured to obtain a multimedia file, where the multimedia file has a multi-frame audio signal;
a local feature extraction module 402, configured to extract local features from a multi-frame audio signal of the multimedia file, and obtain multi-frame audio feature vectors;
an audio encoding module 403, configured to encode a plurality of frames of the audio feature vectors to obtain a first background vector;
an audio decoding module 404, configured to globally decode the first background vector under the condition that attention about the audio signal is added to the first background vector, so as to obtain a note expressed by the audio signal.
In one embodiment of the present invention, the local feature extraction module 402 includes:
the convolution layer determining module is used for determining a convolution layer;
and the convolution operation module is used for inputting the multi-frame audio signals of the multimedia file into the convolution layer to carry out convolution operation, so as to obtain multi-frame audio feature vectors.
In one embodiment of the present invention, the audio encoding module 403 includes:
an encoder determination module for determining an encoder trained in advance for the audio signal;
and the first background vector coding module is used for inputting a plurality of frames of the audio feature vectors into the coder for coding so as to output a multi-frame hiding state positioned at the last first time step in the coder as a first background vector.
In one embodiment of the present invention, the audio decoding module 404 includes:
a decoder determination module for determining a decoder trained in advance for the audio signal;
a second background vector calculation module for calculating a second background vector at a current second time step in the decoder based on all of the first background vectors at present to express attention to the audio signal;
and the second background vector decoding module is used for inputting the second background vector into the decoder for global decoding so as to output notes expressed by the audio signal.
In one embodiment of the present invention, the second background vector calculation module includes:
a first concealment state determining module, configured to determine a first concealment state at each current first time step in an encoder, where the encoder is configured to encode a plurality of frames of the audio feature vector;
a weight configuration module, configured to configure weights related to the attention of the audio signal for each of the first hidden states under the condition of synchronizing the current second time step in the decoder;
a second background vector solving module for calculating a second background vector at a current second time step in the decoder based on the first hidden state and the weights to express attention to the audio signal.
In one embodiment of the present invention, the weight configuration module includes:
a second concealment state determination module for determining a second concealment state located at a last second time step in the decoder;
a correlation calculation module, configured to input the first hidden state and the second hidden state into an attention mechanism adapted to the audio signal, so as to output a correlation;
and the correlation activation module is used for activating the correlation and obtaining the weight related to the attention of the audio signal.
In one embodiment of the present invention, the second background vector solving module includes:
a product calculation module, configured to calculate, for each first time step in the encoder, a product between the first concealment state and the weight;
a summation module for calculating a sum between the products for all first time steps in the encoder, obtaining a second background vector at a current second time step in the decoder to express attention to the audio signal.
The audio beat detection device provided by the embodiment of the invention can execute the audio beat detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example III
Fig. 5 is a block diagram of a computer device according to a third embodiment of the present invention, where the computer device may specifically include:
a feature extractor 501, configured to extract local features from a multi-frame audio signal of a multimedia file, and obtain multi-frame audio feature vectors;
an encoder 502, configured to encode a plurality of frames of the audio feature vectors to obtain a first background vector;
and the decoder 503 with attention is used for globally decoding the first background vector under the condition that the attention about the audio signal is added to the first background vector, so as to obtain the notes expressed by the audio signal.
In one embodiment of the invention, the feature extractor 501 comprises:
and the convolution layer is used for carrying out convolution operation on multi-frame audio signal input of the multimedia file to obtain multi-frame audio feature vectors.
In one embodiment of the present invention, the encoder 502 is trained on audio signals in advance, and is further configured to:
and encoding the multi-frame audio feature vector input to output multi-frame hiding states at the last first time step in the encoder as a first background vector.
In one embodiment of the invention, the attentive decoder 503 includes:
a decoder trained in advance for the audio signal;
an attention module for calculating a second background vector at a current second time step in the decoder based on all of the first background vectors currently to express attention to the audio signal;
the decoder is used for inputting the second background vector into the decoder to perform global decoding so as to output notes expressed by the audio signal.
In one embodiment of the invention, the attention module is further configured to:
determining a first concealment state at each current first time step in an encoder for encoding a plurality of frames of the audio feature vector;
under the condition of synchronizing the current second time step in the decoder, respectively configuring weights related to the attention of the audio signal for each of the first hidden states;
a second background vector at a current second time step in the decoder is calculated based on the first concealment state and the weights to express attention to the audio signal.
In one embodiment of the invention, the attention module is further configured to:
determining a first concealment state for each current first time step in the encoder;
determining a second concealment state for a second time step above in the decoder;
inputting the first hidden state and the second hidden state into an attention mechanism adapted to the audio signal to output a correlation;
activating the correlation to obtain a weight related to the attention of the audio signal.
In one embodiment of the invention, the decoder is further configured to:
calculating a product between the first concealment state and the weights for each first time step in the encoder;
the sum between the products is calculated for all first time steps in the encoder, and a second background vector is obtained at a current second time step in the decoder to express attention to the audio signal.
The computer equipment provided by the embodiment of the invention can execute the audio beat detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example IV
Fig. 6 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 6 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 6 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in FIG. 6, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in fig. 6, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running a program stored in the system memory 28, for example, to implement the audio beat detection method provided by the embodiment of the present invention.
Example five
The fifth embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above audio beat detection method, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.
The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. A method for detecting an audio beat, comprising:
acquiring a multimedia file, wherein the multimedia file is provided with multi-frame audio signals;
extracting local characteristics from multi-frame audio signals of the multimedia file to obtain multi-frame audio characteristic vectors;
encoding the multi-frame audio feature vector to obtain a first background vector;
globally decoding the first background vector under the condition that attention about the audio signal is added to the first background vector, and obtaining notes expressed by the audio signal;
the encoding the multi-frame audio feature vector to obtain a first background vector includes:
determining an encoder trained in advance for the audio signal;
inputting a plurality of frames of the audio feature vectors into the encoder for encoding so as to output a plurality of frames of hidden states positioned at the last first time step in the encoder as a first background vector.
2. The method of claim 1, wherein extracting local features from the multi-frame audio signal of the multimedia file to obtain multi-frame audio feature vectors comprises:
determining a convolution layer;
inputting the multi-frame audio signals of the multimedia file into the convolution layer for convolution operation to obtain multi-frame audio feature vectors.
3. The method according to any of claims 1-2, wherein said globally decoding the first background vector with added attention to the audio signal to obtain notes expressed by the audio signal comprises:
determining a decoder trained in advance for the audio signal;
calculating a second background vector at a current second time step in the decoder based on all of the first background vectors currently to express attention to the audio signal;
and inputting the second background vector into the decoder for global decoding so as to output notes expressed by the audio signal.
4. A method according to claim 3, wherein said calculating a second background vector at a current second time step in said decoder based on all of said first background vectors present to express attention to said audio signal comprises:
determining a first concealment state at each current first time step in an encoder for encoding a plurality of frames of the audio feature vector;
under the condition of synchronizing the current second time step in the decoder, respectively configuring weights related to the attention of the audio signal for each of the first hidden states;
a second background vector at a current second time step in the decoder is calculated based on the first concealment state and the weights to express attention to the audio signal.
5. The method of claim 4, wherein said configuring the respective first concealment states with the respective attention-related weights of the audio signal in synchronization with the current second time step in the decoder comprises:
determining a second concealment state for a second time step above in the decoder;
inputting the first hidden state and the second hidden state into an attention mechanism adapted to the audio signal to output a correlation;
activating the correlation to obtain a weight related to the attention of the audio signal.
6. The method of claim 4, wherein said calculating a second background vector at a current second time step in the decoder based on the first concealment state and the weights to express attention to the audio signal comprises:
calculating a product between the first concealment state and the weights for each first time step in the encoder;
the sum between the products is calculated for all first time steps in the encoder, and a second background vector is obtained at a current second time step in the decoder to express attention to the audio signal.
7. An apparatus for detecting an audio beat, comprising:
the multimedia file acquisition module is used for acquiring a multimedia file, wherein the multimedia file is provided with multi-frame audio signals;
the local feature extraction module is used for extracting local features from multi-frame audio signals of the multimedia file to obtain multi-frame audio feature vectors;
the audio coding module is used for coding the multi-frame audio feature vectors to obtain a first background vector;
an audio decoding module, configured to globally decode the first background vector under a condition that attention about the audio signal is added to the first background vector, to obtain a note expressed by the audio signal;
the audio encoding module includes:
an encoder determination module for determining an encoder trained in advance for the audio signal;
and the first background vector coding module is used for inputting a plurality of frames of the audio feature vectors into the coder for coding so as to output a multi-frame hiding state positioned at the last first time step in the coder as a first background vector.
8. A computer device, comprising:
the feature extractor is used for extracting local features from multi-frame audio signals of the multimedia file to obtain multi-frame audio feature vectors;
the encoder is used for encoding the multi-frame audio feature vectors to obtain a first background vector;
a decoder with attention for globally decoding the first background vector under the condition of adding attention to the audio signal to the first background vector to obtain notes expressed by the audio signal;
the encoder is trained in advance for audio signals, and is further configured to:
and encoding the multi-frame audio feature vector input to output multi-frame hiding states at the last first time step in the encoder as a first background vector.
9. A computer device, the computer device comprising:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of detecting an audio beat as defined in any one of claims 1-6.
10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, which when executed by a processor implements the method for detecting audio beats according to any one of claims 1-6.
CN202110565138.6A 2021-05-24 2021-05-24 Audio beat detection method and device, computer equipment and storage medium Active CN113299318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110565138.6A CN113299318B (en) 2021-05-24 2021-05-24 Audio beat detection method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110565138.6A CN113299318B (en) 2021-05-24 2021-05-24 Audio beat detection method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113299318A CN113299318A (en) 2021-08-24
CN113299318B true CN113299318B (en) 2024-02-23

Family

ID=77324250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110565138.6A Active CN113299318B (en) 2021-05-24 2021-05-24 Audio beat detection method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113299318B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018194456A1 (en) * 2017-04-20 2018-10-25 Universiteit Van Amsterdam Optical music recognition omr : converting sheet music to a digital format
CN110852181A (en) * 2019-10-18 2020-02-28 天津大学 Piano music score difficulty identification method based on attention mechanism convolutional neural network
WO2020136948A1 (en) * 2018-12-26 2020-07-02 日本電信電話株式会社 Speech rhythm conversion device, model learning device, methods for these, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018194456A1 (en) * 2017-04-20 2018-10-25 Universiteit Van Amsterdam Optical music recognition omr : converting sheet music to a digital format
WO2020136948A1 (en) * 2018-12-26 2020-07-02 日本電信電話株式会社 Speech rhythm conversion device, model learning device, methods for these, and program
CN110852181A (en) * 2019-10-18 2020-02-28 天津大学 Piano music score difficulty identification method based on attention mechanism convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
END-TO-END MELODY NOTE TRANSCRIPTION BASED ON A BEAT-SYNCHRONOUS ATTENTION MECHANISM;Nishikimi, R 等;《IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)》;全文 *

Also Published As

Publication number Publication date
CN113299318A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN109874029B (en) Video description generation method, device, equipment and storage medium
CN109117777B (en) Method and device for generating information
CN109785824B (en) Training method and device of voice translation model
CN112866586B (en) Video synthesis method, device, equipment and storage medium
CN110622176A (en) Video partitioning
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN112509555B (en) Dialect voice recognition method, device, medium and electronic equipment
CN110263218B (en) Video description text generation method, device, equipment and medium
CN112825249A (en) Voice processing method and device
CN112818670B (en) Segmentation grammar and semantics in a decomposable variant automatic encoder sentence representation
CN117337467A (en) End-to-end speaker separation via iterative speaker embedding
CN113392265A (en) Multimedia processing method, device and equipment
CN113450774A (en) Training data acquisition method and device
CN112017643B (en) Speech recognition model training method, speech recognition method and related device
CN113299318B (en) Audio beat detection method and device, computer equipment and storage medium
Ivanko et al. Designing advanced geometric features for automatic Russian visual speech recognition
CN114707518B (en) Semantic fragment-oriented target emotion analysis method, device, equipment and medium
CN116363250A (en) Image generation method and system
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium
CN113762056A (en) Singing video recognition method, device, equipment and storage medium
CN113239215A (en) Multimedia resource classification method and device, electronic equipment and storage medium
CN115512692B (en) Voice recognition method, device, equipment and storage medium
CN115982343B (en) Abstract generation method, and method and device for training abstract generation model
Xie et al. Global-shared Text Representation based Multi-Stage Fusion Transformer Network for Multi-modal Dense Video Captioning
CN113096687B (en) Audio and video processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant