CN117746906A

CN117746906A - Music file identification method and device, equipment and medium

Info

Publication number: CN117746906A
Application number: CN202311764704.1A
Authority: CN
Inventors: 方晓胤
Original assignee: China Mobile Communications Group Co Ltd; MIGU Music Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Music Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-22

Abstract

The disclosure provides a music file identification method, a device, equipment and a medium, wherein the method comprises the following steps: text recognition is carried out on voice audio data contained in the music file to be recognized, audio text data of the music file to be recognized are determined, emotion analysis is carried out on accompaniment audio data contained in the music file to be recognized, emotion prediction results of the music file to be recognized are obtained, and relevant information of target music is obtained based on the emotion prediction results and the audio text data. The method can fill the blank of music application in music file identification, and can also improve the audio identification effect and the identification efficiency.

Description

Music file identification method and device, equipment and medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for identifying a music file.

Background

Currently, most music applications can perform voice communication or voice recognition through a music recognition function, but it is difficult to recognize unknown music files and acquire related information of the unknown music files.

In the related art, although music can be identified by voice recognition or voiceprint comparison, these music recognition methods mainly simply extract audio features, and then search on the internet based on the identified audio features, resulting in poor audio recognition effect and low recognition efficiency.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a music file identification method including:

carrying out text recognition on voice audio data contained in a music file to be recognized, and determining audio text data of the music file to be recognized;

carrying out emotion analysis on accompaniment audio data contained in the music file to be identified to obtain an emotion prediction result of the music file to be identified;

and acquiring target music related information corresponding to the music file to be identified based on the emotion prediction result and the audio text data.

According to another aspect of the present disclosure, there is provided a music file recognition apparatus, comprising:

the identification module is used for carrying out text identification on voice audio data contained in the music file to be identified, determining audio text data of the music file to be identified, carrying out emotion analysis on accompaniment audio data contained in the music file to be identified, and obtaining emotion prediction results of the music file to be identified;

and the searching module is used for acquiring the target music related information corresponding to the music file to be identified based on the emotion prediction result and the audio text data.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; the method comprises the steps of,

a memory storing a program;

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to an exemplary embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to an exemplary embodiment of the present disclosure.

According to one or more technical schemes provided by the exemplary embodiments of the present disclosure, text recognition can be performed on voice audio data contained in a music file to be recognized, audio text data of the music file to be recognized is determined, emotion analysis is performed on accompaniment audio data contained in the music file to be recognized, and emotion prediction results of the music file to be recognized are obtained. On the basis, music information searching is carried out based on the emotion prediction result and the audio text data, so that target music related information corresponding to the music file to be identified is obtained. Therefore, the method of the exemplary embodiment of the present disclosure can be applicable to searching music related information of unknown music, and fills the technical blank of music application in the related technology.

In addition, music information searching is carried out based on the emotion prediction result and the audio text data, so that the searching range of the music information can be reduced, the searching efficiency of music related information corresponding to the music file to be identified is improved, the music information can be searched by combining the emotion prediction result and the audio text data, and the accuracy of the searched target music related information is ensured.

Drawings

Further details, features and advantages of the present disclosure are disclosed in the following description of exemplary embodiments, with reference to the following drawings, wherein:

fig. 1 shows a flowchart of a music file recognition method according to an exemplary embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of the structure of a value-area model according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a music source feature extraction and separation model in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a structural framework diagram of an encoder-decoder network of an exemplary embodiment of the present disclosure;

fig. 5 shows a schematic block diagram of a decoder according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates an architectural diagram of an audio text recognition model in accordance with an exemplary embodiment of the present disclosure;

FIG. 7 illustrates an architectural diagram of a music emotion analysis model of an exemplary embodiment of the present disclosure;

fig. 8 shows a functional block diagram of a music file recognition apparatus according to an exemplary embodiment of the present disclosure;

FIG. 9 shows a schematic block diagram of a chip according to an exemplary embodiment of the present disclosure;

fig. 10 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Before describing embodiments of the present disclosure, the following definitions are first provided for the relative terms involved in the embodiments of the present disclosure:

sparse decomposition is a Matching Pursuit (MP) algorithm that decomposes an original signal on a non-orthogonal basis (overcomplete atom library), and it can adaptively construct an overcomplete dictionary according to the characteristics of the signal itself, and represent the signal with fewer atom linear combinations that match the characteristics of the signal itself.

The fast fourier transform (Fast Fourier Transform, FFT) is a fast algorithm for discrete fourier transform by which a signal can be transformed from the time domain to the frequency domain.

Short-time fourier transform (STFT) can build a link between the signal frequency domain and the signal time domain by intercepting the signal in the signal time domain by a window function of a fixed width, and sequentially moving the window position until the whole signal is decomposed into a series of small fragments, and then fourier analysis is performed.

The attention mechanism (Attention Mechanism) is mainly used for screening input characteristic information, reserving or strengthening the influence of relatively important information in a characteristic diagram according to the setting, weakening or even discarding unimportant information, and therefore improving the performance of the whole neural network model.

The Skip-Attention mechanism (Skip-Attention) is a Skip connection structure which can replace a Skip connection structure in an encoder-Decoder Network (Convolutional Encoder-Decoder Network, CEDN) to realize information exchange between the downsampling feature blocks and the upsampling feature blocks at the same level, and can effectively control the flow of low-level features to high-level features in the Network.

The uret + + is a more efficient medical image segmentation structure that is a depth-supervised encoder-decoder network that can be connected by a series of nested, dense jump paths through the encoder and decoder.

The Gamma tone filter is a group of filter models for simulating the frequency decomposition characteristics of a cochlea, can be used for decomposing audio signals, and is convenient for subsequent feature extraction.

The spatial attention mechanism is an adaptive spatial region object selection process that is applied over a range of fine classification and image letters than channel attention.

In daily life, people widely use various music playing applications for playing music, the music playing applications have rich music libraries, and users can input relevant information of music in a search box to search music files to be played and then play the searched music files. However, these music playing applications do not have a search function for unknown music files, and therefore it is a relatively important task to develop a search function for unknown music files for music playing applications.

At present, a voice recognition or voiceprint comparison technology is generally used for music recognition, and the recognition mode is simply to extract audio features of a music file and then perform full-network search, so that the search accuracy and the search efficiency are low. Meanwhile, when the full network search of the audio features is performed, one device is required to play the music file, and the other device recognizes the music file, so that the music file recognition method is poor in practicability.

In view of the above problems, exemplary embodiments of the present disclosure provide a music file identification method, which may be suitable for various music applications, and has strong practicality, and may identify unknown music files, and improve the accuracy and efficiency of music file identification.

In practical applications, the music file of the exemplary embodiment of the present disclosure may be a generalized music file, which may include not only a narrowly understood music file, but also some score utterances, such as score article recitations, etc., all of which belong to the music file of the exemplary embodiment of the present disclosure. The music file may be a complete music file or a partial section of a complete music file.

The music file may be an unknown music file recorded or a known music file directly imported into the electronic device. The known and unknown herein may be that the music file is unknown or known relative to the user initiating the music file identification.

For example, if a user needs to record a music file, the recording function of the electronic device may be directly opened or called by various applications with recording functions deployed on the electronic device, and the music file may be recorded, or the music file may be directly imported into the electronic device. For example: the music file may be sent to the electronic device via instant messaging software.

The music file identification method of the exemplary embodiment of the present disclosure may be applied to various electronic devices or chips in electronic devices, which may include user devices, servers, and the like. The user device may be an electronic device with a display function, for example: a cell phone, tablet, wearable device, in-vehicle 20 device, notebook, ultra-mobile personal computer, UMPC, netbook, palm (Personal Digital Assistant, PDA), wearable device based on augmented reality (augmented reality, AR) and/or Virtual Reality (VR) technology, and the like.

For example, when the user device is a wearable device, the wearable device may also be a generic name for applying wearing technology to perform intelligent design on daily wear and developing wearable devices, such as glasses, gloves, watches, clothes, shoes, and the like. The wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also can realize a powerful function through software support, data interaction and cloud interaction. The generalized wearable intelligent device comprises full functions, large size, and complete or partial functions which can be realized independent of a smart phone, such as a smart watch or a smart glasses, and is only focused on certain application functions, and needs to be matched with other devices such as the smart phone for use, such as various smart bracelets, smart jewelry and the like for physical sign monitoring.

Fig. 1 shows a flowchart of a music file recognition method according to an exemplary embodiment of the present disclosure. As shown in fig. 1, a music file recognition method of an exemplary embodiment of the present disclosure includes:

step 101: and carrying out text recognition on the voice audio data contained in the music file to be recognized, and determining the audio text data of the music file to be recognized.

In practical application, the exemplary embodiment of the disclosure performs text recognition on the voice audio data contained in the music file to be recognized, and when determining the audio text data of the music file to be recognized, the text data contained in the voice audio data is substantially recognized, so that the audio text data of the music file to be recognized is obtained. In the implementation, the voice audio data can be determined to be effective audio Mel spectrum characteristics, hidden vectors corresponding to the effective audio Mel spectrum characteristics are determined based on the effective audio Mel spectrum characteristics, and the audio text data of the music file to be identified is obtained based on the hidden vectors corresponding to the effective audio Mel spectrum characteristics.

The valid audio mel-spectrum feature of the exemplary embodiments of the present disclosure may refer to a piece of mel-spectrum feature in which valid voice (i.e., high frequency voice) exists, which may be a mel-spectrum feature sequence corresponding to human voice audio data converted from human voice audio data, and then one or more valid audio mel-spectrum features are acquired from the mel-spectrum feature sequence corresponding to human voice audio data based on valid data screenshot parameters, while invalid audio (i.e., low frequency voice) mel-spectrum features are discarded, thereby alleviating the subsequent text recognition calculation pressure.

Step 102: and carrying out emotion analysis on accompaniment audio data contained in the music file to be identified to obtain an emotion prediction result of the music file to be identified.

The emotion prediction result of the exemplary embodiment of the present disclosure may describe an emotion space to which the emotion prediction result belongs through a discrete emotion space description model or a dimensional emotion space description model. When describing the emotion space to which the emotion prediction result belongs through the discrete emotion space description model, the emotion prediction result can be described through emotion labels such as gas generation, happiness, surprise, aversion, heart injury, fear and the like, and has natural interpretability. When describing the emotion space to which the emotion prediction result belongs through the dimension emotion space description model, the dimension emotion space description model can represent different emotions by using continuous multidimensional vectors. Each point in the dimension emotion space represents an emotion, has continuity of numerical vectors, and can conveniently calculate differences and continuity between different emotions. The dimensional emotion space description model can be an activity-titer (value-Arousal) model, a Plutchik emotion wheel model and the like.

FIG. 2 shows a schematic diagram of the structure of a value-area model according to an exemplary embodiment of the present disclosure. As shown in fig. 2, the value-Arousal model of the exemplary embodiment of the present disclosure divides emotion into two dimensions of activation and Valence, and represents emotion space corresponding to the numerical variation thereof in a coordinate form. As can be seen from fig. 2, the degree of activation (value) indicates the emotional aggressiveness, such as the degree of high Arousal and fear, the titer (Arousal) may indicate the degree of emotional elevation, such as the degree of shock and calm, the greater the degree of activation, the higher the titer, the more aggressive the emotion and vice versa.

Step 103: and searching music information based on the emotion prediction result and the audio text data to obtain target music related information corresponding to the music file to be identified.

The target music-related message of the exemplary embodiment of the present disclosure may include: attribute information of the music file to be identified, such as song name, singer, album name, release time, upload time, and the like of the music file to be identified. Meanwhile, in order to provide more relevant information of the file to be identified for the user, emotion prediction results of the music file and audio text data of the music file to be identified can be collected and used as a part of relevant information of target music to be pushed to the user terminal.

The method of the disclosed exemplary embodiment can be applied to various music applications such as music playing applications and music searching applications, and can search for unknown music files or known music files which are recorded or uploaded by a user and wait for identification of the music files, so as to obtain target music related information corresponding to the music files to be identified.

The method of the disclosed exemplary embodiment can complete recording and identifying functions through one device, can also introduce music files to be identified from instant messaging application or other storage devices without recording, and can identify the music files.

In addition, music information searching is carried out based on the emotion prediction result and the audio text data, so that the searching range of the music information can be reduced, the searching efficiency of music related information corresponding to the music file to be identified is improved, the music information can be searched by combining the emotion prediction result and the audio text data, and the accuracy of the searched music related information is ensured.

In an alternative manner, when the music file to be identified is recorded, the surrounding environment is noisy and the noise is relatively loud, so that the music file identification accuracy is not high. Based on this, the method of the exemplary embodiment of the present disclosure may further include: and carrying out noise reduction treatment on the music file to be identified by adopting a sparse decomposition mode. According to the method and the device for identifying the music file, the format of the music file to be identified can be converted before noise reduction processing, and the music file to be identified which can be identified is obtained. When the sparse decomposition method is adopted to denoise the audio data, the problem of low recognition accuracy of the music file to be recognized due to noise can be solved.

When the noise reduction processing is performed on the music file to be identified in the sparse decomposition mode, the method can repeatedly perform iterative sparse decomposition on the music file to be identified based on the overcomplete atom library for a plurality of times to obtain the music file to be identified after the repeated iterative sparse decomposition, and if the difference parameters of the music file to be identified after the last two iterative sparse decomposition meet the noise reduction termination condition, the music file to be identified after the last iterative sparse decomposition is determined to be the music file to be identified after the noise reduction.

The difference parameter of the music file to be identified after the last two-time iteration sparse decomposition may be a ratio of the music file to be identified after the last two-time iteration sparse decomposition, and the noise reduction termination condition may include that wallpaper of the music file to be identified after the last two-time iteration sparse decomposition is smaller than a residual threshold parameter. The specific value of the residual threshold parameter may be determined according to the identification accuracy of the music file to be identified, which is not specifically limited herein. It should be understood that, when the music file to be identified is subjected to iterative sparse decomposition, the music file to be identified for which the last iterative sparse decomposition is performed is the music file to be identified after the previous iterative sparse decomposition.

The overcomplete atom library employed in the exemplary embodiments of the present disclosure may be pre-established from a large number of music audio data, and an exemplary expression formula thereof may refer to formula one:

wherein E (f) represents a short-time average zero-crossing rate, sgn represents a sign function, g represents a resonance function of a sound pressure wave in a provincial channel, n represents a range indication amount of an audio data sequence, i represents an indication amount of the audio data sequence, x _i Representing the ith audio data sequence, x _i-1 Represents the i-1 st audio data sequence, f represents a mathematical function of the audio data, f=fr+f _z Fr represents audio data of noiseless music, f _z Representing noise data.

In an alternative manner, the method of the exemplary embodiment of the present disclosure may further include, before performing text recognition and emotion analysis: and extracting and separating the music source characteristics of the music file to be identified, so that the required voice audio data and accompaniment audio data are acquired from the music file to be identified.

In practical application, in order to ensure that the music application can provide richer music related information for the user, when the voice audio data is a voice audio amplitude spectrogram and the accompaniment audio data is an accompaniment audio amplitude spectrogram, the exemplary embodiment of the disclosure can also determine the voice audio file based on the voice audio amplitude spectrogram and the mixed phase spectrogram of the music file to be identified, and determine the accompaniment audio file based on the accompaniment audio amplitude spectrogram and the mixed phase spectrogram of the music file to be identified. Exemplary, exemplary embodiments of the present disclosure may convert a music file to be identified into a mixed amplitude spectrum of the music file to be identified and a phase spectrum of the music file to be identified through STFT.

The exemplary embodiments of the present disclosure may input a mixed amplitude spectrum of a music file to be identified into an encoder/decoder network, acquire a vocal audio amplitude spectrum and an accompaniment audio amplitude spectrum, input a phase spectrum of the music file to be identified and the vocal audio amplitude spectrum into an Inverse Short-time fourier transformer (ISTFT), reconstruct the vocal audio file, input the phase spectrum of the music file to be identified and the accompaniment audio amplitude spectrum into the ISTFT, and reconstruct the accompaniment audio file.

Fig. 3 shows a flow diagram of a music source feature extraction and separation model according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the music source feature extraction and separation of the music file to be identified according to the exemplary embodiment of the present disclosure may include:

step 301: and extracting accompaniment audio masking information and voice audio masking information from the mixed magnitude spectrum of the music file to be identified.

In practical use, the exemplary embodiments of the present disclosure extract accompaniment audio masking information and vocal audio masking information from a mixed-amplitude spectrum of a music file to be identified using an encoder-decoder network. Encoder and decoder networks typically include an encoder module and a decoder module that may be connected by a jump connection structure to communicate information between the encoder module and the decoder module.

Illustratively, when the encoder module of the exemplary embodiment of the present disclosure includes N-level encoders and the decoder module includes N-level decoders, the encoder module of the exemplary embodiment of the present disclosure inputs the mixed amplitude spectrum of the music file to be identified into the N-level encoders in series, obtains N-level amplitude spectrum encoding features, inputs the N-level amplitude spectrum encoding features into the N-level decoders in series, obtains N-th amplitude spectrum decoding features, and then determines accompaniment audio masking information and human voice audio masking information based on the mixed amplitude spectrum of the music file to be identified and the N-th amplitude spectrum decoding features, wherein each level of decoders is used for outputting the corresponding level amplitude spectrum decoding features, each level of encoders is used for outputting the corresponding level amplitude spectrum encoding features, and N represents an integer greater than or equal to 2;

step 302: and determining the voice audio data based on the mixed amplitude spectrum of the music file to be identified and the voice audio masking information. For example: the mixed magnitude spectrum of the music file to be identified and the voice audio masking information can be dot multiplied, so that voice audio data is obtained. At this time, the human voice audio data may be a human voice audio amplitude spectrogram.

Step 303: the accompaniment audio data is determined based on the mixed amplitude spectrum of the music file to be identified and the accompaniment audio masking information. For example: the mixed magnitude spectrum of the music file to be identified and the accompaniment audio masking information may be dot multiplied to obtain accompaniment audio data. At this time, the accompaniment audio data may be an accompaniment audio amplitude spectrogram.

According to the exemplary embodiment of the disclosure, the N-level encoder modules and the N-level decoder modules which are connected in series can be connected through the jump connection structure, so that the jump attention module is used for controlling the flow of low-level features output by the encoder modules to high-level features of the decoder modules, the problem that boundary information (such as accompaniment audio features and voice audio features) of the decoder modules is lost is solved, and the output accompaniment audio masking information and voice audio masking information are ensured to be accurate.

Illustratively, the input of the N-n+1-th stage encoder and the output of the N-1-th stage decoder are both connected to the input of the N-th stage decoder through a first skip attention module, such that the input characteristic of the N-th stage decoder is a first skip attention characteristic determined by the N-1-th stage amplitude decoding characteristic and the input characteristic of the N-n+1-th stage encoder, N representing an integer greater than or equal to 2 and less than or equal to N.

In other words, the first skip attention module may be provided at each of the input of the level 2 decoder to the input of the level N decoder such that the input characteristic of the level 2 decoder is a first skip attention characteristic determined by the level 1 amplitude decoding characteristic and the input characteristic of the N-1 encoder, the input characteristic of the level 3 decoder is a first skip attention characteristic determined by the level 2 amplitude decoding characteristic and the input characteristic of the N-2 encoder, … …, and the input characteristic of the level N decoder is a first skip attention characteristic determined by the level N-1 amplitude decoding characteristic and the input characteristic of the level 1 encoder.

For example, the first skip attention module of exemplary embodiments of the present disclosure may be configured to determine first weight information based on the N-1 st level amplitude decoding feature and the N-n+1 th level amplitude spectrum coding feature, and may derive input features of the N-th level decoder based on the N-n+1 th level amplitude spectrum coding feature and the first weight information.

Illustratively, the exemplary embodiments of the present disclosure determine the accompaniment audio masking information and the vocal audio masking information based on the mixed amplitude spectrum and the nth-level amplitude spectrum decoding characteristics of the music file to be identified may include: and determining second weight information based on the mixed amplitude spectrum and the N-th level amplitude spectrum decoding characteristics of the music file to be identified, and determining accompaniment audio masking information and voice audio masking information based on the mixed amplitude spectrum and the second weight information of the music file to be identified. It can be seen that the exemplary embodiments of the present disclosure may also integrate the nth level amplitude spectrum decoding feature and the mixed amplitude spectrum of the music file to be identified through a skip attention mechanism, so as to ensure that the output accompaniment audio masking information and the voice audio masking information may be as accurate as possible.

Considering the feature that the mixed amplitude spectrum of the music file to be identified is extracted a plurality of times through the encoder-decoder network, the accompaniment audio feature and the vocal audio feature contained therein have been separated, and thus the nth-order amplitude spectrum decoding feature of the exemplary embodiment of the present disclosure may substantially include the mixed amplitude spectrum of the two-channel music file to be identified, which may already exhibit the accompaniment audio feature and the vocal audio feature. Based on the method, the obtained second weight information integrates the characteristics of the decoding characteristic of the N-level amplitude spectrum and the mixed amplitude spectrum of the music file to be identified through a jumping attention mechanism, and the relation information of the accompaniment audio characteristics and the voice audio characteristics lost by the N-level decoder connected in series is made up, so that the output accuracy of the accompaniment audio masking information and the voice audio masking information is further improved.

As can be seen from the above, the exemplary embodiments of the present disclosure may selectively perform a skip attention operation between the N-level encoder and the N-level decoder connected in series, so as to compensate for the relationship information between the accompaniment audio feature and the vocal audio feature lost by the N-level decoder connected in series, thereby ensuring the accuracy of outputting the accompaniment audio masking information and the vocal audio masking information, and improving the accuracy of separating the vocal audio data and the vocal audio data.

The first weight information calculations of the exemplary embodiments of the present disclosure may all be calculated with reference to the principle of the self-attention mechanism of the transducer model.

The N-n+1 level amplitude spectrum coding features can be linearly mapped through the query mapping matrix to obtain a first query matrix Q, the N-1 level amplitude decoding features and the N-n+1 level amplitude spectrum coding features are spliced to obtain first coding and decoding splicing features, then the coding and decoding splicing features are respectively linearly mapped through the value mapping matrix and the key value mapping matrix to obtain a corresponding first value matrix V and a first key value matrix K, and then a first weight matrix, namely first weight information, can be obtained through a self-attention mechanism formula shown in the second.

Where d represents the number of channels of the N-n+1-th level amplitude spectrum coding feature, where the number of channels of the N-n+1-th level amplitude spectrum coding feature is the same as the number of channels of the N-1-th level amplitude decoding feature.

When the second weight information is obtained by referring to the first weight information, the mixed amplitude spectrum of the music file to be identified can be linearly mapped through the query mapping matrix to obtain a second query matrix, the mixed amplitude of the music file to be identified and the decoding characteristic of the N-th level amplitude spectrum are spliced to obtain a second coding and decoding splicing characteristic, the second coding and decoding splicing characteristic is respectively linearly mapped through the value mapping matrix and the key value mapping matrix to obtain a corresponding second value matrix and a second key value matrix, and then the second weight matrix, namely the second weight information, can be obtained through a self-attention mechanism formula shown in the second.

In addition, since the nth level amplitude spectrum decoding feature of the exemplary embodiment of the present disclosure may substantially include a mixed amplitude spectrum of a two-channel music file to be identified, second weight information obtained based on the mixed amplitude spectrum of the music file to be identified and the nth level amplitude spectrum decoding feature substantially includes weight information of an accompaniment audio feature and a vocal audio feature weight information. On the basis, the human voice audio masking information can be obtained by utilizing the mixed amplitude spectrum of the weight information of the accompaniment audio features and the music file to be identified, and the accompaniment audio masking information can be obtained by utilizing the mixed amplitude spectrum of the weight information of the human voice audio features and the music file to be identified.

Fig. 4 shows a schematic diagram of a structural framework of an encoder-decoder network of an exemplary embodiment of the present disclosure. As shown in fig. 4, an encoder-decoder network 400 of an exemplary embodiment of the present disclosure may include an encoder module, a decoder module, and an attention module. Wherein the encoder module includes a first stage encoder 401a, a second stage encoder 401b, a third stage encoder 401c, and a fourth stage encoder 401d, and the decoder module includes a first stage decoder 402a, a second stage decoder 402b, a third stage decoder 402c, and a fourth stage decoder 402d. The first stage encoder 401a, the second stage encoder 401b, the third stage encoder 401c, the fourth stage encoder 401d, the first stage decoder 402a, the second stage decoder 402b, the third stage decoder 402c, and the fourth stage decoder 402d are sequentially connected in series.

To implement the skip attention mechanism, as shown in fig. 4, the attention module of the exemplary embodiment of the present disclosure includes a first-stage skip attention module 403a, a second-stage skip attention module 403b, a third-stage skip attention module 403c, and a fourth-stage skip attention module 403d.

As shown in fig. 4, the first-stage decoder 402a and the second-stage decoder 402b are connected by a first-stage skip attention module 403a, the second-stage decoder 402b and the third-stage decoder 402c are connected by a second-stage skip attention module 403b, the third-stage decoder 402c and the fourth-stage decoder 402d are connected by a third-stage skip attention module 403c, and the fourth-stage skip attention module 403d is provided at the output end of the fourth-stage decoder 402d. An input of the first stage encoder 401a is connected to an input of the third stage skip attention module 403c and an input of the fourth stage skip attention module 403d, an input of the second stage encoder 401b is connected to an input of the second stage skip attention module 403b, and an input of the third stage encoder 401c is connected to an input of the first stage skip attention module 403 a.

It can be seen that as shown in fig. 4, the input of the first-stage skip attention module 403a of the exemplary embodiment of the present disclosure is connected to the input of the third-stage encoder 401c and the output of the first-stage decoder 402a, respectively, and thus, the first-stage skip attention module 403a may determine the first-stage skip attention feature based on the input feature of the third-stage encoder 401c (i.e., the second-stage amplitude spectrum encoding feature output by the second-stage encoder 401 b) and the first-stage amplitude spectrum decoding feature.

As shown in fig. 4, the input of the second-stage skip attention module 403b is connected to the input of the second-stage encoder 401b and the output of the second-stage decoder 402b, respectively, and thus the second-stage skip attention module 403b may determine the second-stage skip attention feature based on the input features of the second-stage encoder 401b (i.e., the first-stage amplitude spectrum encoding features output by the first-stage encoder 401 a) and the second-stage amplitude spectrum decoding features.

As shown in fig. 4, the input of the third-level skip attention module 403c is connected to the input of the first-level encoder 401a and the output of the third-level decoder 402c, respectively, and thus the third-level skip attention module 403c may determine the third-level skip attention feature based on the input feature of the first-level encoder 401a (i.e., the mixed-amplitude spectrum of the music file to be identified) and the third-level amplitude spectrum decoding feature.

As shown in fig. 4, the input terminal of the fourth stage skip attention module 403d is connected to the input terminal of the first stage encoder 401a and the output terminal of the fourth stage decoder 402d, respectively, so that the fourth stage skip attention module 403d may determine the fourth stage skip attention feature, which is a dual-channel feature and may include accompaniment audio masking information and vocal audio masking information, based on the input feature of the first stage encoder 401a (i.e., the mixed amplitude spectrum of the music file to be recognized) and the fourth stage amplitude spectrum decoding feature.

While the various stages of encoders of the exemplary embodiments of the present disclosure may maintain the input feature scale, further extraction of the input features is performed, with each stage of decoder including a downsampling module. Considering that the features input by the skip attention module of each stage come from the input end of the output end of the decoder and the input end of the encoder, in order to ensure the calculability of the input features, the input end of the output end of the decoder and the input end of the encoder of the exemplary embodiment of the present disclosure need to be in the same scale, and thus, as shown in fig. 4, the encoder-decoder network of the exemplary embodiment of the present disclosure may further include downsampling modules, namely, a first stage downsampling module 404a, a second stage downsampling module 404b, a third stage downsampling module 404c and a fourth stage downsampling module 404d.

The first-stage downsampling module 404a is disposed at the input end of the first-stage encoder 401a, such that the input end of the first-stage downsampling module 404a is connected with the input end of the fourth-stage skip attention module 403c, and the output end of the first-stage downsampling module 404a is connected with the input end of the third-stage skip attention module 403 c; the second stage downsampling module 404b is disposed between the first stage encoder 401a and the second stage encoder 401b, and an output end of the second stage downsampling module 404b is connected with an input end of the second stage skip attention module 403 b; the third stage downsampling module 404c is disposed between the second stage encoder 401b and the third stage encoder 401c, such that an output of the third stage downsampling module 404c is connected to an input of the first stage skip attention module 403a, and the fourth stage downsampling module 404d is disposed between the third stage encoder 401c and the fourth stage encoder 401 d.

The exemplary embodiments of the present disclosure may also adjust the encoder architecture to optimize amplitude encoding feature extraction, based on which at least one encoder includes M-level downsampling, M-level upsampling, and spatial attention modules in series, M representing an integer greater than or equal to 1. The number of up-sampling modules included in each stage of up-sampling modules may be at least one.

The output end of the k-1 level downsampling module is connected with the input end of the k-1 level upsampling module, the output end of the k-1 level downsampling module is connected with the k level upsampling module in a jumping manner, and k represents an integer which is more than or equal to 3 and less than or equal to M. The different-stage up-sampling modules are connected through a jump connection structure, the output end of the first-stage down-sampling module is connected with the output end of each stage of down-sampling module in series, and the output end of the last-stage up-sampling module is connected with the input end of the spatial attention module. At this time, the first-stage downsampling feature output by the first-stage downsampling module and the upsampling feature output by the different-stage upsampling module can be spliced, and then the spatial attention feature extraction is performed through the spatial attention module, so that the output feature of the decoder is obtained, and the separation effect of the voice audio feature and the accompaniment audio feature is improved.

It can be seen that the exemplary embodiments of the present disclosure are inspired by the uoet++, the decoder modules included in the multiple uoet networks are shared, while the decoder modules maintain the original structure, and for different uoet networks, the decoder networks may be connected through a jump connection structure.

The disclosure illustrates that when the encoder modules of the various uiet networks are shared, they are not shared in a conventional sense, but rather the decoder modules of different uiet networks are connected to different level downsampling modules included in the same encoder module. The structure can not only maximally reduce the parameter quantity of the encoder module, but also provide a better feature extraction function and improve the output accuracy of accompaniment audio masking information and voice audio masking information. Meanwhile, the spatial attention module extracts splicing features formed by the downsampling features of the first stage and the upsampling features output by the upsampling modules of different stages, so that spatial attention information is obtained, and therefore, the method of the exemplary embodiment of the disclosure can also improve the separation effect of the voice audio features and the accompaniment audio features.

Fig. 5 shows a schematic diagram of a structural framework of a decoder according to an exemplary embodiment of the present disclosure. As shown in fig. 5, the decoder 500 of the exemplary embodiment of the present disclosure includes a four-stage downsampling module and a four-stage upsampling module. The four-stage downsampling module includes a first channel downsampling module 501, a first pooled downsampling module 502, a second channel downsampling module 503, and a second pooled downsampling module 504, and the four-stage upsampling module includes a first-stage upsampling module, a second-stage upsampling module, a third-stage upsampling module, and a fourth-stage upsampling module, respectively, where the first-stage upsampling module may include a first channel upsampling module 501a, the second-stage upsampling module may include a first deconvolution module 502a and a second channel upsampling module 502b, and the third-stage upsampling module may include a third channel upsampling module 503a, a second deconvolution module 503b, and a third channel upsampling module 503c.

As shown in fig. 5, a first channel downsampling module 501, a first pooled downsampling module 502, a second channel downsampling module 503, and a second pooled downsampling module 504 are connected in series, an output end of the first pooled downsampling module 502 is connected with an input end of the first channel upsampling module 501a, an output end of the second channel downsampling module 503 is connected in series with a first deconvolution module 502a and a second channel upsampling module 502b, and the second pooled downsampling module 504 is connected in series with a third channel upsampling module 503a, a second deconvolution module 503b, and a third channel upsampling module 503c.

As shown in fig. 5, the output end of the first pooling downsampling module 502 is in skip connection with the output end of the first deconvolution module 502a, the output end of the first deconvolution module 502a is in skip connection with the output end of the second deconvolution module 503b, the output end of the second channel downsampling module 503 is in skip connection with the output end of the third channel upsampling module 503a, the output end of the first channel downsampling module 501 is in skip connection with the output end of the first channel upsampling module 501a, and the output end of the first channel upsampling module 501a is in skip connection with the output end of the second channel upsampling module 502 b.

As shown in fig. 5, the encoder 500 of the exemplary embodiment of the present disclosure may further include a spatial attention module 505, where an output of the second channel upsampling module 502b is skip connected to an output of the third channel upsampling module 503c, and an output of the third channel upsampling module 503c is connected to an input of the spatial attention module 505.

It can be seen that the output end of the third channel upsampling module 503c of the exemplary embodiment of the present disclosure may substantially output a multi-splice of the first channel downsampling feature, the first channel upsampling feature, the second channel upsampling feature, and the third channel upsampling feature, and perform spatial attention feature extraction by the spatial attention module 505, so as to improve the separation capability of the audio-human feature and the accompaniment audio feature in the amplitude spectrum coding feature output by the decoder.

For example, the spatial attention module may be configured to aggregate features in a channel dimension by using multiple stitching features, so that information between a vocal audio feature and an accompaniment audio feature is communicated with each other, thereby obtaining a channel aggregate feature, then pool the channel aggregate feature by using average pooling and maximum pooling, respectively, obtaining an average pooling result and a maximum pooling result, stitch the two to obtain a pooled stitching feature, then determine a spatial attention weight based on the pooled stitching feature, and obtain an amplitude spectrum decoding feature output by a decoder through the spatial attention weight and the multiple stitching feature. For example, the feature extraction can be performed on the pooled spliced features through a convolution layer, and the feature extraction structure is processed by utilizing a sigmoid activation function, so that the spatial attention weight can be obtained.

In addition, in order to ensure that the amplitude spectrum coding feature of the encoder output is the same as the input feature scale of the encoder input, the encoder 500 of the exemplary embodiment of the present disclosure may further include a third deconvolution module 506, and the output end of the third channel upsampling module 503c is connected to the input end of the spatial attention module 505 through the third deconvolution module 506, so that the splice features of the first channel downsampling feature, the first channel upsampling feature, the second channel upsampling feature, and the third channel upsampling feature may be restored to the scale of the input feature of the encoder through the third deconvolution module 506, and then the spatial attention feature extraction may be performed through the spatial attention module 505.

It should be noted that, the channel downsampling module of the exemplary embodiment of the present disclosure may extract the input features and halve the number of channels, but does not change the feature scale, and similarly, the channel upsampling module extracts the input features and doubles the number of channels, but does not change the feature scale. The pooling downsampling module can extract the input features and halve the feature scale, but does not change the number of channels, and similarly, the channel upsampling module can extract the input features and doubles the feature scale, but does not change the number of channels.

In an alternative manner, when performing text recognition of voice audio data, exemplary embodiments of the present disclosure may perform text recognition based on an audio text recognition model. Fig. 6 shows an architectural diagram of an audio text recognition model of an exemplary embodiment of the present disclosure. As shown in fig. 6, an audio text recognition model 600 of an exemplary embodiment of the present disclosure may include a first preprocessor 601, a first mel-scale filter bank 602, a shared encoder 603, and a attention decoder 604. The audio text recognition model, when trained, may include a plurality of audio samples and audio text corresponding to the audio samples, wherein the audio text may be used as a tag for the audio samples.

As shown in fig. 6, the above-mentioned first preprocessor 601 may obtain weighted human voice audio data by performing a weighting process on a high frequency part of the human voice audio data, obtain effective feature extraction parameters based on the weighted human voice audio data, and convert the human voice audio data into human voice audio mel-spectrum features using a mel-scale filter bank 602. In consideration of the fact that the voice audio data has intermittent high-frequency parts, the obtained effective feature interception parameters are multiple groups, so that one or more effective audio mel spectrum features can be intercepted from the voice audio mel spectrum features based on the effective feature interception parameters, the plurality of effective audio mel spectrum features can form an audio feature sequence group, and each effective audio mel spectrum feature included in the audio feature sequence group can be defined as an audio feature sequence.

Illustratively, the high frequency portion of the exemplary embodiments of the present disclosure may be defined by a frequency threshold, that is, a segment of the human voice audio data greater than the frequency threshold is a high frequency portion, and each high frequency signal of the high frequency portion is weighted, thereby obtaining weighted human voice audio data, which includes the weighted high frequency portion.

The valid feature interception parameter of the exemplary embodiment of the present disclosure may include a valid feature start time and a valid feature end time, and the determining manner may include: and determining the starting time of the effective feature based on the peak high-frequency signal in the weighted high-frequency part, and determining the ending time of the effective feature based on the signal falling position of the high-frequency part. Here, the signal falling position of the high frequency part may refer to a position where energy carried in the falling process of the high frequency part is the same as that carried by the non-high frequency part.

The exemplary embodiments of the present disclosure may weight each high frequency signal of the three pairs of high frequency segments by:

H(f _r ) Represents the weighted high frequency signal, y (f _r ) Amplitude spectrum diagram representing voice audio data, w _m Represents the weight of the high frequency signal, m represents the distribution of the high frequency signal components over the frequency band, and d represents the frequency bandwidth.

When the human voice audio data is converted into human voice audio mel-spectrum features using the mel-scale filter bank, the mel-scale filter bank outputs human voice audio mel-spectrum features in the form of frames, and thus, the individual frames of human voice audio mel-spectrum features may constitute a mel-spectrum feature sequence and be defined as an audio feature sequence.

Considering that there are intermittent high frequency portions of the vocal audio data, one or more of the valid audio mel-spectrum features may be truncated from the vocal audio mel-spectrum features based on the valid feature truncation parameter. The plurality of valid audio mel-spectrum feature sequences may form a set of audio feature sequences, each valid audio mel-spectrum feature included in the set of audio feature sequences may be defined as an audio feature sequence.

As shown in fig. 6, when the valid feature start time and the valid feature end time are acquired, one or more valid audio mel spectrum features corresponding to the valid feature start time and the valid feature end time may be directly acquired from the audio feature sequence group. For example: if a plurality of effective audio mel spectrum features are obtained, an audio feature sequence set formed by the plurality of effective audio mel spectrum features may be input into the shared encoder 603 to obtain hidden vectors corresponding to the audio mel spectrum features, then the hidden vectors corresponding to the audio mel spectrum features are input into the attention decoder 604 to obtain audio text data corresponding to the audio mel spectrum features, and finally the audio text data corresponding to the audio mel spectrum features are spliced according to time sequence, thereby obtaining audio text data.

In an alternative manner, when performing emotion analysis on accompaniment audio data included in a music file to be identified, an exemplary embodiment of the present disclosure may obtain mel-frequency cepstrum coefficient features and cochlea frequency features based on the accompaniment audio data, and then determine an emotion prediction result based on the mel-frequency cepstrum coefficient features and the cochlea frequency features. It should be appreciated that when performing emotion analysis, the illustrative examples of the present disclosure may analyze emotion prediction results for a music file using a music emotion analysis model.

Fig. 7 shows an architectural diagram of a music emotion analysis model of an exemplary embodiment of the present disclosure. As shown in fig. 7, a music emotion analysis model 700 of an exemplary embodiment of the present disclosure may include a second preprocessor 701, a fast fourier transformer 702, a second mel scale filter bank 703, a discrete cosine transformer 704, a GammaTone filter 705, a cldnn_bilstm model 706, the cldnn_bilstm model 706 including CLDNN and (Bi-directional Long Short Term Memory, BILSTM), the CLDNN including a combination network of convolutional neural networks (Convolutional Neural Networks, CNN), a bidirectional long and short term memory network, and a Deep neural network (Deep-Learning Neural Network, DNN), the BILSTM being a bidirectional long and short term memory network.

As shown in fig. 7, the present disclosure may input an accompaniment audio amplitude spectrogram of accompaniment audio data to a second preprocessor 701, the second preprocessor 701 weights, frames and windows a high frequency part of the accompaniment audio data to obtain a preprocessed accompaniment audio amplitude spectrogram, and then sequentially processes the preprocessed accompaniment audio amplitude spectrogram through a fast fourier transformer 702, a second mel-scale filter bank 703 and a discrete cosine transformer 704 to obtain mel-frequency cepstrum coefficient features, processes the mel-frequency cepstrum coefficient features through a GammaTone filter 705 to obtain cochlea frequency features, and finally inputs the mel-frequency cepstrum coefficient features and the cochlea frequency features to a cldnn_bimtm model 706 to obtain an emotion prediction result of the music file.

Exemplary, the exemplary embodiment of the disclosure may acquire the music emotion feature based on the mel frequency cepstrum coefficient feature and the cochlea frequency feature by using CLDNN included in the cldnn_bilstm model as a filtering channel of the music emotion feature, and then learn context related information of the music emotion feature through the BILSTM included in the cldnn_bilstm, thereby obtaining an emotion prediction result of the music file.

In an alternative manner, when acquiring the music related information corresponding to the music file to be identified based on the emotion prediction result and the audio text data, the exemplary embodiment of the present disclosure may acquire the candidate music related information based on the emotion prediction result, and then acquire the target music related information from the candidate music related information based on the audio text data.

In practical application, when searching candidate music related information, the exemplary embodiment of the disclosure searches the candidate music related information from the whole network or a certain music database through the emotion prediction result, and the searching mode is essentially that the emotion labels of the emotion prediction result and the music files are compared to obtain the related information matched with the emotion prediction result and serve as the candidate music related information, so that when searching the candidate music related information based on the emotion prediction result, a large amount of feature comparison is not needed, and the searching range of the music related information can be narrowed.

In view of the fact that the search for music file related information is performed based on audio text data, the audio text data needs to be compared with all audio file data within a search range, for example, the similarity may be based on character strings, on a corpus, and on knowledge. For another example, the comparison may be performed based on a similarity function, which includes, but is not limited to, euclidean distance, cosine distance, jacord similarity, hamming distance, etc., and thus the calculation amount of searching for the target music related information from a large number of music files based on the audio text data is relatively large, resulting in low search efficiency. In contrast, according to the exemplary embodiment of the present disclosure, the candidate music-related information is retrieved through the emotion prediction result, so that the search range of the music-related information can be narrowed, and therefore, the target music-related information is searched within a smaller search range based on the audio text data, and the search efficiency of the target music-related information can be effectively improved.

Moreover, the candidate music-related information essence of the exemplary embodiment of the present disclosure includes the related information matching each music file with the emotion prediction result, and it can be ensured that the music file tag corresponding to the obtained desired music-related information is matched with the emotion prediction result, and therefore, the exemplary embodiment of the present disclosure searches for the candidate music-related information through the emotion prediction result, and can ensure that the target music-related information is accurately obtained from the candidate music-related information based on the audio text data.

In order to increase the information capacity of the target music related information, the method of the exemplary embodiment of the present disclosure may further include: and determining a voice audio file based on the voice audio amplitude spectrogram and the mixed phase spectrogram of the music file to be identified, determining an accompaniment audio file based on the accompaniment audio amplitude spectrogram and the mixed phase spectrogram of the music file to be identified, and expanding the relevant information of the target music based on the voice audio file, the accompaniment audio file and the audio text data. Of course, the exemplary embodiments of the present disclosure may also store the vocal audio files, accompaniment audio files, and audio text data for subsequent selection and storage by other users.

According to the method and the device, a mixed phase spectrogram of a voice audio amplitude spectrogram and a music file to be identified can be input into an Inverse Short-time Fourier transformer (ISTFT), music source reconstruction is conducted through the ISTFT, a voice audio file can be obtained, a mixed phase spectrogram of an accompaniment audio amplitude spectrogram and the music file to be identified can be input into the ISTFT, music source reconstruction can be conducted through the ISTFT, an accompaniment audio file is obtained, and audio text data can be identified from voice audio data (such as a voice audio amplitude spectrogram) according to a preamble correlation method. This way, the functions of the music application can be enhanced, the music application can be guaranteed to provide more music file related information for the user,

The foregoing description of the solution provided by the embodiments of the present disclosure has been mainly presented in terms of an electronic device. It will be appreciated that the server, in order to implement the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiment of the disclosure may divide the functional units of the electronic device according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present disclosure, the division of the modules is merely a logic function division, and other division manners may be implemented in actual practice.

In the case of dividing each functional module with corresponding each function, exemplary embodiments of the present disclosure provide a music file recognition apparatus, which may be an electronic device or a chip applied to the electronic device. Fig. 8 shows a functional block diagram of a music file recognition apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 8, the music file recognition apparatus 800 includes:

the recognition module 801 is configured to perform text recognition on voice audio data included in a music file to be recognized, determine audio text data of the music file to be recognized, perform emotion analysis on accompaniment audio data included in the music file to be recognized, and obtain an emotion prediction result of the music file to be recognized;

and a searching module 802, configured to obtain, based on the emotion prediction result and the audio text data, target music related information corresponding to the music file to be identified.

In one possible implementation, the identifying module 801 is configured to extract accompaniment audio masking information and vocal audio masking information from a mixed amplitude spectrum of a music file to be identified, determine vocal audio data based on the mixed amplitude spectrum of the music file to be identified and the vocal audio masking information, and determine accompaniment audio data based on the mixed amplitude spectrum of the music file to be identified and the accompaniment audio masking information.

In one possible implementation manner, the identifying module 801 is further configured to input the mixed amplitude spectrum of the music file to be identified into a serial N-level encoder to obtain N-level amplitude spectrum coding features, input the N-level amplitude spectrum coding features into a serial N-level decoder to obtain N-level amplitude spectrum decoding features, determine accompaniment audio masking information and voice audio masking information based on the mixed amplitude spectrum of the music file to be identified and the N-level amplitude spectrum decoding features, where each level encoder is configured to output a corresponding level of amplitude spectrum coding features, N represents an integer greater than or equal to 2, and each level decoder is configured to output the corresponding level of amplitude spectrum decoding features.

In one possible implementation manner, the identifying module 801 is configured to input the mixed amplitude spectrum of the music file to be identified into N-level encoders connected in series to obtain N-level amplitude spectrum encoding features, input the N-level amplitude spectrum encoding features into N-level decoders connected in series to obtain N-level amplitude spectrum decoding features, determine accompaniment audio masking information and voice audio masking information based on the mixed amplitude spectrum of the music file to be identified and the N-level amplitude spectrum decoding features, each level encoder is configured to output a corresponding level amplitude spectrum encoding feature, and each level decoder is configured to output a corresponding level amplitude spectrum decoding feature, where N represents an integer greater than or equal to 2.

In one possible implementation manner, the input feature of the N-th level decoder is a first skip attention feature determined by the N-1 th level amplitude decoding feature and the input feature of the N-n+1 th level encoder, the input end of the N-n+1 th level encoder and the output end of the N-1 th level decoder are connected with the input end of the N-th level decoder through a first skip attention module, and N represents an integer greater than or equal to 2 and less than or equal to N;

the first skip attention module is used for determining first weight information based on the N-1 level amplitude decoding feature and the input feature of the N-n+1 level encoder, and obtaining the input feature of the N level decoder based on the input feature of the N-n+1 level encoder and the first weight information.

In one possible implementation manner, the identifying module 801 is configured to determine second weight information based on the mixed amplitude spectrum and the nth level amplitude spectrum decoding feature of the music file to be identified, and determine accompaniment audio masking information and voice audio masking information based on the mixed amplitude spectrum and the second weight information of the music file to be identified.

As one possible implementation, the at least one encoder includes an M-level downsampling module, an M-level upsampling module, and a spatial attention module connected in series, M representing an integer greater than or equal to 1;

The output end of the k-1 level down sampling module is connected with the input end of the k-1 level up sampling module, the output end of the k-1 level down sampling module is connected with the k level up sampling module in a jumping manner, and up sampling modules corresponding to the adjacent two levels of down sampling modules are connected through a jumping connection structure;

the output end of the first-stage downsampling module is connected with the output end of each stage downsampling module in series, the output end of the last-stage upsampling module is connected with the input end of the spatial attention module, the kth-stage upsampling module is connected with the downsampling module of the corresponding stage in a jumping manner, and k represents an integer which is more than or equal to 3 and less than or equal to M.

In one possible implementation manner, the identifying module 801 is configured to determine an effective audio mel spectrum feature based on the voice audio data included in the music file to be identified, determine a hidden vector corresponding to the effective audio mel spectrum feature based on the effective audio mel spectrum feature, and obtain audio text data based on the hidden vector corresponding to the effective audio mel spectrum feature.

In one possible implementation, the identifying module 801 is configured to obtain mel-frequency cepstral coefficient features and cochlear frequency features based on the accompaniment audio data, and determine emotion prediction results based on the mel-frequency cepstral coefficient features and the cochlear frequency features.

In one possible implementation, the search module 802 is configured to obtain candidate music related information based on the emotion prediction result and based on the audio text data. Target music-related information is acquired from the candidate music-related information.

In one possible implementation, the target music related information includes: the device further comprises an expansion module for determining the voice audio file based on the voice audio amplitude spectrogram and the mixed phase spectrogram of the music file to be identified, determining the accompaniment audio file based on the accompaniment audio amplitude spectrogram and the mixed phase spectrogram of the music file to be identified, and expanding the target music related information based on the voice audio file, the accompaniment audio file and the audio text data.

Fig. 9 shows a schematic block diagram of a chip according to an exemplary embodiment of the present disclosure. As shown in fig. 9, the chip 900 includes one or more (including two) processors 901 and a communication interface 902. The communication interface 902 may support the electronic device to perform the data transceiving steps of the method described above, and the processor 901 may support the electronic device to perform the data processing steps of the method described above.

Optionally, as shown in fig. 9, the chip 900 further includes a memory 903, where the memory 903 may include a read-only memory and a random access memory, and provides operating instructions and data to the processor. A portion of the memory may also include non-volatile random access memory (non-volatile random access memory, NVRAM).

In some embodiments, as shown in fig. 9, the processor 901 performs the corresponding operation by invoking a memory-stored operating instruction (which may be stored in an operating system). The processor 901 controls the processing operations of any one of the terminal devices, and the processor may also be referred to as a central processing unit (central processing unit, CPU). Memory 903 may include read only memory and random access memory and provides instructions and data to processor 901. A portion of the memory 903 may also include NVRAM. Such as a memory, a communication interface, and a memory coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. But for clarity of illustration, the various buses are labeled as bus system 904 in fig. 9.

The method disclosed by the embodiment of the disclosure can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks of the disclosure in the embodiments of the disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present disclosure may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The exemplary embodiments of the present disclosure also provide an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to embodiments of the present disclosure when executed by the at least one processor.

The present disclosure also provides a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 10, a block diagram of a structure of an electronic device 1000 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006, an output unit 1007, a storage unit 1008, and a communication unit 1009. The input unit 1006 may be any type of device capable of inputting information to the electronic device 1000, and the input unit 1006 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 1007 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 1008 may include, but is not limited to, magnetic disks, optical disks. Communication unit 1009 allows electronic device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above. For example, in some embodiments, the methods of the exemplary embodiments of the present disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1000 via the ROM 1002 and/or the communication unit 1009. In some embodiments, the computing unit 1001 may be configured to perform the methods of the exemplary embodiments of the present disclosure in any other suitable manner (e.g., by means of firmware).

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions of the embodiments of the present disclosure are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a terminal, a user equipment, or other programmable apparatus. The computer program or instructions may be stored in or transmitted from one computer readable storage medium to another, for example, by wired or wireless means from one website site, computer, server, or data center. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices such as servers, data centers, etc. that integrate one or more available media. Usable media may be magnetic media such as floppy disks, hard disks, magnetic tape; optical media, such as digital video discs (digital video disc, DVD); but also semiconductor media such as solid state disks (solid state drive, SSD).

Although the present disclosure has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations thereof can be made without departing from the spirit and scope of the disclosure. Accordingly, the specification and drawings are merely exemplary illustrations of the present disclosure as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents within the scope of the disclosure. It will be apparent to those skilled in the art that various modifications and variations can be made to the present disclosure without departing from the spirit or scope of the disclosure. Thus, the present disclosure is intended to include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A music file recognition method, comprising:

and searching music information based on the emotion prediction result and the audio text data to obtain target music related information corresponding to the music file to be identified.

2. The method according to claim 1, wherein the method further comprises:

extracting accompaniment audio masking information and voice audio masking information from the mixed magnitude spectrum of the music file to be identified;

determining the voice audio data based on the mixed amplitude spectrum of the music file to be identified and the voice audio masking information;

and determining the accompaniment audio data based on the mixed magnitude spectrum of the music file to be identified and the accompaniment audio masking information.

3. The method according to claim 2, wherein the extracting accompaniment audio masking information and voice audio masking information from the mixed amplitude spectrum of the music file to be identified comprises:

inputting the mixed amplitude spectrum of the music file to be identified into N-level encoders connected in series to obtain N-level amplitude spectrum coding features, wherein each level of encoder is used for outputting corresponding level amplitude spectrum coding features, and N represents an integer greater than or equal to 2;

inputting the N-level amplitude spectrum coding features into N-level decoders connected in series to obtain N-level amplitude spectrum decoding features, wherein each level of decoder is used for outputting corresponding level amplitude spectrum decoding features;

and determining the accompaniment audio masking information and the voice audio masking information based on the mixed amplitude spectrum and the Nth-level amplitude spectrum decoding characteristics of the music file to be identified.

4. A method according to claim 3, wherein the input features of the N-th level decoder are first skip attention features determined by N-1 th level amplitude decoding features and N-n+1 th level encoder input features, the N-n+1 th level encoder input and the N-1 th level decoder output are each connected to the N-th level decoder input via a first skip attention module, N represents an integer greater than or equal to 2 and less than or equal to N;

the first jumping attention module is used for determining first weight information based on an N-1 level amplitude decoding characteristic and an input characteristic of an N-n+1 level encoder, and obtaining the input characteristic of the N level decoder based on the input characteristic of the N-n+1 level encoder and the first weight information.

5. The method of claim 3, wherein the determining the accompaniment audio masking information and the voice audio masking information based on the mixed amplitude spectrum and the nth level amplitude spectrum decoding feature of the music file to be identified comprises:

determining second weight information based on the mixed amplitude spectrum and the nth level amplitude spectrum decoding characteristics of the music file to be identified;

and determining the accompaniment audio masking information and the voice audio masking information based on the mixed magnitude spectrum of the music file to be identified and the second weight information.

6. A method according to claim 3, wherein at least one of the encoders comprises a M-level downsampling module, a M-level upsampling module and a spatial attention module in series, M representing an integer greater than or equal to 1;

the output end of the k-1 level down sampling module is connected with the input end of the k-1 level up sampling module, the output end of the k-1 level down sampling module is connected with the k level up sampling module in a jumping manner, and the up sampling modules corresponding to the adjacent two levels of down sampling modules are connected through a jumping connection structure;

the output end of the first-stage downsampling module is connected with the output end of each stage downsampling module in series, the output end of the last-stage upsampling module is connected with the input end of the spatial attention module, the kth-stage downsampling module is connected with the downsampling module of the corresponding stage in a jumping manner, and k represents an integer which is more than or equal to 3 and less than or equal to M.

7. The method according to any one of claims 1 to 6, wherein the performing emotion analysis on accompaniment audio data included in the music file to be identified to obtain an emotion prediction result of the music file to be identified includes:

Acquiring mel frequency cepstrum coefficient characteristics and cochlea frequency characteristics based on accompaniment audio data contained in the music file to be identified;

determining the emotion prediction result based on the mel frequency cepstrum coefficient feature and the cochlear frequency feature;

and/or the number of the groups of groups,

the music information searching based on the emotion prediction result and the audio text data to obtain music related information corresponding to the music file to be identified comprises the following steps:

acquiring candidate music related information based on the emotion prediction result;

and acquiring the target music related information from the candidate music related information based on the audio text data.

8. A music file recognition apparatus, characterized by comprising:

9. An electronic device, comprising:

a processor; the method comprises the steps of,

a memory storing a program;

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any one of claims 1 to 7.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-7.