CN115312064A

CN115312064A - Singing object recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115312064A
Application number: CN202210906243.6A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-08

Abstract

The embodiment of the application provides a singing object identification method and device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring target audio data of a target singing object; extracting the frequency spectrum characteristics of the target audio data to obtain Mel cepstrum characteristics; extracting tone color characteristics of the target audio data to obtain target tone color characteristics; performing middle-layer feature extraction on the target audio data to obtain music characteristic features; performing first fusion processing on the Mel cepstrum characteristic and the target tone characteristic to obtain a first fusion audio characteristic; performing second fusion processing on the first fusion audio characteristic and the music characteristic to obtain a second fusion audio characteristic; and performing prediction processing on the second fusion audio features through a preset character recognition model to obtain a target identity tag of the target singing object, wherein the target identity tag is used for representing the identity of the target singing object. The method and the device for recognizing the singing object can improve the recognition accuracy of the singing object.

Description

Singing object recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a singing object recognition method and apparatus, an electronic device, and a storage medium.

Background

With the development of the metasphere technology, the aspects of daily life can be extended to a world combining virtuality and reality through the metasphere, and with the increase of the number of the singers in the metasphere, the conventional identification method is often difficult to accurately identify the identities of the singers, so how to improve the identification accuracy of the singers becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a singing object identification method and device, electronic equipment and a storage medium, and aims to improve the identification accuracy of a singing object.

In order to achieve the above object, a first aspect of the embodiments of the present application provides a singing object identification method, where the method includes:

acquiring target audio data of a target singing object;

extracting the frequency spectrum characteristic of the target audio data to obtain a Mel cepstrum characteristic;

extracting tone color characteristics of the target audio data to obtain target tone color characteristics;

performing middle-layer feature extraction on the target audio data to obtain music characteristic features;

performing first fusion processing on the Mel cepstrum characteristic and the target tone characteristic to obtain a first fusion audio characteristic;

performing second fusion processing on the first fusion audio characteristic and the music characteristic to obtain a second fusion audio characteristic;

and predicting the second fusion audio features through a preset character recognition model to obtain a target identity label of the target singing object, wherein the target identity label is used for representing the identity of the target singing object.

In some embodiments, the step of extracting a spectral feature of the target audio data to obtain a mel-frequency cepstrum feature includes:

performing sound spectrum calculation on the target audio data through short-time Fourier transform to obtain a target spectrogram;

and filtering the target spectrogram through a preset Mel cepstrum filter to obtain the Mel cepstrum characteristics.

In some embodiments, the step of performing timbre feature extraction on the target audio data to obtain a target timbre feature includes:

segmenting the target audio data through a preset audio segmentation model to obtain a plurality of target audio segments;

extracting the characteristics of each target audio segment through the audio segmentation model to obtain a target audio hidden vector;

performing feature calculation on all the target audio hidden vectors to obtain a target audio mean vector and a target audio variance vector;

splicing the target audio variance vector and the target audio mean vector to obtain a target timbre hidden vector;

and performing prediction processing on the target tone implicit vector through a preset function to obtain the target tone characteristic.

In some embodiments, the step of performing middle-layer feature extraction on the target audio data to obtain music characteristic features includes:

inputting the target audio data into a preset feature extraction model, wherein the feature extraction model comprises a first convolution layer, a second convolution layer and a third convolution layer;

performing music middle layer feature extraction on the target audio data through the first convolution layer to obtain a first music middle layer feature;

performing music middle layer feature extraction on the target audio data through the second convolution layer to obtain a second music middle layer feature;

performing music middle-layer feature extraction on the target audio data through the third convolutional layer to obtain third music middle-layer features;

and splicing the first music middle layer feature, the second music middle layer feature and the third music middle layer feature to obtain the music characteristic feature.

In some embodiments, the character recognition model includes a GRU layer and a full link layer, and the step of obtaining the target identity tag of the target singing object by performing prediction processing on the second fusion audio feature through a preset character recognition model includes:

performing time sequence feature extraction on the second fusion audio features through the GRU layer to obtain fusion time sequence feature vectors;

and performing label prediction processing on the fusion time sequence feature vector through the full connection layer to obtain a target identity label of the target singing object.

In some embodiments, the step of performing time-series feature extraction on the second fused audio feature through the GRU layer to obtain a fused time-series feature vector includes:

performing time sequence feature extraction on the second fusion audio feature through a first gating circulating unit of the GRU layer to obtain an initial time sequence feature vector;

and performing time sequence feature extraction on the initial time sequence feature vector through a second gating circulation unit of the GRU layer to obtain the fusion time sequence feature vector.

In some embodiments, the step of performing label prediction processing on the fused temporal feature vector through the fully-connected layer to obtain a target identity label of the target singing object includes:

performing label probability calculation on the fusion time sequence feature vector through the classification function of the full connection layer and preset identity labels to obtain a label probability vector corresponding to each preset identity label;

selecting a preset identity label corresponding to the label probability vector with the maximum value to obtain a candidate identity label;

and obtaining the target identity label according to the candidate identity label.

In order to achieve the above object, a second aspect of the embodiments of the present application provides a singing object recognition apparatus, including:

the data acquisition module is used for acquiring target audio data of a target singing object;

the frequency spectrum characteristic extraction module is used for extracting frequency spectrum characteristics of the target audio data to obtain Mel cepstrum characteristics;

the tone characteristic extraction module is used for extracting tone characteristics of the target audio data to obtain target tone characteristics;

the middle-layer feature extraction module is used for performing middle-layer feature extraction on the target audio data to obtain music characteristic features;

the first fusion module is used for carrying out first fusion processing on the Mel cepstrum characteristic and the target tone characteristic to obtain a first fusion audio characteristic;

the second fusion module is used for carrying out second fusion processing on the first fusion audio characteristic and the music characteristic to obtain a second fusion audio characteristic;

and the prediction module is used for performing prediction processing on the second fusion audio characteristic through a preset character recognition model to obtain a target identity tag of the target singing object, wherein the target identity tag is used for representing the identity of the target singing object.

In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, wherein the program, when executed by the processor, implements the method of the first aspect.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium for computer-readable storage, and stores one or more programs, which are executable by one or more processors to implement the method of the first aspect.

The singing object recognition method, the singing object recognition device, the electronic equipment and the storage medium are characterized in that target audio data of a target singing object are obtained; further, extracting the frequency spectrum characteristics of the target audio data to obtain Mel cepstrum characteristics; extracting tone color characteristics of the target audio data to obtain target tone color characteristics; the middle-layer feature extraction is carried out on the target audio data to obtain the music characteristic feature, and the tone characteristic information, the music characteristic information and other contents of the target singing object can be conveniently determined. Further, the Mel cepstrum features and the target tone features are subjected to first fusion processing to obtain first fusion audio features, tone feature information can be fused in the subsequent identification process through the method, and the influence of music accompaniment or background music on the identification effect is eliminated; further, the first fusion audio characteristic and the music characteristic are subjected to second fusion processing to obtain a second fusion audio characteristic, so that music characteristic information can be fused in a subsequent identification process, identification of music types and music styles is increased, and identification accuracy is improved. And finally, performing prediction processing on the second fusion audio features through a preset character recognition model to obtain a target identity tag of the target singing object, wherein the target identity tag is used for representing the identity of the target singing object. Therefore, the identity of the target singing object can be conveniently determined, the problem that the identity of the singing object is difficult to identify due to the fact that the number of the singing objects in the meta universe is increased can be effectively solved, and the identification accuracy of the singing object is improved.

Drawings

Fig. 1 is a flowchart of a singing object recognition method provided in an embodiment of the present application;

FIG. 2 is a flowchart of step S102 in FIG. 1;

FIG. 3 is a flowchart of step S103 in FIG. 1;

FIG. 4 is a flowchart of step S104 in FIG. 1;

fig. 5 is a flowchart of step S107 in fig. 1;

fig. 6 is a flowchart of step S501 in fig. 5;

fig. 7 is a flowchart of step S502 in fig. 5;

fig. 8 is a schematic structural diagram of a singing object recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.

Information Extraction (NER): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology may be various types of information.

Metas (Metaverse): the digital living space is a virtual world which is linked and created by utilizing scientific and technological means, is mapped and interacted with the real world, and is provided with a novel social system. The meta universe is essentially a virtualization and digitization process of the real world, requiring a great deal of modification to content production, economic systems, user experience, and physical world content, among others. However, the development of the meta universe is gradual, and is finally shaped by continuously fusing and evolving a plurality of tools and platforms under the support of shared infrastructure, standards and protocols. The method provides immersive experience based on an augmented reality technology, generates a mirror image of the real world based on a digital twin technology, builds an economic system based on a block chain technology, fuses the virtual world and the real world closely on the economic system, a social system and an identity system, and allows each user to perform content production and world editing.

Fourier transform: the expression is able to express a certain function satisfying a certain condition as a trigonometric function (sine and/or cosine function) or a linear combination of their integrals. In different fields of research, fourier transforms have many different variant forms, such as continuous fourier transforms and discrete fourier transforms.

Mel-Frequency cepstral Coefficients (MFCC): is a set of key coefficients used to create the mel-frequency cepstrum. From segments of the music signal, a set of cepstra sufficient to represent the music signal is obtained, and the mel-frequency cepstral coefficients are the cepstrum (i.e. the spectrum of the spectrum) derived from the cepstrum. Unlike the general cepstrum, the largest feature of the mel cepstrum is that the frequency bands on the mel cepstrum are uniformly distributed on the mel scale, i.e., such frequency bands are closer to the human nonlinear auditory System (Audio System) than the commonly seen linear cepstrum representation method. For example: in the art of audio compression, mel cepstral processing is often used.

Pooling (Pooling): the method is essentially sampling, and selects a certain mode to perform dimensionality reduction processing and compression processing on an input characteristic diagram so as to accelerate the operation speed, and adopts more Pooling processes as Max Pooling (Max Pooling).

Activation Function (Activation Function): is a function that runs on a neuron of an artificial neural network responsible for mapping the input of the neuron to the output.

Vector Quantization (Vector Quantization, VQ): the method clusters original continuous data into discrete data in a clustering-like mode, so that the data quantity required to be stored is reduced, and the aim of data compression is fulfilled.

Encoding (Encoder): the input sequence is converted into a vector of fixed length.

Decoding (Decoder): converting the fixed vector generated before into an output sequence; wherein, the input sequence can be characters, voice, images and videos; the output sequence may be text, images.

Gated cycle Unit (Gate recovery Unit, GRU): the GRU is one of Recurrent Neural Networks (RNN), and like the LSTM (Long-Short Term Memory), is proposed to solve problems such as Long-Term Memory and gradient in back propagation.

Softmax function: the Softmax function is a normalized exponential function that "compresses" one K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0,1) and the sum of all elements is 1, which is commonly used in multi-classification problems.

With the development of the metastic technology, the aspects of daily life can be extended to a world combining virtuality and reality through the metastic, and with the increase of the number of the singers in the metastic, the identity of the singer is often difficult to be accurately identified by a common identification method, so how to improve the identification accuracy of the singer becomes a technical problem to be solved urgently.

Based on this, the embodiment of the application provides a singing object identification method, a singing object identification device, an electronic device and a storage medium, and aims to improve the identification accuracy of a singing object.

The singing object recognition method, the singing object recognition apparatus, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described with reference to the following embodiments, in which the singing object recognition method in the embodiments of the present application is first described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a singing object recognition method, and relates to the technical field of artificial intelligence. The singing object recognition method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like implementing a singing object recognition method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an alternative flowchart of a singing object recognition method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, acquiring target audio data of a target singing object;

step S102, extracting frequency spectrum characteristics of target audio data to obtain Mel cepstrum characteristics;

step S103, extracting tone color characteristics of the target audio data to obtain target tone color characteristics;

step S104, performing middle-layer feature extraction on the target audio data to obtain music characteristic features;

step S105, performing first fusion processing on the Mel cepstrum characteristic and the target tone characteristic to obtain a first fusion audio characteristic;

step S106, carrying out second fusion processing on the first fusion audio characteristic and the music characteristic to obtain a second fusion audio characteristic;

and S107, performing prediction processing on the second fusion audio characteristic through a preset character recognition model to obtain a target identity tag of the target singing object, wherein the target identity tag is used for representing the identity of the target singing object.

The steps S101 to S107 illustrated in the embodiment of the present application are performed by acquiring target audio data of a target singing object; further, extracting the frequency spectrum characteristic of the target audio data to obtain a Mel cepstrum characteristic; extracting tone color characteristics of the target audio data to obtain target tone color characteristics; the middle-layer feature extraction is carried out on the target audio data to obtain the music characteristic feature, and the tone characteristic information, the music characteristic information and other contents of the target singing object can be conveniently determined. The method has the advantages that the first fusion processing is carried out on the Mel cepstrum characteristics and the target tone characteristics to obtain first fusion audio characteristics, tone characteristic information can be fused in the subsequent identification process through the method, and the influence of music accompaniment or background music on the identification effect is eliminated. And performing second fusion processing on the first fusion audio characteristic and the music characteristic to obtain a second fusion audio characteristic, so that music characteristic information can be fused in a subsequent identification process, the identification of the music type and the music style is increased, and the identification precision is improved. And predicting the second fusion audio features through a preset character recognition model to obtain a target identity label of the target singing object, wherein the target identity label is used for representing the identity of the target singing object. Therefore, the identity of the target singing object is conveniently determined, the problem that the identity of the singing object is difficult to identify due to the fact that the number of the singing objects in the meta universe is increased can be effectively solved, and the identification accuracy of the singing object is improved.

In step S101 of some embodiments, a web crawler may be written, and a data source is set, and then data is crawled in a targeted manner, so as to obtain target audio data of a target singing object, where the data source may be various types of network platforms, social media, some specific audio databases, and the like, and the target audio data may be music materials, lecture reports, chat conversations, and the like of the target singing object. The target audio data may also be acquired by other means, without being limited thereto.

In each embodiment of the present application, when data related to the user identity or characteristic, such as user information, user behavior data, user history data, and user location information, is processed, permission or consent of the user is obtained, and the data collection, use, and processing comply with relevant laws and regulations and standards of relevant countries and regions. In addition, when the embodiment of the present application needs to acquire sensitive personal information of a user, individual permission or individual consent of the user is obtained through a pop-up window or a jump to a confirmation page, and after the individual permission or individual consent of the user is definitely obtained, necessary user-related data for enabling the embodiment of the present application to operate normally is acquired.

Referring to fig. 2, in some embodiments, step S102 may include, but is not limited to, step S201 to step S202:

step S201, performing sound spectrum calculation on target audio data through short-time Fourier transform to obtain a target spectrogram;

step S202, filtering the target spectrogram through a preset Mel cepstrum filter to obtain Mel cepstrum characteristics.

In step S201 of some embodiments, a target spectrogram is obtained by performing a sound spectrum calculation on target audio data through short-time fourier transform. Specifically, the target audio data is subjected to signal framing and windowing to obtain multiple frames of audio segments, short-time Fourier transform is performed on the audio segments of each frame, time domain features of the audio segments are converted into frequency domain features, and finally, the frequency domain features of each frame are stacked in the time dimension to obtain a target spectrogram.

In step S202 of some embodiments, the target spectrogram is filtered through a 64-dimensional mel-frequency cepstrum filter bank, and the target spectrogram is subjected to a logarithm operation to obtain a target logarithm spectrum, and then subjected to an inverse fourier transform to obtain the target mel-frequency cepstrum. Further, feature extraction is carried out on the target Mel cepstrum to obtain Mel cepstrum features, wherein the feature dimensions of the Mel cepstrum features are T64, and T is the frame number of the target audio data.

Through the steps S201 to S202, the target audio data can be conveniently converted into the frequency spectrum feature, and the frequency spectrum feature is filtered to obtain the mel cepstrum feature, so that the identity of the target singing object can be identified through the frequency spectrum feature, and the identification accuracy is improved.

Referring to fig. 3, in some embodiments, step S103 may include, but is not limited to, step S301 to step S305:

step S301, segmenting target audio data through a preset audio segmentation model to obtain a plurality of target audio segments;

step S302, extracting the characteristics of each target audio segment through an audio segmentation model to obtain a target audio hidden vector;

step S303, performing feature calculation on all target audio hidden vectors to obtain a target audio mean vector and a target audio variance vector;

step S304, splicing the target audio variance vector and the target audio mean vector to obtain a target timbre implicit vector;

step S305, the target tone color implicit vector is subjected to prediction processing through a preset function, and target tone color characteristics are obtained.

In step S301 of some embodiments, the preset audio segmentation model may be constructed based on an X-Vector network structure, and the audio segmentation model may include at least one Deep Neural Networks (DNN) layer. And carrying out segmentation processing on the target audio data through the audio segmentation model and the preset segment length, and dividing the target audio data into a plurality of target audio segments with the same time length according to different audio time.

In step S302 of some embodiments, feature extraction is performed on each target audio segment through a DNN layer of the audio segmentation model, and audio feature information in the target audio segment is obtained to obtain a target audio hidden vector corresponding to each target audio segment.

In step S303 of some embodiments, when performing feature calculation on all target audio hidden vectors, first performing mean calculation on all target audio hidden vectors to obtain a target audio mean vector; and then performing difference calculation on each target audio implicit vector and the target audio mean vector to obtain a target audio variance vector corresponding to each target audio implicit vector.

In step S304 of some embodiments, vector splicing is performed on the target audio variance vector and the target audio mean vector to obtain a target timbre hidden vector.

In step S305 of some embodiments, the preset function is a softmax function, by which a probability distribution can be created for the target timbre hidden vector on each preset reference timbre feature, the probability that the target timbre hidden vector belongs to each preset reference timbre feature is reflected according to the probability distribution, the reference timbre feature to which the target timbre hidden vector belongs and which has the highest probability is selected as the target timbre feature, and the feature dimension of the target timbre feature may be 512 dimensions. The target tone characteristic can reflect the tone characteristics of the target singing object more accurately, and the tone characteristics comprise information such as tone height, frequency speed, volume, tone quality and the like of the target singing object during sounding.

Referring to fig. 4, in some embodiments, step S104 may include, but is not limited to, step S401 to step S405:

step S401, inputting target audio data into a preset feature extraction model, wherein the feature extraction model comprises a first convolution layer, a second convolution layer and a third convolution layer;

step S402, performing music middle-layer feature extraction on the target audio data through the first convolution layer to obtain a first music middle-layer feature;

step S403, performing music middle layer feature extraction on the target audio data through the second convolution layer to obtain a second music middle layer feature;

step S404, performing music middle-layer feature extraction on the target audio data through the third convolutional layer to obtain third music middle-layer features;

step S405, the first music middle layer feature, the second music middle layer feature, and the third music middle layer feature are spliced to obtain the music characteristic feature.

In step S401 of some embodiments, the target audio data is input into a preset feature extraction model, where the feature extraction model may be constructed based on an inclusion V3 network structure, the feature extraction model includes an input layer, an intermediate layer, a first convolutional layer, a second convolutional layer, and a third convolutional layer, where the input layer and the intermediate layer may also be convolutional structures.

In step S402 of some embodiments, music middle layer feature extraction is performed on the target audio data through the first convolution layer, and music characteristic information in the target audio data is captured to obtain a first music middle layer feature, where a feature dimension of the first music middle layer feature is 128 dimensions.

In step S403 in some embodiments, music middle-layer feature extraction is performed on the target audio data through the second convolutional layer, and music characteristic information in the target audio data is captured, so as to obtain a second music middle-layer feature, where a feature dimension of the second music middle-layer feature is 128 dimensions.

In step S404 of some embodiments, music middle-layer feature extraction is performed on the target audio data through the third convolutional layer, and music characteristic information in the target audio data is captured, so as to obtain a third music middle-layer feature, where a feature dimension of the third music middle-layer feature is 256 dimensions.

In step S405 of some embodiments, vector splicing is performed on the first music middle-layer feature, the second music middle-layer feature, and the third music middle-layer feature in the vector form to obtain a music characteristic feature, and a feature dimension of the music characteristic feature is 512 dimensions.

Through the steps S401 to S405, the target audio data can be migrated and learned well, the music middle layer feature information in the target audio data is extracted, the music feature characteristics are obtained, and the music feature information in the target audio data can be reflected more comprehensively through the music feature characteristics, so that the music feature content can be blended in the singing object recognition process, and the recognition accuracy of the singing object is improved.

It should be explained that the characteristics of the music domain can be roughly divided into three levels. The low-level features of music have a well-defined concept, and the low-level features of music include music beats, music chords, and the like. The music high-level features are not clearly defined and subjective concepts, include emotions of a singer, music genres, music similarities and the like, and can be defined only by considering various aspects of music. The middle-layer music feature refers to a music feature between the low-layer music feature and the high-layer music feature, and the middle-layer music feature includes the speed of music, the stability of rhythm, the melody of music, the complexity of rhythm of music, the perceptibility of music, and the like. The music mid-level features can be used to improve music emotion recognition, music retrieval and music classification.

In step S105 of some embodiments, performing the first fusion process on the mel-frequency cepstral features and the target timbre features includes two stages. The first stage is to perform splicing processing on the mel-frequency cepstrum features and the target tone features, specifically, vector addition can be performed on the mel-frequency cepstrum features and the target tone features in a vector form to obtain preliminary fusion features. In the second stage, the initial fusion feature is subjected to convolution processing for multiple times to obtain a first fusion audio feature, specifically, in the singing object recognition method in the embodiment of the application, the convolution processing for the initial fusion feature includes convolution processing for four times, that is, feature extraction is performed on the initial fusion feature through a first convolution network to obtain a first fusion feature, feature extraction is performed on the first fusion feature through a second convolution network to obtain a second fusion feature, further, feature extraction is performed on the second fusion feature through a third convolution network to obtain a third fusion feature, and finally, feature extraction is performed on the third fusion feature through a fourth convolution network to obtain the first fusion audio feature.

It should be noted that the processing process of each convolution network includes stages of convolution operation, pooling operation, activation operation, and the like, and the dimension reduction processing and feature extraction on the initial fusion feature can be realized by this method, so as to obtain a first fusion audio feature meeting the requirement, where the first convolution network, the second convolution network, the third convolution network, and the fourth convolution network may have the same network structure. For example, each convolutional network comprises a convolutional layer, a pooling layer and an active layer, wherein the convolutional layer has a convolutional kernel size of 3 × 3 or 1 × 1, and the number of channels may be 128; the activation function of the activation layer may be a Relu function, a Sigmoid function, or the like, without limitation.

In step S106 of some embodiments, in performing the second fusion processing on the first fusion audio feature and the music characteristic feature, vector addition may be performed on the first fusion audio feature and the music characteristic feature in a vector form to implement merging of the first fusion audio feature and the music characteristic feature, thereby obtaining a second fusion audio feature.

Referring to fig. 5, in some embodiments where the character recognition model includes a GRU layer and a full link layer, step S107 may include, but is not limited to including, steps S501 through S502:

step S501, performing time sequence feature extraction on the second fusion audio feature through a GRU layer to obtain a fusion time sequence feature vector;

and step S502, performing label prediction processing on the fusion time sequence characteristic vector through a full connection layer to obtain a target identity label of the target singing object.

In step S501 of some embodiments, when the time sequence feature of the second fusion audio feature is extracted through the GRU layer of the character recognition model, the time sequence dimension feature and the audio state feature of the second fusion audio feature can be merged, and a fusion time sequence feature vector is output, where the fusion time sequence feature vector fuses time sequence relationships of the frequency spectrum feature, the tone feature, and the music feature of the target audio data, and can better provide support for subsequent identification of the target singing object, thereby improving the recognition accuracy.

In step S502 of some embodiments, label probability calculation is performed on the fusion time sequence feature vector through a classification function of the full connection layer to obtain a label probability vector corresponding to each preset identity label, and the preset identity label corresponding to the label probability vector with the largest value is selected as the target identity label of the target singing object, so as to determine the identity of the target singing object according to the target identity label, where the target identity label can not only represent who the target singing object is, but also represent whether the target singing object is a virtual character or a real character.

Referring to fig. 6, in some embodiments, step S501 includes, but is not limited to, steps S601 to S602:

step S601, performing time sequence feature extraction on the second fusion audio feature through a first gating circulating unit of the GRU layer to obtain an initial time sequence feature vector;

step S602, performing time sequence feature extraction on the initial time sequence feature vector through a second gating circulation unit of the GRU layer to obtain a fusion time sequence feature vector.

In step S601 in some embodiments, a first gating cycle unit of the GRU layer performs time sequence feature extraction on the second fusion audio feature, captures a time sequence dimension feature of the second fusion audio feature, and then performs feature fusion on the time sequence dimension feature of the second fusion audio feature and the first audio state feature in the first gating cycle unit to obtain an initial time sequence feature vector.

In step S602 of some embodiments, a second gating circulation unit of the GRU layer performs timing feature extraction on the initial timing feature vector, captures a time series dimension feature of the initial timing feature vector, and then performs feature fusion on the time series dimension feature of the initial timing feature vector and a second audio state feature in the second gating circulation unit to obtain a fused timing feature vector.

It should be noted that, the structures of the first gated loop unit and the second gated loop unit may be completely the same, or may be different, and are not limited. For example, in some embodiments, the first gated loop unit and the second gated loop unit each have a 32-unit structure, and the random activation parameter is set to 0.5.

In the steps S601 to S602, the two gate control cycle units are used to extract the time sequence feature of the second fusion audio feature, so that the time sequence feature information of the target audio data on the tone feature, the music feature and the frequency spectrum feature can be captured better, and better data support is provided for the subsequent prediction process of the singing object, thereby improving the identification accuracy of the singing object.

Referring to fig. 7, in some embodiments, step S502 may include, but is not limited to, step S701 to step S703:

step S701, performing label probability calculation on the fusion time sequence feature vector through a classification function of a full connection layer and preset identity labels to obtain a label probability vector corresponding to each preset identity label;

step S702, selecting a preset identity label corresponding to the label probability vector with the maximum value to obtain a candidate identity label;

and step S703, obtaining a target identity label according to the candidate identity label.

In step S701 of some embodiments, the classification function may be a probability function such as a softmax function, and the preset identity tag may be extracted from different data sources, for example, basic information of various people, including identity information, personal conditions, and related audio-video data, etc., obtained from network media and a social platform. Taking the softmax function as an example, the probability distribution condition of the fusion time sequence feature vector on each preset identity tag can be created through the softmax function, and the probability of the fusion time sequence feature vector belonging to each preset identity tag is reflected through the probability distribution condition, so that the tag probability vector corresponding to each preset identity tag is obtained.

In step S702 and step S703 of some embodiments, the size of the tag probability vector may visually reflect the possibility that the fused timing sequence feature vector belongs to each preset identity tag, and when the numerical value of the tag probability vector is larger, the higher the matching degree of the preset identity tag corresponding to the fused timing sequence feature vector is, the higher the possibility that the fused timing sequence feature vector comes from the character corresponding to the preset identity tag is indicated, so that the preset identity tag corresponding to the tag probability vector with the largest numerical value is selected to obtain one or more candidate identity tags, and then a certain identity tag is selected from the candidate identity tags as the target identity tag, thereby representing the identity of the target singing object through the target identity tag.

In the steps S701 to S703, the probability that the fusion timing feature vector belongs to each preset identity tag can be conveniently quantified through the classification function to obtain the tag probability vector, and then the most suitable preset identity tag is selected as the target identity tag according to the size of the tag probability vector, so that the identity of the target singing object is confirmed according to the target identity tag, and the recognition accuracy of the singing object is improved.

According to the singing object recognition method, target audio data of a target singing object are obtained; further, extracting the frequency spectrum characteristic of the target audio data to obtain a Mel cepstrum characteristic; extracting tone color characteristics of the target audio data to obtain target tone color characteristics; the middle-layer feature extraction is carried out on the target audio data to obtain the music characteristic feature, and the tone characteristic information, the music characteristic information and other contents of the target singing object can be conveniently determined. Further, the Mel cepstrum features and the target tone features are subjected to first fusion processing to obtain first fusion audio features, tone feature information can be fused in the subsequent identification process through the method, and the influence of music accompaniment or background music on the identification effect is eliminated; furthermore, second fusion processing is carried out on the first fusion audio characteristic and the music characteristic to obtain a second fusion audio characteristic, so that music characteristic information can be fused in a subsequent identification process, identification of music types and music styles is increased, and identification accuracy is improved. And finally, performing prediction processing on the second fusion audio features through a preset character recognition model to obtain a target identity tag of the target singing object, wherein the target identity tag is used for representing the identity of the target singing object. Therefore, the identity of the target singing object is conveniently determined, the problem that the identity of the singing object is difficult to identify due to the fact that the number of the singing objects in the meta universe is increased can be effectively solved, and the identification accuracy of the singing object is improved.

Referring to fig. 8, an embodiment of the present application further provides a singing object recognition apparatus, which can implement the singing object recognition method described above, and the apparatus includes:

a data obtaining module 801, configured to obtain target audio data of a target singing object;

the spectral feature extraction module 802 is configured to perform spectral feature extraction on the target audio data to obtain mel cepstrum features;

a tone characteristic extraction module 803, configured to perform tone characteristic extraction on the target audio data to obtain a target tone characteristic;

the middle-layer feature extraction module 804 is used for performing middle-layer feature extraction on the target audio data to obtain music characteristic features;

a first fusion module 805, configured to perform first fusion processing on the mel cepstrum feature and the target timbre feature to obtain a first fusion audio feature;

a second fusion module 806, configured to perform second fusion processing on the first fusion audio feature and the music characteristic feature to obtain a second fusion audio feature;

the predicting module 807 is configured to perform prediction processing on the second fusion audio feature through a preset character recognition model to obtain a target identity tag of the target singing object, where the target identity tag is used to represent an identity of the target singing object.

In some embodiments, the spectral feature extraction module 802 includes:

the audio spectrum calculation unit is used for carrying out audio spectrum calculation on the target audio data through short-time Fourier transform to obtain a target spectrogram;

and the filtering unit is used for filtering the target spectrogram through a preset Mel cepstrum filter to obtain Mel cepstrum characteristics.

In some embodiments, the timbre feature extraction module 803 comprises:

the segmentation unit is used for segmenting the target audio data through a preset audio segmentation model to obtain a plurality of target audio segments;

the characteristic extraction unit is used for extracting the characteristics of each target audio segment through the audio segmentation model to obtain a target audio hidden vector;

the characteristic calculation unit is used for carrying out characteristic calculation on all target audio implicit vectors to obtain a target audio mean vector and a target audio variance vector;

the first splicing unit is used for splicing the target audio variance vector and the target audio mean vector to obtain a target timbre implicit vector;

and the prediction unit is used for performing prediction processing on the target tone hidden vector through a preset function to obtain target tone characteristics.

In some embodiments, the middle tier feature extraction module 804 includes:

the device comprises an input unit, a processing unit and a processing unit, wherein the input unit is used for inputting target audio data into a preset feature extraction model, and the feature extraction model comprises a first convolution layer, a second convolution layer and a third convolution layer;

the first extraction unit is used for performing music middle-layer feature extraction on the target audio data through the first convolution layer to obtain a first music middle-layer feature;

the first extraction unit is used for performing music middle-layer feature extraction on the target audio data through the second convolution layer to obtain second music middle-layer features;

the first extraction unit is used for performing music middle-layer feature extraction on the target audio data through the third convolution layer to obtain third music middle-layer features;

and the second splicing unit is used for splicing the first music middle layer characteristic, the second music middle layer characteristic and the third music middle layer characteristic to obtain the music characteristic.

In some embodiments, the character recognition model includes a GRU layer and a fully connected layer, and the prediction module 807 includes:

the time sequence feature extraction unit is used for extracting time sequence features of the second fusion audio features through the GRU layer to obtain fusion time sequence feature vectors;

and the label prediction unit is used for performing label prediction processing on the fusion time sequence characteristic vector through a full connection layer to obtain a target identity label of the target singing object.

In some embodiments, the timing feature extraction unit includes:

the first time sequence feature extraction subunit is used for extracting the time sequence features of the second fusion audio features through a first gating circulating unit of the GRU layer to obtain an initial time sequence feature vector;

and the second time sequence feature extraction subunit is used for extracting the time sequence features of the initial time sequence feature vector through a second gating circulation unit of the GRU layer to obtain a fusion time sequence feature vector.

In some embodiments, the tag prediction unit comprises:

the probability calculation subunit is used for performing label probability calculation on the fusion time sequence feature vector through the classification function of the full connection layer and the preset identity label to obtain a label probability vector corresponding to each preset identity label;

the label selecting subunit is used for selecting a preset identity label corresponding to the label probability vector with the maximum numerical value to obtain a candidate identity label;

and the label determining subunit is used for obtaining the target identity label according to the candidate identity label.

The specific implementation of the singing object recognition apparatus is basically the same as the specific implementation of the singing object recognition method, and is not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device includes: the singing object recognition method comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program realizes the singing object recognition method when being executed by the processor. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 901 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;

the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 902, and the processor 901 calls to execute the singing object recognition method in the embodiments of the present application;

an input/output interface 903 for implementing information input and output;

a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively connected to each other within the device via a bus 905.

The embodiment of the present application further provides a storage medium, which is a computer-readable storage medium for computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the singing object recognition method.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the singing object identification method, the singing object identification device, the electronic equipment and the storage medium, the target audio data of the target singing object are obtained; further, extracting the frequency spectrum characteristics of the target audio data to obtain Mel cepstrum characteristics; extracting tone color characteristics of the target audio data to obtain target tone color characteristics; the middle-layer feature extraction is carried out on the target audio data to obtain the music characteristic feature, and the tone characteristic information, the music characteristic information and other contents of the target singing object can be conveniently determined. Further, the Mel cepstrum features and the target tone features are subjected to first fusion processing to obtain first fusion audio features, tone feature information can be fused in the subsequent identification process through the method, and the influence of music accompaniment or background music on the identification effect is eliminated; furthermore, second fusion processing is carried out on the first fusion audio characteristic and the music characteristic to obtain a second fusion audio characteristic, so that music characteristic information can be fused in a subsequent identification process, identification of music types and music styles is increased, and identification accuracy is improved. And finally, performing prediction processing on the second fusion audio features through a preset character recognition model to obtain a target identity tag of the target singing object, wherein the target identity tag is used for representing the identity of the target singing object. Therefore, the identity of the target singing object is conveniently determined, the problem that the identity of the singing object is difficult to identify due to the fact that the number of the singing objects in the meta universe is increased can be effectively solved, and the identification accuracy of the singing object is improved.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute limitations on the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technologies and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A singing object recognition method, the method comprising:

acquiring target audio data of a target singing object;

extracting the frequency spectrum characteristics of the target audio data to obtain Mel cepstrum characteristics;

and performing prediction processing on the second fusion audio characteristic through a preset character recognition model to obtain a target identity tag of the target singing object, wherein the target identity tag is used for representing the identity of the target singing object.

2. The singing object recognition method according to claim 1, wherein the step of extracting the spectral features of the target audio data to obtain mel cepstral features includes:

3. The singing object recognition method according to claim 1, wherein the step of performing timbre feature extraction on the target audio data to obtain a target timbre feature comprises:

4. The method as claimed in claim 1, wherein the step of performing middle-layer feature extraction on the target audio data to obtain the characteristic features of music comprises:

and splicing the first music middle layer characteristic, the second music middle layer characteristic and the third music middle layer characteristic to obtain the music characteristic.

5. The singing object recognition method according to any one of claims 1 to 4, wherein the character recognition model includes a GRU layer and a full connection layer, and the step of obtaining the target identity tag of the target singing object by performing prediction processing on the second fusion audio feature through a preset character recognition model includes:

6. The singing object recognition method according to claim 5, wherein the step of performing time-series feature extraction on the second fused audio feature through the GRU layer to obtain a fused time-series feature vector includes:

7. The method according to claim 5, wherein the step of performing label prediction processing on the fused temporal feature vector through the full connection layer to obtain a target identity label of the target singing object includes:

8. An apparatus for singing object recognition, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program, when executed by the processor, implementing the steps of the singing object recognition method according to any one of claims 1 to 7.

10. A storage medium, which is a computer-readable storage medium for computer-readable storage, characterized in that the storage medium stores one or more programs, which are executable by one or more processors to implement the steps of the singing object recognition method according to any one of claims 1 to 7.