CN113593606B

CN113593606B - Audio recognition method and device, computer equipment and computer-readable storage medium

Info

Publication number: CN113593606B
Application number: CN202111156129.8A
Authority: CN
Inventors: 李金朋; 邵云飞; 张卫强
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-02-15
Anticipated expiration: 2041-09-30
Also published as: CN113593606A

Abstract

The application relates to an audio recognition method and device, a computer device and a computer readable storage medium. The method comprises the following steps: and acquiring audio features corresponding to the audio data. Acquiring heterogeneous relation characteristics from a preset heterogeneous relation graph, wherein the preset heterogeneous relation graph is used for representing the relation between labels corresponding to audio data in a training set; the relationship between the labels includes a relationship between the scene labels, a relationship between the event labels, and a relationship between the scene labels and the event labels. The preset heterogeneous relational graph is generated based on inputting the initial heterogeneous relational graph into a preset R-GCN relational graph convolution neural network. And inputting the audio characteristics and the heterogeneous relation characteristics into a preset deep neural network for audio identification, and generating a scene label and an event label corresponding to the audio data. By adopting the method, the double recognition and classification tasks of scenes and events in the audio can be simultaneously carried out, and the accuracy and the reliability of recognition and classification are improved.

Description

Audio recognition method and device, computer equipment and computer-readable storage medium

Technical Field

The present application relates to the field of multimedia recognition technologies, and in particular, to an audio recognition method and apparatus, a computer device, and a computer-readable storage medium.

Background

With the continuous development of multimedia related technologies, audio processing technologies have been developed. The identification of the audio is a crucial link in the audio processing process.

In the traditional method, when the audio is identified, scenes and events are mainly identified from the audio. However, in an actual audio recognition task, scenes and events in audio are usually recognized separately. However, in general, there is a certain association relationship between events, scenes and events, and scenes in audio. If the scene and the event in the audio are respectively identified, the association relationship between the scene and the event cannot be considered in the identification process. Therefore, the accuracy of the scenes and events that identify the audio is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an audio recognition method and apparatus, a computer device, and a computer-readable storage medium, which can improve the accuracy and reliability of recognition classification.

A method of audio recognition, the method comprising:

acquiring audio features corresponding to the audio data;

acquiring heterogeneous relation characteristics from a preset heterogeneous relation graph, wherein the preset heterogeneous relation graph is used for representing the relation between labels corresponding to audio data in a training set; the relationship between the labels comprises the relationship between the scene labels, the relationship between the event labels and the relationship between the scene labels and the event labels; the preset heterogeneous relation graph is generated by inputting the initial heterogeneous relation graph into a preset R-GCN relation graph convolutional neural network;

and inputting the audio characteristics and the heterogeneous relation characteristics into a preset deep neural network for audio identification, and generating a scene label and an event label corresponding to the audio data.

In one embodiment, the method for generating a scene tag and an event tag corresponding to audio data by inputting audio features and heterogeneous relation features into a preset deep neural network for audio recognition includes:

splicing the audio features and the heterogeneous relation features to generate a fusion heterogeneous relation feature;

inputting the fused heterogeneous relation features into a preset deep neural network for convolution processing to generate target features;

and generating a scene label and an event label corresponding to the audio data according to the target characteristics.

In one embodiment, a method of audio recognition is provided, further comprising:

acquiring a training set, and setting a label for each preset audio data in the training set; the labeling label comprises a scene label and an event label;

constructing an initial heterogeneous relation graph according to a label of preset audio data in a training set;

and inputting the initial heterogeneous relation graph into an initial R-GCN relation graph convolution neural network to generate an intermediate heterogeneous relation graph.

In one embodiment, constructing an initial heterogeneous relationship graph according to a label of preset audio data in a training set includes:

constructing an adjacency matrix according to the symbiotic probability among the label labels of the preset audio data in the training set;

constructing a relation category matrix according to the relation categories among the label labels of the audio data in the training set;

and constructing an initial heterogeneous relationship diagram according to the adjacency matrix and the relationship type matrix.

In one embodiment, inputting the initial heterogeneous relationship graph into an initial R-GCN relationship graph convolutional neural network to generate an intermediate heterogeneous relationship graph, including:

acquiring initial heterogeneous relation characteristics from the initial heterogeneous relation graph, and performing aggregation updating on the initial heterogeneous relation characteristics through an initial R-GCN relation graph convolutional neural network to generate intermediate heterogeneous relation characteristics;

and updating the initial heterogeneous relationship diagram based on the intermediate heterogeneous relationship characteristics to generate an intermediate heterogeneous relationship diagram.

In one embodiment, the R-GCN graph convolution neural network comprises an R-GCN layer and an activation function; performing aggregation updating on the initial heterogeneous relationship features through an initial R-GCN relationship graph convolutional neural network to generate intermediate heterogeneous relationship features, wherein the method comprises the following steps:

inputting the initial isomeric relation characteristics into an R-GCN layer for processing to generate processed initial isomeric relation characteristics;

and inputting the processed initial heterogeneous relation characteristics into an activation function for processing to generate intermediate heterogeneous relation characteristics.

extracting audio features from each preset audio data in the training set, and extracting intermediate heterogeneous relation features from the intermediate heterogeneous relation graph;

inputting the audio features and the intermediate heterogeneous relation features of the preset audio data into an initial deep neural network, and generating a predicted scene label and a predicted event label of the preset audio data;

calculating the value of a loss function according to a predicted scene label and a predicted event label of preset audio data and a labeled scene label and a labeled event label of the preset audio data;

adjusting parameters of an initial R-GCN relation graph convolution neural network according to the value of the loss function to generate a preset R-GCN relation graph convolution neural network;

and adjusting the parameters of the initial deep neural network according to the value of the loss function to generate a preset deep neural network.

An audio recognition apparatus, the apparatus comprising:

the audio characteristic acquisition module is used for acquiring audio characteristics corresponding to the audio data;

the system comprises a heterogeneous relation characteristic acquisition module, a relation analysis module and a relation analysis module, wherein the heterogeneous relation characteristic acquisition module is used for acquiring heterogeneous relation characteristics from a preset heterogeneous relation graph, and the preset heterogeneous relation graph is used for representing the relation between labels corresponding to audio data in a training set; the relationship between the labels comprises the relationship between the scene labels, the relationship between the event labels and the relationship between the scene labels and the event labels; the preset heterogeneous relation graph is generated by inputting the initial heterogeneous relation graph into a preset R-GCN relation graph convolutional neural network;

and the audio identification module is used for inputting the audio characteristics and the heterogeneous relation characteristics into a preset deep neural network for audio identification, and generating a scene label and an event label corresponding to the audio data.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as above when executing said computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as above.

According to the audio identification method and device, the computer equipment and the computer readable storage medium, in the process of identifying the audio data, the audio characteristics corresponding to the audio data are firstly obtained, and then the heterogeneous relation characteristics are obtained from the preset heterogeneous relation graph. The preset heterogeneous relational graph is generated based on inputting the initial heterogeneous relational graph into a preset R-GCN relational graph convolution neural network. And finally, inputting the audio characteristics and the heterogeneous relation characteristics into a preset deep neural network for audio recognition, and generating a scene label and an event label corresponding to the audio data. The preset heterogeneous relation graph is used for representing the relation between labels corresponding to the audio data in the training set, and the relation between the labels comprises the relation between scene labels, the relation between event labels and the relation between scene labels and the relation between event labels. Therefore, the scene-scene, scene-event and event-event heterogeneous relationships in the audio data can be fully considered by acquiring the heterogeneous relationship characteristics of the audio data from the preset heterogeneous relationship diagram. Therefore, the audio data identification method can simultaneously perform double identification and classification tasks of scenes and events in the audio, and improves the accuracy and the reliability of audio identification.

Drawings

FIG. 1 is a diagram of an exemplary audio recognition application environment;

FIG. 2 is a flow diagram illustrating an exemplary audio recognition method;

FIG. 3 is a schematic diagram of a process for obtaining audio features corresponding to audio data according to an embodiment;

FIG. 4 is a diagram illustrating heterogeneous relationships in one embodiment;

FIG. 5 is a flow diagram illustrating the generation of scene tags and event tags corresponding to audio data, according to one embodiment;

FIG. 6 is a flow diagram illustrating an exemplary audio recognition method;

FIG. 7 is a schematic flow diagram for building an initial heterogeneous relationship graph in one embodiment;

FIG. 8 is a schematic diagram of a network training process of an embodiment of an audio recognition method;

FIG. 9 is a flow diagram of an exemplary embodiment of a method for audio recognition;

FIG. 10 is a block diagram showing the structure of an audio recognition apparatus according to an embodiment;

FIG. 11 is a block diagram of the audio recognition module of FIG. 10;

FIG. 12 is a diagram showing an internal configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is a diagram of an application scenario of audio recognition in one embodiment. As shown in fig. 1, the application environment includes a computer device 140. The computer device 140 obtains the audio features corresponding to the audio data 120, and then obtains the heterogeneous relationship features from the preset heterogeneous relationship diagram. The preset heterogeneous relation graph is used for representing the relation between labels corresponding to the audio data in the training set; the relationship between the labels includes a relationship between the scene labels, a relationship between the event labels, and a relationship between the scene labels and the event labels. The preset heterogeneous relational graph is generated based on inputting the initial heterogeneous relational graph into a preset R-GCN relational graph convolution neural network. And finally, inputting the audio characteristics and the heterogeneous relation characteristics into a preset deep neural network for audio recognition, and generating a scene label and an event label corresponding to the audio data.

Fig. 2 is a flowchart illustrating an audio recognition method according to an embodiment, and as shown in fig. 2, an audio recognition method applied to a computer device is provided, which includes steps 220 to 260.

S220, acquiring the audio characteristics corresponding to the audio data.

The audio data is a series of samples which represent continuous change in time domain through numbers, and the specific expression is 'oscillogram', and the waveform signal of the audio data can be simplified through extracting the audio features of the audio data. The audio features can speed up the understanding of semantic meaning in the audio by machines such as computers or servers. Common audio features include, but are not limited to: zero crossing rate, short-term energy, short-term autocorrelation function, short-term average amplitude difference, spectrogram, short-term power spectral density, spectral entropy, fundamental frequency, mel-frequency spectrum and the like. In the process of extracting audio features of audio data, generally, sampling and quantization operations are required to analyze an audio signal, where sampling refers to a continuous-time discretization process, and quantization refers to converting a continuous waveform into a discretization number. Common transformation methods for extracting audio features from audio data include, but are not limited to: short-time fourier transform, discrete cosine transform, discrete wavelet transform, mel-frequency spectrum and mel-frequency cepstrum, and constant Q transform, etc. The means for obtaining the audio features corresponding to the audio data is not limited in the present application.

Preferably, fig. 3 is a schematic flow chart of acquiring an audio feature corresponding to audio data in an embodiment, as shown in fig. 3, in this embodiment, extracting a log-mel spectrum from the audio data as the audio feature includes:

s221, framing the audio data. The audio data has the characteristic of being stable for a short time, that is, the audio data is unstable on the whole but is stable on the part, so that the audio data needs to be subjected to framing processing to obtain multi-frame audio data;

and S222, windowing the audio data of each frame obtained in the S221. In the above steps, the audio data is subjected to framing operation, which causes the interruption of two frames of audio data, thereby increasing the error of the original audio data. The windowing operation can effectively solve the problem, so that continuous information exists between two frames of audio data and the periodic characteristic is presented. Commonly used window functions include, but are not limited to: rectangular windows, hanna windows, hamming windows, blackman windows, etc. Preferably, in this embodiment, a hanning window is used to perform windowing on the audio data, so as to obtain multiple frames of windowed audio data;

and S223, acquiring a spectrogram of the audio data. And respectively carrying out short-time Fourier transform on the multi-frame windowed audio data obtained in the step S222. The short-time fourier transform functions to convert a time-domain signal of audio data into a frequency-domain signal. Furthermore, the frequency spectrums of each frame of audio data after short-time Fourier transform are stacked in time to obtain the frequency spectrum graph of the audio data.

And S224, acquiring an energy spectrum of the audio data. And taking the modulus square of the spectrogram of the audio data obtained in the step S223 to obtain an energy spectrum of the whole audio data.

S225, obtaining an audio data log-mel spectrum, converting the frequency of the energy spectrum of the audio data obtained in the S224 into a Mel scale, wherein the calculation formula is as follows:

(1)

wherein the content of the first and second substances,fis a frequency in the energy spectrum and,mthe corresponding frequencies on the mel scale.

Further, the energy spectrum converted into the Mel scale is passed through a preset Mel filter bank to obtain a log-mel spectrum of the audio data. Preferably, each filter in the predetermined mel filter bank is a triangular filter.

S240, acquiring heterogeneous relation characteristics from a preset heterogeneous relation graph, wherein the preset heterogeneous relation graph is used for representing the relation between labels corresponding to the audio data in the training set; the relationship between the labels comprises the relationship between the scene labels, the relationship between the event labels and the relationship between the scene labels and the event labels; the preset heterogeneous relational graph is generated based on inputting the initial heterogeneous relational graph into a preset R-GCN relational graph convolution neural network.

Specifically, fig. 4 is a diagram illustrating a heterogeneous relationship in one embodiment, and as shown in fig. 4, the heterogeneous relationship diagram includes four scene tags: city center, city park, forest path, lakeside beach. And nine event tags: the sound of vehicles, stream, bird, leaves, noisy people, music, sports, joyful and water wave. In the heterogeneous relationship diagram shown in fig. 4, the scene labels are represented by rectangular nodes, and the event labels are represented by elliptical nodes. Further, in order to distinguish the connection type between two nodes, the thick line in the heterogeneous relationship diagram shown in fig. 4 represents a scene-event relationship, the thin line represents an event-event relationship, and the heterogeneous relationship diagram may reflect the relationship type of the two connected nodes. Further, in the heterogeneous relationship diagram shown in fig. 4, a solid line indicates that the association degree between two nodes is high, a dotted line indicates that the association degree between two nodes is low, and two nodes which are not connected indicate that there is no association between the two nodes, that is, the heterogeneous relationship diagram may reflect the association degree of the two connected nodes, specifically, the weight value of each edge reflects the association degree of the two connected nodes. In addition, the heterogeneous relational graph is a directed graph, for example, the weight of the node "city center" to the node "music sound" edge is different from the weight of the node "music sound" to the node "city center" edge.

And inputting the initial heterogeneous relation graph into a preset R-GCN relation graph convolution neural network to obtain a preset heterogeneous relation graph. The heterogeneous relational graph is an irregular and infinite-dimensional data structure and has no translation invariance. For such a structure, the processing effect of the neural networks such as the convolutional neural network CNN and the cyclic convolutional neural network RNN is not ideal. The GCN graph convolution neural network, which acts substantially as a CNN, is a feature extractor, except that the object of the GCN is graph data. The GCN can extract feature-related graph features from input graph data, and the obtained graph features can perform tasks such as node classification, graph classification and edge prediction on the input graph data. And the relation graph convolutional neural network R-GCN can better consider the types and the directions of the edges in the heterogeneous relation graph compared with the GCN.

And inputting the initial heterogeneous relational graph into a preset R-GCN relational graph convolution neural network, and updating the state of each node in the initial heterogeneous relational graph to further obtain the preset heterogeneous relational graph. The preset weight of the edges between the nodes in the heterogeneous relationship graph reflects the relationship between scenes, the relationship between events and the relationship between scenes in the audio data, and the relationship between scenes and events.

The R-GCN is used for extracting the heterogeneous relation characteristics of the heterogeneous relation graph, so that the relations between scenes, between scenes and events and between events in the heterogeneous relation graph can be fully considered, and the accuracy of audio data identification is improved.

And S260, inputting the audio characteristics and the heterogeneous relation characteristics into a preset deep neural network for audio identification, and generating a scene label and an event label corresponding to the audio data.

Specifically, the audio features of the audio data obtained in step 220 and the heterogeneous relationship features obtained in step 240 are input into a preset deep neural network for audio recognition, and a scene tag and an event tag corresponding to the audio data are determined according to an output vector of the preset deep neural network. Wherein the deep neural network includes but is not limited to: CNN convolutional neural networks, GAN generation antagonistic neural networks, ResNet residual networks, etc. Preferably, in the embodiment of the present application, the audio characteristics and the heterogeneous relationship characteristics are input into a preset ResNet residual error network for audio identification, and a scene tag and an event tag corresponding to the audio data are determined according to the output vector.

In the embodiment of the application, in the process of identifying the audio data, the audio features corresponding to the audio data are firstly obtained, and then the heterogeneous relationship features of the audio data are obtained from the preset heterogeneous relationship diagram. The heterogeneous relation graph is used for representing the relation between the scene labels and the scene labels, between the event labels and between the scene labels and the event labels corresponding to the audio data, and the relation between the scenes and the scenes, between the scenes and the events and between the events in the audio data can be fully considered by acquiring the heterogeneous relation characteristics of the audio data from the preset heterogeneous relation graph. And finally, inputting the audio characteristics and the heterogeneous relation characteristics into a preset deep neural network for audio identification, and determining scene labels and event labels corresponding to the audio data according to the output of the deep neural network so as to simultaneously identify scenes and events in the audio. However, the conventional method generally identifies the scenes and the events of the audio respectively, and does not fully consider the relationship between the scenes, the scenes and the events, and the relationship between the events. Therefore, the audio data identification method can simultaneously perform double identification and classification tasks of scenes and events in the audio, and improves the accuracy and the reliability of audio identification.

In one embodiment, fig. 5 is a schematic flowchart illustrating a process of generating a scene tag and an event tag corresponding to audio data in one embodiment, and as shown in fig. 5, S260 inputs an audio feature and a heterogeneous relationship feature into a preset deep neural network for audio recognition to generate a scene tag and an event tag corresponding to audio data, including:

and S262, splicing the audio features and the heterogeneous relation features to generate a fusion heterogeneous relation feature.

Specifically, the audio features extracted from the audio data and the isomeric relation features extracted from the preset isomeric relation graph are spliced, wherein the audio features and the isomeric relation features are both expressed in a matrix form. The specific splicing operation is not limited in the present application, and preferably, in this embodiment, the audio feature and the heterogeneous relationship feature are spliced along the direction of the row vector. For example, assuming that the audio features are a size matrix of T × C and the heterogeneous relationship features are a size matrix of N × C, the merged heterogeneous relationship features generated after the concatenation are a matrix of (T + N) × C.

And S264, inputting the fused heterogeneous relation characteristics into a preset deep neural network for convolution processing to generate target characteristics.

Specifically, the feature of the fused heterogeneous relationship is input into a preset deep neural network for convolution processing to obtain a tensor, namely the target feature. For the scene tag, it appears as a multi-class single output, i.e., the output can determine one scene tag in the audio data. For event tags, multi-category multi-output is represented, i.e., the output can determine a plurality of event tags in the audio data. And designing an activation function of the last convolutional layer in the deep neural network, adopting a softmax function for identifying a scene tag, and adopting a sigmoid function for identifying an event tag. The softmax function ranges each element between (0,1) and the sum of all elements is 1. The sigmoid function maps each element between 0 and 1, but the sum of all elements is not necessarily 1. The method specifically comprises the following steps:

(2)

wherein the content of the first and second substances,z _ithe tensor, i.e. the target feature,

representing the activation function of the last convolutional layer,x _irepresents the output of the last convolutional layer in the deep neural network,Nthe dimension representing the output target state, i.e. the total number of scene tags and event tags,N ₁represents the total number of scene tags that are,N ₂representing the total number of event tags.

And inputting the fused heterogeneous relation features into a preset deep neural network for convolution processing, wherein the convolution operation can learn and extract the features of the input data, and each parameter of the input data is updated, so that the target features are obtained after updating.

And S266, generating a scene label and an event label corresponding to the audio data according to the target characteristics.

Specifically, the target feature obtained in S264 is oneNTensor of dimension, selectingNMedian anterior of the tensor of dimensionN ₁Selecting the label corresponding to the maximum element value in the dimension as the identification result of the scene labelNMiddle and rear of dimension tensorN ₂The label corresponding to the element value greater than the preset threshold in the dimension is used as the identification result of the event label, where the selection range of the preset threshold is (0,1), and preferably, 0.4 is selected as the threshold in this embodiment.

In the embodiment of the application, the audio features and the heterogeneous relation features are spliced to generate the fusion heterogeneous relation features, and the fusion heterogeneous relation features not only have the audio features of audio data, but also have the relation features between scenes, between scenes and events, and between events. And then inputting the fused heterogeneous relation features into a preset deep neural network for convolution processing to generate target features, and finally generating scene tags and event tags corresponding to the audio data according to the target features. The target characteristics output by the deep neural network reflect the audio characteristics of the audio data and the relationships between scenes, between scenes and events and between events, so that the scenes and the events of the audio data are identified at the same time, and the accuracy of identifying the scenes and the events by the audio data is improved.

In one embodiment, fig. 6 is a schematic flowchart of an audio recognition method in an embodiment, and as shown in fig. 6, an audio recognition method is provided, which further includes:

s620, acquiring a training set, and setting a label for each preset audio data in the training set; the labeling label comprises a scene label and an event label.

Specifically, in a general case, the manner of acquiring the audio data includes, but is not limited to: directly acquiring the existing audio, such as downloading from an online sound material website or searching through a multimedia optical disc; acquiring audio by using audio processing software, such as capturing and intercepting CD disc audio data by using audio processing software, or stripping sound in video by using software similar to "thousands of listeners" or the like; the sound is recorded by a microphone, for example, a "recorder" or a "microphone" carried by a computer, a terminal or the like is used for collecting the sound. The method for acquiring the audio data is not limited in the present application.

The method comprises the steps of utilizing obtained audio data to manufacture an audio data training set, setting a label for each audio data in the training set, and labeling each audio data with a scene label and a plurality of event labels. When the label tag is set for the audio data, the audio data may be manually labeled or labeled by some labeling tools, and the method for setting the label tag is not limited in this embodiment. Preferably, in this embodiment, a manual labeling manner is adopted to set a label for each preset audio data in the training set.

And S640, constructing an initial heterogeneous relation graph according to the label labels of the preset audio data in the training set.

Specifically, the heterogeneous relationship graph is composed of nodes and edges, wherein the nodes represent the types of labels, and the edges represent the weight relationship between two connected nodes. Therefore, by counting the labeling labels of the preset audio data in the training set, the relevance between the labeling labels of the preset audio data can be obtained, and a heterogeneous relational graph is further constructed.

And S660, inputting the initial heterogeneous relation graph into an initial R-GCN relation graph convolution neural network to generate an intermediate heterogeneous relation graph.

Specifically, the initial heterogeneous relationship graph is obtained by counting the label of the preset audio data in the training set, so that the initial heterogeneous relationship graph can only reflect the relation between labels corresponding to the preset audio data in the training set, and cannot accurately reflect the logical relationships between scene labels and scene labels, scene labels and event labels, and event labels in the audio data.

In the embodiment of the application, an audio data training set is obtained first, and a label is set for each preset audio data in the training set, wherein the label includes a scene label and an event label. And then, constructing an initial heterogeneous relation graph by counting the labeling labels of the preset audio data in the training set. And finally, inputting the initial heterogeneous relationship diagram into an initial R-GCN relationship diagram convolution neural network to generate an intermediate heterogeneous relationship diagram. The established initial heterogeneous relation graph fully considers the relation among all the label labels in the audio data training set, so that the accuracy of audio identification is improved, and the reliability of results is improved.

In one embodiment, fig. 7 is a schematic flowchart of a process of constructing an initial heterogeneous relationship diagram in an embodiment, and as shown in fig. 7, S640 constructs the initial heterogeneous relationship diagram according to a label tag of preset audio data in a training set, including:

s642, constructing an adjacency matrix according to the symbiotic probability among the label labels of the preset audio data in the training set.

Specifically, assume that the total number of types of label tags of the preset audio data in the training set isNThe total number of the labeled scene labels of the preset audio data isN ₁The total number of the labeled event labels of the preset audio data isN ₂. Counting the symbiotic state among the label labels of the preset audio data in the training set, and constructing a symbiotic matrix M, wherein M isN*NA matrix of dimensions. M_ijIndicating that for all preset audio data in the training set, label L_iLabel L in case of existence_jAlso present. In all the preset audio data in the statistical training set,Ntotal number of respective occurrences of individual tags, toNThe form of the dimension vector is represented as S. Then, each element of each column vector in M is divided by each corresponding element of the S vector to obtain an adjacent matrix A, wherein the adjacent matrix A is also oneN*NThe matrix of dimensions, specifically represented as:

(3)

as can be seen, the adjacency matrix a reflects the co-occurrence probability among the label labels of all the preset audio data in the training set. Specifically, the adjacency matrix a may reflect the thickness of an edge connecting two nodes in the heterogeneous relationship diagram, and further, the adjacency matrix a may reflect a relationship weight between two nodes in the heterogeneous relationship diagram.

Preferably, in consideration of the fact that noise labeling labels exist in all preset audio data in the training set, in this embodiment, the value of an element in the adjacency matrix a that is smaller than the threshold is set to 0, which may indicate that there is no association between two labels, i.e., there is no edge connection between two nodes in the heterogeneous relationship graph.

S644, constructing a relation category matrix according to the relation categories among the labeled labels of the audio data in the training set.

Specifically, a relationship class set R is constructed, including: the scene-scene relationship, the scene-event relationship, and the event-event relationship are respectively represented by different numbers, preferably, in this embodiment, the numbers 1, 2, and 3 are respectively used to represent the scene-scene relationship, the scene-event relationship, and the event-event relationship. Further, a constructed relationship class matrix RMat, RMat of the same dimension as the adjacency matrix A is constructed_ijIndicating label L_iAnd a label L_jThe category relationship between them. Specifically, the relationship type matrix RMat may be constructed to reflect the relationship type of two nodes connected in the heterogeneous relationship diagram.

S646, constructing an initial heterogeneous relation graph according to the adjacent matrixes and the relation category matrixes.

Specifically, an initial heterogeneous relationship graph G = { V, E, R } is constructed according to an adjacency matrix and a relationship type matrix, where V represents each node in the heterogeneous relationship graph, E represents an edge connecting two nodes in the heterogeneous relationship graph, the adjacency matrix a may reflect a relationship weight between two nodes in the heterogeneous relationship graph, and the relationship type matrix RMat may reflect a relationship type between two nodes in the heterogeneous relationship graph.

Further, word vector extraction operation is carried out on the label labels of the preset audio data in the training set, and the word vector extraction is used for mapping a word or phrase into a real number vector. Ways to extract word vectors include, but are not limited to: GloVe, n-gram, word2vec, fastText, ELMO, etc., and the extraction mode of the word vector is not limited in this embodiment. Preferably, in this embodiment, GloVe is adopted to perform word vector extraction operation on the label tags of the preset audio data in the training set, and then the word vector is used as the initial state of each node in the initial heterogeneous relationship graph, that is, the initial heterogeneous relationship feature.

In the embodiment of the application, firstly, an adjacency matrix is constructed according to the symbiosis probability among the label labels of the preset audio data in the training set, and the adjacency matrix can reflect the relation weight between two nodes in the heterogeneous relation graph. And then, according to the relation category between the label labels of the audio data in the training set, constructing a relation category matrix, wherein the relation category matrix can reflect the category relation between two nodes in the heterogeneous relation graph. And finally, constructing a heterogeneous relation graph through the adjacency matrix and the relation category matrix, so that the heterogeneous relation graph can fully reflect the heterogeneous relation among the nodes, namely the relation among scenes, events and the relation among the scenes and the events can be fully considered, further the scenes and the events of the audio data can be simultaneously identified, and the accuracy of audio identification is effectively improved.

and acquiring initial heterogeneous relation characteristics from the initial heterogeneous relation graph, and performing aggregation updating on the initial heterogeneous relation characteristics through an initial R-GCN relation graph convolutional neural network to generate intermediate heterogeneous relation characteristics.

Specifically, the initial state of each node in the initial heterogeneous relationship diagram represents an initial heterogeneous relationship feature, the initial heterogeneous relationship feature is subjected to aggregation update through an initial R-GCN relationship diagram convolutional neural network, that is, the state of each node in the initial heterogeneous relationship diagram is subjected to aggregation update, so as to obtain an intermediate heterogeneous relationship feature, and further, an aggregation update mode is adopted for the state of each node in the initial heterogeneous relationship diagram, specifically:

(4)

wherein the content of the first and second substances,

convolution neural network for representing R-GCN relationship diagramlIn the R-GCN layeriThe state vector of a node, i.e. the characteristic of the node.

Representing the relationship of R-GCN by the second in a convolutional neural networklAfter the R-GCN layeriThe state vector of the individual nodes, i.e.,

and (4) representing the intermediate heterogeneous characteristics obtained after the initial heterogeneous relation graph is calculated through the formula. In addition, the first and second substrates are,

it is shown that the activation function is,Ra set of relationship classes is represented that,

is shown asiThe neighbor set of an individual node under the relation r, a denotes the adjacency matrix,

is shown aslIn the R-GCN layer, the firstiEach node corresponds to a trainable weight matrix under the relation r,

is shown aslIn the R-GCN layer, the firstiEach node corresponds to a trainable weight matrix. It can be seen that when the R-GCN relational graph convolutional neural network updates each node in the heterogeneous relational graph, each layer of node features is obtained from the previous layer of node features and the relationship between the node and the node. In addition, the R-GCN carries out weighted summation on the neighbor node characteristics and the self characteristics of the node to obtain new characteristics, and the R-GCN can keep the self information of the node and can consider self-loop.

Further, the adjacency matrix a is trainable and obtained according to the initial heterogeneous relationship diagram, and is continuously changed when the R-GCN relationship diagram convolutional neural network performs aggregation update on the initial heterogeneous relationship features of the initial heterogeneous relationship diagram.

Further, when the state of each node in the initial heterogeneous relationship graph is updated in an aggregation manner, if the state is updated in an aggregation manner, the state of each node in the initial heterogeneous relationship graph is updated in an aggregation manner

The parameter quantity is too large, and the parameter quantity can be reduced by means of base decompositionThe formula is as follows:

(5)

wherein the content of the first and second substances,

is shown aslA R-GCN layer ofiThe dimension of the state vector of the individual nodes, B being the number of bases,

as a function of the number of the coefficients,

dependency only categoriesr. As can be seen,

can be expressed as a radical

Linear combinations of (3).

Specifically, the state of each node in the initial heterogeneous relationship graph is the initial heterogeneous relationship characteristic, and the intermediate heterogeneous relationship characteristic is obtained by updating the initial heterogeneous relationship characteristic. For the initial heterogeneous relationship graph, the state of each node is updated, and an intermediate heterogeneous relationship graph is obtained.

In the embodiment of the application, the initial heterogeneous relationship characteristics are obtained from the initial heterogeneous relationship diagram, the initial heterogeneous relationship characteristics are subjected to aggregation updating through the initial R-GCN relationship diagram convolutional neural network, namely, all nodes of the initial heterogeneous relationship diagram are updated, and then the intermediate heterogeneous relationship characteristics are generated. And then updating the initial heterogeneous relationship diagram based on the intermediate heterogeneous relationship characteristics to generate an intermediate heterogeneous relationship diagram. Compared with the initial heterogeneous relationship graph, the obtained intermediate heterogeneous relationship graph can better reflect the relationship among all nodes in the heterogeneous relationship graph, namely the relationship among scenes, scenes and events in the audio data, and further the accuracy of identifying the audio data is improved.

and inputting the initial heterogeneous relation characteristics into the R-GCN layer for processing to generate the processed initial heterogeneous relation characteristics.

Specifically, the R-GCN relation graph convolution neural network comprises an R-GCN layer and an activation function, and after the initial heterogeneous relation characteristics are input into the R-GCN layer, the R-GCN layer updates the initial heterogeneous relation characteristics so as to generate the processed initial heterogeneous relation characteristics.

Specifically, the R-GCN map convolutional neural network includes an R-GCN layer and an activation function, the R-GCN layer is not limited herein, and preferably, in the embodiment, the R-GCN map convolutional neural network used has 2R-GCN layers. There is a corresponding activation function behind the R-GCN layer, and the types of activation functions include but are not limited to: a ReLU activation function, a Tanh activation function, an lreuu activation function, a prieuu activation function, etc., preferably, in this embodiment, a ReLU activation function is used.

In the embodiment of the application, the initial heterogeneous relation characteristics are input into an R-GCN relation graph convolutional neural network R-GCN layer for processing, and the processed initial heterogeneous relation characteristics are generated. And then inputting the processed initial heterogeneous relationship characteristics into an activation function for processing to generate intermediate heterogeneous relationship characteristics. The updating of the initial heterogeneous relation characteristics is realized, and the accuracy of the audio data identification is further improved.

In one embodiment, fig. 8 is a schematic diagram of a network training process of an audio recognition method in one embodiment, and as shown in fig. 8, an audio recognition method is provided, which further includes:

and S810, extracting audio features from each preset audio data in the training set, and extracting intermediate heterogeneous relation features from the intermediate heterogeneous relation graph.

Specifically, an audio feature is extracted from each preset audio data in the training set, and preferably, in this embodiment, a log-mel spectrum of each preset audio data in the training set is extracted as the audio feature. And extracting intermediate heterogeneous relation characteristics by acquiring the state of each node in the intermediate heterogeneous relation graph.

And S820, inputting the audio features and the intermediate heterogeneous relation features of the preset audio data into the initial deep neural network, and generating a predicted scene label and a predicted event label of the preset audio data.

Specifically, the audio features of the preset audio data and the intermediate heterogeneous relation features are spliced and input into an initial deep neural network, and a predicted scene label and a predicted event label of the preset audio data are obtained according to the output vector.

And S830, calculating the value of the loss function according to the predicted scene label and the predicted event label of the preset audio data and the labeled scene label and the labeled event label of the preset audio data.

Specifically, the loss function helps to optimize parameters of the neural network, and the loss of the neural network is minimized by optimizing the parameters of the neural network. The output vector of the deep neural network can reflect the scene label and the event label of the audio data at the same time, the scene label is identified to be a multi-classification single-output task, and the event label is identified to be a multi-classification multi-output task.

Preferably, a multi-class cross entropy loss function is adopted for identifying scene tags, and a two-class cross entropy loss is adopted for identifying event tags, so that the loss function is as follows:

(6)

it can also be combined and simplified as follows:

(7)

wherein the content of the first and second substances,

coded for the corresponding tag one-hotiThe value of each element, the value of the element corresponding to the tag in one-hot encoding is 1, the values of the other elements are 0,

output from deep neural networkNThe dimension tensor (i.e. the feature of the object)iThe value of each of the elements is,Nrepresents the total number of scene tags and event tags,N ₁representing the total number of scene tags.

And S840, adjusting parameters of the initial R-GCN relation graph convolution neural network according to the value of the loss function, and generating a preset R-GCN relation graph convolution neural network.

Specifically, the initial R-GCN relation graph convolutional neural network can adjust each trainable weight value in the initial R-GCN relation graph convolutional neural network through the value of a loss function, the loss function determines the parameter updating direction of the initial R-GCN relation graph convolutional neural network, and then the preset R-GCN relation graph convolutional neural network is obtained.

And S850, adjusting the parameters of the initial deep neural network according to the value of the loss function, and generating a preset deep neural network.

Specifically, each trainable weight value in the initial deep neural network can be adjusted by the initial deep neural network according to the value of the loss function, and the loss function determines the parameter updating direction of the initial deep neural network, so as to obtain the preset deep neural network.

In the embodiment of the application, firstly, audio features are extracted from each preset audio data in a training set, an intermediate heterogeneous relation feature is extracted from an intermediate heterogeneous relation graph, then the audio features and the intermediate heterogeneous relation features of the preset audio data are input into an initial deep neural network, and a predicted scene label and a predicted event label of the preset audio data are generated. Further, the value of the loss function is calculated according to the predicted scene tag and the predicted event tag of the preset audio data, and the labeled scene tag and the labeled event tag of the preset audio data. Furthermore, adjusting the parameters of the initial R-GCN relation graph convolution neural network according to the value of the loss function, and generating a preset R-GCN relation graph convolution neural network. Further, adjusting parameters of the initial deep neural network according to the value of the loss function to generate a preset deep neural network. By training the initial R-GCN relation graph convolution neural network and the initial deep neural network, all parameters of the two networks are updated, so that the heterogeneous characteristics of the audio data are effectively learned, and the accuracy of identifying the audio data is improved.

It should be understood that although the various steps in the flow charts of fig. 2-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In a specific embodiment, as shown in fig. 9, there is provided an audio recognition method including:

in the process of training the R-GCN relation graph convolution neural network and the deep neural network, the method specifically comprises the following steps:

s901, acquiring a training set, and setting a label for each preset audio data in the training set, wherein the label comprises a scene label and an event label.

S902, constructing an adjacency matrix according to the symbiotic probability among the label labels of the preset audio data in the training set; constructing a relation category matrix according to the relation categories among the label labels of the audio data in the training set; and constructing an initial heterogeneous relationship diagram according to the adjacency matrix and the relationship type matrix.

And S903, acquiring initial heterogeneous relation characteristics from the initial heterogeneous relation graph.

S904, the R-GCN relation graph convolution neural network comprises an R-GCN layer and an activation function, the initial heterogeneous relation characteristics are subjected to aggregation updating through the initial R-GCN relation graph convolution neural network, the initial heterogeneous relation characteristics are input into the R-GCN layer to be processed, and the processed initial heterogeneous relation characteristics are generated; inputting the processed initial heterogeneous relationship characteristics into an activation function for processing to generate intermediate heterogeneous relationship characteristics, and updating the initial heterogeneous relationship diagram based on the intermediate heterogeneous relationship characteristics to generate an intermediate heterogeneous relationship diagram.

S905, extracting intermediate heterogeneous relation features from the intermediate heterogeneous relation graph;

s906, extracting audio characteristics from each preset audio data in the training set,

s907, inputting the audio features and the intermediate heterogeneous relation features of the preset audio data into an initial deep neural network, and generating a predicted scene label and a predicted event label of the preset audio data;

s908, calculating the value of the loss function according to the predicted scene label and the predicted event label of the preset audio data and the labeled scene label and the labeled event label of the preset audio data;

s909, adjusting the parameters of the initial R-GCN relation graph convolution neural network according to the value of the loss function, and generating a preset R-GCN relation graph convolution neural network; and adjusting the parameters of the initial deep neural network according to the value of the loss function to generate a preset deep neural network.

S910, generating a preset heterogeneous relation graph according to a preset R-GCN relation graph convolutional neural network.

In the actual use process, the method specifically comprises the following steps:

s911, obtaining the audio characteristics corresponding to the audio data;

s912, obtaining heterogeneous relation characteristics from a preset heterogeneous relation graph, wherein the preset heterogeneous relation graph is used for representing the relation between labels corresponding to audio data in a training set; the relationship between the labels comprises the relationship between the scene labels, the relationship between the event labels and the relationship between the scene labels and the event labels;

and S913, inputting the audio features and the heterogeneous relation features into a preset deep neural network for audio recognition, and generating a scene tag and an event tag corresponding to the audio data.

In this embodiment, in the process of training the R-GCN relation graph convolutional neural network and the deep neural network, a training set is first obtained, a label is set for each preset audio data in the training set, and then an initial heterogeneous relation graph is constructed according to a symbiotic probability and a relation category between the label labels of the preset audio data in the training set. Further, the initial heterogeneous relation characteristics obtained from the initial heterogeneous relation graph are rolled up through the initial R-GCN relation graph to obtain an intermediate heterogeneous relation graph. Further, the audio features extracted from each preset audio data in the training set and an intermediate heterogeneous relation graph obtained from the intermediate heterogeneous relation graph are input into the initial deep neural network, and a prediction scene label and a prediction event label of the preset audio data are generated. Further, a preset R-GCN relation graph convolution neural network and a preset deep neural network are obtained by calculating the value of the loss function.

In the actual use process, firstly, audio features corresponding to the audio data are obtained, the heterogeneous relationship features and the audio features extracted from the preset heterogeneous relationship graph are input into a preset deep neural network for audio identification, and a scene label and an event label corresponding to the audio data are generated. Therefore, the invention can simultaneously carry out the dual recognition and classification tasks of scenes and events in the audio, and can improve the accuracy and the credibility of the audio recognition and classification.

In one embodiment, fig. 10 is a block diagram illustrating a structure of an audio recognition apparatus in one embodiment, and as shown in fig. 10, there is provided an audio recognition apparatus 1000, including:

an audio feature obtaining module 1020, configured to obtain an audio feature corresponding to the audio data;

the heterogeneous relationship characteristic obtaining module 1040 is configured to obtain a heterogeneous relationship characteristic from a preset heterogeneous relationship diagram, where the preset heterogeneous relationship diagram is used to represent a relationship between labels corresponding to audio data in a training set; the relationship between the labels comprises the relationship between the scene labels, the relationship between the event labels and the relationship between the scene labels and the event labels; the preset heterogeneous relation graph is generated by inputting the initial heterogeneous relation graph into a preset R-GCN relation graph convolutional neural network;

the audio recognition module 1060 is configured to input the audio features and the heterogeneous relationship features into a preset deep neural network for audio recognition, and generate a scene tag and an event tag corresponding to the audio data.

In one embodiment, as shown in fig. 11, the audio recognition module 1060 includes:

a fused heterogeneous relationship feature obtaining unit 1062, configured to splice the audio feature and the heterogeneous relationship feature to generate a fused heterogeneous relationship feature;

the target feature obtaining unit 1064 is configured to input the feature of the fused heterogeneous relationship into a preset deep neural network for convolution processing, so as to generate a target feature;

scene and event classification unit 1066: and the system is used for generating a scene label and an event label corresponding to the audio data according to the target characteristics.

In one embodiment, there is provided an audio recognition apparatus, further comprising:

the audio data training set acquisition module is used for acquiring an audio data training set and setting a label for each preset audio data in the training set; the labeling label comprises a scene label and an event label;

the heterogeneous relational graph acquisition module is used for constructing an initial heterogeneous relational graph according to the label labels of the preset audio data in the training set;

and the heterogeneous relation graph updating module is used for inputting the initial heterogeneous relation graph into the initial R-GCN relation graph convolution neural network to generate an intermediate heterogeneous relation graph.

In one embodiment, the heterogeneous relational graph obtaining module is further configured to construct an adjacency matrix according to a co-occurrence probability between labels of preset audio data in the training set; constructing a relation category matrix according to the relation categories among the label labels of the audio data in the training set; and constructing an initial heterogeneous relationship diagram according to the adjacency matrix and the relationship type matrix.

In one embodiment, the heterogeneous relationship diagram updating module is further configured to obtain an initial heterogeneous relationship characteristic from the initial heterogeneous relationship diagram, and perform aggregation updating on the initial heterogeneous relationship characteristic through an initial R-GCN relationship diagram convolutional neural network to generate an intermediate heterogeneous relationship characteristic; and updating the initial heterogeneous relationship diagram based on the intermediate heterogeneous relationship characteristics to generate an intermediate heterogeneous relationship diagram.

In one embodiment, the heterogeneous relationship map updating module is further configured to input the initial heterogeneous relationship feature into the R-GCN layer for processing, and generate a processed initial heterogeneous relationship feature; and inputting the processed initial heterogeneous relation characteristics into an activation function for processing to generate intermediate heterogeneous relation characteristics.

In one embodiment, there is provided an audio recognition apparatus, further comprising: the network training module is used for extracting audio features from each preset audio data in the training set and extracting intermediate heterogeneous relation features from the intermediate heterogeneous relation graph; inputting the audio features and the intermediate heterogeneous relation features of the preset audio data into an initial deep neural network, and generating a predicted scene label and a predicted event label of the preset audio data; calculating the value of a loss function according to a predicted scene label and a predicted event label of preset audio data and a labeled scene label and a labeled event label of the preset audio data; adjusting parameters of an initial R-GCN relation graph convolution neural network according to the value of the loss function to generate a preset R-GCN relation graph convolution neural network; and adjusting the parameters of the initial deep neural network according to the value of the loss function to generate a preset deep neural network.

The division of the modules in the audio recognition apparatus is only for illustration, and in other embodiments, the audio recognition apparatus may be divided into different modules as needed to complete all or part of the functions of the audio recognition apparatus.

In one embodiment, FIG. 12 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 12, the computer device may be a server, which includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The database of the computer device is used for storing audio data, the network interface of the computer device is used for communicating with an external terminal through network connection, and the internal memory provides an environment for running an operating system and a computer program in the nonvolatile storage medium. The computer program is executable by a processor for implementing an audio recognition method provided in the following embodiments.

The implementation of each module in the audio recognition apparatus provided in the embodiment of the present application may be in the form of a computer program. The computer program may be run on a computer device or a server. The program modules constituting the computer program may be stored on a memory of the computer device or the server. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the audio recognition method.

A computer program product comprising instructions which, when run on a computer, cause the computer to perform an audio recognition method.

Any reference to memory, storage, database, or other medium used by embodiments of the present application may include non-volatile and/or volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for audio recognition, the method comprising:

acquiring audio features corresponding to the audio data;

acquiring heterogeneous relation characteristics from a preset heterogeneous relation graph, wherein the preset heterogeneous relation graph is used for representing the relation between labels corresponding to audio data in a training set; the relations among the labels comprise relations among scene labels, relations among event labels and relations among event labels, and relations among scene labels and event labels; the preset heterogeneous relation graph is generated by inputting an initial heterogeneous relation graph into a preset R-GCN relation graph convolutional neural network; the initial heterogeneous relation graph is used for representing the relation between the corresponding labels of the audio data in the training set; the relation among the labels refers to the connection relation among the labels corresponding to the audio data; the initial heterogeneous relation graph is obtained by counting labels of the audio data in the training set; the preset R-GCN relation graph convolution neural network is obtained by adjusting parameters of the R-GCN relation graph convolution neural network;

inputting the fusion heterogeneous relation features into a preset deep neural network for convolution processing to generate target features;

2. The audio identification method according to claim 1, wherein the obtaining of the audio features corresponding to the audio data comprises:

obtaining a log-mel spectrum of the audio data;

audio features of the audio data are determined from a log-mel spectrum of the audio data.

3. The audio recognition method of claim 1, further comprising:

acquiring the training set, and setting a label for each preset audio data in the training set; the label comprises a scene label and an event label;

constructing an initial heterogeneous relation graph according to the label of the preset audio data in the training set;

4. The audio identification method according to claim 3, wherein the constructing an initial heterogeneous relationship graph according to the label corresponding to the preset audio data in the training set includes:

constructing a relation category matrix according to the relation categories among the labeled labels of the audio data in the training set;

and constructing the initial heterogeneous relationship graph according to the adjacency matrix and the relationship type matrix.

5. The audio recognition method of claim 3, wherein the inputting the initial heterogeneous relationship graph into an initial R-GCN relationship graph convolutional neural network to generate an intermediate heterogeneous relationship graph comprises:

and updating the initial heterogeneous relationship diagram based on the intermediate heterogeneous relationship characteristics to generate the intermediate heterogeneous relationship diagram.

6. The audio recognition method of claim 5, wherein the R-GCN graph convolutional neural network comprises an R-GCN layer and an activation function; the aggregating and updating the initial heterogeneous relationship features through an initial R-GCN relational graph convolutional neural network to generate intermediate heterogeneous relationship features, including:

inputting the initial isomeric relation characteristic into the R-GCN layer for processing to generate a processed initial isomeric relation characteristic;

and inputting the processed initial heterogeneous relationship characteristic into the activation function for processing to generate the intermediate heterogeneous relationship characteristic.

7. The audio recognition method of claim 3, further comprising:

extracting audio features from each preset audio data in the training set, and extracting the intermediate heterogeneous relation features from the intermediate heterogeneous relation graph;

inputting the audio features of the preset audio data and the intermediate heterogeneous relation features into an initial deep neural network, and generating a predicted scene label and a predicted event label of the preset audio data;

calculating the value of a loss function according to the predicted scene label and the predicted event label of the preset audio data and the labeled scene label and the labeled event label of the preset audio data;

adjusting parameters of the initial R-GCN relation graph convolution neural network according to the value of the loss function to generate a preset R-GCN relation graph convolution neural network;

and adjusting the parameters of the initial deep neural network according to the value of the loss function to generate the preset deep neural network.

8. An audio recognition apparatus, characterized in that the apparatus comprises:

the system comprises a heterogeneous relation characteristic acquisition module, a heterogeneous relation characteristic acquisition module and a data processing module, wherein the heterogeneous relation characteristic acquisition module is used for acquiring heterogeneous relation characteristics from a preset heterogeneous relation graph, and the preset heterogeneous relation graph is used for representing the relation between labels corresponding to audio data in a training set; the relations among the labels comprise relations among scene labels, relations among event labels and relations among event labels, and relations among scene labels and event labels; the preset heterogeneous relation graph is generated by inputting an initial heterogeneous relation graph into a preset R-GCN relation graph convolutional neural network; the initial heterogeneous relation graph is used for representing the relation between the corresponding labels of the audio data in the training set; the relation among the labels refers to the connection relation among the labels corresponding to the audio data; the initial heterogeneous relation graph is obtained by counting labels of the audio data in the training set; the preset R-GCN relation graph convolution neural network is obtained by adjusting parameters of the R-GCN relation graph convolution neural network;

the audio identification module is used for splicing the audio features and the heterogeneous relation features to generate fusion heterogeneous relation features; inputting the fusion heterogeneous relation features into a preset deep neural network for convolution processing to generate target features; and generating a scene label and an event label corresponding to the audio data according to the target characteristics.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the audio recognition method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the audio recognition method of any one of claims 1 to 7.