CN112309405A

CN112309405A - Method and device for detecting multiple sound events, computer equipment and storage medium

Info

Publication number: CN112309405A
Application number: CN202011186597.5A
Authority: CN
Inventors: 刘博卿; 王健宗; 张之勇; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-02-02
Also published as: WO2022001245A1

Abstract

The scheme relates to a block chain technology, belongs to the technical field of voice recognition, and relates to a method for detecting multiple voice events, which comprises the steps of extracting a voice source matrix from voice source data; inputting the sound source matrix into a trained feature extraction network to extract a feature matrix of the sound event; inputting the feature matrix into a trained weight gating circulation layer, and weighting corresponding consequent vectors in the feature matrix according to the weight matrix of the weight gating circulation layer and the weight of the consequent vectors in the feature matrix to obtain a weighted feature matrix; inputting the weighted feature matrix into a full-connection layer, and acquiring a probability matrix through full connection, wherein the dimensionality of the probability matrix corresponds to the type number of the sound events; and determining the occurring target sound event according to the probability matrix. Wherein the sound source matrix may be stored in a block chain. The application also provides a detection device, a computer device and a storage medium for the multiple sound events. The method and the device can accurately detect a plurality of sound events at the same time.

Description

Method and device for detecting multiple sound events, computer equipment and storage medium

Technical Field

The present application relates to the field of voice recognition technologies, and in particular, to a method and an apparatus for detecting multiple voice events, a computer device, and a storage medium.

Background

The sound event detection is applied to the application fields of family intelligent sound box development, telephone customer service and the like, and the sound event can assist relevant personnel to identify and react to the sound event in real time. Most of the existing event detection can only realize the detection of a single sound, such as only one of crying of a baby or an alarm sound of a smoke alarm, and more typically, the detection of a wake-up word, such as 'H i Sara' and 'xiaoaizhike'.

The existing sound event detection scheme needs to carry out targeted training on a specific sound frequency spectrum, needs a special labeling team to carefully and massively label detailed time endpoint values of sound events, and uses the detailed time endpoint values as training materials of the specific sound events. On the basis, the existing sound event monitoring scheme is trained on multiple sound event models simultaneously, so that the training effect of the models is poorer, and the detection precision of the sound events is poorer.

Disclosure of Invention

The embodiment of the application aims to accurately and simultaneously detect a plurality of sound events.

In order to solve the above technical problem, an embodiment of the present application provides a method for detecting multiple sound events, which adopts the following technical solutions:

a method for detecting multiple sound events, the method comprising,

extracting a sound source matrix from the sound source data;

inputting the sound source matrix into a trained feature extraction network to extract a feature matrix of a sound event;

inputting the feature matrix into a trained weight gating circulation layer, and weighting corresponding consequent vectors in the feature matrix according to a weight matrix of the weight gating circulation layer and the weight of the consequent vectors in the feature matrix to obtain a weighted feature matrix;

inputting the weighted feature matrix into a full-connection layer, and acquiring a probability matrix through full connection, wherein the dimensionality of the probability matrix corresponds to the type number of the sound events;

and determining the occurring target sound event according to the probability matrix.

Further, the feature extraction network comprises a convolution layer and a maximum pooling layer under a gated linear activation function, at least one group of convolution layer and at least one group of maximum pooling layer under the gated linear activation function are respectively arranged, and the convolution layer and the maximum pooling layer under the gated linear activation function are sequentially arranged at intervals;

the inputting the sound source matrix into a trained feature extraction network to extract a feature matrix specifically comprises:

inputting a sound source matrix into a convolution layer under the gating linear activation function to carry out convolution operation, and applying gating to obtain an intermediate matrix;

and inputting the intermediate matrix output by the convolution layer under the gating linear activation function into a maximum pooling layer for dimensionality reduction so as to output a characteristic matrix.

Further, the sound source matrix is input to the convolutional layer under the gated linear activation function to perform convolution operation, and gating is applied to obtain an intermediate matrix, specifically obtained by the following formula:

Y＝(W*X+b)⊙sigmoid(V*X+c)

wherein Y is an intermediate matrix, W and V are convolution kernels with the same number and size respectively, b and c are offsets obtained through training, and X is the sound source matrix.

Further, the weighting matrix includes a first weighting matrix and a second weighting matrix, the feature matrix is input to a trained weight gating cycle layer, and according to the weighting matrix of the weight gating cycle layer and the weighting of the antecedent vector in the feature matrix, the corresponding consequent vector in the feature matrix is weighted to obtain a weighted feature matrix, which specifically includes:

obtaining an activation value corresponding to each vector in the feature matrix through the following formula:

wherein g is an activation function; h is_tIs x_tCorresponding activation value, h_t-1Is x_t-1Corresponding activation value, x_tIs the vector corresponding to the time t in the feature matrix. Y is a first weight matrix, U is a second weight matrix, and ω is calculated during x_tFor x when activating value_t-1The weight applied. b and c are offsets.

And splicing the activation values to obtain a weighted feature matrix.

Further, after the fully connecting the weighted feature matrices to obtain a probability matrix, the method further includes:

and classifying the sound events corresponding to the probability matrix through a softmax function, wherein the number of the classifications is consistent with the dimension number in the probability matrix.

Further, the extracting a sound source matrix from the sound source data specifically includes:

dividing sound source data according to the frame length and the frame shift amount, and extracting an audio frame;

extracting the audio frame as an audio vector according to an FBANK format;

and splicing the audio vectors to obtain a sound source matrix.

A detection device for multiple sound events comprises,

a sound source extraction module for extracting a sound source matrix from the sound source data;

the characteristic extraction module is used for inputting the sound source matrix into a trained characteristic extraction network so as to extract a characteristic matrix of the sound event;

the weight weighting module is used for inputting the feature matrix into a trained weight gating circulation layer, and weighting corresponding consequent vectors in the feature matrix according to the weight matrix of the weight gating circulation layer and the weight of the consequent vectors in the feature matrix to obtain a weighted feature matrix;

the full-connection module is used for inputting the weighted feature matrix into a full-connection layer and acquiring a probability matrix through full connection, wherein the dimension of the probability matrix corresponds to the number of the types of the sound events;

and the determining module is used for determining the generated target sound event according to the probability matrix.

Further, the weight matrix includes a first weight matrix and a second weight matrix, and the weight weighting module specifically includes:

the activation value determining submodule is used for acquiring the activation value corresponding to each vector in the feature matrix through the following formula:

And the characteristic weighting submodule is used for splicing the activation values to obtain a weighted characteristic matrix.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

a computer device comprising a memory having stored therein a computer program and a processor implementing the steps of the method of detecting a plurality of sound events as described above when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of detecting a plurality of sound events as set forth above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: the method comprises the steps of extracting features of an extracted sound source to obtain a feature matrix comprising a plurality of vectors, weighting a later item in the feature matrix in a weight gate circulation layer according to the weight of a previous item vector in the feature matrix in cooperation with a trained weight matrix, enabling the feature of a sound event in the previous item vector in the feature matrix to influence the later item, reducing the influence of a hidden layer on the sound feature in the weighting process, enabling the feature of the sound event between frames to form continuous feedback, effectively highlighting the sound feature with short duration and the sound feature with long duration through weighting, and then connecting to obtain a probability matrix corresponding to the type of the sound event. And determining the occurrence of each sound event according to the probability and determining the target sound event to occur. The scheme can accurately and simultaneously detect a plurality of sound events.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is a flow diagram of one embodiment of a method for detection of multiple sound events according to the present application;

FIG. 2 is a flowchart of one embodiment of step S200 of FIG. 1;

FIG. 3 is a flowchart of one embodiment of step S300 of FIG. 1;

FIG. 4 is a flow diagram of one embodiment of a method for detection of multiple sound events according to the present application;

FIG. 5 is a flowchart of one embodiment of step S100 of FIG. 1;

FIG. 6 is a schematic block diagram of one embodiment of a multiple sound event detection apparatus according to the present application;

FIG. 7 is a block diagram of one embodiment of the feature extraction module shown in FIG. 6;

FIG. 8 is a block diagram illustrating one embodiment of the weight module shown in FIG. 6;

FIG. 9 is a schematic diagram of one embodiment of the sound source extraction module shown in FIG. 6;

FIG. 10 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

Referring to FIG. 1, a flow diagram of one embodiment of a method for detection of various sound events is shown, in accordance with the present application. The detection method of the multiple sound events comprises the following steps:

in step S100, a sound source matrix is extracted from the sound source data.

The sound source is complex because it is the detection of multiple sound events and still includes multiple sounds after filtering and denoising, and is subjected to data extraction to obtain digitized sound source data. The digitized sound stores the information in the sound in a matrix form, and when the sound data is stored through a sound source matrix, the vector in the matrix stores the sound data in one audio frame, and the sound source matrix formed by splicing the vectors stores the whole sound data. The audio frame refers to a period of audio, and is the minimum unit for audio storage and operation in the scheme.

And step S200, inputting the sound source matrix into a trained feature extraction network to extract a feature matrix of the sound event.

In the feature extraction network, feature extraction can be performed on data through the deep learning capability of the neural network, the data is usually convolved layer by layer through the convolutional neural network and a pre-trained convolution kernel, and data features are extracted through a channel formed by the convolution kernel.

And step S300, inputting the feature matrix into a trained weight gating circulation layer, and weighting corresponding consequent vectors in the feature matrix according to the weight matrix of the weight gating circulation layer and the weight of the consequent vectors in the feature matrix to obtain the weighted feature matrix.

The feature matrix comprises a plurality of vectors, each vector corresponds to one audio frame, feature data of a section of sound is stored through the vectors, the latter term vectors are weighted according to a pre-trained weight matrix and weights provided by former term vectors of the vectors, and the feature matrix is updated, so that the feature matrix can be weighted according to multiple detected sound events to highlight features of the sound events, on the other hand, the vectors in the feature matrix are related through the weights provided by the former term vectors, and the features of the sound can be corrected according to factors such as the sounding lengths of different sounds, and therefore the influence of serialization of the vectors in the feature matrix can be weakened. The method can prevent a sound from stopping, but still embodied in the vector of the latter term of the vector; if a sound duration event is longer, the weight provided by the antecedent vector can be continuously embodied in the antecedent vector.

In this embodiment, the feature matrix is by x₁、x₂、x₃……x_tn vectors are sequentially formed, wherein each vector corresponding to each preceding vector provides a weight O to the vector of the next term_t(t＝[1，n]). For example: x is the number of₁Has a weight of O₁The weight matrix is Z, for x₂The weighting requires combining a weight matrix, and O₁In which O is₁Specifically, the sum x₁The relative values and by corresponding cross products O₁ ^＊x₂ ^＊Z, para-phase O₁ ^＊Z+x₂Adding and splicing

In equal ways, for vector x in the feature matrix₂The weighting is performed. Wherein the operator + is the addition of the number of the para-position elements in the vectors at the two sides of the operator

The vectors or matrixes on both sides of the operator are spliced into a matrix. Obviously, the way of providing weights to the consequent vectors by the antecedent vectors is not limited to the aboveAccording to the scheme, the antecedent content is weighted, so that the influence of the antecedent sound event on the postcedent is transmitted to the postcedent, and the mode of vector weighting of the postcedent belongs to the specific implementation mode of the scheme.

And S400, inputting the weighted feature matrix into a full-connection layer, and acquiring a probability matrix through full connection, wherein the dimension of the probability matrix corresponds to the number of the types of the sound events.

And fully connecting the weighted feature matrixes, wherein a fully-connected layer is arranged behind a weight gating circulation layer and aims to adjust the matrix dimension to form a probability matrix with the dimension equivalent to the number of the sound event types. And if the sound events in n are detected, the number of the neurons in the full connection layer is n, so that the n-dimensional matrix is output and comprises n vectors reflecting the occurrence probability of the specific sound events. That is, the vector for each dimension in the probability matrix corresponds to the probability of an acoustic event occurring.

And step S600, determining the target sound event according to the probability matrix.

Determining whether a sound event occurs in a piece of audio according to the probability corresponding to the sound event provided by the vector in the probability matrix, and determining a target sound event that occurs.

The method comprises the steps of extracting features of an extracted sound source to obtain a feature matrix comprising a plurality of vectors, weighting a later vector in the feature matrix in a weight gate cycle layer in cooperation with a trained weight matrix according to the weight of a previous vector in the feature matrix, enabling the feature of a sound event in the previous vector in the feature matrix to influence the later vector, reducing the influence of a hidden layer on the sound feature in a weighting process, enabling the features of the sound event between frames to form continuous feedback, enabling the sound feature with short duration and the sound feature with long duration to be effectively highlighted through weighting, and then connecting to obtain a probability matrix corresponding to the type of the sound event. And determining the occurrence of each sound event according to the probability and determining the target sound event to occur. The scheme can accurately and simultaneously detect a plurality of sound events.

It should be emphasized that, in order to further ensure the privacy and security of the information of the detection methods of the various sound events, the sound source matrix, the probability matrix, and the feature matrix information may also be stored in the nodes of a block chain.

the step S200 of inputting the sound source matrix into a trained feature extraction network to extract a feature matrix of a sound event includes:

step S201, inputting a sound source matrix into a convolution layer under the gating linear activation function for convolution operation, and applying gating to obtain an intermediate matrix;

specifically, the gated linear activation function used in this embodiment is:

Y＝(W*X+b)⊙sigmoid(V*X+c)

w and V are convolution kernels with the same number and size, respectively, and more preferably, the convolution kernels used in the embodiment are 3 × 3 convolution kernels, 128 convolution kernels are provided to provide 128 channels for convolution, b and c are offsets obtained through training, X is a feature matrix, and the convolution result is activated under gating through a sigmoid activation function to obtain the feature matrix.

The data of the feature matrix extracted by the sound source matrix can be smoother by applying gating, and the features extracted in such a way are more accurately concentrated.

Step S202, inputting the output intermediate matrix of the convolution layer under the gating linear activation function into the maximum pooling layer for dimension reduction so as to output a characteristic matrix.

And excessive fitting of the feature matrix is prevented through pooling, and the extraction precision of the feature matrix is ensured.

According to the scheme, the gate control is implemented on the convolution result, so that the feature extraction of the feature matrix is accurate and concentrated, and the accuracy of the feature extraction is improved.

Further, the step S300 is to input the feature matrix into a trained weight gating cycle layer, and weight a corresponding consequent vector in the feature matrix according to a weight matrix of the weight gating cycle layer and a weight of a consequent vector in the feature matrix to obtain a weighted feature matrix, and specifically includes:

step S301: and acquiring an activation value corresponding to each vector in the feature matrix.

Specifically, in this embodiment, the calculation process of the hidden layer in the weight-gated cyclic layer is as follows:

where g is an activation function, in this embodiment a hyperbolic tangent function is used; h is_tIs the activation value at time t, h_t-1Is the activation value, x, at time t-1_tIs the vector corresponding to the time t in the feature matrix. Y and U are each independently of x_tAnd x_t-1Corresponding weight matrix, ω being calculated at x_tFor x when activating value_t-1The weight applied. b is an offset. When t is 1, there is no antecedent vector, and the weighting can be performed by using only the trained feature matrix and parameters, and when t > 1, the activation value of the antecedent and the trained feature matrix and parameters are combined for weighting.

In this embodiment, the weighting process is divided into two phases when t > 1:

first weighting, i.e. Y x_tOne of the feature matrices is weighted by a first weighting matrix to emphasize the sound event related features in the vector.

Second weighting, i.e. ω Uh_t-1+ b, is the antecedent vector sound event from each vectorThe weight of the occurrence and continuation is carried out on the vector, the weight is embodied by the activation value of the previous item, the second weight matrix and the weight added to the previous item vector, the weight is continued backwards in the previous item sound event, when the latter item vector is influenced, the weight is large, otherwise, if the sound event in the previous item is not continued backwards or the influence on the latter item vector is weakened, the weight is small. Wherein the first weight matrix and the second weight matrix are determined by training, respectively. The quadratic weighting can overcome the effect of vector serialization in a sequential network, i.e., when the duration of a sound event in a preceding vector in the sequence is short, the weighting on a following vector is small, and if the duration of a sound event in a preceding vector is long, the weighting on a following vector is large. The weighted feature matrix can effectively show the sound events with short duration and can also effectively show the sound events with long duration.

Step S302: and splicing the activation values to obtain a weighted feature matrix.

Specifically, the activation values are vectors corresponding to the directions of the feature vectors, the dimensions of the activation values are the same, and the activation values having the same dimensions are sequentially arranged to obtain a weighted feature matrix

In this embodiment, the activation value of the antecedent vector can embody the features included in the antecedent vector, if the correlation between the features of the antecedent vector and the features of the consequent vector is high, the influence of the antecedent vector on the weighting process of the consequent vector is strong, then a weight is applied to the antecedent vector, which can reflect the duration time of the sound event, if the duration time of the sound event is longer, the applied weight is higher, the feature of the corresponding sound event can be maintained, and conversely, if the duration time of the sound event is shorter, the applied weight is lower, the feature of the sound event is abandoned in time, and the feature display of the subsequent vector is not influenced.

Therefore, the scheme can highlight the characteristics of the sound event through the adjustment of the weight, and the accuracy of the probability matrix extraction is improved.

Further, in the step S400, after the weighted feature matrices are fully connected to obtain the probability matrix, the method further includes:

step S500, sound event classification corresponding to the probability matrix is carried out through a softmax function, and the sound event classification corresponds to the vectors in the probability matrix.

The probability matrix is mapped and time-classified through the softmax function, the probability of each sound event is adjusted to be between 0 and 1, and when the sounding probability of the sound event is close to 1, the sound event is determined to possibly occur, so that whether the sound event occurs or not can be reflected more intuitively.

Further, the step S100 of extracting a sound source matrix from the sound source data specifically includes:

step S101, dividing sound source data according to frame length and frame shift amount, and extracting audio frames;

the sound source data generally defines a time length, the sound source data is accessed according to the time length, and the judgment method of multiple sounds is to detect multiple sound events for the sound source data within the time length range.

Specific 10s is 10000ms, time range of the 1 st audio frame: 0-100ms, 2 nd frame 23ms-123ms, the consecutive frame is the audio frequency in the time range of [23 x (n-1) ms,23 x (n-1) +100ms ], and 23 x (n-1) +100< 10000ms, so that 10s sound source data can be solved n < 431.43, n is an integer, so n is 431, it can be seen that the audio data in two adjacent audio frames are partially overlapped, which is also beneficial to the following feature extraction and weighting process, the feature of the sound event can be continuous and is easy to capture.

In this application, the value of 10s is the minimum unit for analysis and determination in this embodiment, and if the obtained sound source data is very long, the sound source data may be segmented according to the minimum unit that can be analyzed to obtain multiple segments of sub data, and perform sound detection on each segment of sub data to determine what sound event occurs.

Step S102, extracting the audio frame as an audio vector according to the FBANK format;

FBANK is a feature extraction format for audio, and for a segment of audio, the audio can be recorded and stored in the form of vectors after the features are extracted, where each vector corresponds to audio data in a time period, in this embodiment, the audio data in each audio frame is generated into a vector for storage in the FBANK format, and in this embodiment, the dimension of the audio vector corresponding to each audio frame is summarized to be 64 dimensions.

And step S103, splicing the audio vectors to obtain a sound source matrix.

And the sound source matrix obtained by audio vector splicing is a 431-by-64 sound source matrix.

According to the scheme, the data in the sound source data are segmented by time and extracted through vectors, so that the feature extraction and processing of the sound source data on the dimensionality of an audio frame are facilitated, and the probability of sound events in the whole audio is reflected finally. This scheme does benefit to the detection efficiency who promotes sound detection to audio data's storage.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 2, as an implementation of the method shown in fig. 1, the present application provides an embodiment of a device for detecting multiple sound events, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 1, and the device may be applied to various electronic devices.

In order to solve the above technical problem, an apparatus for detecting multiple sound events provided in an embodiment of the present application adopts the following technical solutions:

a device for detecting a plurality of sound events, comprising,

a sound source extraction module 100 extracting a sound source matrix from the sound source data;

a feature extraction module 200, configured to input the sound source matrix to a trained feature extraction network to extract a feature matrix of a sound event;

the weight weighting module 300 is configured to input the feature matrix to a trained weight gating cycle layer, and weight a corresponding consequent vector in the feature matrix according to a weight matrix of the weight gating cycle layer and a weight of a consequent vector in the feature matrix to obtain a weighted feature matrix;

a full-connection module 400, configured to input the weighted feature matrix into a full-connection layer, and obtain a probability matrix through full-connection, where a dimension of the probability matrix corresponds to a number of types of sound events;

a determining module 600, configured to determine the occurring target sound event according to the probability matrix.

The method comprises the steps of extracting features of an extracted sound source to obtain a feature matrix comprising a plurality of vectors, weighting a later vector in the feature matrix in a weight gate circulation layer in cooperation with a trained weight matrix according to the weight of a previous vector in the feature matrix, enabling the feature of a sound event in the previous vector in the feature matrix to influence the later vector, reducing the influence of a hidden layer on the sound feature in a weighting process, enabling the features of the sound event between frames to form continuous feedback, enabling the sound feature with short duration and the sound feature with long duration to be effectively highlighted through weighting, and then connecting and pooling to obtain a probability matrix corresponding to the type of the sound event. And the occurrence condition of each sound event is determined according to the probability, and the scheme can accurately and simultaneously detect a plurality of sound events.

the feature extraction module 200 specifically includes:

a feature extraction sub-module 201, configured to input the sound source matrix into the convolution layer under the gated linear activation function to perform convolution operation, and apply gating to obtain an intermediate matrix;

and the feature pooling submodule 202 is configured to input the intermediate matrix output by the convolutional layer under the gated linear activation function into the maximum pooling layer for dimensionality reduction, so as to output a feature matrix.

Specifically, the gated active function used in this embodiment is:

Y＝(W*X+b)⊙sigmoid(V*X+c)

wherein W and V are convolution kernels respectively, and more preferably, the convolution kernel used in the present embodiment is a 3X 3 convolution kernel, such convolution kernel is provided with 128 to provide 128 channels for convolution, b and c are offsets obtained by training, X is a feature matrix, and the convolution result is activated under gating by a sigmoid activation function to obtain the feature matrix. According to the scheme, the gate control is implemented on the convolution result, so that the feature extraction of the feature matrix is accurate and concentrated, and the accuracy of the feature extraction is improved.

Further, the weight weighting module 300 specifically includes:

the activation value determining submodule 301 is configured to obtain an activation value corresponding to each vector in the feature matrix.

A feature weighting submodule 302, configured to concatenate the activation values to obtain a weighted feature matrix.

Specifically, in this embodiment, the calculation process of the hidden layer in the weight gating cycle layer in the activation value determination sub-module 301 is as follows:

h_t＝g(Y*x_t+ωUh_t-1+b)

where g is an activation function, in this embodiment a hyperbolic tangent function is used; h is_tIs the activation value at time t, h_t-1Is the activation value, x, at time t-1_tIs the vector corresponding to the time t in the feature matrix. Y and U are each independently of x_tAnd x_t-1Corresponding weight matrix, ω being calculated at x_tFor x when activating value_t-1The weight applied. b is an offset.

The scheme can highlight the characteristics of the sound event through the adjustment of the weight, and improves the accuracy of the probability matrix extraction.

Further, the apparatus for detecting multiple sound events further includes a probability sorting module 500, which classifies the sound event corresponding to the probability matrix by a softmax function, where the sound event classification corresponds to a vector in the probability matrix.

The probability matrix is mapped and time-classified through the softmax function, so that the probability of each sound event is arranged between 0 and 1, and more intuitive reaction on whether the sound event occurs or not is facilitated.

Further, the sound source extraction module 100 specifically includes:

and the audio frame extraction submodule 101 is configured to segment the sound source data according to the frame length and the frame shift amount, and extract an audio frame.

And the audio vector extraction submodule 102 is configured to extract the audio frame as an audio vector according to the FBANK format.

And the sound source matrix splicing submodule 103 is used for splicing the audio vectors to acquire a sound source matrix.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 3, fig. 3 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. AS will be understood by those skilled in the art, the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an application Specific integrated circuit (AS ic), a programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various types of application software, such as program codes of detection methods of various sound events. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute the program code stored in the memory 61 or process data, for example, execute the program code of the detection method of the plurality of sound events.

The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The computer device provided by this embodiment performs feature extraction on an extracted sound source to obtain a feature matrix including a plurality of vectors when executing a detection method of multiple sounds, and then weights a next term vector in the feature matrix according to a weight of a previous term vector in the feature matrix in a weight gate loop layer in cooperation with a trained weight matrix, so that features of a sound event in the previous term vector in the feature matrix affect the next term, and reduces the effect of a hidden layer on the sound features in a weighting process, so that the features of the sound event between frames can form continuous feedback, and for sound features with short duration and sound features with long duration, both the sound features with short duration and the sound features with long duration can be effectively highlighted through weighting, and then connected to obtain a probability matrix corresponding to the type of the sound event. And the occurrence condition of each sound event is determined according to the probability, and the scheme can accurately and simultaneously detect a plurality of sound events.

The present application further provides another embodiment, which is to provide a computer readable storage medium storing a program of a method for detecting multiple sound events, the program of the method for detecting multiple sound events being executable by at least one processor, so that the at least one processor performs the steps of the method for detecting multiple sound events as described above.

The present embodiment provides a computer program recorded on a computer-readable storage medium, which, when executing a plurality of sound detection methods, extracting the features of the extracted sound source to obtain a feature matrix comprising a plurality of vectors, matching the trained weight matrix in a weight gate cycle layer according to the weight of the previous vector in the feature matrix, weighting the latter term vector in the feature matrix to enable the feature of the sound event in the former term vector in the feature matrix to influence the latter term, reducing the influence of the hidden layer on the sound feature in the weighting process to enable the feature of the sound event between frames to form continuous feedback, the sound features with short duration and the sound features with long duration can be effectively highlighted through weighting, and then the probability matrixes corresponding to the types of the sound events are obtained through connection. And the occurrence condition of each sound event is determined according to the probability, and the scheme can accurately and simultaneously detect a plurality of sound events.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for detecting multiple sound events, the method comprising:

extracting a sound source matrix from the sound source data;

2. The method of detecting multiple sound events according to claim 1,

the feature extraction network comprises a convolutional layer and a maximum pooling layer under a gated linear activation function, wherein at least one group of convolutional layer and at least one group of maximum pooling layer under the gated linear activation function are respectively arranged, and the convolutional layer and the maximum pooling layer under the gated linear activation function are sequentially arranged at intervals;

the inputting the sound source matrix into a trained feature extraction network to extract a feature matrix of a sound event specifically includes:

3. The method according to claim 2, wherein the sound source matrix is input to the convolutional layer under the gated linear activation function for convolution operation, and gating is applied to obtain an intermediate matrix, specifically by the following formula:

Y＝(W*X+b)☉sigmoid(V*X+c)

4. The method according to claim 1, wherein the weight matrix comprises a first weight matrix and a second weight matrix, the feature matrix is input to a trained weight gating cycle layer, and a corresponding post-term vector in the feature matrix is weighted according to the weight matrix of the weight gating cycle layer and a weight of a pre-term vector in the feature matrix to obtain a weighted feature matrix, and the method specifically comprises:

And splicing the activation values to obtain a weighted feature matrix.

5. The method of claim 1, wherein after fully concatenating the weighted feature matrices to obtain a probability matrix, the method further comprises:

6. The method for detecting multiple sound events according to claim 1, wherein the extracting a sound source matrix from sound source data specifically comprises:

extracting the audio frame as an audio vector according to an FBANK format;

and splicing the audio vectors to obtain a sound source matrix.

7. A device for detecting multiple sound events, comprising,

8. The apparatus for multiple sound event detection according to claim 7,

the weight matrix comprises a first weight matrix and a second weight matrix, and the weight weighting module specifically comprises:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, carries out the steps of the method of detecting a plurality of sound events according to any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the method of detecting a plurality of sound events according to any one of claims 1 to 6.