WO2022001245A1 - 多种声音事件的检测方法、装置、计算机设备及存储介质 - Google Patents

多种声音事件的检测方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022001245A1
WO2022001245A1 PCT/CN2021/083752 CN2021083752W WO2022001245A1 WO 2022001245 A1 WO2022001245 A1 WO 2022001245A1 CN 2021083752 W CN2021083752 W CN 2021083752W WO 2022001245 A1 WO2022001245 A1 WO 2022001245A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
weight
feature
sound source
sound
Prior art date
Application number
PCT/CN2021/083752
Other languages
English (en)
French (fr)
Inventor
刘博卿
王健宗
张之勇
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022001245A1 publication Critical patent/WO2022001245A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the present application relates to the technical field of sound recognition, and in particular, to a method, apparatus, computer equipment and storage medium for detecting various sound events.
  • Sound event detection is used in application fields such as the development of home smart speakers and telephone customer service. Sound events can instantly assist relevant personnel to identify and respond to sound events.
  • the inventor realizes that most of the existing event detection can only detect a single sound, for example, it can only identify one of the crying sound of a baby or the alarm sound of a smoke alarm, and more typically is the detection of a wake-up word, For example, "Hi Sara” and "Xiao Ai”.
  • the purpose of the embodiments of the present application is to accurately and simultaneously detect multiple sound events.
  • the embodiments of the present application provide a method for detecting multiple sound events, which adopts the following technical solutions:
  • a method for detecting multiple sound events comprising,
  • the weighted feature matrix is input into the fully connected layer, and a probability matrix is obtained through the full connection, wherein the dimension of the probability matrix corresponds to the number of types of sound events;
  • the occurrence of the target sound event is determined.
  • the embodiment of the present application also provides a detection device for multiple sound events, including:
  • the sound source extraction module is used to extract the sound source matrix from the sound source data
  • a feature extraction module for inputting the sound source matrix into a trained feature extraction network to extract the feature matrix of the sound event
  • the weight weighting module is used to input the feature matrix into the trained weight gated cyclic layer, and according to the weight matrix of the weight gated cyclic layer and the weight of the preceding term vector in the feature matrix, to the feature matrix
  • the corresponding posterior term vectors are weighted to obtain the weighted feature matrix
  • the fully connected module is used to input the weighted feature matrix into the fully connected layer, and obtain a probability matrix through the full connection, wherein the dimension of the probability matrix corresponds to the number of types of sound events;
  • the determining module is configured to determine the target sound event that occurs according to the probability matrix.
  • the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
  • a computer device includes a memory and a processor, wherein computer-readable instructions are stored in the memory, and the processor also implements the following steps when executing the computer-readable instructions:
  • the weighted feature matrix is input into the fully connected layer, and a probability matrix is obtained through the full connection, wherein the dimension of the probability matrix corresponds to the number of types of sound events;
  • the occurrence of the target sound event is determined.
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • a computer-readable storage medium where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the processor further performs the following steps:
  • the weighted feature matrix is input into the fully connected layer, and a probability matrix is obtained through the full connection, wherein the dimension of the probability matrix corresponds to the number of types of sound events;
  • the occurrence of the target sound event is determined.
  • the embodiment of the present application mainly has the following beneficial effects: by performing feature extraction on the extracted sound source, a feature matrix including several vectors is obtained, and then, in the weight gate loop layer, according to the previous item in the feature matrix The weight of the vector, combined with the trained weight matrix, weights the latter item in the feature matrix, so that the feature of the sound event in the vector of the former item in the feature matrix affects the latter item, reducing the hidden layer in the weighting process.
  • the influence on the sound features enables the features of sound events between frames to form continuous feedback, and the sound features with short duration and long duration events can be effectively highlighted by weighting, and then connected to obtain A probability matrix corresponding to the kind of sound event. The occurrence of each sound event is determined according to the probability, and the target sound event that occurs is determined.
  • the scheme can accurately detect multiple sound events simultaneously.
  • FIG. 1 is a flowchart of an embodiment of a method for detecting multiple sound events according to the present application
  • Fig. 2 is a flowchart of a specific implementation manner of step S200 in Fig. 1;
  • Fig. 3 is a flow chart of a specific implementation manner of step S300 in Fig. 1;
  • FIG. 4 is a flowchart of an embodiment of a method for detecting multiple sound events according to the present application
  • Fig. 5 is a flow chart of a specific implementation manner of step S100 in Fig. 1;
  • FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for detecting multiple sound events according to the present application.
  • FIG. 7 is a schematic structural diagram of a specific implementation of the feature extraction module shown in FIG. 6;
  • FIG. 8 is a schematic structural diagram of a specific implementation manner of the weight module shown in FIG. 6;
  • FIG. 9 is a schematic structural diagram of a specific embodiment of the sound source extraction module shown in FIG. 6;
  • FIG. 10 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • FIG. 1 a flow chart of one embodiment of a method for detecting multiple sound events according to the present application is shown.
  • the method for detecting multiple sound events includes the following steps:
  • Step S100 extracting a sound source matrix from the sound source data.
  • Data extraction is performed on the sound source to obtain digitized sound source data.
  • the detection of various sound events is aimed at, so the sound source is complex, and the obtained sound source data is filtered and noise-reduced. After that, multiple sounds are still included.
  • the digitized sound stores the information in the sound in the form of a matrix.
  • the vector in the matrix saves the sound data in an audio frame, and the sound source matrix formed by splicing the vectors is the sound source matrix.
  • the entire piece of sound data is stored.
  • the audio frame refers to audio for a period of time, and is the smallest unit of audio storage and operation in this solution.
  • Step S200 the sound source matrix is input into the trained feature extraction network to extract the feature matrix of the sound event.
  • the deep learning ability of the neural network can perform feature extraction on the data.
  • the convolutional neural network is used to perform layer-by-layer convolution on the data through the pre-trained convolution kernel, and the channel formed by the convolution kernel is used. Extract data features.
  • Step S300 the feature matrix is input into the trained weight-gated loop layer, and the corresponding post-term vector in the feature matrix is determined according to the weight matrix of the weight-gated loop layer and the weight of the preceding term vector in the feature matrix.
  • the term vector is weighted to obtain the weighted feature matrix.
  • the feature matrix includes several vectors, each vector corresponds to an audio frame, and the feature data of a piece of sound is saved by the vector, and the latter term vector is weighted according to the pre-trained weight matrix and the weight provided by the vector's former term vector, and Update the feature matrix, so that the feature matrix can be weighted according to the detected various sound events to highlight the features of the sound events.
  • the vectors in the feature matrix are related to each other through the weight provided by the preceding item vector. , in this way, according to factors such as the utterance length of different sounds, the characteristics of the sounds can be modified, so the influence of the serialization of the vectors in the feature matrix can be weakened.
  • the weight of x 1 is O 1
  • the weight matrix is Z
  • weighting x 2 requires combining the weight matrix
  • O 1 is the specific value associated with x 1
  • O 1 * x 2 *Z add to phase O 1 *Z+x 2 , splicing
  • the weight of x 1 is O 1
  • the weight matrix is Z
  • weighting x 2 requires combining the weight matrix
  • O 1 is the specific value associated with x 1
  • O 1 * x 2 *Z add to phase O 1 *
  • the operator + is the addition of the number of bit elements in the vectors on both sides of the operator, and the operator Concatenates the vectors or matrices on both sides of the operator into a matrix.
  • the method of providing weight to the latter item vector through the former item vector is not limited to the above scheme, and the former item content is weighted to the latter item content, so as to transmit the influence of the former item sound event on the latter item to the latter item, the latter item
  • the methods of vector weighting all belong to the specific implementations of this solution.
  • Step S400 the weighted feature matrix is input into the full connection layer, and a probability matrix is obtained through full connection, wherein the dimension of the probability matrix corresponds to the number of types of sound events.
  • the weighted feature matrix is fully connected, and the fully connected layer is set after the weight-gated loop layer.
  • the purpose is to adjust the matrix dimension to form a probability matrix with a dimension equal to the number of sound event types.
  • the probability of occurrence of a sound event corresponding to the vector of each dimension in the obtained probability matrix.
  • the number of neurons used in the fully connected layer is the same as the number of sound events detected. If the sound events in n are detected, then The number of neurons in the fully connected layer is n, so an n-dimensional matrix is output, including n vectors that reflect the probability of a specific sound event. That is, the vector of each dimension in the probability matrix corresponds to the probability of a sound event occurring.
  • Step S600 according to the probability matrix, determine the occurring target sound event.
  • the probability corresponding to the sound event provided by the vector in the probability matrix it is determined whether the sound event occurs in a piece of audio, and the target sound event that occurs is determined.
  • the present application obtains a feature matrix including several vectors by performing feature extraction on the extracted sound source.
  • the feature matrix is The latter term vector is weighted, so that the features of the sound events in the former vector in the feature matrix affect the latter term, reducing the influence of the hidden layer on the sound features in the weighting process, so that the sound events between frames
  • the features of can form continuous feedback, and the sound features with short duration and long duration events can be effectively highlighted by weighting, and then connected to obtain a probability matrix corresponding to the type of sound event. The occurrence of each sound event is determined according to the probability, and the target sound event that occurs is determined.
  • the scheme can accurately detect multiple sound events simultaneously.
  • the above-mentioned sound source matrix, probability matrix, and feature matrix information can also be stored in a node of a blockchain.
  • the feature extraction network includes a convolution layer and a maximum pooling layer under the gated linear activation function, and at least one set of the convolution layer and the maximum pooling layer under the gated linear activation function is respectively provided, and all The convolutional layer and the maximum pooling layer under the gated linear activation function are set in sequence at intervals;
  • the sound source matrix is input into the trained feature extraction network to extract the feature matrix of the sound event, specifically including:
  • Step S201 input the sound source matrix into the convolution layer under the gated linear activation function to perform a convolution operation, and apply gating to obtain an intermediate matrix;
  • the gated linear activation function used in this embodiment is:
  • W and V are convolution kernels with the same number and size respectively. More preferably, the convolution kernel used in this embodiment is a 3*3 convolution kernel, and 128 such convolution kernels are set to provide 128 convolution kernels.
  • the channel is convolved, b and c are the offsets obtained through training, X is the feature matrix, and the convolution result is activated under gating through the sigmoid activation function to obtain the feature matrix.
  • the data of the feature matrix extracted by the sound source matrix can be smoother, and the features extracted in this way are more accurate and concentrated.
  • Step S202 the intermediate matrix output by the convolution layer under the gated linear activation function is input into the maximum pooling layer for dimension reduction to output a feature matrix.
  • the feature matrix is prevented from overfitting by pooling, and the extraction accuracy of the feature matrix is guaranteed.
  • the scheme implements gating on the convolution results, so that the feature extraction of the feature matrix is accurate and concentrated, and the accuracy of feature extraction is improved.
  • the feature matrix is input into the trained weight-gated loop layer, and the feature matrix is calculated according to the weight matrix of the weight-gated loop layer and the weight of the preceding term vector in the feature matrix.
  • the corresponding post-term vectors in the matrix are weighted to obtain the weighted feature matrix, which includes:
  • Step S301 Obtain the activation value corresponding to each vector in the feature matrix.
  • the calculation process of the hidden layer in the weight-gated recurrent layer is as follows:
  • g is the activation function, and the hyperbolic tangent function is used in this embodiment
  • h t is the activation value at time t
  • h t-1 is the activation value at time t-1
  • x t is the corresponding value at time t in the feature matrix vector.
  • U and Y are the x t x t-1 corresponding to weight and weight matrix
  • is calculated activation time value of x t x t-1 to the right value is applied.
  • b is the offset.
  • the weighting process is divided into two stages when t>1:
  • the first weighting ie Y*x t , is to weight a vector in the feature matrix by the first weight matrix to highlight the sound event-related features in the vector.
  • the second weighting, ⁇ Uh t-1 +b is the weighting of the vectors according to the proportion of the occurrence and continuation of the sound event of the preceding term of each vector, which is weighted by the activation value of the preceding term and the second weight matrix, There is also a weight value attached to the preceding item vector.
  • the above weight value continues backward when the preceding item sound event has an impact on the succeeding item vector. On the contrary, if the sound event in the preceding item does not have any backward effect
  • the continuation, or the weakening of the influence on the vector of the latter term will result in a smaller weight.
  • the first weight matrix and the second weight matrix are respectively determined through training.
  • the quadratic weighting can overcome the influence of vector serialization in the sequential network, that is, when the duration of the sound event in the preceding item vector in the sequence is short, the weighting on the latter item vector is small, if the sound event in the preceding item vector continues. The longer the time, the greater the weighting on the latter term vector.
  • the feature matrix weighted in this way can effectively display sound events with short duration and sound events with long duration.
  • Step S302 splicing the activation values to obtain a weighted feature matrix.
  • the activation value is a vector corresponding to the eigenvector direction, the dimensions of each activation value are the same, and the activation values with the same dimension are arranged in order to obtain a weighted feature matrix
  • the activation value of the anterior term vector can reflect the features contained in the anterior term vector. If the feature correlation between the anterior term vector and the posterior term vector is high, then the anterior term vector has a strong influence on the weighting process of the posterior term vector.
  • the weight value applied to the preceding term vector can reflect the duration of the sound event. If the sound event lasts for a long time, the applied weight is higher, and the characteristics of the corresponding sound event can be maintained. On the contrary, if the sound event continues If the time is short, the applied weight will be relatively low, and the features of the sound event will be discarded in time without affecting the feature display of subsequent vectors.
  • this scheme can highlight the characteristics of sound events through weight adjustment, and improve the accuracy of probability matrix extraction.
  • the method further includes:
  • Step S500 classify the sound events corresponding to the probability matrix by the softmax function, and the sound event classification corresponds to the vector in the probability matrix.
  • the probability matrix is mapped and time-classified by the softmax function, so that the probability of each sound event can be sorted between 0 and 1.
  • the sound probability of the sound event is close to 1, it is determined that the sound event is likely to occur, so there are Conducive to a more intuitive response to the occurrence of sound events.
  • the sound source matrix is extracted from the sound source data, which specifically includes:
  • Step S101 segment sound source data according to frame length and frame shift, and extract audio frames
  • the sound source data generally has a limited duration, and the sound source data is accessed according to the duration.
  • the method for judging multiple sounds is to detect multiple sound events on the sound source data within this period of time.
  • the duration of the source data is 10s.
  • the audio frame is intercepted by the frame length of 100ms and the frame shift of 23ms, so that a total of 431 audio frames can be intercepted.
  • the value of 10s is the minimum unit for analysis and judgment in this embodiment. If the acquired sound source data is very long, the sound source data can be cut according to the minimum analyzable unit to obtain multiple segments of sub-data. The data is subjected to sound detection to determine what sound event has occurred. Of course, in other embodiments, the duration of each piece of sound source data can also be adjusted according to actual needs.
  • Step S102 extracting described audio frame according to FBANK format is audio vector
  • FBANK is a feature extraction format for audio.
  • the audio can be recorded and stored in the form of a vector after extracting features.
  • Each vector corresponds to the audio data in a time period.
  • the audio data in each audio frame is stored by generating a vector in the FBANK format.
  • the dimension of the audio vector corresponding to each audio frame is summarized as 64 dimensions.
  • Step S103 splicing the audio vectors to obtain a sound source matrix.
  • the sound source matrix obtained by audio vector splicing is a 431*64 sound source matrix.
  • the data in the sound source data is divided by time and extracted by vectors, which is conducive to the subsequent feature extraction and processing of the sound source data in the dimension of the audio frame, and finally reflects the occurrence of sound events in the entire audio. probability.
  • the storage of audio data in this solution is beneficial to improve the detection efficiency of sound detection.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a device for detecting multiple sound events, and the device embodiment corresponds to the method embodiment shown in FIG. 1 , Specifically, the device can be applied to various electronic devices.
  • a device for detecting multiple sound events provided by the embodiments of the present application adopts the following technical solutions:
  • the sound source extraction module 100 extracts the sound source matrix from the sound source data
  • the feature extraction module 200 is used to input the sound source matrix into the trained feature extraction network to extract the feature matrix of the sound event;
  • Weight weighting module 300 is used to input the feature matrix into the trained weight gated cyclic layer, according to the weight matrix of the weight gated cyclic layer and the weight of the preceding term vector in the feature matrix, to the feature matrix The corresponding post-term vectors in , are weighted to obtain the weighted feature matrix;
  • the full connection module 400 is used to input the weighted feature matrix into the full connection layer, and obtains a probability matrix through the full connection, wherein the dimension of the probability matrix corresponds to the number of types of sound events;
  • the determining module 600 is configured to determine the occurring target sound event according to the probability matrix.
  • the present application obtains a feature matrix including several vectors by performing feature extraction on the extracted sound source.
  • the feature matrix is The latter term vector is weighted, so that the features of the sound events in the former vector in the feature matrix affect the latter term, reducing the influence of the hidden layer on the sound features in the weighting process, so that the sound events between frames
  • the features can form continuous feedback, and the sound features with short duration and long duration events can be effectively highlighted by weighting, and then connected and pooled to obtain a probability matrix corresponding to the type of sound event. The occurrence of each sound event is determined according to the probability, and the scheme can accurately detect multiple sound events at the same time.
  • the feature extraction network includes a convolution layer and a maximum pooling layer under the gated linear activation function, and at least one set of the convolution layer and the maximum pooling layer under the gated linear activation function is respectively provided, and all The convolutional layer and the maximum pooling layer under the gated linear activation function are set in sequence at intervals;
  • the feature extraction module 200 specifically includes:
  • the feature extraction submodule 201 is used to input the sound source matrix into the convolution layer under the gated linear activation function to perform convolution operation, and apply gate control to obtain an intermediate matrix;
  • the feature pooling sub-module 202 is configured to input the intermediate matrix output by the convolution layer under the gated linear activation function into the maximum pooling layer for dimension reduction to output a feature matrix.
  • the gated current activation function used in this embodiment is:
  • W and V are convolution kernels respectively.
  • the convolution kernel used in this embodiment is a 3*3 convolution kernel.
  • b and c are the offsets obtained through training, X is the feature matrix, and the convolution result is activated under gating through the sigmoid activation function to obtain the feature matrix.
  • the scheme implements gating on the convolution results, so that the feature extraction of the feature matrix is accurate and concentrated, and the accuracy of feature extraction is improved.
  • weighting module 300 specifically includes:
  • the activation value determination sub-module 301 is used to obtain the activation value corresponding to each vector in the feature matrix.
  • the feature weighting sub-module 302 is configured to concatenate the activation values to obtain a weighted feature matrix.
  • the calculation process of the hidden layer in the weight-gated cyclic layer in the activation value determination sub-module 301 is as follows:
  • g is the activation function, and the hyperbolic tangent function is used in this embodiment
  • h t is the activation value at time t
  • h t-1 is the activation value at time t-1
  • x t is the corresponding value at time t in the feature matrix vector.
  • U and Y are the x t x t-1 corresponding to weight and weight matrix
  • is calculated activation time value of x t x t-1 to the right value is applied.
  • b is the offset.
  • This scheme can highlight the characteristics of sound events through weight adjustment, and improve the accuracy of probability matrix extraction.
  • the apparatus for detecting multiple sound events further includes a probability sorting module 500, which classifies the sound events corresponding to the probability matrix through a softmax function, and the sound event classification corresponds to the vector in the probability matrix.
  • the probability matrix is mapped and time-classified by the softmax function, so that the probability of each sound event can be sorted between 0 and 1, which is beneficial to more intuitively reflect whether the sound event occurs.
  • the sound source extraction module 100 specifically includes:
  • the audio frame extraction sub-module 101 is used for dividing the sound source data according to the frame length and the frame shift amount, and extracting the audio frame.
  • the audio vector extraction sub-module 102 is configured to extract the audio frame as an audio vector according to the FBANK format.
  • the sound source matrix splicing sub-module 103 is used for splicing the audio vectors to obtain the sound source matrix.
  • the data in the sound source data is divided by time and extracted by vectors, which is conducive to the subsequent feature extraction and processing of the sound source data in the dimension of the audio frame, and finally reflects the occurrence of sound events in the entire audio. probability.
  • the storage of audio data in this solution is beneficial to improve the detection efficiency of sound detection.
  • FIG. 3 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 6 includes a memory 61 , a processor 62 , and a network interface 63 that communicate with each other through a system bus. It should be pointed out that only the computer device 6 with components 61-63 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 61 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 .
  • the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device.
  • the memory 61 is generally used to store the operating system and various application software installed on the computer device 6 , such as computer-readable instructions of various sound event detection methods, and the like.
  • the memory 61 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 62 is typically used to control the overall operation of the computer device 6 . In this embodiment, the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, for example, computer-readable instructions for executing the methods for detecting multiple sound events.
  • CPU Central Processing Unit
  • controller central processing unit
  • microcontroller a microcontroller
  • microprocessor microprocessor
  • This processor 62 is typically used to control the overall operation of the computer device 6 .
  • the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, for example, computer-readable instructions for executing the methods for detecting multiple sound events.
  • the network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.
  • the computer device performs feature extraction on the extracted sound source to obtain a feature matrix including several vectors when performing multiple sound detection methods, and then performs the feature matrix in the weight gate loop layer according to the previous item in the feature matrix.
  • the weight of the vector combined with the trained weight matrix, weights the vector of the latter item in the feature matrix, so that the feature of the sound event in the vector of the former item in the feature matrix affects the latter item, reducing the implicit in the weighting process.
  • the influence of layers on the sound features enables the features of sound events between frames to form continuous feedback, and the sound features with short duration and long duration events can be effectively highlighted by weighting, and then connected.
  • a probability matrix corresponding to the kind of sound event is obtained. The occurrence of each sound event is determined according to the probability, and the scheme can accurately detect multiple sound events at the same time.
  • the present application also provides another implementation manner, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions for a detection method of multiple sound events, and the detection method of the multiple sound events
  • Method Computer readable instructions are executable by at least one processor to cause the at least one processor to perform the steps of the method of detection of various sound events as described above.
  • the computer-readable instructions recorded in the computer-readable storage medium provided in this embodiment perform feature extraction on the extracted sound sources to obtain a feature matrix including several vectors when performing multiple sound detection methods, and then loop through the weight gate.
  • the latter vector in the feature matrix is weighted, so that the feature of the sound event in the previous vector in the
  • One effect is to reduce the influence of the hidden layer on the sound features in the weighting process, so that the features of sound events between frames can form continuous feedback, which can be used for both the sound features with short duration and the sound features with long duration events. It is effectively highlighted by weighting and then connected to obtain a probability matrix corresponding to the type of sound event. The occurrence of each sound event is determined according to the probability, and the scheme can accurately detect multiple sound events at the same time.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种多种声音事件的检测方法,包括从声源数据提取声源矩阵(S100);将声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵(S200);将特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和特征矩阵中的前项向量的权重,对特征矩阵中对应的后项向量加权,得到加权后的特征矩阵(S300);将加权后的特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中该概率矩阵的维数与声音事件的种类数量相对应(S400);根据概率矩阵,确定发生的目标声音事件(S600)。其中,声源矩阵可存储在区块链中。

Description

多种声音事件的检测方法、装置、计算机设备及存储介质
本申请要求于2020年10月29日提交中国专利局、申请号为202011186597.5,发明 名称为“多种声音事件的检测方法、装置、计算机设备及存储介质”的中国专利申请的优 先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及声音识别技术领域,尤其涉及一种多种声音事件的检测方法、装置、计算机设备及存储介质。
背景技术
声音事件检测应用于家庭智能音箱发展、电话客服等的应用领域,声音事件可以即时的辅助相关人员对声音事件进行识别和反应。发明人意识到,现有的事件检测大多只能实现对单一声音的检测,比如只能识别婴儿的哭声或者烟雾报警器的报警声中的一种,更加典型的是对唤醒词的检测,比如“Hi Sara”“小爱同学”。
现有的声音事件检测方案,因为需要对特定的声音频谱进行针对性的训练,需要专门的标注团队,对声音事件发生的详细的时间端点值进行仔细且大量的数据标注,以此作为特定声音事件的训练素材,因为很多声音来源只是对一整段声源进行标注,即使这个声音并没有在这一整段语音中一直存在,这样就造成了声音事件和时间端点的对应关系不能保证,如此现有的声音事件检测方案的训练效果不佳,声音检测的不准确。在此基础之上,对现有声音事件监测方案同时进行多种声音事件模型的训练,则模型的训练效果更差,声音事件的检测精度更差。
发明内容
本申请实施例的目的在于精确的同时检测多种声音事件。
为了解决上述技术问题,本申请实施例提供一种多种声音事件的检测方法,采用了如下所述的技术方案:
一种多种声音事件的检测方法,该方法包括,
从声源数据提取声源矩阵;
将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
根据所述概率矩阵,确定发生的目标声音事件。
为了解决上述技术问题,本申请实施例还提供一种一种多种声音事件的检测装置,包括,
声源提取模块,用于从声源数据提取声源矩阵;
特征提取模块,用于将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
权重加权模块,用于将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
全连接模块,用于将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
确定模块,用于根据所述概率矩阵,确定发生的目标声音事件。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时还实现如下步骤:
从声源数据提取声源矩阵;
将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
根据所述概率矩阵,确定发生的目标声音事件。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器还执行如下步骤:
从声源数据提取声源矩阵;
将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
根据所述概率矩阵,确定发生的目标声音事件。
与现有技术相比,本申请实施例主要有以下有益效果:通过对提取的声音源做特征提取得到包括若干向量的特征矩阵,之后,在权重门循环层中根据特征矩阵中的前一项向量的权重,配合训练好的权重矩阵,对特征矩阵中的后一项进行加权,使得特征矩阵之中前一项向量中的声音事件的特征对后一项影响,降低加权过程中隐含层对声音特征的影响,使得帧与帧之间声音事件的特征能够形成连续的反馈,对持续时间短的声音特征和持续事件长的声音特征,都能够通过加权有效的突出出来,之后进行连接获得与声音事件的种类相对应的概率矩阵。并根据概率确定每种声音事件的发生情况,并且确定发生的目标声音事件。该方案能够精确的同时对多个声音事件进行检测。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1根据本申请的多种声音事件的检测方法的一个实施例的流程图;
图2是图1中步骤S200的一种具体实施方式的流程图;
图3是图1中步骤S300的一种具体实施方式的流程图;
图4根据本申请的多种声音事件的检测方法的一个实施例的流程图;
图5是图1中步骤S100的一种具体实施方式的流程图;
图6是根据本申请的多种声音事件的检测装置的一个实施例的结构示意图;
图7是图6所示特征提取模块一种具体实施方式的结构示意图;
图8是图6所示权重模块一种具体实施方式的结构示意图;
图9是图6所示声源提取模块一种具体实施方式的结构示意图;
图10是根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
参考图1,示出了根据本申请的多种声音事件的检测方法的一个实施例的流程图。所述的多种声音事件的检测方法,包括以下步骤:
步骤S100,从声源数据提取声源矩阵。
对声音来源进行数据提取,以获取数字化的声源数据,其中对于本实施例而言,针对的是多种声音事件的检测,所以声音来源是复杂的,获取的声源数据经过滤波和降噪之后,仍然包括多种声音。数字化之后的声音以矩阵的形式存储声音中的信息,通常通过声源矩阵存储声音数据的时候,矩阵中的向量保存了一个音频帧当中的声音数据,而通过向量拼接而成的声源矩阵则存储了整段声音数据。所述音频帧指的是一段时间的音频,是本方案中音频存储和操作的最小单位。
步骤S200,将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵。
特征提取网络中通过神经网络的深度学习能力能够对数据进行特征提取,常见的通过卷积神经网络,通过预先训练的卷积核,对数据进行层层卷积,并通过卷积核形成的通道将数据特征提取出来。
步骤S300,将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵。
特征矩阵中包括若干向量,每个向量与一个音频帧相对应,并且通过向量保存一段声音的特征数据,根据预先训练的权重矩阵以及向量的前项向量提供的权重对后项向量进行加权,并且更新特征矩阵,这样首先能够根据检测的多种声音事件对特征矩阵进行加权以凸显声音事件的特征,另一方面,通过前项向量提供的权重,使得特征矩阵中的向量相互之间有所关联,这样根据不同声音的发声长短等因素,能够对声音的特征进行修正,因此能够削弱特征矩阵中向量的序列化的影响。可以防止一个声音已经停止了,但是依然在所在向量的后项向量中体现;如果一个声音持续事件较长,也能够在后项向量中通过前项向量提供的权重持续体现。
在本实施例中,特征矩阵是通过x 1、x 2、x 3……x tn个向量顺序组成,其中每个在先向量对应的每个向量向后项向量提供权重O t(t=[1,n])。例如:x 1的权重为O 1,权重矩阵为Z,对x 2进行加权需要结合权重矩阵,以及O 1,其中O 1具体为和x 1相关的数值,并通过相应的向量积O 1*x 2*Z,对位相O 1*Z+x 2加,拼接
Figure PCTCN2021083752-appb-000001
等方式,对特征矩阵中的向量x 2进行加权。其中所述运算符+为运算符两侧的向量中对位元素数量相加,运算符
Figure PCTCN2021083752-appb-000002
为运算符 两侧的向量或矩阵拼接为矩阵。显然,通过前项向量对后项向量提供权重的方式,不限于上述方案,通过前项内容,对后项内容加权,以将前项声音事件对后项的的影响传递给后项,后项向量加权的方式均属于本方案的具体实施方式。
步骤S400,将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应。
对加权之后的特征矩阵进行全连接,全连接层设置在权重门控循环层之后,目的是调整矩阵维度,形成维度与声音事件种类数量相当的概率矩阵。获取的概率矩阵中每一个维度的向量对应的一种声音事件发生的概率,全连接层使用的神经元个数与所述声音事件的检测个数相同,如果对n中声音事件进行检测,则全连接层的神经元个数为n,如此输出n维矩阵,包括了n个反应特定声音事件发生概率的向量。即,概率矩阵中每一个维度的向量对应一种声音事件发生的概率。
步骤S600,根据所述概率矩阵,确定发生的目标声音事件。
根据概率矩阵中的向量提供的与声音事件对应的概率,确定在一段音频中,声音事件是否发生,并且确定发生的目标声音事件。
本申请通过对提取的声音源做特征提取得到包括若干向量的特征矩阵,之后,在权重门循环层中根据特征矩阵中的前一项向量的权重,配合训练好的权重矩阵,对特征矩阵中的后一项向量进行加权,使得特征矩阵之中前一项向量中的声音事件的特征对后一项影响,降低加权过程中隐含层对声音特征的影响,使得帧与帧之间声音事件的特征能够形成连续的反馈,对持续时间短的声音特征和持续事件长的声音特征,都能够通过加权有效的突出出来,之后进行连接获得与声音事件的种类相对应的概率矩阵。并根据概率确定每种声音事件的发生情况,并且确定发生的目标声音事件。该方案能够精确的同时对多个声音事件进行检测。
需要强调的是,为进一步保证上述多种声音事件的检测方法信息的私密和安全性,上述声源矩阵、概率矩阵、特征矩阵信息还可以存储于一区块链的节点中。
进一步的,所述特征提取网络包括门控线性激活函数下的卷积层和最大池化层,所述门控线性激活函数下的卷积层和最大池化层分别至少设置有一组,并且所述门控线性激活函数下的卷积层和最大池化层依次间隔设置;
所述步骤S200,将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵,具体包括:
步骤S201,将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵;
具体的,本实施例中使用的门控线性激活函数为:
Y=(W*X+b)☉sigmoid(V*X+c)
其中W和V分别为数量和大小相同的卷积核,更佳的,本实施例中使用的卷积核为3*3卷积核,这样的卷积核设置有128个,以提供128个通道进行卷积,b和c是通过训练获取的偏移量,X为特征矩阵,通过sigmoid激活函数在门控下对卷积结果进行激活,以获取特征矩阵。
通过施加门控能够使得声源矩阵提取的特征矩阵的数据更为平滑,如此提取的特征更为准确集中。
步骤S202,将所述门控线性激活函数下的卷积层输出的中间矩阵输入到最大池化层中进行降维,以输出特征矩阵。
通过池化防止特征矩阵过度拟合,保证特征矩阵的提取精度。
该方案通过对卷积结果实施门控,使得特征矩阵的特征提取准确和集中,提升了特征提取的准确性。
进一步的,所述步骤S300,将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的 后项向量加权,得到加权后的特征矩阵,具体包括:
步骤S301:获取特征矩阵中每个向量所对应的激活值。
具体的,本实施例中,权重门控循环层中隐含层的计算过程为:
Figure PCTCN2021083752-appb-000003
其中g为激活函数,本实施例中用的是双曲正切函数;h t是t时刻的激活值,h t-1是t-1时刻的激活值,x t是特征矩阵中t时刻对应的向量。Y和U分别是与x t和x t-1对应的权重矩阵,ω是在计算x t的激活值的时候对x t-1施加的权值。b是偏移量。当t=1时,没有前项向量,对其加权可仅仅使用训练得到的特征矩阵和参数进行加权,当t>1时,结合前项的激活值和训练的到的特征矩阵和参数进行加权。
在本实施例中,当t>1时加权的过程分为两个阶段:
第一次加权,即Y*x t,是通过第一权重矩阵对特征矩阵中的一个向量进行加权,以突出向量中声音事件相关的特征。
第二次加权,即ωUh t-1+b,是根据每个向量的前项向量声音事件出现和延续的比重对向量进行的加权,这一加权通过前项的激活值以及第二权重矩阵,还有对前项向量附加的权值体现,上述权值在前项声音事件向后持续,对后项向量有所影响时,权值就大,反之,如果前项中的声音事件向后没有延续,或对后项向量的影响减弱,权值就小。其中第一权重矩阵和第二权重矩阵分别通过训练确定。二次加权能够克服序列性网络中,向量序列化的影响,即当序列当中前项向量中的声音事件持续时间短,那么对后项向量的加权就小,如果前项向量中的声音事件持续时间长,那么对后项向量的加权就大。如此加权后的特征矩阵即能够有效展示持续时间短的声音事件,也能够有效展示持续时间长的声音事件。
步骤S302:将所述激活值拼接以获得加权后的特征矩阵。
具体的,激活值为与特征向量向对应的向量,各个激活值的维数是相同的,将维数相同的激活值依次排列,以获得加权后的特征矩阵
在本实施例前项向量的激活值能够体现出前项向量包含的特征,如果前项向量和后项向量的特征相关性高,那么前项向量对后项向量的加权过程影响就强,其次对前项向量施加权值,能够反应声音事件延续的时间长短,如果声音事件延续的时间比较长,那么施加的权值就比较高,保持相应的声音事件的特征能够保持,相反,如果声音事件延续的时间短,那么施加的权值就比较低,声音事件的特征就会被及时放弃,而不影响到后续向量的特征展示。
因此,该方案能够通过权重的调整,突出声音事件的特征,提升了概率矩阵提取的准确性。
进一步的,所述步骤S400,对加权后的所述特征矩阵进行全连接,以获取概率矩阵之后,该方法还包括:
步骤S500,通过softmax函数对概率矩阵所对应的声音事件分类,所述声音事件分类与概率矩阵中的向量相对应。
通过softmax函数对概率矩阵进行映射和时间分类,能够使得各个声音事件发生的概率整理到0到1之间,当声音事件的发声概率接近于1时,确定该声音事件很有可能发生,这样有利于更加直观的反应声音事件是否发生。
进一步的,所述步骤S100,从声源数据提取声源矩阵,具体包括:
步骤S101,根据帧长和帧移量分割声源数据,提取音频帧;
声源数据一般会限定时长,根据时长接取声源数据,多种声音的判断方法就是对这一段时长范围内的声源数据进行多种声音事件的检测,在本实施例中,每段声源数据的时长是10s,在10s的声源数据中,通过100ms的帧长,和23ms的帧移截取音频帧,由此共 能截取到431个音频帧。
具体10s=10000ms,第1个音频帧的时间范围:0-100ms,第2帧23ms-123ms,前后连续的帧是有重合的部分的,所以第n帧就是[23*(n-1)ms,23*(n-1)+100ms]时间范围内的音频,而23*(n-1)+100<=10000ms,所以10s的声源数据可以解出n<=431.43,n是整数,所以n=431,可见,相邻的两个音频帧中的音频数据是部分重叠的,这样也有利于后续特征提取和加权的过程当中,声音事件的特征能够连续,易于捕捉。
本申请中,10s这个值是本实施例分析判断的最小单位,如果获取的声源数据很长,可以按照可分析的最小单位将该声源数据进行切割,得到多段子数据,对每一段子数据进行声音检测,判断有什么声音事件发生,当然,在其他实施方式中,每段声源数据的时长也可也根据实际需要进行调整。
步骤S102,根据FBANK格式提取所述音频帧为音频向量;
FBANK是一种针对音频的特征提取格式,对于一段音频,能够在提取特征之后通过向量的形式对音频进行记录和存储,其中每个向量,对应的是一个时间段内的音频数据,在本实施例中,每个音频帧中的音频数据都通过FBANK格式生成一个向量进行存储,在本实施例汇总每个音频帧对应的音频向量的维数为64维。
步骤S103,将音频向量拼接,以获取声源矩阵。
音频向量拼接所得的声源矩阵为431*64的声源矩阵。
该方案将声源数据中的数据,以时间为分割,通过向量进行提取,有利于后续在音频帧的维度上对声源数据进行特征提取和加工,最终反映出整段音频中声音事件发生的概率。该方案对音频数据的存储有利于提升声音检测的检测效率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
进一步参考图2,作为对上述图1所示方法的实现,本申请提供了一种多种声音事件的检测装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
为了解决上述技术问题,本申请实施例提供的一种多种声音事件的检测装置,采用了如下所述的技术方案:
多种声音事件的检测装置,包括,
声源提取模块100,从声源数据提取声源矩阵;
特征提取模块200,用于将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
权重加权模块300,用于将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
全连接模块400,用于将加权后的所述特征矩阵输入到全连接层中,通过全连接获取 概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
确定模块600,用于根据所述概率矩阵,确定发生的目标声音事件。
本申请通过对提取的声音源做特征提取得到包括若干向量的特征矩阵,之后,在权重门循环层中根据特征矩阵中的前一项向量的权重,配合训练好的权重矩阵,对特征矩阵中的后一项向量进行加权,使得特征矩阵之中前一项向量中的声音事件的特征对后一项影响,降低加权过程中隐含层对声音特征的影响,使得帧与帧之间声音事件的特征能够形成连续的反馈,对持续时间短的声音特征和持续事件长的声音特征,都能够通过加权有效的突出出来,之后进行连接和池化获得与声音事件的种类相对应的概率矩阵。并根据概率确定每种声音事件的发生情况,该方案能够精确的同时对多个声音事件进行检测。
进一步的,所述特征提取网络包括门控线性激活函数下的卷积层和最大池化层,所述门控线性激活函数下的卷积层和最大池化层分别至少设置有一组,并且所述门控线性激活函数下的卷积层和最大池化层依次间隔设置;
所述特征提取模块200,具体包括:
特征提取子模块201,用于将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵;
特征池化子模块202,用于将所述门控线性激活函数下的卷积层输出的中间矩阵输入到最大池化层中进行降维,以输出特征矩阵。
具体的,本实施例中使用的门控现行激活函数为:
Y=(W*X+b)☉sigmoid(V*X+c)
其中W和V分别为卷积核,更佳的,本实施例中使用的卷积核为3*3卷积核,这样的卷积核设置有128个,以提供128个通道进行卷积,b和c是通过训练获取的偏移量,X为特征矩阵,通过sigmoid激活函数在门控下对卷积结果进行激活,以获取特征矩阵。该方案通过对卷积结果实施门控,使得特征矩阵的特征提取准确和集中,提升了特征提取的准确性。
进一步的,所述权重加权模块300具体包括:
激活值确定子模块301,用于获取特征矩阵中每个向量所对应的激活值。
特征加权子模块302,用于将所述激活值拼接以获得加权后的特征矩阵。
具体的,本实施例中,用于激活值确定子模块301中的权重门控循环层中隐含层的计算过程为:
h t=g(Y*x t+ωUh t-1+b)
其中g为激活函数,本实施例中用的是双曲正切函数;h t是t时刻的激活值,h t-1是t-1时刻的激活值,x t是特征矩阵中t时刻对应的向量。Y和U分别是与x t和x t-1对应的权重矩阵,ω是在计算x t的激活值的时候对x t-1施加的权值。b是偏移量。
该方案能够通过权重的调整,突出声音事件的特征,提升了概率矩阵提取的准确性。
进一步的,多种声音事件的检测装置,还包括概率整理模块500,通过softmax函数对概率矩阵所对应的声音事件分类,所述声音事件分类与概率矩阵中的向量相对应。
通过softmax函数对概率矩阵进行映射和时间分类,能够使得各个声音事件发生的概率整理到0到1之间,有利于更加直观的反应声音事件是否发生。
进一步的,所述声源提取模块100具体包括:
音频帧提取子模块101,用于根据帧长和帧移量分割声源数据,提取音频帧。
音频向量提取子模块102,用于根据FBANK格式提取所述音频帧为音频向量。
声源矩阵拼接子模块103,用于将音频向量拼接,以获取声源矩阵。
该方案将声源数据中的数据,以时间为分割,通过向量进行提取,有利于后续在音频帧的维度上对声源数据进行特征提取和加工,最终反映出整段音频中声音事件发生的概率。 该方案对音频数据的存储有利于提升声音检测的检测效率。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图3,图3为本实施例计算机设备基本结构框图。
所述计算机设备6包括通过系统总线相互通信连接存储器61、处理器62、网络接口63。需要指出的是,图中仅示出了具有组件61-63的计算机设备6,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器61至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。所述计算机可读存储介质可以是非易失性,也可以是易失性。在一些实施例中,所述存储器61可以是所述计算机设备6的内部存储单元,例如该计算机设备6的硬盘或内存。在另一些实施例中,所述存储器61也可以是所述计算机设备6的外部存储设备,例如该计算机设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器61还可以既包括所述计算机设备6的内部存储单元也包括其外部存储设备。本实施例中,所述存储器61通常用于存储安装于所述计算机设备6的操作系统和各类应用软件,例如多种声音事件的检测方法的计算机可读指令等。此外,所述存储器61还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器62在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器62通常用于控制所述计算机设备6的总体操作。本实施例中,所述处理器62用于运行所述存储器61中存储的计算机可读指令或者处理数据,例如运行所述多种声音事件的检测方法的计算机可读指令。
所述网络接口63可包括无线网络接口或有线网络接口,该网络接口63通常用于在所述计算机设备6与其他电子设备之间建立通信连接。
本实施例提供的计算机设备,在执行多种声音的检测方法时,对提取的声音源做特征提取得到包括若干向量的特征矩阵,之后,在权重门循环层中根据特征矩阵中的前一项向量的权重,配合训练好的权重矩阵,对特征矩阵中的后一项向量进行加权,使得特征矩阵之中前一项向量中的声音事件的特征对后一项影响,降低加权过程中隐含层对声音特征的影响,使得帧与帧之间声音事件的特征能够形成连续的反馈,对持续时间短的声音特征和持续事件长的声音特征,都能够通过加权有效的突出出来,之后进行连接获得与声音事件的种类相对应的概率矩阵。并根据概率确定每种声音事件的发生情况,该方案能够精确的同时对多个声音事件进行检测。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有多种声音事件的检测方法计算机可读指令,所述多种声音事件的检测方法计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的多种声音事件的检测方法的步骤。
本实施例提供的计算机可读存储介质所记录的计算机可读指令,在执行多种声音的检测方法时,对提取的声音源做特征提取得到包括若干向量的特征矩阵,之后,在权重门循环层中根据特征矩阵中的前一项向量的权重,配合训练好的权重矩阵,对特征矩阵中的后 一项向量进行加权,使得特征矩阵之中前一项向量中的声音事件的特征对后一项影响,降低加权过程中隐含层对声音特征的影响,使得帧与帧之间声音事件的特征能够形成连续的反馈,对持续时间短的声音特征和持续事件长的声音特征,都能够通过加权有效的突出出来,之后进行连接获得与声音事件的种类相对应的概率矩阵。并根据概率确定每种声音事件的发生情况,该方案能够精确的同时对多个声音事件进行检测。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种多种声音事件的检测方法,该方法包括:
    从声源数据提取声源矩阵;
    将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
    将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
    将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
    根据所述概率矩阵,确定发生的目标声音事件。
  2. 根据权利要求1所述的多种声音事件的检测方法,其中,
    所述特征提取网络包括门控线性激活函数下的卷积层和最大池化层,所述门控线性激活函数下的卷积层和最大池化层分别至少设置有一组,并且所述门控线性激活函数下的卷积层和最大池化层依次间隔设置;
    所述将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵,具体包括:
    将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵;
    将所述门控线性激活函数下的卷积层输出的中间矩阵输入到最大池化层中进行降维,以输出特征矩阵。
  3. 根据权利要求2所述的多种声音事件的检测方法,其中,所述将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵,具体通过以下公式获得:
    Y=(W*X+b)☉sigmoid(V*X+c)
    其中Y为中间矩阵,W和V分别为数量和大小相同的卷积核,b和c是通过训练获取的偏移量,X为所述声源矩阵。
  4. 根据权利要求1所述的多种声音事件的检测方法,其中,所述权重矩阵包括第一权重矩阵和第二权重矩阵,所述将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵,具体包括:
    通过以下公式获取特征矩阵中每个向量所对应的激活值:
    Figure PCTCN2021083752-appb-100001
    其中g为激活函数;h t是x t所对应的激活值,h t-1是x t-1所对应的激活值,x t是特征矩阵中t时刻对应的向量,Y是第一权重矩阵,U是第二权重矩阵,ω是在计算x t的激活值的时候对x t-1施加的权值,b和c是偏移量;
    将所述激活值拼接以获得加权后的特征矩阵。
  5. 根据权利要求1所述的多种声音事件的检测方法,其中,所述对加权后的所述特征矩阵进行全连接,以获取概率矩阵之后,该方法还包括:
    通过softmax函数对概率矩阵所对应的声音事件分类,所述分类的数量与概率矩阵中的维数一致。
  6. 根据权利要求1所述的多种声音事件的检测方法,其中,所述从声源数据提取声源矩阵,具体包括:
    根据帧长和帧移量分割声源数据,提取音频帧;
    根据FBANK格式提取所述音频帧为音频向量;
    将音频向量拼接,以获取声源矩阵。
  7. 一种多种声音事件的检测装置,包括,
    声源提取模块,用于从声源数据提取声源矩阵;
    特征提取模块,用于将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
    权重加权模块,用于将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
    全连接模块,用于将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
    确定模块,用于根据所述概率矩阵,确定发生的目标声音事件。
  8. 根据权利要求7所述的多种声音事件的检测装置,其中,
    所述权重矩阵包括第一权重矩阵和第二权重矩阵,所述权重加权模块具体包括:
    激活值确定子模块,用于通过以下公式获取特征矩阵中每个向量所对应的激活值:
    Figure PCTCN2021083752-appb-100002
    其中g为激活函数;h t是x t所对应的激活值,h t-1是x t-1所对应的激活值,x t是特征矩阵中t时刻对应的向量,Y是第一权重矩阵,U是第二权重矩阵,ω是在计算x t的激活值的时候对x t-1施加的权值,b和c是偏移量;
    特征加权子模块,用于将所述激活值拼接以获得加权后的特征矩阵。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时还实现如下步骤:
    从声源数据提取声源矩阵;
    将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
    将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
    将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
    根据所述概率矩阵,确定发生的目标声音事件。
  10. 根据权利要求9所述的计算机设备,其中,
    所述特征提取网络包括门控线性激活函数下的卷积层和最大池化层,所述门控线性激活函数下的卷积层和最大池化层分别至少设置有一组,并且所述门控线性激活函数下的卷积层和最大池化层依次间隔设置;
    所述将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵,具体包括:
    将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵;
    将所述门控线性激活函数下的卷积层输出的中间矩阵输入到最大池化层中进行降维,以输出特征矩阵。
  11. 根据权利要求10所述的计算机设备,其中,所述将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵,具体通过以下公式获得:
    Y=(W*X+b)☉sigmoid(V*X+c)
    其中Y为中间矩阵,W和V分别为数量和大小相同的卷积核,b和c是通过训练获取 的偏移量,X为所述声源矩阵。
  12. 根据权利要求9所述的计算机设备,其中,所述权重矩阵包括第一权重矩阵和第二权重矩阵,所述将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵,具体包括:
    通过以下公式获取特征矩阵中每个向量所对应的激活值:
    Figure PCTCN2021083752-appb-100003
    其中g为激活函数;h t是x t所对应的激活值,h t-1是x t-1所对应的激活值,x t是特征矩阵中t时刻对应的向量,Y是第一权重矩阵,U是第二权重矩阵,ω是在计算x t的激活值的时候对x t-1施加的权值,b和c是偏移量;
    将所述激活值拼接以获得加权后的特征矩阵。
  13. 根据权利要求9所述的计算机设备,其中,所述对加权后的所述特征矩阵进行全连接,以获取概率矩阵之后,所述处理器执行所述计算机可读指令时还实现如下步骤:
    通过softmax函数对概率矩阵所对应的声音事件分类,所述分类的数量与概率矩阵中的维数一致。
  14. 根据权利要求9所述的计算机设备,其中,所述从声源数据提取声源矩阵,具体包括:
    根据帧长和帧移量分割声源数据,提取音频帧;
    根据FBANK格式提取所述音频帧为音频向量;
    将音频向量拼接,以获取声源矩阵。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器还执行如下步骤:
    从声源数据提取声源矩阵;
    将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵;
    将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵;
    将加权后的所述特征矩阵输入到全连接层中,通过全连接获取概率矩阵,其中所述概率矩阵的维数与声音事件的种类数量相对应;
    根据所述概率矩阵,确定发生的目标声音事件。
  16. 根据权利要求15所述的计算机可读存储介质,其中,
    所述特征提取网络包括门控线性激活函数下的卷积层和最大池化层,所述门控线性激活函数下的卷积层和最大池化层分别至少设置有一组,并且所述门控线性激活函数下的卷积层和最大池化层依次间隔设置;
    所述将所述声源矩阵输入到训练好的特征提取网络,以提取声音事件的特征矩阵,具体包括:
    将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵;
    将所述门控线性激活函数下的卷积层输出的中间矩阵输入到最大池化层中进行降维,以输出特征矩阵。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述将声源矩阵输入到所述门控线性激活函数下的卷积层进行卷积操作,并且施加门控,以获取中间矩阵,具体通过以下公式获得:
    Y=(W*X+b)☉sigmoid(V*X+c)
    其中Y为中间矩阵,W和V分别为数量和大小相同的卷积核,b和c是通过训练获取的偏移量,X为所述声源矩阵。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述权重矩阵包括第一权重矩阵和第二权重矩阵,所述将所述特征矩阵输入到训练好的权重门控循环层,根据权重门控循环层的权重矩阵和所述特征矩阵中的前项向量的权重,对所述特征矩阵中对应的后项向量加权,得到加权后的特征矩阵,具体包括:
    通过以下公式获取特征矩阵中每个向量所对应的激活值:
    Figure PCTCN2021083752-appb-100004
    其中g为激活函数;h t是x t所对应的激活值,h t-1是x t-1所对应的激活值,x t是特征矩阵中t时刻对应的向量,Y是第一权重矩阵,U是第二权重矩阵,ω是在计算x t的激活值的时候对x t-1施加的权值,b和c是偏移量;
    将所述激活值拼接以获得加权后的特征矩阵。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述对加权后的所述特征矩阵进行全连接,以获取概率矩阵之后,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行如下步骤:
    通过softmax函数对概率矩阵所对应的声音事件分类,所述分类的数量与概率矩阵中的维数一致。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述从声源数据提取声源矩阵,具体包括:
    根据帧长和帧移量分割声源数据,提取音频帧;
    根据FBANK格式提取所述音频帧为音频向量;
    将音频向量拼接,以获取声源矩阵。
PCT/CN2021/083752 2020-10-29 2021-03-30 多种声音事件的检测方法、装置、计算机设备及存储介质 WO2022001245A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011186597.5 2020-10-29
CN202011186597.5A CN112309405A (zh) 2020-10-29 2020-10-29 多种声音事件的检测方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022001245A1 true WO2022001245A1 (zh) 2022-01-06

Family

ID=74332357

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083752 WO2022001245A1 (zh) 2020-10-29 2021-03-30 多种声音事件的检测方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112309405A (zh)
WO (1) WO2022001245A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112309405A (zh) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 多种声音事件的检测方法、装置、计算机设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106248801A (zh) * 2016-09-06 2016-12-21 哈尔滨工业大学 一种基于多声发射事件概率的钢轨裂纹检测方法
CN106356052A (zh) * 2016-10-17 2017-01-25 腾讯科技(深圳)有限公司 语音合成方法及装置
CN108648748A (zh) * 2018-03-30 2018-10-12 沈阳工业大学 医院噪声环境下的声学事件检测方法
CN110751955A (zh) * 2019-09-23 2020-02-04 山东大学 基于时频矩阵动态选择的声音事件分类方法及系统
US20200105293A1 (en) * 2018-09-28 2020-04-02 Cirrus Logic International Semiconductor Ltd. Sound event detection
CN111723874A (zh) * 2020-07-02 2020-09-29 华南理工大学 一种基于宽度和深度神经网络的声场景分类方法
CN112309405A (zh) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 多种声音事件的检测方法、装置、计算机设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229967B (zh) * 2016-08-22 2021-06-15 赛灵思公司 一种基于fpga实现稀疏化gru神经网络的硬件加速器及方法
US20180129937A1 (en) * 2016-11-04 2018-05-10 Salesforce.Com, Inc. Quasi-recurrent neural network
CN109559735B (zh) * 2018-10-11 2023-10-27 平安科技(深圳)有限公司 一种基于神经网络的语音识别方法、终端设备及介质
CN110263304B (zh) * 2018-11-29 2023-01-10 腾讯科技(深圳)有限公司 语句编码方法、语句解码方法、装置、存储介质及设备
CN110097089A (zh) * 2019-04-05 2019-08-06 华南理工大学 一种基于注意力组合神经网络的文档级别的情感分类方法
CN110176248B (zh) * 2019-05-23 2020-12-22 广西交科集团有限公司 道路声音识别方法、系统、计算机设备及可读存储介质
CN111046172B (zh) * 2019-10-30 2024-04-12 北京奇艺世纪科技有限公司 一种舆情分析方法、装置、设备和存储介质
CN110992979B (zh) * 2019-11-29 2022-04-08 北京搜狗科技发展有限公司 一种检测方法、装置和电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106248801A (zh) * 2016-09-06 2016-12-21 哈尔滨工业大学 一种基于多声发射事件概率的钢轨裂纹检测方法
CN106356052A (zh) * 2016-10-17 2017-01-25 腾讯科技(深圳)有限公司 语音合成方法及装置
CN108648748A (zh) * 2018-03-30 2018-10-12 沈阳工业大学 医院噪声环境下的声学事件检测方法
US20200105293A1 (en) * 2018-09-28 2020-04-02 Cirrus Logic International Semiconductor Ltd. Sound event detection
CN110751955A (zh) * 2019-09-23 2020-02-04 山东大学 基于时频矩阵动态选择的声音事件分类方法及系统
CN111723874A (zh) * 2020-07-02 2020-09-29 华南理工大学 一种基于宽度和深度神经网络的声场景分类方法
CN112309405A (zh) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 多种声音事件的检测方法、装置、计算机设备及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANG JINJIA;CUI LIN;YANG QIAN;JI SHAONAN: "General Audio Tagging Based on Attention-Gated Convolutional Recurrent Neural Network", JOURNAL OF FUDAN UNIVERSITY(NATURAL SCIENCE), vol. 59, no. 3, 15 June 2020 (2020-06-15), pages 360 - 367, XP055884007, ISSN: 0427-7104, DOI: 10.15943/j.cnki.fdxb-jns.2020.03.016 *
YANG DE-JU;MA LIANG-LI;TAN LIN-SHAN;PEI JING-JING: "End-to-end Speech Recognition based on Gated Convolutional Neural Network and CTC", COMPUTER ENGINEERING AND DESIGN, vol. 41, no. 9, 16 September 2020 (2020-09-16), pages 2650 - 2654, XP055884012, ISSN: 1000-7024, DOI: 10.16208/j.issn1000-7024.2020.09.037 *

Also Published As

Publication number Publication date
CN112309405A (zh) 2021-02-02

Similar Documents

Publication Publication Date Title
US11996091B2 (en) Mixed speech recognition method and apparatus, and computer-readable storage medium
CN111489290B (zh) 一种人脸图像超分辨重建方法、装置及终端设备
CN112749758B (zh) 图像处理方法、神经网络的训练方法、装置、设备和介质
CN109840413B (zh) 一种钓鱼网站检测方法及装置
WO2020140374A1 (zh) 语音数据处理方法、装置、设备及存储介质
WO2020164272A1 (zh) 上网设备的识别方法、装置及存储介质、计算机设备
CN111552633A (zh) 接口的异常调用测试方法、装置、计算机设备及存储介质
WO2020168754A1 (zh) 基于预测模型的绩效预测方法、装置及存储介质
CN111125529A (zh) 产品匹配方法、装置、计算机设备及存储介质
WO2022001245A1 (zh) 多种声音事件的检测方法、装置、计算机设备及存储介质
WO2022052633A1 (zh) 文本备份方法、装置、设备及计算机可读存储介质
CN110705282A (zh) 关键词提取方法、装置、存储介质及电子设备
WO2021008356A1 (zh) 多型号门控器切换控制方法、装置、电子设备及存储介质
WO2020258509A1 (zh) 终端设备异常访问的隔离方法和装置
CN117058421A (zh) 基于多头模型的图像检测关键点方法、系统、平台及介质
CN114241411B (zh) 基于目标检测的计数模型处理方法、装置及计算机设备
US11886590B2 (en) Emulator detection using user agent and device model learning
CN112071331B (zh) 语音文件修复方法、装置、计算机设备及存储介质
CN115240647A (zh) 声音事件检测方法、装置、电子设备及存储介质
CN112596846A (zh) 确定界面显示内容的方法、装置、终端设备及存储介质
CN110929033A (zh) 长文本分类方法、装置、计算机设备及存储介质
EP4199456A1 (en) Traffic classification method and apparatus, training method and apparatus, device and medium
CN110992067B (zh) 消息推送方法、装置、计算机设备及存储介质
CN110875874B (zh) 一种电子红包检测方法、装置及移动终端
CN118057525A (zh) 语音识别方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21834375

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21834375

Country of ref document: EP

Kind code of ref document: A1