WO2022116420A1 - Procédé et appareil de détection d'événement vocal, dispositif électronique, et support de stockage informatique - Google Patents

Procédé et appareil de détection d'événement vocal, dispositif électronique, et support de stockage informatique Download PDF

Info

Publication number
WO2022116420A1
WO2022116420A1 PCT/CN2021/082872 CN2021082872W WO2022116420A1 WO 2022116420 A1 WO2022116420 A1 WO 2022116420A1 CN 2021082872 W CN2021082872 W CN 2021082872W WO 2022116420 A1 WO2022116420 A1 WO 2022116420A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
classification model
feature
speech
event
Prior art date
Application number
PCT/CN2021/082872
Other languages
English (en)
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022116420A1 publication Critical patent/WO2022116420A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a voice event detection method, apparatus, electronic device, and computer-readable storage medium.
  • Voice event detection refers to the detection of human voice, singing, tapping, dog barking, car chirping and other events in audio, and marking their start time and end time.
  • Traditional speech event detection methods include methods based on signal processing and methods based on hidden Markov models.
  • the inventor realized that the occurrence of events often has high uncertainty, and it is difficult to collect a large number of samples of speech events, so the accuracy of traditional speech event detection methods is low; at the same time, for a random speech event, the model passes In the frame-level judgment, the judgment of different frames of the same event may be different, resulting in the instability of the event detection result.
  • a voice event detection method provided by this application includes:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the present application also provides a voice event detection device, the device comprising:
  • a feature extraction module used to obtain the audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence
  • the self-attention module is used to perform feature analysis on the speech frame feature sequence by using the classification model based on the self-attention mechanism to obtain the hidden state sequence to be identified;
  • an identification module configured to perform event identification on the hidden state sequence to be identified by using the classification model to obtain an event label sequence
  • the smoothing module is used for smoothing the event label sequence to obtain a speech event detection result corresponding to the speech to be detected.
  • the present application also provides an electronic device, the electronic device comprising:
  • the processor executes the computer program stored in the memory to realize the following steps:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor step:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • FIG. 1 is a schematic flowchart of a voice event detection method provided by an embodiment of the present application.
  • FIG. 2 is a schematic block diagram of a voice event detection apparatus provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the internal structure of an electronic device implementing a voice event detection method provided by an embodiment of the present application
  • the embodiment of the present application provides a voice event detection method.
  • the executive body of the voice event detection method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the voice event detection method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • FIG. 1 it is a schematic flowchart of a voice event detection method according to an embodiment of the present application.
  • the voice event detection method includes:
  • the audio to be detected is audio that includes various sound events, such as human voice, singing, percussion, dog barking, car chirping and other events.
  • the to-be-detected audio can be obtained from a database.
  • the audio to be detected can be obtained from a node of a blockchain.
  • performing acoustic feature extraction on the to-be-detected audio to obtain a speech frame feature sequence including:
  • Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
  • the embodiment of the present application performs acoustic feature extraction on the audio to be detected.
  • the audio to be detected needs to be framed, for example, every 10 milliseconds is a frame, and then the Mel spectrum is calculated for each frame of speech, and the Cepstral analysis is performed to extract acoustic features of the audio to be detected.
  • the Mel spectrum is a commonly used speech feature representation, which can effectively describe the basic information of speech and facilitate subsequent analysis and processing of speech.
  • the classification model is a deep learning model that can recognize speech features and is used to detect and classify different sound events.
  • the classification model includes an input layer, a hidden layer and a fully connected layer.
  • the self-attention mechanism is a mechanism that pays attention to some details according to the detection target, rather than based on the global analysis.
  • the feature vector recalculated by the self-attention mechanism can fully consider the connection between the contexts in the continuous audio.
  • the feature analysis is performed on the speech frame feature sequence using the classification model based on the self-attention mechanism to obtain the hidden state sequence to be identified, including:
  • the feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
  • the embodiment of the present application may divide the speech frame feature sequence into several windows:
  • t represents the frame sequence number
  • T represents the length of the window
  • x t represents the feature whose length is 1 in the speech frame feature sequence.
  • the hidden layer of the classification model described in the embodiments of the present application is a deep neural network composed of several layers of self-attention mechanism networks.
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism, including:
  • the second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached.
  • Layer self-attention mechanism network to get the hidden state sequence to be recognized.
  • the common speech feature of the current window is z t
  • the hidden state sequence calculated by the self-attention mechanism network of several layers of the hidden layer is o t :
  • D l is the l-th layer of self-attention mechanism network in the hidden layer of the classification model, the output of the previous layer l-1 As the input of the next layer of self-attention mechanism network.
  • an event label sequence including:
  • the hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
  • Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
  • the event tag sequence is the probability value of various sound events included in each frame of speech in the audio to be detected.
  • the category of sound events contained in a window may be defined as y t :
  • a classification model based on a self-attention mechanism before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, it also includes training the classification model:
  • the training sample set is input into the classification model to obtain the predicted label sequence
  • the training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
  • the real event sequence y t described in the embodiment of the present application is obtained by marking.
  • the predicted label sequence of the classification model is In this embodiment of the present application, a cross entropy loss function can be used to calculate a training error (Loss), so as to update and learn the classification model, until the classification model converges, stop training, and obtain the trained classification model.
  • Loss training error
  • the cross-entropy loss function includes:
  • Loss is the training error
  • N is the total number of categories of sound events
  • y t is the real event sequence
  • the event label sequence is the result of frame-by-frame output, so there will inevitably be glitches and jitters in the result.
  • a sequence matching network (SMN) is used to map the event label sequence in the time dimension. Smoothing on top to fill in sudden gaps in event prediction, as well as very brief glitches.
  • the smoothing of the event label sequence to obtain the speech event detection result corresponding to the speech to be detected includes:
  • the event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence
  • the endpoints in the smooth event label sequence determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
  • the multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
  • the classification model described in the embodiments of the present application can detect audio in real time, so that it can be applied to scenarios with high real-time requirements, such as sports competitions, live broadcasts and other fields, and has high practical application value.
  • a classification model based on a self-attention mechanism is used to perform feature analysis on the speech frame feature sequence.
  • the classification model combines the features of multiple windows to improve the accuracy of speech features.
  • the classification model It is based on a self-attention mechanism, which can improve the accuracy of event detection; smoothing the sequence of event labels improves the stability of event detection results. Therefore, the voice event detection method, device and computer-readable storage medium proposed in this application can improve the stability and accuracy of voice event detection.
  • FIG. 2 it is a schematic diagram of a module of the voice event detection apparatus of the present application.
  • the voice event detection apparatus 100 described in this application can be installed in an electronic device. According to the implemented functions, the voice event detection apparatus may include a feature extraction module 101 , a self-attention module 102 , a recognition module 103 and a smoothing module 104 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the feature extraction module 101 is configured to acquire the audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence.
  • the audio to be detected is audio that includes various sound events, such as human voice, singing, percussion, dog barking, car chirping and other events.
  • the to-be-detected audio can be obtained from a database.
  • the audio to be detected can be obtained from a node of a blockchain.
  • the feature extraction module 101 when performing acoustic feature extraction on the to-be-detected audio to obtain a speech frame feature sequence, the feature extraction module 101 specifically performs the following operations:
  • Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
  • the embodiment of the present application performs acoustic feature extraction on the audio to be detected.
  • the audio to be detected needs to be framed, for example, every 10 milliseconds is a frame, and then the Mel spectrum is calculated for each frame of speech, and the Cepstral analysis is performed to extract acoustic features of the audio to be detected.
  • the Mel spectrum is a commonly used speech feature representation, which can effectively describe the basic information of speech and facilitate subsequent analysis and processing of speech.
  • the self-attention module 102 is configured to perform feature analysis on the speech frame feature sequence using a classification model based on the self-attention mechanism to obtain a hidden state sequence to be identified.
  • the classification model is a deep learning model that can recognize speech features and is used to detect and classify different sound events.
  • the classification model includes an input layer, a hidden layer and a fully connected layer.
  • the self-attention mechanism is a mechanism that pays attention to some details according to the detection target, rather than based on the global analysis.
  • the feature vector recalculated by the self-attention mechanism can fully consider the connection between the contexts in the continuous audio.
  • the self-attention module 102 is specifically used for:
  • the feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
  • the embodiment of the present application may divide the speech frame feature sequence into several windows:
  • t represents the frame sequence number
  • T represents the length of the window
  • x t represents the feature whose length is 1 in the speech frame feature sequence.
  • the hidden layer of the classification model described in the embodiments of the present application is a deep neural network composed of several layers of self-attention mechanism networks.
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism, including:
  • the second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached.
  • Layer self-attention mechanism network to get the hidden state sequence to be recognized.
  • the common speech feature of the current window is z t
  • the hidden state sequence calculated by the self-attention mechanism network of several layers of the hidden layer is o t :
  • D l is the l-th layer of self-attention mechanism network in the hidden layer of the classification model, the output of the previous layer l-1 As the input of the next layer of self-attention mechanism network.
  • the identifying module 103 is configured to use the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence.
  • the identification module 103 is specifically used for:
  • the hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
  • Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
  • the event tag sequence is the probability value of various sound events included in each frame of speech in the audio to be detected.
  • the category of sound events contained in a window may be defined as y t :
  • a classification model based on a self-attention mechanism before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, it also includes training the classification model:
  • the training sample set is input into the classification model to obtain the predicted label sequence
  • the training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
  • the real event sequence y t described in the embodiment of the present application is obtained by marking.
  • the predicted label sequence of the classification model is In this embodiment of the present application, a cross entropy loss function can be used to calculate a training error (Loss), so as to update and learn the classification model, until the classification model converges, stop training, and obtain the trained classification model.
  • Loss training error
  • the cross-entropy loss function includes:
  • Loss is the training error
  • N is the total number of categories of sound events
  • y t is the real event sequence
  • the smoothing module 104 is configured to perform smoothing processing on the event label sequence to obtain a speech event detection result corresponding to the speech to be detected.
  • the event label sequence is the result of frame-by-frame output, so there will inevitably be glitches and jitters in the result.
  • a sequence matching network (SMN) is used to map the event label sequence in the time dimension. Smoothing on top to fill in sudden gaps in event prediction, as well as very brief glitches.
  • the smoothing module 104 is specifically used for:
  • the event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence
  • the endpoint in the smooth event label sequence determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
  • the multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
  • the classification model described in the embodiments of the present application can detect audio in real time, so that it can be applied to scenarios with high real-time requirements, such as sports competitions, live broadcasts and other fields, and has high practical application value.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing the voice event detection method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a voice event detection program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 , such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the voice event detection program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. Voice event detection program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the voice event detection program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, it can realize:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
  • the present application also provides a computer-readable storage medium, where the readable storage medium stores a computer program, and when executed by a processor of an electronic device, the computer program can realize:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un procédé de détection d'événement vocal, un appareil de détection d'événement vocal (100), un dispositif électronique (1) et un support de stockage lisible par ordinateur, se rapportant à la technologie de l'intelligence artificielle. Le procédé consiste : à obtenir un audio sous détection et à réaliser une extraction de caractéristique acoustique sur l'audio pour obtenir une séquence de caractéristiques de trame de parole (S1) ; à réaliser une analyse de caractéristiques sur la séquence de caractéristiques de trame de parole en utilisant un modèle de classification basé sur un mécanisme d'auto-attention pour obtenir une séquence d'état caché à identifier (S2) ; à réaliser une identification d'événement sur la séquence d'état caché en utilisant le modèle de classification pour obtenir une séquence d'étiquette d'événement (S3) ; et à réaliser un traitement de lissage sur la séquence d'étiquette d'événement pour obtenir un résultat de détection d'événement vocal correspondant à une parole sous détection (S4). La présente invention se rapporte à la technologie des chaînes de blocs, et un audio sous détection est stocké dans un nœud de chaîne de blocs. La stabilité et la précision d'une détection d'événement vocal sont améliorées.
PCT/CN2021/082872 2020-12-01 2021-03-25 Procédé et appareil de détection d'événement vocal, dispositif électronique, et support de stockage informatique WO2022116420A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011381842.8 2020-12-01
CN202011381842.8A CN112447189A (zh) 2020-12-01 2020-12-01 语音事件检测方法、装置、电子设备及计算机存储介质

Publications (1)

Publication Number Publication Date
WO2022116420A1 true WO2022116420A1 (fr) 2022-06-09

Family

ID=74740231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/082872 WO2022116420A1 (fr) 2020-12-01 2021-03-25 Procédé et appareil de détection d'événement vocal, dispositif électronique, et support de stockage informatique

Country Status (2)

Country Link
CN (1) CN112447189A (fr)
WO (1) WO2022116420A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115623531A (zh) * 2022-11-29 2023-01-17 浙大城市学院 利用无线射频信号的隐藏监控设备发现和定位方法
CN117316184A (zh) * 2023-12-01 2023-12-29 常州分音塔科技有限公司 一种基于音频信号的事件检测反馈处理系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112447189A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 语音事件检测方法、装置、电子设备及计算机存储介质
CN113140226B (zh) * 2021-04-28 2022-06-21 桂林电子科技大学 一种采用双Token标签的声事件标注及识别方法
CN113239872B (zh) * 2021-06-01 2024-03-19 平安科技(深圳)有限公司 事件识别方法、装置、设备及存储介质
CN113782051B (zh) * 2021-07-28 2024-03-19 北京中科模识科技有限公司 广播效果分类方法及系统、电子设备和存储介质
CN113707175B (zh) * 2021-08-24 2023-12-19 上海师范大学 基于特征分解分类器与自适应后处理的声学事件检测系统
CN113724734B (zh) * 2021-08-31 2023-07-25 上海师范大学 声音事件的检测方法、装置、存储介质及电子装置
CN113555037B (zh) * 2021-09-18 2022-01-11 中国科学院自动化研究所 篡改音频的篡改区域检测方法、装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827804A (zh) * 2019-11-14 2020-02-21 福州大学 一种音频帧序列到事件标签序列的声音事件标注方法
CN110929092A (zh) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 一种基于动态注意力机制的多事件视频描述方法
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN111753549A (zh) * 2020-05-22 2020-10-09 江苏大学 一种基于注意力机制的多模态情感特征学习、识别方法
CN112447189A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 语音事件检测方法、装置、电子设备及计算机存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN110827804A (zh) * 2019-11-14 2020-02-21 福州大学 一种音频帧序列到事件标签序列的声音事件标注方法
CN110929092A (zh) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 一种基于动态注意力机制的多事件视频描述方法
CN111753549A (zh) * 2020-05-22 2020-10-09 江苏大学 一种基于注意力机制的多模态情感特征学习、识别方法
CN112447189A (zh) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 语音事件检测方法、装置、电子设备及计算机存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115623531A (zh) * 2022-11-29 2023-01-17 浙大城市学院 利用无线射频信号的隐藏监控设备发现和定位方法
CN115623531B (zh) * 2022-11-29 2023-03-31 浙大城市学院 利用无线射频信号的隐藏监控设备发现和定位方法
CN117316184A (zh) * 2023-12-01 2023-12-29 常州分音塔科技有限公司 一种基于音频信号的事件检测反馈处理系统
CN117316184B (zh) * 2023-12-01 2024-02-09 常州分音塔科技有限公司 一种基于音频信号的事件检测反馈处理系统

Also Published As

Publication number Publication date
CN112447189A (zh) 2021-03-05

Similar Documents

Publication Publication Date Title
WO2022116420A1 (fr) Procédé et appareil de détection d'événement vocal, dispositif électronique, et support de stockage informatique
WO2021232594A1 (fr) Appareil et procédé de reconnaissance d'émotions de parole, dispositif électronique, et support de stockage
WO2022078346A1 (fr) Procédé et appareil de reconnaissance d'intention de texte, dispositif électronique et support de stockage
WO2022121176A1 (fr) Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage lisible
WO2022213465A1 (fr) Procédé et appareil de reconnaissance d'image à base de réseau neuronal, dispositif électronique et support
CN112001175B (zh) 流程自动化方法、装置、电子设备及存储介质
CA3060822A1 (fr) Methode et appareil d'acquisition des renseignements d'une etiquette, dispositif electronique et support lisible par un ordinateur
WO2022105179A1 (fr) Procédé et appareil de reconnaissance d'image de caractéristiques biologiques, dispositif électronique et support de stockage lisible
CN112667805B (zh) 一种工单类别确定方法、装置、设备及介质
CN112527994A (zh) 情绪分析方法、装置、设备及可读存储介质
CN109947924B (zh) 对话系统训练数据构建方法、装置、电子设备及存储介质
CN109299227B (zh) 基于语音识别的信息查询方法和装置
WO2022227190A1 (fr) Procédé et appareil de synthèse vocale, dispositif électronique et support de stockage
WO2021189903A1 (fr) Procédé et appareil d'identification d'état d'utilisateur basé sur l'audio, dispositif électronique et support d'informations
WO2022194062A1 (fr) Procédé et appareil de détection de marqueur de maladie, dispositif électronique et support d'enregistrement
WO2022178933A1 (fr) Procédé et appareil de détection de sentiment vocal basé sur un contexte, dispositif et support de stockage
CN113205814B (zh) 语音数据标注方法、装置、电子设备及存储介质
WO2021208700A1 (fr) Procédé et appareil de sélection de données vocales, dispositif électronique et support d'enregistrement
CN113434542B (zh) 数据关系识别方法、装置、电子设备及存储介质
CN113254814A (zh) 网络课程视频打标签方法、装置、电子设备及介质
WO2022141867A1 (fr) Procédé et appareil de reconnaissance de parole, dispositif électronique et support de stockage lisible
CN113221990B (zh) 信息录入方法、装置及相关设备
WO2022222228A1 (fr) Procédé et appareil pour reconnaître de mauvaises informations textuelles, et dispositif électronique et support de stockage
CN115631748A (zh) 基于语音对话的情感识别方法、装置、电子设备及介质
CN111859985B (zh) Ai客服模型测试方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899464

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899464

Country of ref document: EP

Kind code of ref document: A1