WO2021159756A1 - Method for response obligation detection based on multiple modes, and system and apparatus - Google Patents

Method for response obligation detection based on multiple modes, and system and apparatus Download PDF

Info

Publication number
WO2021159756A1
WO2021159756A1 PCT/CN2020/125140 CN2020125140W WO2021159756A1 WO 2021159756 A1 WO2021159756 A1 WO 2021159756A1 CN 2020125140 W CN2020125140 W CN 2020125140W WO 2021159756 A1 WO2021159756 A1 WO 2021159756A1
Authority
WO
WIPO (PCT)
Prior art keywords
response
response obligation
obligation
detection model
training
Prior art date
Application number
PCT/CN2020/125140
Other languages
French (fr)
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021159756A1 publication Critical patent/WO2021159756A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present invention relates to the technical field of speech recognition in artificial intelligence, and in particular to a method, system, device and storage medium for detecting response obligations based on multimodality.
  • ROD Response Obligation Detection
  • the dialogue system is set to respond to every sentence detected.
  • some specific sentences do not need to be answered, such as self-explanation.
  • Language, a public statement, or a sentence after the object of speaking has been changed.
  • response obligation detection is widely used. Its purpose is to distinguish whether it is necessary to reply to the detected sentence, so as to improve user experience and complete more natural and effective dialogue interactions.
  • the user needs to use a specific keyword similar to the input command to wake up the system (such as Xiao Ai, siri, etc.) before the system will reply to the detected sentence.
  • This approach requires users to know the keywords to wake up the system in advance, and the usage is relatively rigid, and it is not suitable for the first use by large-scale user groups.
  • the sentence in the use environment of the dialogue system that is, the target domain
  • the system's training database that is, the source domain.
  • the sentence that needs to be responded cannot be correctly identified in the application scenario. For example, when training the model, the corpus recorded under relatively quiet conditions is used, and there may be different background noises in actual applications, which causes the system to fail to correctly perform speech recognition.
  • the present invention provides a multi-modal response duty detection method, system, electronic device and computer storage medium, and its main purpose is to solve the problem of low efficiency and poor quality of the existing response duty checking method.
  • the present invention provides a multi-modality-based response obligation detection method, which includes the following steps:
  • the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
  • the trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • the present invention also provides a multi-modal response obligation detection system, which includes:
  • the sample set establishment unit is used to obtain training data samples and save the training data samples to the training sample data set;
  • the model training unit is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for input data Acoustic feature extraction and semantic feature extraction are performed on the information, and response obligation detection is performed on the input data information according to the extracted acoustic feature and semantic feature;
  • the model application unit is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • the present invention also provides an electronic device including: a memory, a processor, and a multi-modality-based response obligation detection stored in the memory and running on the processor A program, when the multi-modal response obligation detection program is executed by the processor, the following steps are implemented:
  • the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
  • the trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • the present invention also provides a computer-readable storage medium in which a multi-modality-based response obligation detection program is stored, and the multi-modality-based response obligation detection program is stored in the computer-readable storage medium.
  • the steps of the above-mentioned multi-modal response obligation detection method are realized:
  • the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
  • the trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • the multi-modal response obligation detection method, electronic device, and computer-readable storage medium proposed in the present invention design a response obligation detection model, which is a multi-modal fusion algorithm based on voice features and semantic information. Embedded in the automatic dialogue system, it can realize the response obligation detection in the dialogue. In addition, different from the traditional response obligation detection, this algorithm also pays attention to the semantic information of the received sentence while the speech signal is On the one hand, automatic speech recognition is used to convert the speech signal into a text form, and semantic understanding is carried out according to the text information. When judging whether the received sentence needs a reply, the acoustic characteristics and semantic information of the sample are comprehensively considered.
  • the present invention proposes to use an adversarial network to reduce the difference in the distribution of features between the target domain and the source domain, and at the same time use self-supervised learning to take the consistency of the two modalities as the learning target , to further enhance the domain adaptability of features, that is, to detect whether two features from different modalities are extracted from the same sample, and use the result of detection and prediction as part of the loss function, so as to supervise the model to learn and understand semantic information and improve the model Accuracy.
  • FIG. 1 is a flowchart of a preferred embodiment of a method for detecting a response obligation based on a multi-modality according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a preferred embodiment of an electronic device according to an embodiment of the present invention.
  • Fig. 3 is a schematic diagram of internal logic of a multi-modal response obligation detection program according to an embodiment of the present invention.
  • FIG. 1 shows the flow of the multi-modality-based response obligation detection method provided according to the present invention.
  • the multi-modal response obligation detection method provided by the present invention includes:
  • S110 Obtain a training data sample, and save the training data sample to a training sample data set.
  • the training data sample is the historical data after the technical confirmation by the technician. This type of historical data has been marked with the corresponding response label after the technical confirmation by the technician, and used as the training data sample for subsequent response obligation testing
  • the training and use of the model for example, the training data sample can be a section of historical voice information, which has been marked with corresponding response labels (such as reply, no reply) after technical confirmation by a technician.
  • the corresponding historical data can be obtained from the two data domains of the target domain and the source domain as the training data sample.
  • the training data samples include target domain data samples and source domain data samples.
  • the target domain data samples are sentences in the real environment used by the dialogue system, and the source domain data samples are sentences in a traditional preset training database.
  • the present invention introduces the target domain data sample and the source domain data sample to train the response duty detection model self-designed later, thereby significantly improving the recognition accuracy of the response duty detection model.
  • the training sample data set can be stored in a node of the blockchain.
  • S120 Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a corresponding preset accuracy; wherein the response obligation detection model is used to extract acoustic features of input data information And semantic feature extraction, and based on the extracted acoustic features and semantic features, the input data information is response obligation detection and labeling.
  • the response obligation detection model mainly includes a multi-modal fusion module, which is used to perform acoustic information feature extraction and semantic feature extraction on source domain data samples and target domain data samples in the training sample data set.
  • the sound features are extracted through Mel Frequency Cepstral Coefficient (MFCC) or Perceptual Linear Prediction (PLP), where MFCC and PLP are the existing commonly used acoustics.
  • MFCC Mel Frequency Cepstral Coefficient
  • PLP Perceptual Linear Prediction
  • the information feature extraction method extracts the frequency domain features of the speech signal in a short time, and obtains the combined information of the time and frequency domain of the sample. This information is an important feature for distinguishing different phonemes. Since MFCC and PLP are common technical methods for the extraction of existing acoustic features, the specific process of their data processing will not be repeated here.
  • the original signal (training data sample) can be divided into frames.
  • the information of every 20ms can be divided into a frame, and the speech signal It is regarded as a stable time sequence signal, so that the frequency domain information can be extracted from the signal within this period of time.
  • Commonly used feature extraction methods such as calculating MFCC/PLP, etc. Both of these extraction methods simulate the human auditory system for modeling. Generally speaking, PLP has a strong anti-noise ability, while MFCC has a faster calculation speed.
  • the specific features to be adopted can be selected according to different business scenarios.
  • the ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information.
  • acoustic features Automatic speech recognition: ASR
  • Automatic speech recognition mainly includes two parts, acoustic model processing and decoding search element processing.
  • the acoustic model is To improve the recognition rate, more acoustic models are currently used as end-to-end models; the decoding search part includes the classic method (connectionist temporal classification, CTC) or the current mainstream RNN-T network and Transformer network.
  • CTC connectionist temporal classification
  • voice recognition technology is an existing technology and how many specific implementation methods are In this case, the present invention mainly uses speech recognition technology to obtain the required semantic features. Therefore, the specific data processing process will not be repeated here.
  • the response obligation detection model provided by the present invention also includes a confrontation network module, which is used for target domain data samples and source domain data samples. Conduct confrontation training to improve the feature extraction capabilities of the multi-modal fusion module for acoustic and semantic features.
  • the confrontation network module includes a first confrontation network and a second confrontation network, and in the process of using the training sample data set to train the preset response obligation detection model,
  • the first confrontation network is used to conduct confrontation training on the target domain acoustic features and source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;
  • the second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the semantic features of the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.
  • the present invention uses the above-mentioned multi-modal fusion confrontation network.
  • the domain classification loss of the domain classifier can be calculated on the source domain and the target domain respectively. This method can effectively avoid the domain classifier from only focusing on the less robust modes when optimizing, so it is now necessary to improve The feature extraction accuracy of the model.
  • the loss function of the domain classifier is:
  • F m and D m represent the feature matrix and domain classifier in the target domain and the source domain, respectively, and d is the domain label, which indicates whether the current sample belongs to the target domain or the source domain.
  • the domain classifier updates the network parameters by minimizing the domain classification loss L d , and the final output D m (F m (x)) is the domain where the domain classifier predicts the input data.
  • the label classifier minimizes the label classification loss Ly, thereby improving the model's ability to predict sample labels.
  • the domain classification loss Ld can be maximized so that the features extracted by the feature extractor are as relevant as possible to the sentence response obligation judgment, and have nothing to do with the specific domain, that is, when the model makes judgments, it will not be affected by the change of the sample domain. , Only focus on whether the sample itself needs to respond. The two classifiers continuously iterate to reduce the influence of the domain on the recognition of response obligations.
  • the response obligation detection model also includes a total classifier network, and the total classification network is used to calculate the final response obligation probability according to the acoustic feature and the semantic feature.
  • the acoustic response obligation probability and the semantic response obligation probability are calculated respectively according to the acoustic feature and the semantic feature, and then the acoustic response obligation probability is calculated according to the acoustic response obligation probability. And the semantic response obligation probability to calculate the final response obligation probability;
  • the loss function for calculating the probability of the acoustic response obligation is:
  • P(x1) is the acoustic response obligation probability
  • y is the true value of the training data sample
  • ⁇ S ⁇ is the training sample data set
  • x1 is the training data in ⁇ S ⁇ used to calculate the acoustic response obligation probability sample
  • the loss function for calculating the probability of the semantic response obligation is:
  • P(x2) is the acoustic response obligation probability
  • y is the true value of the training data sample
  • ⁇ S ⁇ is the training sample data set
  • x2 is the training data in ⁇ S ⁇ used to calculate the semantic response obligation probability sample
  • the loss function for calculating the probability of the final response obligation is:
  • the multimodal fusion module extracts the acoustic features and voice features
  • the acoustic features calculated by MFCC or PLP are input into the deep learning network (RNN/CNN/Transducer, etc., that is, the total classifier network) to calculate its needs
  • the probability of being replied P(x1) and the classification loss L y ⁇ speech are input into the deep learning network (RNN/CNN/Transducer, etc., that is, the total classifier network).
  • the deep learning network (that is, the total classifier network) can be modeled in a two-category manner. According to labels 0 and 1, whether a response is required is marked. When the data information is input, the result value of the network output That is the probability of a response.
  • the classification loss of this part of the total classifier network can be obtained by the following formula
  • Ly ⁇ speech can be used to derive the parameters of the network model, and use back propagation to update the parameters of the network model to optimize the network model.
  • the total classifier network is also used to process the semantic features extracted by the multimodal fusion module through word embedding.
  • Word embedding is the general term for the representation learning technology of language models in natural language processing. Conceptually, It refers to embedding a high-dimensional space whose dimension is the number of all words into a continuous vector space with a much lower dimension, and each word or phrase is mapped to a vector in the real number domain to facilitate subsequent calculations.
  • the result of the word embedding processing is used as the input of the recurrent neural network (LSTM/GRU, etc.) to calculate the probability P(x2) that the sentence needs to be replied and the classification L y ⁇ semantic .
  • LSTM/GRU, etc. are special recurrent neural networks. Compared with ordinary RNNs, this kind of network can perform long and short-term memory and deal with long-term dependencies.
  • the LSTM structure includes a forget gate responsible for determining the forgetting information of the current step, an input gate determining updated data, and an output gate determining output information; and GRU is a variant of LSTM, which combines the forget gate and the input gate into an update gate.
  • the resulting model is simpler than the standard LSTM model.
  • LSTM/GRU when updating network parameters, LSTM/GRU also follows the back-propagation rule and uses the loss function to calculate the reciprocal of the model coefficients to update the model. It should be noted that the calculation process of the classification loss of this part of the total classifier network is the same as the calculation method of Ly ⁇ speech , and will not be repeated here.
  • the label classifier when the label classifier finally predicts the sample, it needs to combine the probability and loss of acoustic features and semantic information to calculate the response obligation probability and loss L_y of the sample.
  • the commonly used calculation method is:
  • P(x) is the probability of the final response obligation
  • Ly is the loss function of the probability of the final response obligation
  • a+b 1
  • a and b are the weights of the preset acoustic features and semantic information, respectively.
  • the total classification network model can obtain the final response obligation probability through the above series of processing, which can be used to perform response obligation detection on the input data information. If the final response obligation probability is 0, the system does not respond. If the final response obligation probability is 1. The system responds.
  • the present invention also uses a modal consistency detector C, that is, a multi-modal self-supervised learning module, which uses Acoustic features and semantic information extracted from samples in the source and target domains are used as input, and features of different modalities are randomly selected pairwise to detect whether the label classification of the two modalities is consistent.
  • a modal consistency detector C that is, a multi-modal self-supervised learning module, which uses Acoustic features and semantic information extracted from samples in the source and target domains are used as input, and features of different modalities are randomly selected pairwise to detect whether the label classification of the two modalities is consistent.
  • L C ⁇ x ⁇ (S,T) -clogC(F 0 (x),...,F m (x))
  • c indicates whether the input modalities are consistent.
  • S130 Use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • the response obligation detection model can significantly respond to the response accuracy of the obligation detection model through loss function, confrontation training and other methods. At this time, the response obligation detection model can be used to treat the detection target pre-data Perform testing.
  • the model can be applied to services related to automatic conversations, such as intelligent customer service systems.
  • the system often cannot see the user's facial expressions, etc., and can only judge whether the user is talking to the system through voice. By judging the duty of response, the customer service system can keep waiting while the user is talking to others. If the system does not detect the duty of response for a long time, it can also prompt the user to end the conversation.
  • the model can also be applied to smart homes, such as Tmall Elf, Xiao Ai, etc., to provide users with more humane services. For example, users do not need to use specific keywords to wake up the system and can directly speak their needs.
  • the system can receive instructions to serve users.
  • the multi-modal response obligation detection method proposed by the present invention is designed by designing a response obligation detection model, which is a multi-modal fusion algorithm based on speech features and semantic information, and the algorithm is embedded
  • a response obligation detection model which is a multi-modal fusion algorithm based on speech features and semantic information
  • the algorithm is embedded
  • this algorithm also pays attention to the semantic information of the received sentence while the speech signal is
  • automatic speech recognition is used to convert the speech signal into a text form, and semantic understanding is carried out according to the text information.
  • the present invention proposes to use an adversarial network to reduce the difference in the distribution of features between the target domain and the source domain, and at the same time use self-supervised learning to take the consistency of the two modalities as the learning target , to further enhance the domain adaptability of features, that is, to detect whether two features from different modalities are extracted from the same sample, and use the result of detection and prediction as part of the loss function, so as to supervise the model to learn and understand semantic information and improve the model Accuracy.
  • this application also provides a multi-modal response obligation detection system, which includes:
  • the sample set establishment unit is used to obtain training data samples and save the training data samples to the training sample data set;
  • the model training unit is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for input data Acoustic feature extraction and semantic feature extraction are performed on the information, and response obligation detection is performed on the input data information according to the extracted acoustic feature and semantic feature;
  • the model application unit is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • the invention also provides an electronic device 70.
  • FIG. 2 is a schematic structural diagram of a preferred embodiment of an electronic device 70 provided by the present invention.
  • the electronic device 70 may be a terminal device with a computing function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
  • a computing function such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
  • the electronic device 70 includes a processor 71 and a memory 72.
  • the memory 72 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory, etc., and the above-mentioned readable storage medium may also be a volatile storage medium.
  • the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70.
  • the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 70, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the readable storage medium of the memory 72 is generally used to store the multi-modal response duty detection program 73 installed in the electronic device 70.
  • the memory 72 can also be used to temporarily store data that has been output or will be output.
  • the processor 72 may be a central processing unit (CPU), a microprocessor or other data processing chip, which is used to run program codes or process data stored in the memory 72, for example, based on multi-modality.
  • the electronic device 70 is a terminal device such as a smart phone, a tablet computer, and a portable computer. In other embodiments, the electronic device 70 may be a server.
  • FIG. 2 only shows the electronic device 70 with the components 71-73, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the electronic device 70 may also include a user interface.
  • the user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc.
  • the user interface may also include a standard wired interface and a wireless interface.
  • the electronic device 70 may further include a display, and the display may also be referred to as a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like.
  • the display is used for displaying information processed in the electronic device 70 and for displaying a visualized user interface.
  • the electronic device 70 may also include a touch sensor.
  • the area provided by the touch sensor for the user to perform touch operations is called the touch area.
  • the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like.
  • the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like.
  • the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.
  • the area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor.
  • the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.
  • the electronic device 70 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
  • RF radio frequency
  • the memory 72 which is a computer storage medium, may include an operating system and a multi-modal response obligation detection program 73; The following steps are implemented in the response obligation detection program 73:
  • the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
  • the trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • FIG. 3 is an internal logic diagram of a multi-modal response obligation detection program according to an embodiment of the present invention.
  • the multi-modal response obligation detection program 73 can also be divided into One or more modules, one or more modules are stored in the memory 72 and executed by the processor 71 to complete the present invention.
  • the module referred to in the present invention refers to a series of computer program instruction segments capable of completing specific functions.
  • FIG. 3 it is a program module diagram of a preferred embodiment of the multi-modal response obligation detection program 73 in FIG. 2.
  • the multi-modal response obligation detection program 73 can be divided into: a sample set establishment module 74, a model training module 75, and a model application module 76.
  • the functions or operation steps implemented by modules 74-76 are similar to the above, and will not be described in detail here. Illustratively, for example, where:
  • the sample set establishment module 74 is configured to obtain training data samples and save the training data samples to the training sample data set;
  • the model training module 75 is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model reaches a corresponding preset accuracy; wherein, the response obligation detection model is used for Perform acoustic feature extraction and semantic feature extraction on the input data information, and perform response obligation detection on the input data information according to the extracted acoustic feature and semantic feature;
  • the model application module 76 is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • the present invention also provides a computer-readable storage medium in which a multi-modality-based response obligation detection program 73 is stored.
  • a multi-modality-based response obligation detection program 73 is stored.
  • the multi-modality-based response obligation detection program 73 is executed by a processor, the following operations are implemented:
  • the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
  • the trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

A method for response obligation detection based on multiple modes, relating to artificial intelligence, and comprising: acquiring a training data sample, and storing the training data sample in a training sample dataset (S110); using the training sample dataset to train a pre-set response obligation detection model so as to cause the response obligation model to attain a pre-set accuracy (S120); by means of the response obligation detection model, performing detection on target-domain data to undergo detection, so as to determine whether the system needs to respond to the target-domain data to undergo detection (S130). The described method further relates to blockchain technology, and the training sample dataset is stored in a blockchain, thus being able to effectively solve the problems in response obligation checking methods of low efficiency and poor quality.

Description

基于多模态的响应义务检测方法、系统及装置Multi-modality-based response duty detection method, system and device
本申请要求于2020年09月04日提交中国专利局、申请号为2020109217599,发明名称为“基于多模态的响应义务检测方法、系统及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 4, 2020, the application number is 2020109217599, and the invention title is "Multimodal-based response obligation detection method, system and device". The entire content is approved The reference is incorporated in this application.
技术领域Technical field
本发明涉及人工智能中的语音识别技术领域,尤其涉及一种基于多模态的响应义务检测方法、系统、装置及存储介质。The present invention relates to the technical field of speech recognition in artificial intelligence, and in particular to a method, system, device and storage medium for detecting response obligations based on multimodality.
背景技术Background technique
响应义务检测(Response Obligation Detection,ROD)是自动对话系统等智能语音产品的重要组成部分。在传统的语音对话交互中,对话系统被设定为回应检测到的每一句语句,然而,发明人意识到在人与人的自然交流中,某些特定语句不需要被回复,诸如自言自语,公开声明,或是说话对象改变后的语句。对于自动对话系统,这些语句容易引起不必要的错误回复,降低对话系统的准确率,降低用户的使用体验。针对这种现象,响应义务检测被广泛应用,其目的是区分出是否需要对检测到的语句进行回复,从而提高用户使用体验,完成更自然有效的对话交互。Response Obligation Detection (ROD) is an important part of intelligent voice products such as automatic dialogue systems. In the traditional voice dialogue interaction, the dialogue system is set to respond to every sentence detected. However, the inventor realized that in the natural communication between people, some specific sentences do not need to be answered, such as self-explanation. Language, a public statement, or a sentence after the object of speaking has been changed. For automatic dialogue systems, these sentences are likely to cause unnecessary error replies, reduce the accuracy of the dialogue system, and reduce the user experience. In response to this phenomenon, response obligation detection is widely used. Its purpose is to distinguish whether it is necessary to reply to the detected sentence, so as to improve user experience and complete more natural and effective dialogue interactions.
传统的对话系统为了提高准确率,会选择严格限制响应条件。一方面,用户需要使用类似于输入命令的特定关键词唤醒系统(如、小爱同学、siri等),系统才会回复检测到的语句。这种做法需要用户提前知道唤醒系统的关键词,使用较为死板,不适用于大规模用户群体初次使用。另一方面,对话系统的使用环境(即目标域)的语句通常和该系统的训练数据库(即源域)相差较大,导致尽管在训练时对话系统识别响应义务的准确率较高,但实际应用场景中却无法正确识别需要响应的语句。例如训练模型时使用的是较为安静的条件下录制的语料,而实际应用中可能存在不同的背景噪声,从而导致系统无法正确进行语音识别。In order to improve the accuracy of the traditional dialogue system, it will choose to strictly limit the response conditions. On the one hand, the user needs to use a specific keyword similar to the input command to wake up the system (such as Xiao Ai, siri, etc.) before the system will reply to the detected sentence. This approach requires users to know the keywords to wake up the system in advance, and the usage is relatively rigid, and it is not suitable for the first use by large-scale user groups. On the other hand, the sentence in the use environment of the dialogue system (that is, the target domain) is usually quite different from the system's training database (that is, the source domain). However, the sentence that needs to be responded cannot be correctly identified in the application scenario. For example, when training the model, the corpus recorded under relatively quiet conditions is used, and there may be different background noises in actual applications, which causes the system to fail to correctly perform speech recognition.
由于以上两种限制,传统的对话系统很难在保证较高的准确率的前提下,为用户提供自然流畅的对话交互。实际业务场景中,对话系统应该在多种场景中都能充分理解用户意图,准确判断是否需要对检测到的语句进行回复,同时降低用户的使用门槛,不设置唤醒系统的关键词,才能有效地与大规模的用户进行沟通,否则会扰乱对话的连贯性,降低用户的使用体验,影响业务开展。因此,基于以上问题,亟需一种准确率较高的响应义务检测算法以提高自动对话系统的相应精度。Due to the above two limitations, it is difficult for traditional dialogue systems to provide users with natural and smooth dialogue interactions under the premise of ensuring high accuracy. In actual business scenarios, the dialogue system should be able to fully understand the user’s intentions in a variety of scenarios, accurately determine whether it needs to respond to the detected sentences, and at the same time lower the user’s threshold for use, and not set keywords to wake up the system in order to be effective Communicate with large-scale users, otherwise it will disrupt the continuity of the dialogue, reduce the user experience, and affect business development. Therefore, based on the above problems, there is an urgent need for a response obligation detection algorithm with higher accuracy to improve the corresponding accuracy of the automatic dialogue system.
技术问题technical problem
本发明提供一种基于多模态的响应义务检测方法、系统、电子装置以及计算机存储介质,其主要目的在于解决现有的响应义务检查方法效率低质量差的问题。The present invention provides a multi-modal response duty detection method, system, electronic device and computer storage medium, and its main purpose is to solve the problem of low efficiency and poor quality of the existing response duty checking method.
技术解决方案Technical solutions
为实现上述目的,本发明提供一种基于多模态的响应义务检测方法,该方法包括如下步骤:In order to achieve the above objective, the present invention provides a multi-modality-based response obligation detection method, which includes the following steps:
获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
另外,本发明还提供一种基于多模态的响应义务检测系统,所述系统包括:In addition, the present invention also provides a multi-modal response obligation detection system, which includes:
样本集建立单元,用于获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;The sample set establishment unit is used to obtain training data samples and save the training data samples to the training sample data set;
模型训练单元,用于使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;The model training unit is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for input data Acoustic feature extraction and semantic feature extraction are performed on the information, and response obligation detection is performed on the input data information according to the extracted acoustic feature and semantic feature;
模型应用单元,用于利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The model application unit is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
另外,为实现上述目的,本发明还提供一种电子装置,该电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的基于多模态的响应义务检测程序,所述基于多模态的响应义务检测程序被所述处理器执行时实现如下步骤:In addition, in order to achieve the above object, the present invention also provides an electronic device including: a memory, a processor, and a multi-modality-based response obligation detection stored in the memory and running on the processor A program, when the multi-modal response obligation detection program is executed by the processor, the following steps are implemented:
获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需 对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
另外,为实现上述目的,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有基于多模态的响应义务检测程序,所述基于多模态的响应义务检测程序被处理器执行时,实现如上述基于多模态的响应义务检测方法的步骤:In addition, in order to achieve the above-mentioned object, the present invention also provides a computer-readable storage medium in which a multi-modality-based response obligation detection program is stored, and the multi-modality-based response obligation detection program is stored in the computer-readable storage medium. When executed by the processor, the steps of the above-mentioned multi-modal response obligation detection method are realized:
获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
有益效果Beneficial effect
本发明提出的基于多模态的响应义务检测方法、电子装置及计算机可读存储介质,通过设计一个响应义务检测模型,该模型为基于语音特征和语义信息的多模态融合算法,将该算法嵌入到自动对话系统,能够实现对话中的响应义务检测。另外,有别于传统的响应义务检测,本算法在语音信号的同时,也关注接收到的语句的语义信息,即在接收到语音信号之后,一方面通过声学特征提取方法分析声音信号,另一方面通过自动语音识别将语音信号转换成文本形式,根据文本信息进行语义理解,在判断接收到的语句是否需要回复时,综合考虑样本的声学特征和语义信息。此外,针对目标域和源域差别较大的问题,本发明提出使用对抗网络降低特征在目标域和源域之间的分布差异,同时借助自监督学习将两个模态的一致性作为学习目标,进一步增强特征的域适应能力,即检测两个来自不同模态的特征是否抽取自同一个样本,并将检测预测的结果作为损失函数的一部分,以此来监督模型学习理解语义信息,提高模型的精度。The multi-modal response obligation detection method, electronic device, and computer-readable storage medium proposed in the present invention design a response obligation detection model, which is a multi-modal fusion algorithm based on voice features and semantic information. Embedded in the automatic dialogue system, it can realize the response obligation detection in the dialogue. In addition, different from the traditional response obligation detection, this algorithm also pays attention to the semantic information of the received sentence while the speech signal is On the one hand, automatic speech recognition is used to convert the speech signal into a text form, and semantic understanding is carried out according to the text information. When judging whether the received sentence needs a reply, the acoustic characteristics and semantic information of the sample are comprehensively considered. In addition, in view of the large difference between the target domain and the source domain, the present invention proposes to use an adversarial network to reduce the difference in the distribution of features between the target domain and the source domain, and at the same time use self-supervised learning to take the consistency of the two modalities as the learning target , To further enhance the domain adaptability of features, that is, to detect whether two features from different modalities are extracted from the same sample, and use the result of detection and prediction as part of the loss function, so as to supervise the model to learn and understand semantic information and improve the model Accuracy.
附图说明Description of the drawings
图1为根据本发明实施例的基于多模态的响应义务检测方法的较佳实施例流程图;FIG. 1 is a flowchart of a preferred embodiment of a method for detecting a response obligation based on a multi-modality according to an embodiment of the present invention;
图2为根据本发明实施例的电子装置的较佳实施例结构示意图;2 is a schematic structural diagram of a preferred embodiment of an electronic device according to an embodiment of the present invention;
图3为根据本发明实施例的基于多模态的响应义务检测程序的内部逻辑示意图。Fig. 3 is a schematic diagram of internal logic of a multi-modal response obligation detection program according to an embodiment of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the objectives, functional characteristics and advantages of the present invention will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的最佳实施方式The best mode of the present invention
以下将结合附图对本申请的具体实施例进行详细描述。The specific embodiments of the present application will be described in detail below with reference to the accompanying drawings.
实施例1Example 1
为了说明本发明提供的基于多模态的响应义务检测方法,图1示出了根据本发明提供的基于多模态的响应义务检测方法的流程。In order to explain the multi-modality-based response obligation detection method provided by the present invention, FIG. 1 shows the flow of the multi-modality-based response obligation detection method provided according to the present invention.
如图1所示,本发明提供的基于多模态的响应义务检测方法,包括:As shown in Figure 1, the multi-modal response obligation detection method provided by the present invention includes:
S110:获取训练数据样本,并将该训练数据样本保存至训练样本数据集。S110: Obtain a training data sample, and save the training data sample to a training sample data set.
需要说明的是,训练数据样本为经技术人员技术确认后的历史数据,该类历史数据在经技术人员技术确认后已打上相应的响应标标签,以此作为训练数据样本,供后续响应义务检测模型的训练使用,例如,训练数据样本可以为一段段的历史语音信息,在经技术人员技术确认后,已经打上相应的响应标签(如:回复、不回复)。It should be noted that the training data sample is the historical data after the technical confirmation by the technician. This type of historical data has been marked with the corresponding response label after the technical confirmation by the technician, and used as the training data sample for subsequent response obligation testing The training and use of the model, for example, the training data sample can be a section of historical voice information, which has been marked with corresponding response labels (such as reply, no reply) after technical confirmation by a technician.
此外,为提高训练数据样本对真实数据的模拟精度,从而提高后述的响应义务检测模型的精度,可以从目标域和源域两个数据域去获取相应的历史数据作为训练数据样本,即该训练数据样本包括目标域数据样本和源域数据样本,其中,目标域数据样本即对话系统使用的真实环境下的语句,源域数据样本为传统的预设训练数据库中的语句。In addition, in order to improve the simulation accuracy of the training data sample against the real data, thereby improving the accuracy of the response obligation detection model described later, the corresponding historical data can be obtained from the two data domains of the target domain and the source domain as the training data sample. The training data samples include target domain data samples and source domain data samples. The target domain data samples are sentences in the real environment used by the dialogue system, and the source domain data samples are sentences in a traditional preset training database.
由于目标域与源域之间的样本差距较大,例如训练模型时使用的是较为安静的条件下录制的语料,而实际应用中可能存在不同的背景噪声,从而导致系统无法正确进行语音识别,即导致尽管在训练时对话系统识别响应义务的准确率较高,但实际应用场景中却无法正确识别需要响应的语句。因此,本发明引入目标域数据样本和源域数据样本两种数据对后期自行设计的响应义务检测模型进行训练,从而显著提高响应义务检测模型的识别精度。Due to the large sample gap between the target domain and the source domain, for example, when training the model, the corpus recorded under relatively quiet conditions is used, and there may be different background noises in actual applications, which may cause the system to fail to perform speech recognition correctly. That is to say, although the dialogue system has a high accuracy in identifying the response obligation during training, it cannot correctly identify the sentence that needs to be responded in the actual application scenario. Therefore, the present invention introduces the target domain data sample and the source domain data sample to train the response duty detection model self-designed later, thereby significantly improving the recognition accuracy of the response duty detection model.
另外,需要强调的是,为进一步保证上述该训练样本数据集内数据的私密和安全性,该训练样本数据集可以存储于区块链的节点中。In addition, it should be emphasized that, in order to further ensure the privacy and security of the data in the training sample data set, the training sample data set can be stored in a node of the blockchain.
S120:使用该训练样本数据集对预设的响应义务检测模型进行训练,以使该响应义务检测模型达到相应的预设精度;其中,该响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对该输入数据信息进行响应义务检测标注。S120: Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a corresponding preset accuracy; wherein the response obligation detection model is used to extract acoustic features of input data information And semantic feature extraction, and based on the extracted acoustic features and semantic features, the input data information is response obligation detection and labeling.
具体地,该响应义务检测模型主要包括多模态融合模块,该多模态融合模块用于对训练样本数据集中源域数据样本和目标域数据样本进行声学信息特征提取和语义特征提取。Specifically, the response obligation detection model mainly includes a multi-modal fusion module, which is used to perform acoustic information feature extraction and semantic feature extraction on source domain data samples and target domain data samples in the training sample data set.
具体地,在提取声学特征的过程中,通过梅尔倒频谱(Mel Frequency Cepstral Coefficient:MFCC)或感知线性预测(Perceptual Linear Prediction:PLP)提取声音特征,其中MFCC和PLP是现有的常用的声学信息特征提取方法,通过在短时间内抽取语音信号的频域特征,得到样本的时频域相结合的信息,该信息是区别不同音素的重要特征。由于MFCC和PLP是现有的声学特征提取的常用技术手段,因此,对其数据处理的具体过程,在此不再赘述。Specifically, in the process of extracting acoustic features, the sound features are extracted through Mel Frequency Cepstral Coefficient (MFCC) or Perceptual Linear Prediction (PLP), where MFCC and PLP are the existing commonly used acoustics. The information feature extraction method extracts the frequency domain features of the speech signal in a short time, and obtains the combined information of the time and frequency domain of the sample. This information is an important feature for distinguishing different phonemes. Since MFCC and PLP are common technical methods for the extraction of existing acoustic features, the specific process of their data processing will not be repeated here.
需要说明的是,模型在实际处理过程中,对于声学特征提取,可对原始信号(训练数 据样本)进行分帧处理,如将每20ms的信息为一帧,在该时间段内可将语音信号视为平稳的时序信号,从而可以对这段时间内的信号进行频域信息提取。常用的特征提取方法,如计算MFCC/PLP等。这两种提取方法都是模拟人类听觉系统进行建模,一般来说PLP抗噪声能力较强,而MFCC的计算速度较快,具体采取何种特征可依据不同的业务场景进行选择。It should be noted that in the actual processing process of the model, for the acoustic feature extraction, the original signal (training data sample) can be divided into frames. For example, the information of every 20ms can be divided into a frame, and the speech signal It is regarded as a stable time sequence signal, so that the frequency domain information can be extracted from the signal within this period of time. Commonly used feature extraction methods, such as calculating MFCC/PLP, etc. Both of these extraction methods simulate the human auditory system for modeling. Generally speaking, PLP has a strong anti-noise ability, while MFCC has a faster calculation speed. The specific features to be adopted can be selected according to different business scenarios.
具体地,在提取语义特征的过程中,使用ASR网络对该输入数据信息或声学特征进行处理,以获取该输入数据信息的语义特征。例如在对语义信息提取时,可先对声学特征(Automatic speech recognition:ASR)提取语义信息进行自动语音识别,自动语音识别主要包括两部分,声学模型处理和解码搜素处理,其中,声学模型为提升识别率的基础,目前使用较多的声学模型为端到端模型;解码搜索部分包括经典方法(connectionist temporal classification,CTC)或目前主流的RNN-T网络和Transformer网络。声学特征在上述自动语音识别处理后最终输出预测的文本,即为语音识别结果,从而得到相应的语义特征。Specifically, in the process of extracting semantic features, the ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information. For example, when extracting semantic information, you can first extract semantic information from acoustic features (Automatic speech recognition: ASR) for automatic speech recognition. Automatic speech recognition mainly includes two parts, acoustic model processing and decoding search element processing. Among them, the acoustic model is To improve the recognition rate, more acoustic models are currently used as end-to-end models; the decoding search part includes the classic method (connectionist temporal classification, CTC) or the current mainstream RNN-T network and Transformer network. The acoustic feature finally outputs the predicted text after the above-mentioned automatic speech recognition processing, which is the speech recognition result, so as to obtain the corresponding semantic feature.
当然,对于语义特征的提取,也可以直接使用语音识别技术对输入数据信息进行处理,以获取相应的语义特征,需要说明的是,语音识别技术为一项现有技术,具体的实现方式有多种,本发明此处主要是对语音识别技术的使用,以获取所需的语义特征,因此,对其具体的数据处理过程,在此不再赘述。Of course, for the extraction of semantic features, you can also directly use voice recognition technology to process the input data information to obtain the corresponding semantic features. It should be noted that the voice recognition technology is an existing technology and how many specific implementation methods are In this case, the present invention mainly uses speech recognition technology to obtain the required semantic features. Therefore, the specific data processing process will not be repeated here.
此外,为提高多模态融合模块对于声学特征和语义特征的特征提取能力,本发明提供的响应义务检测模型还包括对抗网络模块,该对抗网络模块用于对目标域数据样本和源域数据样本进行对抗训练,以提高提高多模态融合模块对于声学特征和语义特征的特征提取能力。In addition, in order to improve the feature extraction capabilities of the multi-modal fusion module for acoustic and semantic features, the response obligation detection model provided by the present invention also includes a confrontation network module, which is used for target domain data samples and source domain data samples. Conduct confrontation training to improve the feature extraction capabilities of the multi-modal fusion module for acoustic and semantic features.
具体地,该对抗网络模块包括第一对抗网络和第二对抗网络,并且,在使用该训练样本数据集对预设的响应义务检测模型进行训练的过程中,Specifically, the confrontation network module includes a first confrontation network and a second confrontation network, and in the process of using the training sample data set to train the preset response obligation detection model,
该第一对抗网络用于对该响应义务检测模型提取的目标域声学特征和源域声学特征进行对抗训练,以使该响应义务检测模型的声学特征提取精度达到预设精度;The first confrontation network is used to conduct confrontation training on the target domain acoustic features and source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;
该第二对抗网络用于对该响应义务检测模型提取的目标域语义特征和源域语义特征进行对抗训练,以使该响应义务检测模型的语义特征提取精度达到预设精度。The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the semantic features of the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.
需要说明的是,为减小目标域和源域的特征的分布差异对算法准确率的影响,本发明使用了上述的多模态融合的对抗网络。在对抗网络模块中,可分别在源域和目标域上计算域分类器的域分类损失,该方法可以有效避免域分类器在优化时只关注鲁棒性较弱的模态,从而现需提高模型的特征提取精度。具体地,该域分类器的损失函数为:It should be noted that, in order to reduce the influence of the difference in the distribution of the characteristics of the target domain and the source domain on the accuracy of the algorithm, the present invention uses the above-mentioned multi-modal fusion confrontation network. In the adversarial network module, the domain classification loss of the domain classifier can be calculated on the source domain and the target domain respectively. This method can effectively avoid the domain classifier from only focusing on the less robust modes when optimizing, so it is now necessary to improve The feature extraction accuracy of the model. Specifically, the loss function of the domain classifier is:
L d=∑ x∈(S,T)-dlog(D m(F m(x)))-(1-d)log(1-D m(F m(x))) L d =∑ x∈(S,T) -dlog(D m (F m (x)))-(1-d)log(1-D m (F m (x)))
其中,F m和D m分别表示目标域和源域中的特征矩阵和域分类器,d为域标签,表示当 前样本是否属于目标域或源域。域分类器通过最小化域分类损失L d,更新网络参数,最终输出的D m(F m(x))即为域分类器预测输入数据所在的域。标签分类器则通过最小化标签分类损失Ly,从而提升模型对样本标签的预测能力。最终,希望能最大化域分类损失Ld使特征提取器提取的特征尽可能的与语句响应义务判断相关,而与具体的域无关,即模型进行判断时,不会受样本的域的改变的影响,只关注于样本本身是否需要进行响应。两个分类器不断迭代从而降低域对响应义务识别的影响。 Among them, F m and D m represent the feature matrix and domain classifier in the target domain and the source domain, respectively, and d is the domain label, which indicates whether the current sample belongs to the target domain or the source domain. The domain classifier updates the network parameters by minimizing the domain classification loss L d , and the final output D m (F m (x)) is the domain where the domain classifier predicts the input data. The label classifier minimizes the label classification loss Ly, thereby improving the model's ability to predict sample labels. In the end, it is hoped that the domain classification loss Ld can be maximized so that the features extracted by the feature extractor are as relevant as possible to the sentence response obligation judgment, and have nothing to do with the specific domain, that is, when the model makes judgments, it will not be affected by the change of the sample domain. , Only focus on whether the sample itself needs to respond. The two classifiers continuously iterate to reduce the influence of the domain on the recognition of response obligations.
此外,该响应义务检测模型还包括总分类器网络,该总分类网络用于根据该声学特征和该语义特征计算最终响应义务概率。In addition, the response obligation detection model also includes a total classifier network, and the total classification network is used to calculate the final response obligation probability according to the acoustic feature and the semantic feature.
具体地,在根据该声学特征和该语义特征计算最终响应义务概率过程中,先根据该声学特征和该语义特征分别计算出声学响应义务概率和语义响应义务概率,然后根据该声学响应义务概率和该语义响应义务概率计算该最终响应义务概率;其中,Specifically, in the process of calculating the final response obligation probability according to the acoustic feature and the semantic feature, the acoustic response obligation probability and the semantic response obligation probability are calculated respectively according to the acoustic feature and the semantic feature, and then the acoustic response obligation probability is calculated according to the acoustic response obligation probability. And the semantic response obligation probability to calculate the final response obligation probability; where,
计算该声学响应义务概率的损失函数为:The loss function for calculating the probability of the acoustic response obligation is:
Figure PCTCN2020125140-appb-000001
Figure PCTCN2020125140-appb-000001
其中,P(x1)为声学响应义务概率,y为训练数据样本的真实值,{S}为训练样本数据集,x1为计算所述声学响应义务概率时所用到的{S}中的训练数据样本;Among them, P(x1) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x1 is the training data in {S} used to calculate the acoustic response obligation probability sample;
计算该语义响应义务概率的损失函数为:The loss function for calculating the probability of the semantic response obligation is:
Figure PCTCN2020125140-appb-000002
Figure PCTCN2020125140-appb-000002
其中,P(x2)为声学响应义务概率,y为训练数据样本的真实值,{S}为训练样本数据集,x2为计算所述语义响应义务概率时所用到的{S}中的训练数据样本;Among them, P(x2) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x2 is the training data in {S} used to calculate the semantic response obligation probability sample;
计算该最终响应义务概率的损失函数为:The loss function for calculating the probability of the final response obligation is:
L y=a*L y^speech+b*L y^semantic L y =a*L y^speech +b*L y^semantic
其中,a+b=1,a和b分别为声学特征和语义特征所占的预设权重Among them, a+b=1, a and b are the preset weights occupied by acoustic features and semantic features, respectively
更为具体地,在多模态融合模块提取声学特征和语音特征后,将通过MFCC或PLP计算得到的声学特征输入深度学习网络(RNN/CNN/Transducer等,即总分类器网络)计算其需要被回复的概率P(x1)及分类损失L y^speechMore specifically, after the multimodal fusion module extracts the acoustic features and voice features, the acoustic features calculated by MFCC or PLP are input into the deep learning network (RNN/CNN/Transducer, etc., that is, the total classifier network) to calculate its needs The probability of being replied P(x1) and the classification loss L y^speech .
需要说明的是,该深度学习网络(即总分类器网络)可以使用二分类的方式进行建模,按照标签0和1标注是否需要进行应答响应,当输入数据信息后,其网络输出的结果值即为需要进行响应的概率。It should be noted that the deep learning network (that is, the total classifier network) can be modeled in a two-category manner. According to labels 0 and 1, whether a response is required is marked. When the data information is input, the result value of the network output That is the probability of a response.
其中,总分类器网络该部分的分类损失可通过一下公式Among them, the classification loss of this part of the total classifier network can be obtained by the following formula
Figure PCTCN2020125140-appb-000003
Figure PCTCN2020125140-appb-000003
其中,P(x1)为模型判断需要进行应答响应的概率,y为样本标签的真实值。后期可通过L y^speech根据对网络模型的参数求导,并使用反向传播更新网络模型的参数来优化网络模型。 Among them, P(x1) is the probability that a response is required for model judgment, and y is the true value of the sample label. In the later stage, Ly^speech can be used to derive the parameters of the network model, and use back propagation to update the parameters of the network model to optimize the network model.
此外,该总分类器网络还用于将多模态融合模块提取的语义特征通过词嵌入处理(word embedding),词嵌入是自然语言处理中语言模型于表征学习技术的总称,概念上而言,它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中,每个单词或词组被映射为实数域上的向量,方便后续计算。In addition, the total classifier network is also used to process the semantic features extracted by the multimodal fusion module through word embedding. Word embedding is the general term for the representation learning technology of language models in natural language processing. Conceptually, It refers to embedding a high-dimensional space whose dimension is the number of all words into a continuous vector space with a much lower dimension, and each word or phrase is mapped to a vector in the real number domain to facilitate subsequent calculations.
词嵌入处理的结果作为循环神经网络(LSTM/GRU等)的输入,计算该语句需要被回复的概率P(x2)及分类L y^semantic。LSTM/GRU等是特殊的循环神经网络,相较于普通的RNN来说,这种网络可以进行长短期记忆,处理长依赖的问题。如LSTM结构中包括负责决定当前步骤遗忘信息的遗忘门、决定更新的数据的输入门和决定输出信息的输出门;而GRU为LSTM的变种,它组合了遗忘门和输入门为一个更新门,结果模型比标准LSTM模型更简单。 The result of the word embedding processing is used as the input of the recurrent neural network (LSTM/GRU, etc.) to calculate the probability P(x2) that the sentence needs to be replied and the classification L y^semantic . LSTM/GRU, etc. are special recurrent neural networks. Compared with ordinary RNNs, this kind of network can perform long and short-term memory and deal with long-term dependencies. For example, the LSTM structure includes a forget gate responsible for determining the forgetting information of the current step, an input gate determining updated data, and an output gate determining output information; and GRU is a variant of LSTM, which combines the forget gate and the input gate into an update gate. The resulting model is simpler than the standard LSTM model.
此外,在更新网络参数时,LSTM/GRU也遵循反向传播规则,使用损失函数对模型系数计算倒数的方式更新模型。需要说明的是,该总分类器网络该部分的分类损失的计算过程和L y^speech计算方法相同,在此不再赘述。 In addition, when updating network parameters, LSTM/GRU also follows the back-propagation rule and uses the loss function to calculate the reciprocal of the model coefficients to update the model. It should be noted that the calculation process of the classification loss of this part of the total classifier network is the same as the calculation method of Ly^speech , and will not be repeated here.
此外,最终在标签分类器对该样本进行预判时,需结合声学特征和语义信息的概率和损失,计算得到样本的响应义务概率和损失L_y,常用的计算方法是:In addition, when the label classifier finally predicts the sample, it needs to combine the probability and loss of acoustic features and semantic information to calculate the response obligation probability and loss L_y of the sample. The commonly used calculation method is:
P(x)=aP(x1)+b P(x2)P(x)=aP(x1)+b P(x2)
L y=a*L y^speech+b*L y^semantic L y =a*L y^speech +b*L y^semantic
其中,P(x)为最终响应义务概率,L y为最终响应义务概率的损失函数,a+b=1,a和b分别为预设的声学特征和语义信息所占的权重。 Among them, P(x) is the probability of the final response obligation, Ly is the loss function of the probability of the final response obligation, a+b=1, and a and b are the weights of the preset acoustic features and semantic information, respectively.
总分类网络模型通过上述一系列处理即可得到最终响应义务概率,从而用于对该输入数据信息进行响应义务检测,若最终响应义务概率为0,即系统不做响应,若最终响应义务概率为1,即系统做出响应。The total classification network model can obtain the final response obligation probability through the above series of processing, which can be used to perform response obligation detection on the input data information. If the final response obligation probability is 0, the system does not respond. If the final response obligation probability is 1. The system responds.
另外,为了使用样本中的声学特征和语义信号的一致性作为自监督学习的分类标签进行表征学习,本发明还使用一个模态一致性检测器C,即,多模态自监督学习模块,使用从源域和目标域中的样本抽取出的声学特征和语义信息作为输入,两两随机选择不同模态的特征,检测这两种模态的标签分类是否一致。该自监督学习可以进一步增强特征的表征能力。其中,In addition, in order to use the consistency of the acoustic features and semantic signals in the sample as the classification label of self-supervised learning for characterization learning, the present invention also uses a modal consistency detector C, that is, a multi-modal self-supervised learning module, which uses Acoustic features and semantic information extracted from samples in the source and target domains are used as input, and features of different modalities are randomly selected pairwise to detect whether the label classification of the two modalities is consistent. This self-supervised learning can further enhance the characterization ability of features. in,
C的损失函数为:L C=∑ x∈(S,T)-clogC(F 0(x),…,F m(x)) The loss function of C is: L C =∑ x∈(S,T) -clogC(F 0 (x),...,F m (x))
其中,c表示输入的模态是否一致。Among them, c indicates whether the input modalities are consistent.
需要说明的是,本发明最后将响应义务检测的损失和域分类、模态一致性分类的损失结合起来训练整个网络,其使用的损失函数为L=L ydL dcL c,其中λd和λc分别表示域分类器的损失和模态一致性检测器的损失所占权重。模型的损失函数L越小,预测越准确。因此最终根据反向传播,使用损失函数L对模型参数求导,使用导数更新网络参数来优化模型。 It should be noted that the present invention finally combines the loss of response obligation detection with the loss of domain classification and modal consistency classification to train the entire network, and the loss function used is L=L yd L dc L c , where λd and λc represent the weight of the loss of the domain classifier and the loss of the modal consistency detector, respectively. The smaller the loss function L of the model, the more accurate the prediction. Therefore, finally according to the back propagation, the loss function L is used to derive the model parameters, and the derivative is used to update the network parameters to optimize the model.
S130:利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对该待检测目标域数据进行响应。S130: Use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
需要说明的是,响应义务检测模型在经历上述样本训练后,通过损失函数、对抗训练等手段能够显著响应义务检测模型的响应精度,此时,即可利用该响应义务检测模型对待检测目标预数据进行检测。It should be noted that after the above-mentioned sample training, the response obligation detection model can significantly respond to the response accuracy of the obligation detection model through loss function, confrontation training and other methods. At this time, the response obligation detection model can be used to treat the detection target pre-data Perform testing.
具体地,对于应用场景,该模型可应用于自动对话相关的业务中,如智能客服系统。在进行用户交互时,系统往往不能看到用户的表情等,只能通过语音判断用户是否在和系统进行对话。通过判断响应义务,客服系统可以在用户与别人说话时保持等待,若系统长时间没有检测到响应义务,也可以提示用户结束对话。此外,该模型也可以应用在智能家居中,如天猫精灵、小爱同学等,为用户提供更人性化的服务,如用户不需要使用特定关键词唤醒系统,可以直接说出自己的需求,系统就能收到指令为用户进行服务。Specifically, for application scenarios, the model can be applied to services related to automatic conversations, such as intelligent customer service systems. During user interaction, the system often cannot see the user's facial expressions, etc., and can only judge whether the user is talking to the system through voice. By judging the duty of response, the customer service system can keep waiting while the user is talking to others. If the system does not detect the duty of response for a long time, it can also prompt the user to end the conversation. In addition, the model can also be applied to smart homes, such as Tmall Elf, Xiao Ai, etc., to provide users with more humane services. For example, users do not need to use specific keywords to wake up the system and can directly speak their needs. The system can receive instructions to serve users.
通过上述技术方案的表述可知,本本发明提出的基于多模态的响应义务检测方法,通过设计一个响应义务检测模型,该模型为基于语音特征和语义信息的多模态融合算法,将该算法嵌入到自动对话系统,能够实现对话中的响应义务检测。另外,有别于传统的响应义务检测,本算法在语音信号的同时,也关注接收到的语句的语义信息,即在接收到语音信号之后,一方面通过声学特征提取方法分析声音信号,另一方面通过自动语音识别将语音信号转换成文本形式,根据文本信息进行语义理解,在判断接收到的语句是否需要回复时,综合考虑样本的声学特征和语义信息。此外,针对目标域和源域差别较大的问题,本发明提出使用对抗网络降低特征在目标域和源域之间的分布差异,同时借助自监督学习将两个模态的一致性作为学习目标,进一步增强特征的域适应能力,即检测两个来自不同模态的特征是否抽取自同一个样本,并将检测预测的结果作为损失函数的一部分,以此来监督模型学习理解语义信息,提高模型的精度。It can be seen from the expression of the above technical solution that the multi-modal response obligation detection method proposed by the present invention is designed by designing a response obligation detection model, which is a multi-modal fusion algorithm based on speech features and semantic information, and the algorithm is embedded To the automatic dialogue system, it is possible to realize the response obligation detection in the dialogue. In addition, different from the traditional response obligation detection, this algorithm also pays attention to the semantic information of the received sentence while the speech signal is On the one hand, automatic speech recognition is used to convert the speech signal into a text form, and semantic understanding is carried out according to the text information. When judging whether the received sentence needs a reply, the acoustic characteristics and semantic information of the sample are comprehensively considered. In addition, in view of the large difference between the target domain and the source domain, the present invention proposes to use an adversarial network to reduce the difference in the distribution of features between the target domain and the source domain, and at the same time use self-supervised learning to take the consistency of the two modalities as the learning target , To further enhance the domain adaptability of features, that is, to detect whether two features from different modalities are extracted from the same sample, and use the result of detection and prediction as part of the loss function, so as to supervise the model to learn and understand semantic information and improve the model Accuracy.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
实施例2Example 2
与上述方法相对应,本申请还提供一种基于多模态的响应义务检测系统,该系统包括:Corresponding to the above method, this application also provides a multi-modal response obligation detection system, which includes:
样本集建立单元,用于获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;The sample set establishment unit is used to obtain training data samples and save the training data samples to the training sample data set;
模型训练单元,用于使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;The model training unit is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for input data Acoustic feature extraction and semantic feature extraction are performed on the information, and response obligation detection is performed on the input data information according to the extracted acoustic feature and semantic feature;
模型应用单元,用于利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The model application unit is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
实施例3Example 3
本发明还提供一种电子装置70。参照图2所示,该图为本发明提供的电子装置70的较佳实施例结构示意图。The invention also provides an electronic device 70. Referring to FIG. 2, this figure is a schematic structural diagram of a preferred embodiment of an electronic device 70 provided by the present invention.
在本实施例中,电子装置70可以是服务器、智能手机、平板电脑、便携计算机、桌上型计算机等具有运算功能的终端设备。In this embodiment, the electronic device 70 may be a terminal device with a computing function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
该电子装置70包括:处理器71以及存储器72。The electronic device 70 includes a processor 71 and a memory 72.
存储器72包括至少一种类型的可读存储介质。至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质,上述可读存储介质也可以是易失性存储介质。在一些实施例中,可读存储介质可以是该电子装置70的内部存储单元,例如该电子装置70的硬盘。在另一些实施例中,可读存储介质也可以是电子装置1的外部存储器,例如电子装置70上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The memory 72 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory, etc., and the above-mentioned readable storage medium may also be a volatile storage medium. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70. In other embodiments, the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 70, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
在本实施例中,存储器72的可读存储介质通常用于存储安装于电子装置70的基于多模态的响应义务检测程序73。存储器72还可以用于暂时地存储已经输出或者将要输出的数据。In this embodiment, the readable storage medium of the memory 72 is generally used to store the multi-modal response duty detection program 73 installed in the electronic device 70. The memory 72 can also be used to temporarily store data that has been output or will be output.
处理器72在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器72中存储的程序代码或处理数据,例如基于多模态的响应义务检测程序73等。In some embodiments, the processor 72 may be a central processing unit (CPU), a microprocessor or other data processing chip, which is used to run program codes or process data stored in the memory 72, for example, based on multi-modality. The response obligation test program 73 and so on.
在一些实施例中,电子装置70为智能手机、平板电脑、便携计算机等的终端设备。在其他实施例中,电子装置70可以为服务器。In some embodiments, the electronic device 70 is a terminal device such as a smart phone, a tablet computer, and a portable computer. In other embodiments, the electronic device 70 may be a server.
图2仅示出了具有组件71-73的电子装置70,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。FIG. 2 only shows the electronic device 70 with the components 71-73, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
可选地,该电子装置70还可以包括用户接口,用户接口可以包括输入单元比如键盘(Keyboard)、语音输入装置比如麦克风(microphone)等具有语音识别功能的设备、语音输出装置比如音响、耳机等,可选地用户接口还可以包括标准的有线接口、无线接口。Optionally, the electronic device 70 may also include a user interface. The user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc. Optionally, the user interface may also include a standard wired interface and a wireless interface.
可选地,该电子装置70还可以包括显示器,显示器也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器用于显示在电子装置70中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 70 may further include a display, and the display may also be referred to as a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like. The display is used for displaying information processed in the electronic device 70 and for displaying a visualized user interface.
可选地,该电子装置70还可以包括触摸传感器。触摸传感器所提供的供用户进行触摸操作的区域称为触控区域。此外,这里的触摸传感器可以为电阻式触摸传感器、电容式触摸传感器等。而且,触摸传感器不仅包括接触式的触摸传感器,也可包括接近式的触摸传感器等。此外,触摸传感器可以为单个传感器,也可以为例如阵列布置的多个传感器。Optionally, the electronic device 70 may also include a touch sensor. The area provided by the touch sensor for the user to perform touch operations is called the touch area. In addition, the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like. In addition, the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.
此外,该电子装置70的显示器的面积可以与触摸传感器的面积相同,也可以不同。可选地,将显示器与触摸传感器层叠设置,以形成触摸显示屏。该装置基于触摸显示屏侦测用户触发的触控操作。In addition, the area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor. Optionally, the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.
可选地,该电子装置70还可以包括射频(Radio Frequency,RF)电路,传感器、音频电路等等,在此不再赘述。Optionally, the electronic device 70 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
在图2所示的装置实施例中,作为一种计算机存储介质的存储器72中可以包括操作系统、以及基于多模态的响应义务检测程序73;处理器71执行存储器72中存储基于多模态的响应义务检测程序73时实现如下步骤:In the device embodiment shown in FIG. 2, the memory 72, which is a computer storage medium, may include an operating system and a multi-modal response obligation detection program 73; The following steps are implemented in the response obligation detection program 73:
获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
在该实施例中,图3为根据本发明实施例的基于多模态的响应义务检测程序的内部逻辑示意图,如图3所示,基于多模态的响应义务检测程序73还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器72中,并由处理器71执行,以完成本发明。本发明所称的模块是指能够完成特定功能的一系列计算机程序指令段。参照图3所示,为图2中基于多模态的响应义务检测程序73较佳实施例的程序模块图。基于多模态的响应义务检测程序73 可以被分割为:样本集建立模块74、模型训练模块75以及模型应用模块76。模块74-76所实现的功能或操作步骤均与上文类似,此处不再详述,示例性地,例如,其中:In this embodiment, FIG. 3 is an internal logic diagram of a multi-modal response obligation detection program according to an embodiment of the present invention. As shown in FIG. 3, the multi-modal response obligation detection program 73 can also be divided into One or more modules, one or more modules are stored in the memory 72 and executed by the processor 71 to complete the present invention. The module referred to in the present invention refers to a series of computer program instruction segments capable of completing specific functions. Referring to FIG. 3, it is a program module diagram of a preferred embodiment of the multi-modal response obligation detection program 73 in FIG. 2. The multi-modal response obligation detection program 73 can be divided into: a sample set establishment module 74, a model training module 75, and a model application module 76. The functions or operation steps implemented by modules 74-76 are similar to the above, and will not be described in detail here. Illustratively, for example, where:
样本集建立模块74,用于获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;The sample set establishment module 74 is configured to obtain training data samples and save the training data samples to the training sample data set;
模型训练模块75,用于使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到相应的预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;The model training module 75 is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model reaches a corresponding preset accuracy; wherein, the response obligation detection model is used for Perform acoustic feature extraction and semantic feature extraction on the input data information, and perform response obligation detection on the input data information according to the extracted acoustic feature and semantic feature;
模型应用模块76,用于利用训练完成的所述响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The model application module 76 is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
实施例4Example 4
本发明还提供一种计算机可读存储介质,计算机可读存储介质中存储有基于多模态的响应义务检测程序73,基于多模态的响应义务检测程序73被处理器执行时实现如下操作:The present invention also provides a computer-readable storage medium in which a multi-modality-based response obligation detection program 73 is stored. When the multi-modality-based response obligation detection program 73 is executed by a processor, the following operations are implemented:
获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
本发明提供的计算机可读存储介质的具体实施方式与上述基于多模态的响应义务检测方法、电子装置的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer-readable storage medium provided by the present invention is substantially the same as the specific implementation of the above-mentioned multi-modal response obligation detection method and electronic device, and will not be repeated here.
需要说明的是,本发明所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。It should be noted that the blockchain referred to in the present invention is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and do not limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of the present invention, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of the present invention.

Claims (20)

  1. 一种基于多模态的响应义务检测方法,应用于电子装置,其中,所述方法包括:A multi-modality-based response obligation detection method applied to an electronic device, wherein the method includes:
    获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
    使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
    利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  2. 根据权利要求1所述的基于多模态的响应义务检测方法,其中,在所述响应义务检测模型对所述输入数据信息进行声学特征提取的过程中:The method for detecting a response obligation based on a multi-modality according to claim 1, wherein, in the process of extracting acoustic features of the input data information by the response obligation detection model:
    使用梅尔倒频谱网络或感知线性预测网络对所述输入数据信息进行处理,以获取所述输入数据信息的声学特征。Use a Mel cepstrum network or a perceptual linear prediction network to process the input data information to obtain the acoustic characteristics of the input data information.
  3. 根据权利要求2所述的基于多模态的响应义务检测方法,其中,在所述响应义务检测模型对所述输入数据信息进行语义特征提取的过程中:The method for detecting response obligations based on multimodality according to claim 2, wherein, in the process of extracting semantic features of the input data information by the response obligation detection model:
    使用ASR网络对所述输入数据信息或声学特征进行处理,以获取所述输入数据信息的语义特征。The ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information.
  4. 根据权利要求3所述的基于多模态的响应义务检测方法,其中,The method for detecting a response obligation based on a multi-modality according to claim 3, wherein:
    所述训练样本数据集存储于区块链中;并且,The training sample data set is stored in the blockchain; and,
    所述训练数据样本包括目标域数据样本和源域数据样本,在使用所述训练样本数据集对预设的响应义务检测模型进行训练的过程中,使用所述目标域数据样本和所述源域数据样本对所述响应义务检测模型进行训练。The training data sample includes a target domain data sample and a source domain data sample. In the process of using the training sample data set to train a preset response obligation detection model, the target domain data sample and the source domain are used. The data sample trains the response obligation detection model.
  5. 根据权利要求4所述的基于多模态的响应义务检测方法,其中,所述响应义务检测模型还包括第一对抗网络和第二对抗网络,并且,在使用所述训练样本数据集对预设的响应义务检测模型进行训练的过程中,The method for detecting a response obligation based on a multi-modality according to claim 4, wherein the response obligation detection model further includes a first confrontation network and a second confrontation network, and when using the training sample data set During the training process of the response duty detection model,
    所述第一对抗网络用于对所述响应义务检测模型提取的目标域声学特征和源域声学特征进行对抗训练,以使所述响应义务检测模型的声学特征提取精度达到预设精度;The first confrontation network is used to conduct confrontation training on the target domain acoustic features and the source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;
    所述第二对抗网络用于对所述响应义务检测模型提取的目标域语义特征和源域语义特征进行对抗训练,以使所述响应义务检测模型的语义特征提取精度达到预设精度。The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.
  6. 根据权利要求5所述的基于多模态的响应义务检测方法,其中,所述响应义务检测模型还包括总分类器网络,所述总分类网络用于根据所述声学特征和所述语义特征计算最 终响应义务概率。The method for detecting response obligations based on multimodality according to claim 5, wherein the response obligation detection model further comprises a total classifier network, and the total classification network is used to calculate according to the acoustic features and the semantic features. Probability of final response obligation.
  7. 根据权利要求6所述的基于多模态的响应义务检测方法,其中,根据所述声学特征和所述语义特征计算最终响应义务概率的方法包括:The multimodal response duty detection method according to claim 6, wherein the method of calculating the final response duty probability according to the acoustic feature and the semantic feature comprises:
    先根据所述声学特征和所述语义特征分别计算出声学响应义务概率和语义响应义务概率,然后根据所述声学响应义务概率和所述语义响应义务概率计算所述最终响应义务概率;其中,First calculate the acoustic response obligation probability and the semantic response obligation probability respectively according to the acoustic feature and the semantic feature, and then calculate the final response obligation probability according to the acoustic response obligation probability and the semantic response obligation probability; wherein,
    计算所述声学响应义务概率的损失函数为:The loss function for calculating the probability of the acoustic response obligation is:
    Figure PCTCN2020125140-appb-100001
    Figure PCTCN2020125140-appb-100001
    其中,P(x1)为声学响应义务概率,y为训练数据样本的真实值,{S}为训练样本数据集,x1为计算所述声学响应义务概率时所用到的{S}中的训练数据样本;Among them, P(x1) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x1 is the training data in {S} used to calculate the acoustic response obligation probability sample;
    计算所述语义响应义务概率的损失函数为:The loss function for calculating the probability of the semantic response obligation is:
    Figure PCTCN2020125140-appb-100002
    Figure PCTCN2020125140-appb-100002
    其中,P(x2)为声学响应义务概率,y为训练数据样本的真实值,{S}为训练样本数据集,x2为计算所述语义响应义务概率时所用到的{S}中的训练数据样本;Among them, P(x2) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x2 is the training data in {S} used to calculate the semantic response obligation probability sample;
    计算所述最终响应义务概率的损失函数为:The loss function for calculating the probability of the final response obligation is:
    L y=a*L y^speech+b*L y^semantic L y =a*L y^speech +b*L y^semantic
    其中,a+b=1,a和b分别为声学特征和语义特征所占的预设权重。Among them, a+b=1, and a and b are the preset weights occupied by acoustic features and semantic features, respectively.
  8. 一种基于多模态的响应义务检测系统,其中,所述系统包括:A response obligation detection system based on multi-modality, wherein the system includes:
    样本集建立单元,用于获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;The sample set establishment unit is used to obtain training data samples and save the training data samples to the training sample data set;
    模型训练单元,用于使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;The model training unit is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for input data Acoustic feature extraction and semantic feature extraction are performed on the information, and response obligation detection is performed on the input data information according to the extracted acoustic feature and semantic feature;
    模型应用单元,用于利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The model application unit is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  9. 一种电子装置,其中,所述电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的基于多模态的响应义务检测程序,所述基于多模态的响应义务检测程序被所述处理器执行时实现如下步骤:An electronic device, wherein the electronic device comprises: a memory, a processor, and a multi-modality-based response obligation detection program that is stored in the memory and can run on the processor, and the multi-modality-based When the response obligation detection program is executed by the processor, the following steps are implemented:
    获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
    使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
    利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  10. 根据权利要求9所述的电子装置,其中,在所述响应义务检测模型对所述输入数据信息进行声学特征提取的过程中:9. The electronic device according to claim 9, wherein, in the process of extracting acoustic features of the input data information by the response obligation detection model:
    使用梅尔倒频谱网络或感知线性预测网络对所述输入数据信息进行处理,以获取所述输入数据信息的声学特征。Use a Mel cepstrum network or a perceptual linear prediction network to process the input data information to obtain the acoustic characteristics of the input data information.
  11. 根据权利要求10所述的电子装置,其中,在所述响应义务检测模型对所述输入数据信息进行语义特征提取的过程中:The electronic device according to claim 10, wherein, in the process of extracting semantic features of the input data information by the response obligation detection model:
    使用ASR网络对所述输入数据信息或声学特征进行处理,以获取所述输入数据信息的语义特征。The ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information.
  12. 根据权利要求11所述的电子装置,其中,The electronic device according to claim 11, wherein:
    所述训练样本数据集存储于区块链中;并且,The training sample data set is stored in the blockchain; and,
    所述训练数据样本包括目标域数据样本和源域数据样本,在使用所述训练样本数据集对预设的响应义务检测模型进行训练的过程中,使用所述目标域数据样本和所述源域数据样本对所述响应义务检测模型进行训练。The training data sample includes a target domain data sample and a source domain data sample. In the process of using the training sample data set to train a preset response obligation detection model, the target domain data sample and the source domain are used. The data sample trains the response obligation detection model.
  13. 根据权利要求12所述的电子装置,其中,所述响应义务检测模型还包括第一对抗网络和第二对抗网络,并且,在使用所述训练样本数据集对预设的响应义务检测模型进行训练的过程中,The electronic device according to claim 12, wherein the response obligation detection model further comprises a first confrontation network and a second confrontation network, and the preset response obligation detection model is trained using the training sample data set in the process of,
    所述第一对抗网络用于对所述响应义务检测模型提取的目标域声学特征和源域声学特征进行对抗训练,以使所述响应义务检测模型的声学特征提取精度达到预设精度;The first confrontation network is used to conduct confrontation training on the target domain acoustic features and the source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;
    所述第二对抗网络用于对所述响应义务检测模型提取的目标域语义特征和源域语义特征进行对抗训练,以使所述响应义务检测模型的语义特征提取精度达到预设精度。The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.
  14. 根据权利要求13所述的电子装置,其中,所述响应义务检测模型还包括总分类器网络,所述总分类网络用于根据所述声学特征和所述语义特征计算最终响应义务概率。The electronic device according to claim 13, wherein the response obligation detection model further comprises a total classifier network, and the total classification network is configured to calculate a final response obligation probability based on the acoustic feature and the semantic feature.
  15. 根据权利要求14所述的电子装置,其中,根据所述声学特征和所述语义特征计算最终响应义务概率的方法包括:The electronic device according to claim 14, wherein the method of calculating the probability of a final response obligation based on the acoustic feature and the semantic feature comprises:
    先根据所述声学特征和所述语义特征分别计算出声学响应义务概率和语义响应义务概率,然后根据所述声学响应义务概率和所述语义响应义务概率计算所述最终响应义务概率;其中,First calculate the acoustic response obligation probability and the semantic response obligation probability respectively according to the acoustic feature and the semantic feature, and then calculate the final response obligation probability according to the acoustic response obligation probability and the semantic response obligation probability; wherein,
    计算所述声学响应义务概率的损失函数为:The loss function for calculating the probability of the acoustic response obligation is:
    Figure PCTCN2020125140-appb-100003
    Figure PCTCN2020125140-appb-100003
    其中,P(x1)为声学响应义务概率,y为训练数据样本的真实值,{S}为训练样本数据集,x1为计算所述声学响应义务概率时所用到的{S}中的训练数据样本;Among them, P(x1) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x1 is the training data in {S} used to calculate the acoustic response obligation probability sample;
    计算所述语义响应义务概率的损失函数为:The loss function for calculating the probability of the semantic response obligation is:
    Figure PCTCN2020125140-appb-100004
    Figure PCTCN2020125140-appb-100004
    其中,P(x2)为声学响应义务概率,y为训练数据样本的真实值,{S}为训练样本数据集,x2为计算所述语义响应义务概率时所用到的{S}中的训练数据样本;Among them, P(x2) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x2 is the training data in {S} used to calculate the semantic response obligation probability sample;
    计算所述最终响应义务概率的损失函数为:The loss function for calculating the probability of the final response obligation is:
    L y=a*L y^speech+b*L y^semantic L y =a*L y^speech +b*L y^semantic
    其中,a+b=1,a和b分别为声学特征和语义特征所占的预设权重。Among them, a+b=1, and a and b are the preset weights occupied by acoustic features and semantic features, respectively.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有基于多模态的响应义务检测程序,所述基于多模态的响应义务检测程序被处理器执行时,实现基于多模态的响应义务检测方法的步骤:A computer-readable storage medium, wherein a multi-modality-based response obligation detection program is stored in the computer-readable storage medium, and when the multi-modality-based response obligation detection program is executed by a processor, the multi-modality-based response obligation detection program The steps of the modal response duty detection method:
    获取训练数据样本,并将所述训练数据样本保存至训练样本数据集;Obtaining training data samples, and saving the training data samples to a training sample data set;
    使用所述训练样本数据集对预设的响应义务检测模型进行训练,以使所述响应义务检测模型达到预设精度;其中,所述响应义务检测模型用于对输入数据信息进行声学特征提取和语义特征提取,并根据提取的声学特征和语义特征对所述输入数据信息进行响应义务检测;Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;
    利用训练完成的响应义务检测模型对待检测目标域数据进行检测,以判断系统是否需对所述待检测目标域数据进行响应。The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
  17. 根据权利要求16所述的计算机可读存储介质,其中,在所述响应义务检测模型对所述输入数据信息进行声学特征提取的过程中:The computer-readable storage medium according to claim 16, wherein, in the process of extracting acoustic features of the input data information by the response obligation detection model:
    使用梅尔倒频谱网络或感知线性预测网络对所述输入数据信息进行处理,以获取所述输入数据信息的声学特征。Use a Mel cepstrum network or a perceptual linear prediction network to process the input data information to obtain the acoustic characteristics of the input data information.
  18. 根据权利要求17所述的计算机可读存储介质,其中,在所述响应义务检测模型对 所述输入数据信息进行语义特征提取的过程中:The computer-readable storage medium according to claim 17, wherein, in the process of extracting semantic features of the input data information by the response obligation detection model:
    使用ASR网络对所述输入数据信息或声学特征进行处理,以获取所述输入数据信息的语义特征。The ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information.
  19. 根据权利要求18所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 18, wherein:
    所述训练样本数据集存储于区块链中;并且,The training sample data set is stored in the blockchain; and,
    所述训练数据样本包括目标域数据样本和源域数据样本,在使用所述训练样本数据集对预设的响应义务检测模型进行训练的过程中,使用所述目标域数据样本和所述源域数据样本对所述响应义务检测模型进行训练。The training data sample includes a target domain data sample and a source domain data sample. In the process of using the training sample data set to train a preset response obligation detection model, the target domain data sample and the source domain are used. The data sample trains the response obligation detection model.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述响应义务检测模型还包括第一对抗网络和第二对抗网络,并且,在使用所述训练样本数据集对预设的响应义务检测模型进行训练的过程中,The computer-readable storage medium according to claim 19, wherein the response obligation detection model further includes a first confrontation network and a second confrontation network, and the training sample data set is used to detect a preset response obligation During the training of the model,
    所述第一对抗网络用于对所述响应义务检测模型提取的目标域声学特征和源域声学特征进行对抗训练,以使所述响应义务检测模型的声学特征提取精度达到预设精度;The first confrontation network is used to conduct confrontation training on the target domain acoustic features and the source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;
    所述第二对抗网络用于对所述响应义务检测模型提取的目标域语义特征和源域语义特征进行对抗训练,以使所述响应义务检测模型的语义特征提取精度达到预设精度。The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.
PCT/CN2020/125140 2020-09-04 2020-10-30 Method for response obligation detection based on multiple modes, and system and apparatus WO2021159756A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010921759.9 2020-09-04
CN202010921759.9A CN112037772B (en) 2020-09-04 2020-09-04 Response obligation detection method, system and device based on multiple modes

Publications (1)

Publication Number Publication Date
WO2021159756A1 true WO2021159756A1 (en) 2021-08-19

Family

ID=73590563

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125140 WO2021159756A1 (en) 2020-09-04 2020-10-30 Method for response obligation detection based on multiple modes, and system and apparatus

Country Status (2)

Country Link
CN (1) CN112037772B (en)
WO (1) WO2021159756A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076957A (en) * 2023-10-16 2023-11-17 湖南智警公共安全技术研究院有限公司 Personnel identity association method and system based on multi-mode information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654943A (en) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 Voice wakeup method, apparatus and system thereof
CN108320733A (en) * 2017-12-18 2018-07-24 上海科大讯飞信息科技有限公司 Voice data processing method and device, storage medium, electronic equipment
CN109326285A (en) * 2018-10-23 2019-02-12 出门问问信息科技有限公司 Voice information processing method, device and non-transient computer readable storage medium
CN109360554A (en) * 2018-12-10 2019-02-19 广东潮庭集团有限公司 A kind of language identification method based on language deep neural network
US20190266998A1 (en) * 2017-06-12 2019-08-29 Ping An Technology(Shenzhen) Co., Ltd. Speech recognition method and device, computer device and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108257600B (en) * 2016-12-29 2020-12-22 中国移动通信集团浙江有限公司 Voice processing method and device
CN108334496B (en) * 2018-01-30 2020-06-12 中国科学院自动化研究所 Man-machine conversation understanding method and system for specific field and related equipment
JP2020024310A (en) * 2018-08-08 2020-02-13 株式会社日立製作所 Speech processing system and speech processing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654943A (en) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 Voice wakeup method, apparatus and system thereof
US20190266998A1 (en) * 2017-06-12 2019-08-29 Ping An Technology(Shenzhen) Co., Ltd. Speech recognition method and device, computer device and storage medium
CN108320733A (en) * 2017-12-18 2018-07-24 上海科大讯飞信息科技有限公司 Voice data processing method and device, storage medium, electronic equipment
CN109326285A (en) * 2018-10-23 2019-02-12 出门问问信息科技有限公司 Voice information processing method, device and non-transient computer readable storage medium
CN109360554A (en) * 2018-12-10 2019-02-19 广东潮庭集团有限公司 A kind of language identification method based on language deep neural network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076957A (en) * 2023-10-16 2023-11-17 湖南智警公共安全技术研究院有限公司 Personnel identity association method and system based on multi-mode information

Also Published As

Publication number Publication date
CN112037772A (en) 2020-12-04
CN112037772B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN108564941B (en) Voice recognition method, device, equipment and storage medium
US10176804B2 (en) Analyzing textual data
US10354656B2 (en) Speaker recognition
US20200279002A1 (en) Method and system for processing unclear intent query in conversation system
JP6667504B2 (en) Orphan utterance detection system and method
US11282524B2 (en) Text-to-speech modeling
JP5901001B1 (en) Method and device for acoustic language model training
US10224030B1 (en) Dynamic gazetteers for personalized entity recognition
WO2003050799A9 (en) Method and system for non-intrusive speaker verification using behavior models
CN116561592B (en) Training method of text emotion recognition model, text emotion recognition method and device
Salekin et al. Distant emotion recognition
WO2023129255A1 (en) Intelligent character correction and search in documents
TW202032534A (en) Voice recognition method and device, electronic device and storage medium
WO2021159756A1 (en) Method for response obligation detection based on multiple modes, and system and apparatus
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
US10446138B2 (en) System and method for assessing audio files for transcription services
JP4143541B2 (en) Method and system for non-intrusive verification of speakers using behavior models
CN117275466A (en) Business intention recognition method, device, equipment and storage medium thereof
US10706086B1 (en) Collaborative-filtering based user simulation for dialog systems
TWI818427B (en) Method and system for correcting speaker diarisation using speaker change detection based on text
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
WO2021098637A1 (en) Voice transliteration method and apparatus, and related system and device
CN114218356A (en) Semantic recognition method, device, equipment and storage medium based on artificial intelligence
Zhou et al. Improved multi-kernel SVM for multi-modal and imbalanced dialogue act classification
RU2744063C1 (en) Method and system for determining speaking user of voice-controlled device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918667

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20918667

Country of ref document: EP

Kind code of ref document: A1