WO2021159756A1

WO2021159756A1 - Method for response obligation detection based on multiple modes, and system and apparatus

Info

Publication number: WO2021159756A1
Application number: PCT/CN2020/125140
Authority: WO
Inventors: 罗剑; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-09-04
Filing date: 2020-10-30
Publication date: 2021-08-19
Also published as: CN112037772A; CN112037772B

Abstract

A method for response obligation detection based on multiple modes, relating to artificial intelligence, and comprising: acquiring a training data sample, and storing the training data sample in a training sample dataset (S110); using the training sample dataset to train a pre-set response obligation detection model so as to cause the response obligation model to attain a pre-set accuracy (S120); by means of the response obligation detection model, performing detection on target-domain data to undergo detection, so as to determine whether the system needs to respond to the target-domain data to undergo detection (S130). The described method further relates to blockchain technology, and the training sample dataset is stored in a blockchain, thus being able to effectively solve the problems in response obligation checking methods of low efficiency and poor quality.

Description

Multi-modality-based response duty detection method, system and device

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 4, 2020, the application number is 2020109217599, and the invention title is "Multimodal-based response obligation detection method, system and device". The entire content is approved The reference is incorporated in this application.

Technical field

The present invention relates to the technical field of speech recognition in artificial intelligence, and in particular to a method, system, device and storage medium for detecting response obligations based on multimodality.

Background technique

Response Obligation Detection (ROD) is an important part of intelligent voice products such as automatic dialogue systems. In the traditional voice dialogue interaction, the dialogue system is set to respond to every sentence detected. However, the inventor realized that in the natural communication between people, some specific sentences do not need to be answered, such as self-explanation. Language, a public statement, or a sentence after the object of speaking has been changed. For automatic dialogue systems, these sentences are likely to cause unnecessary error replies, reduce the accuracy of the dialogue system, and reduce the user experience. In response to this phenomenon, response obligation detection is widely used. Its purpose is to distinguish whether it is necessary to reply to the detected sentence, so as to improve user experience and complete more natural and effective dialogue interactions.

In order to improve the accuracy of the traditional dialogue system, it will choose to strictly limit the response conditions. On the one hand, the user needs to use a specific keyword similar to the input command to wake up the system (such as Xiao Ai, siri, etc.) before the system will reply to the detected sentence. This approach requires users to know the keywords to wake up the system in advance, and the usage is relatively rigid, and it is not suitable for the first use by large-scale user groups. On the other hand, the sentence in the use environment of the dialogue system (that is, the target domain) is usually quite different from the system's training database (that is, the source domain). However, the sentence that needs to be responded cannot be correctly identified in the application scenario. For example, when training the model, the corpus recorded under relatively quiet conditions is used, and there may be different background noises in actual applications, which causes the system to fail to correctly perform speech recognition.

Due to the above two limitations, it is difficult for traditional dialogue systems to provide users with natural and smooth dialogue interactions under the premise of ensuring high accuracy. In actual business scenarios, the dialogue system should be able to fully understand the user’s intentions in a variety of scenarios, accurately determine whether it needs to respond to the detected sentences, and at the same time lower the user’s threshold for use, and not set keywords to wake up the system in order to be effective Communicate with large-scale users, otherwise it will disrupt the continuity of the dialogue, reduce the user experience, and affect business development. Therefore, based on the above problems, there is an urgent need for a response obligation detection algorithm with higher accuracy to improve the corresponding accuracy of the automatic dialogue system.

technical problem

The present invention provides a multi-modal response duty detection method, system, electronic device and computer storage medium, and its main purpose is to solve the problem of low efficiency and poor quality of the existing response duty checking method.

Technical solutions

In order to achieve the above objective, the present invention provides a multi-modality-based response obligation detection method, which includes the following steps:

Obtaining training data samples, and saving the training data samples to a training sample data set;

Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;

The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.

In addition, the present invention also provides a multi-modal response obligation detection system, which includes:

The sample set establishment unit is used to obtain training data samples and save the training data samples to the training sample data set;

The model training unit is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for input data Acoustic feature extraction and semantic feature extraction are performed on the information, and response obligation detection is performed on the input data information according to the extracted acoustic feature and semantic feature;

The model application unit is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.

In addition, in order to achieve the above object, the present invention also provides an electronic device including: a memory, a processor, and a multi-modality-based response obligation detection stored in the memory and running on the processor A program, when the multi-modal response obligation detection program is executed by the processor, the following steps are implemented:

In addition, in order to achieve the above-mentioned object, the present invention also provides a computer-readable storage medium in which a multi-modality-based response obligation detection program is stored, and the multi-modality-based response obligation detection program is stored in the computer-readable storage medium. When executed by the processor, the steps of the above-mentioned multi-modal response obligation detection method are realized:

Beneficial effect

The multi-modal response obligation detection method, electronic device, and computer-readable storage medium proposed in the present invention design a response obligation detection model, which is a multi-modal fusion algorithm based on voice features and semantic information. Embedded in the automatic dialogue system, it can realize the response obligation detection in the dialogue. In addition, different from the traditional response obligation detection, this algorithm also pays attention to the semantic information of the received sentence while the speech signal is On the one hand, automatic speech recognition is used to convert the speech signal into a text form, and semantic understanding is carried out according to the text information. When judging whether the received sentence needs a reply, the acoustic characteristics and semantic information of the sample are comprehensively considered. In addition, in view of the large difference between the target domain and the source domain, the present invention proposes to use an adversarial network to reduce the difference in the distribution of features between the target domain and the source domain, and at the same time use self-supervised learning to take the consistency of the two modalities as the learning target , To further enhance the domain adaptability of features, that is, to detect whether two features from different modalities are extracted from the same sample, and use the result of detection and prediction as part of the loss function, so as to supervise the model to learn and understand semantic information and improve the model Accuracy.

Description of the drawings

FIG. 1 is a flowchart of a preferred embodiment of a method for detecting a response obligation based on a multi-modality according to an embodiment of the present invention;

2 is a schematic structural diagram of a preferred embodiment of an electronic device according to an embodiment of the present invention;

Fig. 3 is a schematic diagram of internal logic of a multi-modal response obligation detection program according to an embodiment of the present invention.

The realization of the objectives, functional characteristics and advantages of the present invention will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

The best mode of the present invention

The specific embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Example 1

In order to explain the multi-modality-based response obligation detection method provided by the present invention, FIG. 1 shows the flow of the multi-modality-based response obligation detection method provided according to the present invention.

As shown in Figure 1, the multi-modal response obligation detection method provided by the present invention includes:

S110: Obtain a training data sample, and save the training data sample to a training sample data set.

It should be noted that the training data sample is the historical data after the technical confirmation by the technician. This type of historical data has been marked with the corresponding response label after the technical confirmation by the technician, and used as the training data sample for subsequent response obligation testing The training and use of the model, for example, the training data sample can be a section of historical voice information, which has been marked with corresponding response labels (such as reply, no reply) after technical confirmation by a technician.

In addition, in order to improve the simulation accuracy of the training data sample against the real data, thereby improving the accuracy of the response obligation detection model described later, the corresponding historical data can be obtained from the two data domains of the target domain and the source domain as the training data sample. The training data samples include target domain data samples and source domain data samples. The target domain data samples are sentences in the real environment used by the dialogue system, and the source domain data samples are sentences in a traditional preset training database.

Due to the large sample gap between the target domain and the source domain, for example, when training the model, the corpus recorded under relatively quiet conditions is used, and there may be different background noises in actual applications, which may cause the system to fail to perform speech recognition correctly. That is to say, although the dialogue system has a high accuracy in identifying the response obligation during training, it cannot correctly identify the sentence that needs to be responded in the actual application scenario. Therefore, the present invention introduces the target domain data sample and the source domain data sample to train the response duty detection model self-designed later, thereby significantly improving the recognition accuracy of the response duty detection model.

In addition, it should be emphasized that, in order to further ensure the privacy and security of the data in the training sample data set, the training sample data set can be stored in a node of the blockchain.

S120: Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a corresponding preset accuracy; wherein the response obligation detection model is used to extract acoustic features of input data information And semantic feature extraction, and based on the extracted acoustic features and semantic features, the input data information is response obligation detection and labeling.

Specifically, the response obligation detection model mainly includes a multi-modal fusion module, which is used to perform acoustic information feature extraction and semantic feature extraction on source domain data samples and target domain data samples in the training sample data set.

Specifically, in the process of extracting acoustic features, the sound features are extracted through Mel Frequency Cepstral Coefficient (MFCC) or Perceptual Linear Prediction (PLP), where MFCC and PLP are the existing commonly used acoustics. The information feature extraction method extracts the frequency domain features of the speech signal in a short time, and obtains the combined information of the time and frequency domain of the sample. This information is an important feature for distinguishing different phonemes. Since MFCC and PLP are common technical methods for the extraction of existing acoustic features, the specific process of their data processing will not be repeated here.

It should be noted that in the actual processing process of the model, for the acoustic feature extraction, the original signal (training data sample) can be divided into frames. For example, the information of every 20ms can be divided into a frame, and the speech signal It is regarded as a stable time sequence signal, so that the frequency domain information can be extracted from the signal within this period of time. Commonly used feature extraction methods, such as calculating MFCC/PLP, etc. Both of these extraction methods simulate the human auditory system for modeling. Generally speaking, PLP has a strong anti-noise ability, while MFCC has a faster calculation speed. The specific features to be adopted can be selected according to different business scenarios.

Specifically, in the process of extracting semantic features, the ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information. For example, when extracting semantic information, you can first extract semantic information from acoustic features (Automatic speech recognition: ASR) for automatic speech recognition. Automatic speech recognition mainly includes two parts, acoustic model processing and decoding search element processing. Among them, the acoustic model is To improve the recognition rate, more acoustic models are currently used as end-to-end models; the decoding search part includes the classic method (connectionist temporal classification, CTC) or the current mainstream RNN-T network and Transformer network. The acoustic feature finally outputs the predicted text after the above-mentioned automatic speech recognition processing, which is the speech recognition result, so as to obtain the corresponding semantic feature.

Of course, for the extraction of semantic features, you can also directly use voice recognition technology to process the input data information to obtain the corresponding semantic features. It should be noted that the voice recognition technology is an existing technology and how many specific implementation methods are In this case, the present invention mainly uses speech recognition technology to obtain the required semantic features. Therefore, the specific data processing process will not be repeated here.

In addition, in order to improve the feature extraction capabilities of the multi-modal fusion module for acoustic and semantic features, the response obligation detection model provided by the present invention also includes a confrontation network module, which is used for target domain data samples and source domain data samples. Conduct confrontation training to improve the feature extraction capabilities of the multi-modal fusion module for acoustic and semantic features.

Specifically, the confrontation network module includes a first confrontation network and a second confrontation network, and in the process of using the training sample data set to train the preset response obligation detection model,

The first confrontation network is used to conduct confrontation training on the target domain acoustic features and source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;

The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the semantic features of the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.

It should be noted that, in order to reduce the influence of the difference in the distribution of the characteristics of the target domain and the source domain on the accuracy of the algorithm, the present invention uses the above-mentioned multi-modal fusion confrontation network. In the adversarial network module, the domain classification loss of the domain classifier can be calculated on the source domain and the target domain respectively. This method can effectively avoid the domain classifier from only focusing on the less robust modes when optimizing, so it is now necessary to improve The feature extraction accuracy of the model. Specifically, the loss function of the domain classifier is:

L _d =∑ _x∈(S,T) -dlog(D ^m (F ^m (x)))-(1-d)log(1-D ^m (F ^m (x)))

Among them, F ^m and D ^m represent the feature matrix and domain classifier in the target domain and the source domain, respectively, and d is the domain label, which indicates whether the current sample belongs to the target domain or the source domain. The domain classifier updates the network parameters by minimizing the domain classification loss L _d , and the final output D ^m (F ^m (x)) is the domain where the domain classifier predicts the input data. The label classifier minimizes the label classification loss Ly, thereby improving the model's ability to predict sample labels. In the end, it is hoped that the domain classification loss Ld can be maximized so that the features extracted by the feature extractor are as relevant as possible to the sentence response obligation judgment, and have nothing to do with the specific domain, that is, when the model makes judgments, it will not be affected by the change of the sample domain. , Only focus on whether the sample itself needs to respond. The two classifiers continuously iterate to reduce the influence of the domain on the recognition of response obligations.

In addition, the response obligation detection model also includes a total classifier network, and the total classification network is used to calculate the final response obligation probability according to the acoustic feature and the semantic feature.

Specifically, in the process of calculating the final response obligation probability according to the acoustic feature and the semantic feature, the acoustic response obligation probability and the semantic response obligation probability are calculated respectively according to the acoustic feature and the semantic feature, and then the acoustic response obligation probability is calculated according to the acoustic response obligation probability. And the semantic response obligation probability to calculate the final response obligation probability; where,

The loss function for calculating the probability of the acoustic response obligation is:

Among them, P(x1) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x1 is the training data in {S} used to calculate the acoustic response obligation probability sample;

The loss function for calculating the probability of the semantic response obligation is:

Among them, P(x2) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x2 is the training data in {S} used to calculate the semantic response obligation probability sample;

The loss function for calculating the probability of the final response obligation is:

L _y ＝a*L _y^speech +b*L _y^semantic

Among them, a+b=1, a and b are the preset weights occupied by acoustic features and semantic features, respectively

More specifically, after the multimodal fusion module extracts the acoustic features and voice features, the acoustic features calculated by MFCC or PLP are input into the deep learning network (RNN/CNN/Transducer, etc., that is, the total classifier network) to calculate its needs The probability of being replied P(x1) and the classification loss L _y^speech .

It should be noted that the deep learning network (that is, the total classifier network) can be modeled in a two-category manner. According to labels 0 and 1, whether a response is required is marked. When the data information is input, the result value of the network output That is the probability of a response.

Among them, the classification loss of this part of the total classifier network can be obtained by the following formula

Among them, P(x1) is the probability that a response is required for model judgment, and y is the true value of the sample label. In the later stage, _Ly^speech can be used to derive the parameters of the network model, and use back propagation to update the parameters of the network model to optimize the network model.

In addition, the total classifier network is also used to process the semantic features extracted by the multimodal fusion module through word embedding. Word embedding is the general term for the representation learning technology of language models in natural language processing. Conceptually, It refers to embedding a high-dimensional space whose dimension is the number of all words into a continuous vector space with a much lower dimension, and each word or phrase is mapped to a vector in the real number domain to facilitate subsequent calculations.

The result of the word embedding processing is used as the input of the recurrent neural network (LSTM/GRU, etc.) to calculate the probability P(x2) that the sentence needs to be replied and the classification L _y^semantic . LSTM/GRU, etc. are special recurrent neural networks. Compared with ordinary RNNs, this kind of network can perform long and short-term memory and deal with long-term dependencies. For example, the LSTM structure includes a forget gate responsible for determining the forgetting information of the current step, an input gate determining updated data, and an output gate determining output information; and GRU is a variant of LSTM, which combines the forget gate and the input gate into an update gate. The resulting model is simpler than the standard LSTM model.

In addition, when updating network parameters, LSTM/GRU also follows the back-propagation rule and uses the loss function to calculate the reciprocal of the model coefficients to update the model. It should be noted that the calculation process of the classification loss of this part of the total classifier network is the _{same as the calculation method of Ly^speech} , and will not be repeated here.

In addition, when the label classifier finally predicts the sample, it needs to combine the probability and loss of acoustic features and semantic information to calculate the response obligation probability and loss L_y of the sample. The commonly used calculation method is:

P(x)=aP(x1)+b P(x2)

L _y ＝a*L _y^speech +b*L _y^semantic

Among them, P(x) is the probability of the final response obligation, _Ly is the loss function of the probability of the final response obligation, a+b=1, and a and b are the weights of the preset acoustic features and semantic information, respectively.

The total classification network model can obtain the final response obligation probability through the above series of processing, which can be used to perform response obligation detection on the input data information. If the final response obligation probability is 0, the system does not respond. If the final response obligation probability is 1. The system responds.

In addition, in order to use the consistency of the acoustic features and semantic signals in the sample as the classification label of self-supervised learning for characterization learning, the present invention also uses a modal consistency detector C, that is, a multi-modal self-supervised learning module, which uses Acoustic features and semantic information extracted from samples in the source and target domains are used as input, and features of different modalities are randomly selected pairwise to detect whether the label classification of the two modalities is consistent. This self-supervised learning can further enhance the characterization ability of features. in,

The loss function of _C is: L C =∑ _x∈(S,T) -clogC(F ⁰ (x),...,F ^m (x))

Among them, c indicates whether the input modalities are consistent.

It should be noted that the present invention finally combines the loss of response obligation detection with the loss of domain classification and modal consistency classification to train the entire network, and the loss function used is L=L _y +λ _d L _d +λ _c L _c , where λd and λc represent the weight of the loss of the domain classifier and the loss of the modal consistency detector, respectively. The smaller the loss function L of the model, the more accurate the prediction. Therefore, finally according to the back propagation, the loss function L is used to derive the model parameters, and the derivative is used to update the network parameters to optimize the model.

S130: Use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.

It should be noted that after the above-mentioned sample training, the response obligation detection model can significantly respond to the response accuracy of the obligation detection model through loss function, confrontation training and other methods. At this time, the response obligation detection model can be used to treat the detection target pre-data Perform testing.

Specifically, for application scenarios, the model can be applied to services related to automatic conversations, such as intelligent customer service systems. During user interaction, the system often cannot see the user's facial expressions, etc., and can only judge whether the user is talking to the system through voice. By judging the duty of response, the customer service system can keep waiting while the user is talking to others. If the system does not detect the duty of response for a long time, it can also prompt the user to end the conversation. In addition, the model can also be applied to smart homes, such as Tmall Elf, Xiao Ai, etc., to provide users with more humane services. For example, users do not need to use specific keywords to wake up the system and can directly speak their needs. The system can receive instructions to serve users.

It can be seen from the expression of the above technical solution that the multi-modal response obligation detection method proposed by the present invention is designed by designing a response obligation detection model, which is a multi-modal fusion algorithm based on speech features and semantic information, and the algorithm is embedded To the automatic dialogue system, it is possible to realize the response obligation detection in the dialogue. In addition, different from the traditional response obligation detection, this algorithm also pays attention to the semantic information of the received sentence while the speech signal is On the one hand, automatic speech recognition is used to convert the speech signal into a text form, and semantic understanding is carried out according to the text information. When judging whether the received sentence needs a reply, the acoustic characteristics and semantic information of the sample are comprehensively considered. In addition, in view of the large difference between the target domain and the source domain, the present invention proposes to use an adversarial network to reduce the difference in the distribution of features between the target domain and the source domain, and at the same time use self-supervised learning to take the consistency of the two modalities as the learning target , To further enhance the domain adaptability of features, that is, to detect whether two features from different modalities are extracted from the same sample, and use the result of detection and prediction as part of the loss function, so as to supervise the model to learn and understand semantic information and improve the model Accuracy.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

Example 2

Corresponding to the above method, this application also provides a multi-modal response obligation detection system, which includes:

Example 3

The invention also provides an electronic device 70. Referring to FIG. 2, this figure is a schematic structural diagram of a preferred embodiment of an electronic device 70 provided by the present invention.

In this embodiment, the electronic device 70 may be a terminal device with a computing function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.

The electronic device 70 includes a processor 71 and a memory 72.

The memory 72 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory, etc., and the above-mentioned readable storage medium may also be a volatile storage medium. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70. In other embodiments, the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 70, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.

In this embodiment, the readable storage medium of the memory 72 is generally used to store the multi-modal response duty detection program 73 installed in the electronic device 70. The memory 72 can also be used to temporarily store data that has been output or will be output.

In some embodiments, the processor 72 may be a central processing unit (CPU), a microprocessor or other data processing chip, which is used to run program codes or process data stored in the memory 72, for example, based on multi-modality. The response obligation test program 73 and so on.

In some embodiments, the electronic device 70 is a terminal device such as a smart phone, a tablet computer, and a portable computer. In other embodiments, the electronic device 70 may be a server.

FIG. 2 only shows the electronic device 70 with the components 71-73, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

Optionally, the electronic device 70 may also include a user interface. The user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc. Optionally, the user interface may also include a standard wired interface and a wireless interface.

Optionally, the electronic device 70 may further include a display, and the display may also be referred to as a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like. The display is used for displaying information processed in the electronic device 70 and for displaying a visualized user interface.

Optionally, the electronic device 70 may also include a touch sensor. The area provided by the touch sensor for the user to perform touch operations is called the touch area. In addition, the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like. In addition, the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.

In addition, the area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor. Optionally, the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.

Optionally, the electronic device 70 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.

In the device embodiment shown in FIG. 2, the memory 72, which is a computer storage medium, may include an operating system and a multi-modal response obligation detection program 73; The following steps are implemented in the response obligation detection program 73:

In this embodiment, FIG. 3 is an internal logic diagram of a multi-modal response obligation detection program according to an embodiment of the present invention. As shown in FIG. 3, the multi-modal response obligation detection program 73 can also be divided into One or more modules, one or more modules are stored in the memory 72 and executed by the processor 71 to complete the present invention. The module referred to in the present invention refers to a series of computer program instruction segments capable of completing specific functions. Referring to FIG. 3, it is a program module diagram of a preferred embodiment of the multi-modal response obligation detection program 73 in FIG. 2. The multi-modal response obligation detection program 73 can be divided into: a sample set establishment module 74, a model training module 75, and a model application module 76. The functions or operation steps implemented by modules 74-76 are similar to the above, and will not be described in detail here. Illustratively, for example, where:

The sample set establishment module 74 is configured to obtain training data samples and save the training data samples to the training sample data set;

The model training module 75 is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model reaches a corresponding preset accuracy; wherein, the response obligation detection model is used for Perform acoustic feature extraction and semantic feature extraction on the input data information, and perform response obligation detection on the input data information according to the extracted acoustic feature and semantic feature;

The model application module 76 is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.

Example 4

The present invention also provides a computer-readable storage medium in which a multi-modality-based response obligation detection program 73 is stored. When the multi-modality-based response obligation detection program 73 is executed by a processor, the following operations are implemented:

The specific implementation of the computer-readable storage medium provided by the present invention is substantially the same as the specific implementation of the above-mentioned multi-modal response obligation detection method and electronic device, and will not be repeated here.

It should be noted that the blockchain referred to in the present invention is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The above are only preferred embodiments of the present invention, and do not limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of the present invention, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of the present invention.

Claims

A multi-modality-based response obligation detection method applied to an electronic device, wherein the method includes:

Obtaining training data samples, and saving the training data samples to a training sample data set;

Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;

The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
The method for detecting a response obligation based on a multi-modality according to claim 1, wherein, in the process of extracting acoustic features of the input data information by the response obligation detection model:

Use a Mel cepstrum network or a perceptual linear prediction network to process the input data information to obtain the acoustic characteristics of the input data information.
The method for detecting response obligations based on multimodality according to claim 2, wherein, in the process of extracting semantic features of the input data information by the response obligation detection model:

The ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information.
The method for detecting a response obligation based on a multi-modality according to claim 3, wherein:

The training sample data set is stored in the blockchain; and,

The training data sample includes a target domain data sample and a source domain data sample. In the process of using the training sample data set to train a preset response obligation detection model, the target domain data sample and the source domain are used. The data sample trains the response obligation detection model.
The method for detecting a response obligation based on a multi-modality according to claim 4, wherein the response obligation detection model further includes a first confrontation network and a second confrontation network, and when using the training sample data set During the training process of the response duty detection model,

The first confrontation network is used to conduct confrontation training on the target domain acoustic features and the source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;

The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.
The method for detecting response obligations based on multimodality according to claim 5, wherein the response obligation detection model further comprises a total classifier network, and the total classification network is used to calculate according to the acoustic features and the semantic features. Probability of final response obligation.
The multimodal response duty detection method according to claim 6, wherein the method of calculating the final response duty probability according to the acoustic feature and the semantic feature comprises:

First calculate the acoustic response obligation probability and the semantic response obligation probability respectively according to the acoustic feature and the semantic feature, and then calculate the final response obligation probability according to the acoustic response obligation probability and the semantic response obligation probability; wherein,

The loss function for calculating the probability of the acoustic response obligation is:

Among them, P(x1) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x1 is the training data in {S} used to calculate the acoustic response obligation probability sample;

The loss function for calculating the probability of the semantic response obligation is:

Among them, P(x2) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x2 is the training data in {S} used to calculate the semantic response obligation probability sample;

The loss function for calculating the probability of the final response obligation is:

L y ＝a*L y^speech +b*L y^semantic

Among them, a+b=1, and a and b are the preset weights occupied by acoustic features and semantic features, respectively.
A response obligation detection system based on multi-modality, wherein the system includes:

The sample set establishment unit is used to obtain training data samples and save the training data samples to the training sample data set;

The model training unit is configured to use the training sample data set to train a preset response obligation detection model, so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for input data Acoustic feature extraction and semantic feature extraction are performed on the information, and response obligation detection is performed on the input data information according to the extracted acoustic feature and semantic feature;

The model application unit is configured to use the trained response obligation detection model to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
An electronic device, wherein the electronic device comprises: a memory, a processor, and a multi-modality-based response obligation detection program that is stored in the memory and can run on the processor, and the multi-modality-based When the response obligation detection program is executed by the processor, the following steps are implemented:

Obtaining training data samples, and saving the training data samples to a training sample data set;

Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;

The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
9. The electronic device according to claim 9, wherein, in the process of extracting acoustic features of the input data information by the response obligation detection model:

Use a Mel cepstrum network or a perceptual linear prediction network to process the input data information to obtain the acoustic characteristics of the input data information.
The electronic device according to claim 10, wherein, in the process of extracting semantic features of the input data information by the response obligation detection model:

The ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information.
The electronic device according to claim 11, wherein:

The training sample data set is stored in the blockchain; and,

The training data sample includes a target domain data sample and a source domain data sample. In the process of using the training sample data set to train a preset response obligation detection model, the target domain data sample and the source domain are used. The data sample trains the response obligation detection model.
The electronic device according to claim 12, wherein the response obligation detection model further comprises a first confrontation network and a second confrontation network, and the preset response obligation detection model is trained using the training sample data set in the process of,

The first confrontation network is used to conduct confrontation training on the target domain acoustic features and the source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;

The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.
The electronic device according to claim 13, wherein the response obligation detection model further comprises a total classifier network, and the total classification network is configured to calculate a final response obligation probability based on the acoustic feature and the semantic feature.
The electronic device according to claim 14, wherein the method of calculating the probability of a final response obligation based on the acoustic feature and the semantic feature comprises:

First calculate the acoustic response obligation probability and the semantic response obligation probability respectively according to the acoustic feature and the semantic feature, and then calculate the final response obligation probability according to the acoustic response obligation probability and the semantic response obligation probability; wherein,

The loss function for calculating the probability of the acoustic response obligation is:

Among them, P(x1) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x1 is the training data in {S} used to calculate the acoustic response obligation probability sample;

The loss function for calculating the probability of the semantic response obligation is:

Among them, P(x2) is the acoustic response obligation probability, y is the true value of the training data sample, {S} is the training sample data set, and x2 is the training data in {S} used to calculate the semantic response obligation probability sample;

The loss function for calculating the probability of the final response obligation is:

L y ＝a*L y^speech +b*L y^semantic

Among them, a+b=1, and a and b are the preset weights occupied by acoustic features and semantic features, respectively.
A computer-readable storage medium, wherein a multi-modality-based response obligation detection program is stored in the computer-readable storage medium, and when the multi-modality-based response obligation detection program is executed by a processor, the multi-modality-based response obligation detection program The steps of the modal response duty detection method:

Obtaining training data samples, and saving the training data samples to a training sample data set;

Use the training sample data set to train a preset response obligation detection model so that the response obligation detection model achieves a preset accuracy; wherein, the response obligation detection model is used for acoustic feature extraction and extraction of input data information. Semantic feature extraction, and performing response duty detection on the input data information according to the extracted acoustic features and semantic features;

The trained response obligation detection model is used to detect the target domain data to be detected to determine whether the system needs to respond to the target domain data to be detected.
The computer-readable storage medium according to claim 16, wherein, in the process of extracting acoustic features of the input data information by the response obligation detection model:

Use a Mel cepstrum network or a perceptual linear prediction network to process the input data information to obtain the acoustic characteristics of the input data information.
The computer-readable storage medium according to claim 17, wherein, in the process of extracting semantic features of the input data information by the response obligation detection model:

The ASR network is used to process the input data information or acoustic features to obtain the semantic features of the input data information.
The computer-readable storage medium of claim 18, wherein:

The training sample data set is stored in the blockchain; and,

The training data sample includes a target domain data sample and a source domain data sample. In the process of using the training sample data set to train a preset response obligation detection model, the target domain data sample and the source domain are used. The data sample trains the response obligation detection model.
The computer-readable storage medium according to claim 19, wherein the response obligation detection model further includes a first confrontation network and a second confrontation network, and the training sample data set is used to detect a preset response obligation During the training of the model,

The first confrontation network is used to conduct confrontation training on the target domain acoustic features and the source domain acoustic features extracted by the response obligation detection model, so that the acoustic feature extraction accuracy of the response obligation detection model reaches a preset accuracy;

The second confrontation network is used to conduct confrontation training on the semantic features of the target domain and the source domain extracted by the response obligation detection model, so that the semantic feature extraction accuracy of the response obligation detection model reaches a preset accuracy.