CN111899738A - Dialogue generating method, device and storage medium - Google Patents
Dialogue generating method, device and storage medium Download PDFInfo
- Publication number
- CN111899738A CN111899738A CN202010742806.3A CN202010742806A CN111899738A CN 111899738 A CN111899738 A CN 111899738A CN 202010742806 A CN202010742806 A CN 202010742806A CN 111899738 A CN111899738 A CN 111899738A
- Authority
- CN
- China
- Prior art keywords
- signal
- features
- neural network
- modal
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000013528 artificial neural network Methods 0.000 claims abstract description 126
- 238000012545 processing Methods 0.000 claims description 34
- 238000012549 training Methods 0.000 claims description 29
- 238000007781 pre-processing Methods 0.000 claims description 28
- 230000002708 enhancing effect Effects 0.000 claims description 13
- 230000000873 masking effect Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 9
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 8
- 230000001629 suppression Effects 0.000 claims 1
- 238000000605 extraction Methods 0.000 abstract description 4
- 238000013461 design Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
The application provides a dialog generating method, a device and a storage medium, the method determines the signal characteristics of a multi-modal signal by acquiring the multi-modal signal in a target dialog scene, performs characteristic enhancement on the signal characteristics, inputs the enhanced characteristics into a preset neural network for high-level characteristic extraction, and inputs the extracted high-level characteristics into another neural network for target dialog sentence generation, wherein the multi-modal signal comprises a plurality of voice signals, image signals and text signals, so that the acquired information is more comprehensive, and the signal characteristics of the multi-modal signal are enhanced, the information contained in the enhanced characteristics is richer, and the high-level characteristic extraction is performed through one neural network, so that the comprehension and reasoning capability of the other neural network on the multi-modal information is improved, the generated dialogue sentences have higher accuracy and relevance, and the performance of the dialogue system based on the embodiment of the application is improved.
Description
Technical Field
The present application relates to computer technologies, and in particular, to a method and an apparatus for generating a dialog, and a storage medium.
Background
With the rapid development of scientific and technical and economic levels, the society is gradually changing to a service-based society to better provide services for users. The intelligent dialogue system which is popular at present is generated based on the above idea. After receiving a question initiated by a user, the intelligent dialogue system can automatically answer the question, and a dialogue between a person and a machine is formed in the process of one-time question and answer.
In the related art, during a man-machine conversation, the intelligent dialog system generally generates reply content based on voice information, for example, car navigation, a user initiates a question "yes route to location a", and the intelligent dialog system in navigation generates a reply by the voice information, for example, performs semantic analysis on the voice information, extracts two entity information of location a and location route ", and then performs a corresponding reply according to the two entity information.
However, in the above-mentioned process of the man-machine conversation, the intelligent conversation system generates a reply only by the voice message, the obtained information is limited, and the feature extracted from the voice message contains less information, which easily causes a mistake in the reply generated by the intelligent conversation system, and reduces the performance of the conversation system.
Disclosure of Invention
In order to solve the problems in the prior art, the present application provides a dialog generation method, apparatus, and storage medium.
In a first aspect, an embodiment of the present application provides a dialog generation method, including:
acquiring multi-modal signals in a target conversation scene, wherein the multi-modal signals comprise a plurality of voice signals, image signals and text signals;
determining signal features of the multi-modal signal;
performing feature enhancement on the signal features to obtain enhanced features;
inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained through training of signal features and dialogue sentences of multi-modal signals in a dialogue scene;
and acquiring the target dialogue statement output by the first preset neural network.
In one possible implementation, the performing feature enhancement on the signal feature includes:
and if the multi-modal signal comprises a speech signal, performing speech feature enhancement on signal features of the speech signal, wherein the speech feature enhancement comprises one or more of time domain warping, frequency domain masking and time domain masking.
In one possible implementation, the performing feature enhancement on the signal feature includes:
and if the multi-modal signal comprises an image signal, performing image feature enhancement on the signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.
In one possible implementation, the performing feature enhancement on the signal feature includes:
performing text feature enhancement on signal features of the text signal if the multi-modal signal comprises a text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.
In one possible implementation manner, before the inputting the enhanced features into the first preset neural network, the method further includes:
inputting the enhanced features into a second preset neural network, wherein the second preset neural network is obtained through signal feature and high-level feature training;
acquiring a target high-level feature output by the second preset neural network;
the inputting the enhanced features into a first preset neural network comprises:
and inputting the target high-level features into the first preset neural network.
In one possible implementation, the high-level features include one or more of vggist features of speech, I3D Red Green Blue (RGB) features and I3D Flow features of images, and word vectors of text.
In one possible implementation, the determining signal features of the multi-modal signal includes:
and if the multi-mode signal comprises a Voice signal, performing Voice preprocessing on the Voice signal to obtain the signal characteristics of the Voice signal, wherein the Voice preprocessing comprises one or more of silence Activity Detection (VAD), short-time Fourier transform (STFT) and F-BANK.
In one possible implementation, the determining signal features of the multi-modal signal includes:
and if the multi-modal signal comprises an image signal, performing image preprocessing on the image signal to obtain the signal characteristics of the image signal, wherein the image preprocessing comprises one or more of image enhancement and normalization.
In one possible implementation, the determining signal features of the multi-modal signal includes:
if the multi-mode signal comprises a voice signal, inputting the voice signal into a third preset neural network, wherein the third preset neural network is obtained through signal feature training of the voice signal and the voice signal;
and acquiring the signal characteristics of the voice signal output by the third preset neural network.
In one possible implementation, the determining signal features of the multi-modal signal includes:
if the multi-mode signal comprises an image signal, inputting the image signal into a fourth preset neural network, wherein the fourth preset neural network is obtained through signal feature training of the image signal and the image signal;
and acquiring the signal characteristics of the image signal output by the fourth preset neural network.
In a second aspect, an embodiment of the present application provides a dialog generating apparatus, including:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring multi-modal signals in a target conversation scene, and the multi-modal signals comprise a plurality of voice signals, image signals and text signals;
a determination module to determine signal characteristics of the multi-modal signal;
the enhancement module is used for carrying out feature enhancement on the signal features to obtain enhanced features;
the first input module is used for inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained by training signal features and dialogue sentences of multi-modal signals in a dialogue scene;
and the second acquisition module is used for acquiring the target dialogue statement output by the first preset neural network.
In a possible implementation manner, the enhancing module is specifically configured to:
and if the multi-modal signal comprises a speech signal, performing speech feature enhancement on signal features of the speech signal, wherein the speech feature enhancement comprises one or more of time domain warping, frequency domain masking and time domain masking.
In a possible implementation manner, the enhancing module is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image feature enhancement on the signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.
In a possible implementation manner, the enhancing module is specifically configured to:
performing text feature enhancement on signal features of the text signal if the multi-modal signal comprises a text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.
In a possible implementation manner, the apparatus further includes:
the second input module is used for inputting the enhanced features into a second preset neural network before the first input module inputs the enhanced features into the first preset neural network, wherein the second preset neural network is obtained through signal feature and high-level feature training;
the third acquisition module is used for acquiring the target high-level features output by the second preset neural network;
the first input module is specifically configured to:
and inputting the target high-level features into the first preset neural network.
In one possible implementation, the high-level features include one or more of VGGish features for speech, I3D RGB features and I3D Flow features for images, and word vectors for text.
In a possible implementation manner, the determining module is specifically configured to:
and if the multi-mode signal comprises a voice signal, performing voice preprocessing on the voice signal to obtain the signal characteristics of the voice signal, wherein the voice preprocessing comprises one or more of VAD, STFT and F-BANK.
In a possible implementation manner, the determining module is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image preprocessing on the image signal to obtain the signal characteristics of the image signal, wherein the image preprocessing comprises one or more of image enhancement and normalization.
In a possible implementation manner, the determining module is specifically configured to:
if the multi-mode signal comprises a voice signal, inputting the voice signal into a third preset neural network, wherein the third preset neural network is obtained through signal feature training of the voice signal and the voice signal;
and acquiring the signal characteristics of the voice signal output by the third preset neural network.
In a possible implementation manner, the determining module is specifically configured to:
if the multi-mode signal comprises an image signal, inputting the image signal into a fourth preset neural network, wherein the fourth preset neural network is obtained through signal feature training of the image signal and the image signal;
and acquiring the signal characteristics of the image signal output by the fourth preset neural network.
In a third aspect, an embodiment of the present application provides a server, including:
a processor;
a memory; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program causes a server to execute the method according to the first aspect.
Compared with the prior art which only acquires voice information, the method and the device for generating dialog provided by the embodiment of the application have the advantages that the information acquired by the embodiment of the application is more comprehensive, the signal characteristics of the multi-modal signals are enhanced, the information contained in the enhanced characteristics is richer, the understanding and reasoning capability of the neural network on the multi-modal information are improved, and the generated dialog sentences have higher accuracy and relevance, the performance of a dialog system based on the embodiment of the application is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic diagram of a dialog generation system architecture provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a dialog generation method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of another dialog generation method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of another dialog generation method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a dialog generation provided by an embodiment of the present application;
fig. 6 is a schematic structural diagram of a dialog generating device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of another dialog generating device according to an embodiment of the present application;
FIG. 8A is a diagram of one possible basic hardware architecture of a dialog generating device according to an embodiment of the present application;
fig. 8B is another possible basic hardware architecture diagram of a dialog generating device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," if any, in the description and claims of this application and the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The dialog generation according to the embodiment of the present application refers to acquiring a multi-modal signal in a dialog scene, where the multi-modal signal includes a plurality of speech signals, image signals, and text signals, and further performing feature enhancement on signal features of the multi-modal signal, so that a neural network generates a dialog sentence based on the enhanced signal features, the understanding and reasoning ability of the neural network on the multi-modal information is improved, and the generated dialog sentence has higher accuracy and correlation.
The dialog generation method provided by the embodiment of the application can be applied to application scenes such as an intelligent terminal auxiliary system, automobile navigation, an intelligent sound box and a human-computer interaction robot, and the embodiment of the application is not particularly limited.
Optionally, fig. 1 is a schematic diagram of a dialog generation system architecture. In fig. 1, taking car navigation as an example, the architecture includes a processing device 11 and a plurality of information acquiring devices, such as a voice acquiring device, an image acquiring device, a text acquiring device, etc., which are not particularly limited in this embodiment of the application, here, the processing device 11 may be disposed in a navigation system of a car, and the plurality of information acquiring devices take a voice acquiring device 12, an image acquiring device 13, and a text acquiring device 14 as examples.
It is to be understood that the illustrated structure of the embodiments of the present application does not constitute a specific limitation on the dialog generation architecture. In other possible embodiments of the present application, the foregoing architecture may include more or less components than those shown in the drawings, or combine some components, or split some components, or arrange different components, which may be determined according to practical application scenarios, and is not limited herein. The components shown in fig. 1 may be implemented in hardware, software, or a combination of software and hardware.
In a specific implementation process, the number and the setting positions of the voice acquiring device 12, the image acquiring device 13 and the text acquiring device 14 in the embodiment of the present application may be determined according to actual situations, and the embodiment of the present application is not particularly limited thereto. In the application scenario, when a user dialogues with a navigation system in a car during driving, a processing device 11 in the navigation system may obtain a multi-modal signal in the dialog scenario, specifically, taking the multi-modal signal as an example that the multi-modal signal includes a voice signal, an image signal and a text signal, the processing device 11 may obtain the voice signal in the dialog scenario through the voice obtaining device 12, obtain the image signal in the dialog scenario through the image obtaining device 13, obtain the text signal in the dialog scenario through the text obtaining device 14, then the processing device 11 may perform feature enhancement on the signal feature of the multi-modal signal, and generate a dialog sentence based on the enhanced signal feature through a neural network, wherein the processing device 11 obtains the multi-modal signal in the dialog scenario, and more comprehensively obtains information in the dialog scenario, the processing device 11 performs feature enhancement on the signal features of the multi-modal signal, so that the enhanced features contain richer information, the understanding and reasoning capabilities of the neural network on the multi-modal information are improved, the generated dialogue sentences have higher accuracy and relevance, the dialogue performance of the navigation system is improved, and further, a user can acquire accurate navigation information and improve user experience through dialogue with the navigation system.
In addition, the system architecture and the service scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and it can be known by a person skilled in the art that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
The technical solutions of the present application are described below with several embodiments as examples, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 2 is a flowchart illustrating a dialog generating method according to an embodiment of the present application, where the dialog generating method according to the embodiment of the present application may be executed by the processing device 11 in fig. 1, and the device may be implemented by software and/or hardware. As shown in fig. 2, the dialog generating method provided in the embodiment of the present application includes the following steps:
s201: a multi-modal signal in a target dialog scene is acquired, the multi-modal signal including a plurality of a speech signal, an image signal, and a text signal.
The target dialogue scene may be determined according to an actual situation, for example, in fig. 1, the user has a dialogue with a navigation system on an automobile during driving, which is not particularly limited in the embodiment of the present application.
Modality refers to the way things happen or exist, such as sound, images, text, etc. Here, the above-mentioned multi-modal signal includes a plurality of voice signals, image signals, and text signals, wherein the image signals include pictures and/or video signals, and the like.
For example, the manner of acquiring the multi-modal signal in the target dialog scene may be determined according to actual situations, for example, in fig. 1, the processing device 11 acquires the voice signal in the dialog scene through the voice acquiring device 12, acquires the image signal in the dialog scene through the image acquiring device 13, and acquires the text signal in the dialog scene through the text acquiring device 14, which is not limited in particular by the embodiment of the present application.
S202: signal features of the multi-modal signal are determined.
Here, taking the example that the above-mentioned multi-modal signal includes a speech signal, the signal characteristics of the speech signal include a spectrogram, an F-bank characteristic, or the like.
In one possible implementation, if the multi-modal signal includes a speech signal, the speech signal may be subjected to speech pre-processing to obtain signal characteristics of the speech signal, wherein the speech pre-processing includes one or more of VAD, STFT, and F-BANK.
In addition, if the multi-modal signal includes a speech signal, the speech signal may be input to a third predetermined neural network, where the third predetermined neural network is obtained by training the speech signal and signal characteristics of the speech signal, so as to obtain signal characteristics of the speech signal output by the third predetermined neural network.
In the embodiment of the present application, the voice signal may be extracted through VAD, STFT, F-BANK, or the like, or the voice signal characteristic may be extracted through a deep learning method such as a neural network, which may be determined specifically according to the situation, and the embodiment of the present application does not particularly limit this.
In one possible implementation, if the multi-modal signal includes an image signal, the image signal may be subjected to image pre-processing to obtain a signal characteristic of the image signal, wherein the image pre-processing includes one or more of image enhancement and normalization.
In addition, if the multi-modal signal includes an image signal, the image signal may be input to a fourth pre-set neural network, wherein the fourth pre-set neural network is obtained by training the image signal and the signal characteristics of the image signal, so as to obtain the signal characteristics of the image signal output by the fourth pre-set neural network.
Here, the image signal may be extracted to obtain the image signal features by methods such as image enhancement and normalization, or may be extracted to obtain the image signal features by a neural network such as Vggish and ImageNet, which may be determined according to the situation, and this is not particularly limited in this embodiment of the present application.
S203: and performing characteristic enhancement on the signal characteristics to obtain enhanced characteristics.
Illustratively, taking the above-mentioned multi-modal signals including the voice signal, the image signal and the text signal as an example, the processing device 11 further enhances the determined signal characteristics after determining the signal characteristics of the voice signal, the image signal and the text signal, to obtain enhanced voice, image and text characteristics, that is, enhanced characteristics, wherein the enhanced characteristics contain richer information, thereby improving the understanding and reasoning ability of the following neural network on the multi-modal information and generating a more accurate dialog statement.
S204: and inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained by training signal features and dialogue sentences of multi-modal signals in a dialogue scene.
S205: and acquiring a target dialogue statement output by the first preset neural network.
The processing device 11 trains a first preset neural network by using a large number of signal features and dialogue sentences of multi-modal signals in a dialogue scene, and inputs the enhanced features into the first preset neural network after the training is completed, so as to obtain a target dialogue sentence output by the first preset neural network.
In the embodiment of the application, by acquiring the multi-modal signal in the target dialog scene, the signal characteristics of the multi-modal signal are further determined, and performing feature enhancement on the signal features to obtain enhanced features, inputting the enhanced features into a first preset neural network, thereby obtaining a target dialogue statement output by the first preset neural network, wherein the multi-modal signal includes a plurality of voice signals, image signals and text signals, compared with the prior art that only voice information is acquired, the information acquired by the embodiment of the application is more comprehensive, in addition, the signal characteristics of the multi-modal signals are enhanced, the information contained in the enhanced characteristics is richer, the understanding and reasoning capability of the neural network on the multi-modal information is improved, the generated dialogue sentences have higher accuracy and relevance, and the performance of a dialogue system based on the embodiment of the application is improved.
In the embodiment of the present invention, when performing feature enhancement on the signal features, it is considered how to perform feature enhancement on the signal features when the multi-modal signal includes a speech signal, when the multi-modal signal includes an image signal, and when the multi-modal signal includes a text signal. Fig. 3 is a flowchart illustrating another dialog generation method according to an embodiment of the present application. As shown in fig. 3, the method includes:
s301: a multi-modal signal in a target dialog scene is acquired, the multi-modal signal including a plurality of a speech signal, an image signal, and a text signal.
S302: signal features of the multi-modal signal are determined.
The steps S301 to S302 are the same as the steps S201 to S202, and are not described herein again.
S303: and if the multi-modal signal comprises a voice signal, performing voice feature enhancement on the signal feature of the voice signal, wherein the voice feature enhancement comprises one or more of time domain distortion, frequency domain mask and time domain mask.
Here, the time domain warping is to randomly perform a nonlinear deformation operation on the signal feature of the speech signal in the time domain, thereby performing feature enhancement on the signal feature of the speech signal.
The frequency domain masking is to perform a mask operation on the frequency domain of the signal feature of the speech signal, where the window size and the window position of the mask are randomly set, for example, the window is set to have a length of 5, the number of windows is 1-2, after a certain frequency domain is selected, the feature in the range is changed to 0, and the signal feature of the speech signal is erased, so as to enhance the signal feature of the speech signal. Similarly, the time-domain mask is a mask operation performed on the time domain of the signal feature of the speech signal, the window size and the window position of the mask are randomly set, for example, the window is set to have a length of 10ms and the number of windows is 1-2, after a certain time domain is selected, the feature in the range is changed to 0, and the signal feature of the time domain is erased, so as to enhance the signal feature of the speech signal.
S304: if the multi-modal signal comprises an image signal, performing image feature enhancement on signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.
Here, taking the example that the image signal includes a video signal, the picture cropping is to crop each frame in the video with a certain probability, and the contrast adjustment is to adjust the contrast of each frame of the image, so as to enhance the signal characteristics of the image signal.
The gaussian blur processing is to add gaussian blur to each frame of image according to a certain probability (for example, 50%), and similarly, the gaussian noise processing is to add gaussian noise to each frame of image to achieve the purpose of enhancing the signal characteristics of the image signals.
The affine change is to perform changes including translation, rotation, scale change, shearing and the like on each frame of image, so as to enhance the signal characteristics of the image signal.
S305: and if the multi-modal signal comprises a text signal, performing text feature enhancement on the signal feature of the text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.
Here, the synonym replacement means performing synonym replacement on the text signal, and the context-based word replacement means performing word replacement on the text signal based on the context content of the text signal, thereby performing feature enhancement on the signal feature of the text signal.
In addition, in addition to the above-mentioned manner of performing feature enhancement on the signal feature, the embodiment of the present application may also perform feature enhancement on the signal feature by using other techniques, which may be determined according to practical situations, and this is not limited in particular by the embodiment of the present application.
S306: after the feature enhancement is carried out, an enhanced feature is obtained, and the enhanced feature is input into a first preset neural network, wherein the first preset neural network is obtained through training of signal features and dialogue sentences of multi-modal signals in a dialogue scene.
S307: and acquiring a target dialogue statement output by the first preset neural network.
The implementation of steps S306 to S307 is similar to that of steps S204 to S205, and is not described herein again.
In the embodiment of the application, the signal characteristics are enhanced in different modes, different requirements of various application scenes are met, and the application is suitable, and the multi-modal signals comprise a plurality of voice signals, image signals and text signals.
In addition, before the enhanced features are input into the first preset neural network, the enhanced features are also input into the second preset neural network, and high-level features are extracted. Fig. 4 is a flowchart illustrating another dialog generation method according to an embodiment of the present application. As shown in fig. 4, the method includes:
s401: a multi-modal signal in a target dialog scene is acquired, the multi-modal signal including a plurality of a speech signal, an image signal, and a text signal.
S402: signal features of the multi-modal signal are determined.
S403: and performing characteristic enhancement on the signal characteristics to obtain enhanced characteristics.
The steps S401 to S403 are the same as the steps S201 to S203, and are not described herein again.
S404: and inputting the enhanced features into a second preset neural network, wherein the second preset neural network is obtained by training signal features and high-level features, and the high-level features comprise one or more of VGGish features of voice, I3D RGB features and I3DFlow features of images and word vectors of texts.
In the embodiment of the application, after the signal characteristics of the multi-modal signal are determined and the signal characteristics are subjected to characteristic enhancement, high-level characteristic extraction is performed through a second preset neural network, wherein the high-level characteristics include but are not limited to VGGish characteristics of voice, I3D RGB characteristics and I3D Flow characteristics of images, word vectors of texts and the like, so that information contained in the characteristics input into a subsequent first neural network is richer, the understanding and reasoning capability of the first neural network on the multi-modal information is improved, and an accurate dialogue sentence is generated.
S405: and acquiring the high-level target features output by the second preset neural network, and inputting the high-level target features into the first preset neural network, wherein the first preset neural network is obtained by training the signal features of the multi-mode signals in the dialogue scene and the dialogue sentences.
S406: and acquiring a target dialogue statement output by the first preset neural network.
Illustratively, as shown in fig. 5, the target high-level features input into a first preset neural network, which may be a multi-layer attention model, the first preset neural network outputting the target dialogue sentence are exemplified by the target high-level features including VGGish feature of speech, I3DRGB feature and I3D Flow feature of an image, and word vectors of text.
Compared with the prior art that only voice information is acquired, the information acquired by the embodiment of the application is more comprehensive, the signal characteristics of the multi-mode signal are enhanced, and high-level characteristic extraction is performed through the second preset neural network, so that information included in the characteristics of the input follow-up first neural network is richer, the understanding and reasoning capability of the first neural network on the multi-mode information is improved, and accurate conversation sentences are generated.
Fig. 6 is a schematic structural diagram of a dialog generating device according to an embodiment of the present application, which corresponds to the dialog generating method according to the foregoing embodiment. For convenience of explanation, only portions related to the embodiments of the present application are shown. Fig. 6 is a schematic structural diagram of a dialog generating device according to an embodiment of the present application, where the dialog generating device 60 includes: a first obtaining module 601, a determining module 602, an enhancing module 603, a first input module 604, and a second obtaining module 605. The dialog generating means here may be the processing means itself described above, or a chip or an integrated circuit that implements the functionality of the processing means. It should be noted here that the division of the first obtaining module, the determining module, the enhancing module, the first inputting module and the second obtaining module is only a division of logic functions, and the two may be integrated or independent physically.
The first obtaining module 601 is configured to obtain a multi-modal signal in a target dialog scene, where the multi-modal signal includes a plurality of a voice signal, an image signal, and a text signal.
A determining module 602 for determining signal features of the multi-modal signal.
An enhancing module 603, configured to perform feature enhancement on the signal feature to obtain an enhanced feature.
The first input module 604 is configured to input the enhanced features into a first preset neural network, where the first preset neural network is obtained by training signal features of a multi-modal signal in a dialog scene and a dialog statement.
A second obtaining module 605, configured to obtain the target dialog statement output by the first preset neural network.
In one possible design, the enhancement module 603 is specifically configured to:
and if the multi-modal signal comprises a speech signal, performing speech feature enhancement on signal features of the speech signal, wherein the speech feature enhancement comprises one or more of time domain warping, frequency domain masking and time domain masking.
In one possible design, the enhancement module 603 is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image feature enhancement on the signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.
In one possible design, the enhancement module 603 is specifically configured to:
performing text feature enhancement on signal features of the text signal if the multi-modal signal comprises a text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.
In one possible design, the determining module 602 is specifically configured to:
and if the multi-mode signal comprises a voice signal, performing voice preprocessing on the voice signal to obtain the signal characteristics of the voice signal, wherein the voice preprocessing comprises one or more of VAD, STFT and F-BANK.
In one possible design, the determining module 602 is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image preprocessing on the image signal to obtain the signal characteristics of the image signal, wherein the image preprocessing comprises one or more of image enhancement and normalization.
In a possible implementation manner, the determining module 602 is specifically configured to:
if the multi-mode signal comprises a voice signal, inputting the voice signal into a third preset neural network, wherein the third preset neural network is obtained through signal feature training of the voice signal and the voice signal;
and acquiring the signal characteristics of the voice signal output by the third preset neural network.
In a possible implementation manner, the determining module 602 is specifically configured to:
if the multi-mode signal comprises an image signal, inputting the image signal into a fourth preset neural network, wherein the fourth preset neural network is obtained through signal feature training of the image signal and the image signal;
and acquiring the signal characteristics of the image signal output by the fourth preset neural network.
The apparatus provided in the embodiment of the present application may be configured to implement the technical solution of the method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present application.
Fig. 7 is a schematic structural diagram of another dialog generating device according to an embodiment of the present application. As shown in fig. 7, in addition to fig. 6, the dialog generating device 60 further includes: a second input module 606 and a third acquisition module 607.
The second input module 606 is configured to input the enhanced features into a second preset neural network before the first input module 604 inputs the enhanced features into the first preset neural network, where the second preset neural network is obtained through signal feature and high-level feature training.
A third obtaining module 607, configured to obtain the target high-level feature output by the second preset neural network.
The first input module 604 is specifically configured to:
and inputting the target high-level features into the first preset neural network.
In one possible design, the high-level features include one or more of VGGish features for speech, I3D RGB features and I3D Flow features for images, and word vectors for text.
The apparatus provided in the embodiment of the present application may be configured to implement the technical solution of the method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present application.
Alternatively, fig. 8A and 8B each schematically provide one possible basic hardware architecture of the dialog generating device described in the present application.
Referring to fig. 8A and 8B, a dialog generating device 800 comprises at least one processor 801 and a communication interface 803. Further optionally, a memory 802 and a bus 804 may also be included.
The dialog generating device 800 may be the processing device, and the present application is not limited to this. In the dialog generating device 800, the number of the processors 801 may be one or more, and fig. 8A and 8B illustrate only one of the processors 801. Alternatively, the processor 801 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Digital Signal Processing (DSP). If the dialog generating device 800 has multiple processors 801, the types of the multiple processors 801 may be different, or may be the same. Alternatively, the plurality of processors 801 of the dialog generating device 800 may also be integrated as a multi-core processor.
The communication interface 803 may provide information input/output for the at least one processor. Any one or any combination of the following devices may also be included: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.
Optionally, the communication interface 803 may also be used for the dialog generating device 800 to communicate data with other computing devices or terminals.
Further alternatively, fig. 8A and 8B show the bus 804 by a thick line. A bus 804 may connect the processor 801 with the memory 802 and the communication interface 803. Thus, via bus 804, processor 801 may access memory 802 and may also interact with other computing devices or terminals using communication interface 803.
In the present application, the dialog generating device 800 executes computer instructions in the memory 802, so that the dialog generating device 800 implements the dialog generating method provided by the present application, or so that the dialog generating device 800 deploys the dialog generating means described above.
From the perspective of logical functional division, illustratively, as shown in fig. 8A, the memory 802 may include therein a first obtaining module 601, a determining module 602, an enhancing module 603, a first input module 604, and a second obtaining module 605. The inclusion herein merely refers to that the instructions stored in the memory may, when executed, implement the functionality of the first obtaining module, the determining module, the enhancing module, the first inputting module, and the second obtaining module, respectively, without limitation to physical structure.
The first obtaining module 601 is configured to obtain a multi-modal signal in a target dialog scene, where the multi-modal signal includes a plurality of a voice signal, an image signal, and a text signal.
A determining module 602 for determining signal features of the multi-modal signal.
An enhancing module 603, configured to perform feature enhancement on the signal feature to obtain an enhanced feature.
The first input module 604 is configured to input the enhanced features into a first preset neural network, where the first preset neural network is obtained by training signal features of a multi-modal signal in a dialog scene and a dialog statement.
A second obtaining module 605, configured to obtain the target dialog statement output by the first preset neural network.
In one possible design, the enhancement module 603 is specifically configured to:
and if the multi-modal signal comprises a speech signal, performing speech feature enhancement on signal features of the speech signal, wherein the speech feature enhancement comprises one or more of time domain warping, frequency domain masking and time domain masking.
In one possible design, the enhancement module 603 is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image feature enhancement on the signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.
In one possible design, the enhancement module 603 is specifically configured to:
performing text feature enhancement on signal features of the text signal if the multi-modal signal comprises a text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.
In one possible design, the determining module 602 is specifically configured to:
and if the multi-mode signal comprises a voice signal, performing voice preprocessing on the voice signal to obtain the signal characteristics of the voice signal, wherein the voice preprocessing comprises one or more of VAD, STFT and F-BANK.
In one possible design, the determining module 602 is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image preprocessing on the image signal to obtain the signal characteristics of the image signal, wherein the image preprocessing comprises one or more of image enhancement and normalization.
In a possible implementation manner, the determining module 602 is specifically configured to:
if the multi-mode signal comprises a voice signal, inputting the voice signal into a third preset neural network, wherein the third preset neural network is obtained through signal feature training of the voice signal and the voice signal;
and acquiring the signal characteristics of the voice signal output by the third preset neural network.
In a possible implementation manner, the determining module 602 is specifically configured to:
if the multi-mode signal comprises an image signal, inputting the image signal into a fourth preset neural network, wherein the fourth preset neural network is obtained through signal feature training of the image signal and the image signal;
and acquiring the signal characteristics of the image signal output by the fourth preset neural network.
Illustratively, as shown in fig. 8B, the memory 802 may further include a second input module 606 and a third obtaining module 607. The inclusion herein merely refers to that the instructions stored in the memory may, when executed, implement the functions of the second input module and the third obtaining module, respectively, and is not limited to a physical structure.
The second input module 606 is configured to input the enhanced features into a second preset neural network before the first input module 604 inputs the enhanced features into the first preset neural network, where the second preset neural network is obtained through signal feature and high-level feature training.
A third obtaining module 607, configured to obtain the target high-level feature output by the second preset neural network.
The first input module 604 is specifically configured to:
and inputting the target high-level features into the first preset neural network.
In one possible design, the high-level features include one or more of VGGish features for speech, I3D RGB features and I3D Flow features for images, and word vectors for text.
In addition, the dialog generating device may be implemented by software as in fig. 8A and 8B, or may be implemented by hardware as a hardware module or as a circuit unit.
The present application provides a computer-readable storage medium, the computer program product comprising computer instructions that instruct a computing device to perform the above-described dialog generation method provided herein.
The present application provides a chip comprising at least one processor and a communication interface providing information input and/or output for the at least one processor. Further, the chip may also include at least one memory for storing computer instructions. The at least one processor is configured to call and execute the computer instructions to perform the dialog generation method provided in the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Claims (20)
1. A dialog generation method, comprising:
acquiring multi-modal signals in a target conversation scene, wherein the multi-modal signals comprise a plurality of voice signals, image signals and text signals;
determining signal features of the multi-modal signal;
performing feature enhancement on the signal features to obtain enhanced features;
inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained through training of signal features and dialogue sentences of multi-modal signals in a dialogue scene;
and acquiring the target dialogue statement output by the first preset neural network.
2. The method of claim 1, wherein the feature enhancing the signal feature comprises:
and if the multi-modal signal comprises a speech signal, performing speech feature enhancement on signal features of the speech signal, wherein the speech feature enhancement comprises one or more of time domain warping, frequency domain masking and time domain masking.
3. The method of claim 1, wherein the feature enhancing the signal feature comprises:
and if the multi-modal signal comprises an image signal, performing image feature enhancement on the signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.
4. The method of claim 1, wherein the feature enhancing the signal feature comprises:
performing text feature enhancement on signal features of the text signal if the multi-modal signal comprises a text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.
5. The method of any one of claims 1 to 4, further comprising, prior to said inputting said enhanced features into a first pre-set neural network:
inputting the enhanced features into a second preset neural network, wherein the second preset neural network is obtained through signal feature and high-level feature training;
acquiring a target high-level feature output by the second preset neural network;
the inputting the enhanced features into a first preset neural network comprises:
and inputting the target high-level features into the first preset neural network.
6. The method of claim 5, wherein the advanced features include one or more of VGGish features for speech, I3D RGB red green blue RGB features and I3D Flow features for images, and word vectors for text.
7. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:
and if the multi-mode signal comprises a voice signal, performing voice preprocessing on the voice signal to obtain the signal characteristics of the voice signal, wherein the voice preprocessing comprises one or more of silence suppression VAD, short-time Fourier transform (STFT) and F-BANK.
8. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:
and if the multi-modal signal comprises an image signal, performing image preprocessing on the image signal to obtain the signal characteristics of the image signal, wherein the image preprocessing comprises one or more of image enhancement and normalization.
9. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:
if the multi-mode signal comprises a voice signal, inputting the voice signal into a third preset neural network, wherein the third preset neural network is obtained through signal feature training of the voice signal and the voice signal;
and acquiring the signal characteristics of the voice signal output by the third preset neural network.
10. The method of any of claims 1 to 4, wherein said determining signal features of the multi-modal signal comprises:
if the multi-mode signal comprises an image signal, inputting the image signal into a fourth preset neural network, wherein the fourth preset neural network is obtained through signal feature training of the image signal and the image signal;
and acquiring the signal characteristics of the image signal output by the fourth preset neural network.
11. A dialog generation device, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring multi-modal signals in a target conversation scene, and the multi-modal signals comprise a plurality of voice signals, image signals and text signals;
a determination module to determine signal characteristics of the multi-modal signal;
the enhancement module is used for carrying out feature enhancement on the signal features to obtain enhanced features;
the first input module is used for inputting the enhanced features into a first preset neural network, wherein the first preset neural network is obtained by training signal features and dialogue sentences of multi-modal signals in a dialogue scene;
and the second acquisition module is used for acquiring the target dialogue statement output by the first preset neural network.
12. The apparatus according to claim 11, wherein the enhancement module is specifically configured to:
and if the multi-modal signal comprises a speech signal, performing speech feature enhancement on signal features of the speech signal, wherein the speech feature enhancement comprises one or more of time domain warping, frequency domain masking and time domain masking.
13. The apparatus according to claim 11, wherein the enhancement module is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image feature enhancement on the signal features of the image signal, wherein the image feature enhancement comprises one or more of picture cropping, Gaussian blur processing, contrast adjustment, Gaussian noise processing and affine change.
14. The apparatus according to claim 11, wherein the enhancement module is specifically configured to:
performing text feature enhancement on signal features of the text signal if the multi-modal signal comprises a text signal, wherein the text feature enhancement comprises one or more of synonym replacement and context-based word replacement.
15. The apparatus of any one of claims 11 to 14, further comprising:
the second input module is used for inputting the enhanced features into a second preset neural network before the first input module inputs the enhanced features into the first preset neural network, wherein the second preset neural network is obtained through signal feature and high-level feature training;
the third acquisition module is used for acquiring the target high-level features output by the second preset neural network;
the first input module is specifically configured to:
and inputting the target high-level features into the first preset neural network.
16. The apparatus of claim 15, wherein the high-level features comprise one or more of VGGish features for speech, I3D RGB features and I3D Flow features for images, and word vectors for text.
17. The apparatus according to any one of claims 11 to 14, wherein the determining module is specifically configured to:
and if the multi-mode signal comprises a voice signal, performing voice preprocessing on the voice signal to obtain the signal characteristics of the voice signal, wherein the voice preprocessing comprises one or more of VAD, STFT and F-BANK.
18. The apparatus according to any one of claims 11 to 14, wherein the determining module is specifically configured to:
and if the multi-modal signal comprises an image signal, performing image preprocessing on the image signal to obtain the signal characteristics of the image signal, wherein the image preprocessing comprises one or more of image enhancement and normalization.
19. A dialog generating device, comprising:
a processor;
a memory; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-10.
20. A computer-readable storage medium, characterized in that it stores a computer program that causes a server to execute the method of any one of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010742806.3A CN111899738A (en) | 2020-07-29 | 2020-07-29 | Dialogue generating method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010742806.3A CN111899738A (en) | 2020-07-29 | 2020-07-29 | Dialogue generating method, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111899738A true CN111899738A (en) | 2020-11-06 |
Family
ID=73182515
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010742806.3A Pending CN111899738A (en) | 2020-07-29 | 2020-07-29 | Dialogue generating method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111899738A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116383365A (en) * | 2023-06-01 | 2023-07-04 | 广州里工实业有限公司 | Learning material generation method and system based on intelligent manufacturing and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130132308A1 (en) * | 2011-11-22 | 2013-05-23 | Gregory Jensen Boss | Enhanced DeepQA in a Medical Environment |
CN105574133A (en) * | 2015-12-15 | 2016-05-11 | 苏州贝多环保技术有限公司 | Multi-mode intelligent question answering system and method |
US20170155631A1 (en) * | 2015-12-01 | 2017-06-01 | Integem, Inc. | Methods and systems for personalized, interactive and intelligent searches |
CN109949821A (en) * | 2019-03-15 | 2019-06-28 | 慧言科技(天津)有限公司 | A method of far field speech dereverbcration is carried out using the U-NET structure of CNN |
CN110196930A (en) * | 2019-05-22 | 2019-09-03 | 山东大学 | A kind of multi-modal customer service automatic reply method and system |
CN110263217A (en) * | 2019-06-28 | 2019-09-20 | 北京奇艺世纪科技有限公司 | A kind of video clip label identification method and device |
CN111061847A (en) * | 2019-11-22 | 2020-04-24 | 中国南方电网有限责任公司 | Dialogue generation and corpus expansion method and device, computer equipment and storage medium |
CN111312292A (en) * | 2020-02-18 | 2020-06-19 | 北京三快在线科技有限公司 | Emotion recognition method and device based on voice, electronic equipment and storage medium |
CN111344717A (en) * | 2019-12-31 | 2020-06-26 | 深圳市优必选科技股份有限公司 | Interactive behavior prediction method, intelligent device and computer-readable storage medium |
-
2020
- 2020-07-29 CN CN202010742806.3A patent/CN111899738A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130132308A1 (en) * | 2011-11-22 | 2013-05-23 | Gregory Jensen Boss | Enhanced DeepQA in a Medical Environment |
US20170155631A1 (en) * | 2015-12-01 | 2017-06-01 | Integem, Inc. | Methods and systems for personalized, interactive and intelligent searches |
CN105574133A (en) * | 2015-12-15 | 2016-05-11 | 苏州贝多环保技术有限公司 | Multi-mode intelligent question answering system and method |
CN109949821A (en) * | 2019-03-15 | 2019-06-28 | 慧言科技(天津)有限公司 | A method of far field speech dereverbcration is carried out using the U-NET structure of CNN |
CN110196930A (en) * | 2019-05-22 | 2019-09-03 | 山东大学 | A kind of multi-modal customer service automatic reply method and system |
CN110263217A (en) * | 2019-06-28 | 2019-09-20 | 北京奇艺世纪科技有限公司 | A kind of video clip label identification method and device |
CN111061847A (en) * | 2019-11-22 | 2020-04-24 | 中国南方电网有限责任公司 | Dialogue generation and corpus expansion method and device, computer equipment and storage medium |
CN111344717A (en) * | 2019-12-31 | 2020-06-26 | 深圳市优必选科技股份有限公司 | Interactive behavior prediction method, intelligent device and computer-readable storage medium |
CN111312292A (en) * | 2020-02-18 | 2020-06-19 | 北京三快在线科技有限公司 | Emotion recognition method and device based on voice, electronic equipment and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116383365A (en) * | 2023-06-01 | 2023-07-04 | 广州里工实业有限公司 | Learning material generation method and system based on intelligent manufacturing and electronic equipment |
CN116383365B (en) * | 2023-06-01 | 2023-09-08 | 广州里工实业有限公司 | Learning material generation method and system based on intelligent manufacturing and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106887225B (en) | Acoustic feature extraction method and device based on convolutional neural network and terminal equipment | |
US20240021202A1 (en) | Method and apparatus for recognizing voice, electronic device and medium | |
CN108492818B (en) | Text-to-speech conversion method and device and computer equipment | |
CN110174942B (en) | Eye movement synthesis method and device | |
WO2022062800A1 (en) | Speech separation method, electronic device, chip and computer-readable storage medium | |
CN112365878A (en) | Speech synthesis method, device, equipment and computer readable storage medium | |
CN103514882A (en) | Voice identification method and system | |
US11776563B2 (en) | Textual echo cancellation | |
CN110781329A (en) | Image searching method and device, terminal equipment and storage medium | |
CN116797695A (en) | Interaction method, system and storage medium of digital person and virtual whiteboard | |
CN110970030A (en) | Voice recognition conversion method and system | |
CN113012680B (en) | Speech technology synthesis method and device for speech robot | |
CN111899738A (en) | Dialogue generating method, device and storage medium | |
CN114694654A (en) | Audio processing method and device, terminal equipment and computer readable storage medium | |
US20230081543A1 (en) | Method for synthetizing speech and electronic device | |
CN111916057A (en) | Language identification method and device, electronic equipment and computer readable storage medium | |
CN111813989B (en) | Information processing method, apparatus and storage medium | |
CN114121010A (en) | Model training, voice generation, voice interaction method, device and storage medium | |
CN114566156A (en) | Keyword speech recognition method and device | |
CN112951274A (en) | Voice similarity determination method and device, and program product | |
CN114595314A (en) | Emotion-fused conversation response method, emotion-fused conversation response device, terminal and storage device | |
CN113051426A (en) | Audio information classification method and device, electronic equipment and storage medium | |
CN112885366A (en) | Active noise reduction method and device, storage medium and terminal | |
CN112331209A (en) | Method and device for converting voice into text, electronic equipment and readable storage medium | |
CN114467141A (en) | Voice processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201106 |
|
RJ01 | Rejection of invention patent application after publication |