CN113127708B - Information interaction method, device, equipment and storage medium - Google Patents

Information interaction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113127708B
CN113127708B CN202110423719.6A CN202110423719A CN113127708B CN 113127708 B CN113127708 B CN 113127708B CN 202110423719 A CN202110423719 A CN 202110423719A CN 113127708 B CN113127708 B CN 113127708B
Authority
CN
China
Prior art keywords
information
scene
generation model
knowledge graph
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110423719.6A
Other languages
Chinese (zh)
Other versions
CN113127708A (en
Inventor
王永超
苏志铭
刘权
陈志刚
刘聪
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
iFlytek Co Ltd
Original Assignee
University of Science and Technology of China USTC
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC, iFlytek Co Ltd filed Critical University of Science and Technology of China USTC
Priority to CN202110423719.6A priority Critical patent/CN113127708B/en
Priority to PCT/CN2021/105959 priority patent/WO2022222286A1/en
Publication of CN113127708A publication Critical patent/CN113127708A/en
Application granted granted Critical
Publication of CN113127708B publication Critical patent/CN113127708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses an information interaction method, device, equipment and storage medium, wherein cross-language and cross-scene multi-modal training data and a scene knowledge graph library are utilized in advance, a reply generation model is obtained through non-supervision training, and then after multi-modal data in a current interaction scene is obtained, the scene knowledge graph library can be consulted, the multi-modal data can be processed by utilizing the reply generation model, further reply information for interaction is output, and a man-machine interaction process is realized. Because the reply generation model is trained by using cross-language and cross-scene multi-mode training data, the reply generation model can be suitable for cross-language and cross-scene interaction processes, different interaction systems are not required to be independently built for different languages and different scenes, and the difficulty of system development and deployment is introduced.

Description

Information interaction method, device, equipment and storage medium
Technical Field
The present application relates to the field of man-machine interaction technologies, and in particular, to an information interaction method, apparatus, device, and storage medium.
Background
With the progress of speech recognition and natural language understanding technologies, man-machine interaction terminals have become practical from experiments in a plurality of scenes such as vehicle-mounted, enterprise, medical and the like.
However, the current interactive systems are all based on implementation of a single scene and a single language, i.e. in different interactive scenes, such as vehicle-mounted and medical, a set of interactive systems are required to be designed for different scenes. And the voice interaction system is not universal aiming at different crowd habits of different countries, a set of interaction system is required to be designed independently aiming at specific scenes of specific crowds of specific languages, and the development amount of the system is large and difficult to deploy.
Disclosure of Invention
In view of the above problems, the present application is provided to provide an information interaction method, apparatus, device and storage medium, so as to support man-machine interaction under the conditions of cross languages and cross scenes. The specific scheme is as follows:
an information interaction method, comprising:
Acquiring multi-mode data in a current interaction scene, wherein the multi-mode data comprises video information, audio information and/or text information in a man-machine interaction process;
referring to a pre-configured scene knowledge graph library, processing the multi-mode data based on a pre-trained reply generation model, and outputting reply information for interaction, wherein the scene knowledge graph library comprises scene knowledge graphs corresponding to different scenes one by one;
The reply generation model is obtained by training the multi-mode training data of cross languages and cross scenes and the scene knowledge graph base in an unsupervised mode.
Preferably, the training process of the reply generation model includes:
acquiring cross-language and cross-scene multi-mode training data and a pre-configured scene knowledge graph library;
aligning video information, audio information and text information contained in the multi-modal training data;
and taking the aligned multi-mode training data as a sample to be input, referring to the scene knowledge graph library, and training a reply generation model by taking the characters which are blocked in the text information contained in the multi-mode training data as targets.
Preferably, the aligning the video information, the audio information and the text information contained in the multimodal training data includes:
Extracting the characteristics of each video frame in the video information to obtain a video characteristic vector corresponding to the video information;
performing discretization representation on the video feature vector to obtain a video feature vector aligned with each character in the text information one by one;
extracting the characteristics of each voice frame in the audio information to obtain an audio characteristic vector corresponding to the audio information;
and carrying out discretization representation on the audio feature vector to obtain the audio feature vector aligned with each character in the text information one by one.
Preferably, the training reply generation model with the aligned multimodal training data as a sample input refers to the scene knowledge graph base, and aims at predicting the blocked characters in the text information contained in the multimodal training data, and includes:
splicing video information, audio information and text information contained in the input aligned multi-mode training data by using a reply generation model to obtain splicing characteristics;
selecting an adapted scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature;
predicting the blocked characters in the text information based on the splicing characteristics and the knowledge-graph vector characteristics by utilizing a reply generation model;
And training a reply generation model by taking the blocked character predicted by the reply generation model as a target that the blocked character approaches to the real blocked character in the text information.
Preferably, the multimodal training data further includes location information, and the using the reply generation model to splice video information, audio information and text information included in the input aligned multimodal training data to obtain a splice feature includes:
And splicing the video information, the audio information, the text information and the position information contained in the input aligned multi-mode training data by utilizing a reply generation model to obtain splicing characteristics.
Preferably, the processing the multi-mode data based on the pre-trained reply generation model by referring to the pre-configured scene knowledge graph library, and outputting reply information for interaction, includes:
splicing video information, audio information and/or text information contained in the multi-mode data by using a reply generation model to obtain splicing characteristics;
selecting an adapted scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature;
and predicting and outputting reply information for interaction based on the splicing characteristics and the knowledge graph vector characteristics by using a reply generation model.
Preferably, the multi-modal data further includes location information;
The method for splicing the video information, the audio information and/or the text information contained in the multi-mode data by utilizing the reply generation model to obtain splicing characteristics comprises the following steps:
and splicing the position information, the video information, the audio information and/or the text information contained in the multi-mode data by utilizing a reply generation model to obtain splicing characteristics.
An information interaction device, comprising:
The multi-mode data acquisition unit is used for acquiring multi-mode data in the current interaction scene, wherein the multi-mode data comprises video information, audio information and/or text information in the man-machine interaction process;
the reply information generation unit is used for referring to a pre-configured scene knowledge graph library, processing the multi-mode data based on a pre-trained reply generation model and outputting reply information for interaction, wherein the scene knowledge graph library comprises scene knowledge graphs corresponding to different scenes one by one;
The reply generation model is obtained by training the multi-mode training data of cross languages and cross scenes and the scene knowledge graph base in an unsupervised mode.
An information interaction device, comprising: a memory and a processor;
the memory is used for storing programs;
The processor is configured to execute the program to implement the steps of the information interaction method described above.
A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the information interaction method as described above.
By means of the technical scheme, the information interaction method of the application uses cross-language and cross-scene multi-modal training data and scene knowledge graph base in advance, and a reply generation model is obtained through non-supervision training, so that after the multi-modal data in the current interaction scene is obtained, the scene knowledge graph base can be consulted, the multi-modal data can be processed by using the reply generation model, and further reply information for interaction is output, and the man-machine interaction process is realized. Because the reply generation model is trained by using cross-language and cross-scene multi-mode training data, the reply generation model can be suitable for cross-language and cross-scene interaction processes, different interaction systems are not required to be independently built for different languages and different scenes, and the difficulty of system development and deployment is introduced.
Meanwhile, the application further introduces a scene knowledge graph base which comprises scene knowledge graphs corresponding to different scenes one by one, and the reply generation model training process further refers to the scene knowledge graph base, so that the scene knowledge graph matched with the current interaction scene can be automatically matched according to the input multi-mode data, and reply information is generated based on the matched scene knowledge graph, so that the generated reply information is more adaptive to the current man-machine interaction scene, and the accuracy of man-machine interaction is improved.
Still further, the reply generation model training process is trained by using cross-language and cross-scene multi-mode training data in an unsupervised mode, which is different from the existing supervised training mode, and can naturally use all the existing multi-mode data to train the model, so that the training data volume is greatly increased, the training data is not required to be labeled manually, and further the labor cost is saved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a schematic flow chart of an information interaction method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a training process of a reply generation model according to an embodiment of the present application;
FIG. 3 illustrates a schematic diagram of a multi-modal pre-training model training process;
FIG. 4 illustrates a schematic diagram of a language model training process;
FIG. 5 is a schematic flow chart of generating reply information by a reply generation model according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an information interaction device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an information interaction device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The application provides an information interaction scheme which can be applied to machine generation of reply information for interaction under a man-machine interaction scene. The man-machine interaction scene can be a plurality of different scenes, such as a vehicle-mounted scene, an enterprise scene, a medical scene and the like.
The scheme of the application can be realized based on the terminal with the data processing capability, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.
Next, as described in connection with fig. 1, the information interaction method of the present application may include the following steps:
Step S100, multi-mode data in the current interaction scene is obtained.
The multimodal data may include video information, audio information, and/or text information of a human-machine interaction process.
The application can collect multi-mode data in the human-computer interaction process by using the sensing device, such as collecting video information by using a camera, collecting audio information by using a recording device and collecting text information by using an input device. It will be understood that, of course, the text information may be obtained by extracting subtitles from the collected video information or audio information or performing speech recognition on the collected video information or audio information.
The video information may include, among other things, feature information of the interactive object, as well as ambient information. The feature information of the interactive object may be an image of the interactive object, and personal attribute information of the interactive object analyzed according to the image, such as country, occupation, age, interest preference, etc. Taking country as an example, by collecting country information of the interaction object, the current interaction scene can be determined more accurately, for example: the method has the advantages that the interactive object can be shot to be an Indian through video information, and the current activity is that the Indian is eating, so that the current interactive scene can be determined to be the Indian eating scene, and the scene knowledge graph corresponding to the Indian eating scene can be further conveniently found out from the scene knowledge graph library, so that the output interactive content is optimized.
The audio information includes voice information of the interactive object, such as voice content, tone, timbre, etc.
The text information may include text content entered by the interactive object during the interaction.
It should be noted that the multi-modal data under the current interaction scenario obtained in this step may include multi-modal data at the current moment and historical multi-modal data, such as multi-modal data of historical multi-round interaction. The historical multi-mode data can assist in accurately generating the reply information.
Step S110, referring to a pre-configured scene knowledge graph library, processing the multi-mode data based on a pre-trained reply generation model, and outputting reply information for interaction.
The scene knowledge graph library comprises scene knowledge graphs corresponding to different scenes one by one. According to the method, different scenes where the man-machine interaction is located are collected and summarized in advance, so that a scene knowledge graph corresponding to each scene one by one is constructed, the scene knowledge graph contains priori knowledge of the corresponding scene and is used for assisting a reply generation model to accurately generate reply information matched with the scene, and the aim of disambiguation of the interaction scene is achieved.
The disambiguation of the interaction scene means that the interaction object speaks the same sentence under different interaction scenes, and the reply generation model should give the best reply in combination with the current scene instead of just feeding back the same interaction result.
For example, a user in a car-mounted interactive scenario says: "I want to play basketball," then the interactive results should be combined with the current interactive scenario, giving an interactive response that helps the user navigate around basketball courses that can play basketball nearby. And if the user is at home: the voice interaction result should give out today's weather condition first, then wait for the answer feedback of the user, if weather is bad, and the user still holds to play basketball, then the final voice interaction result is to navigate to indoor basketball court for the user.
For another example, the interaction scene also includes the country of the interaction object, and the interaction scene where the interaction object of different country is located is different.
The meaning of the same word in different countries may be different, e.g. "Bump" is the word, and in the united kingdom "collision", e.g. crash, etc. In sweden, the meaning of "dumping" is represented. If the interaction scene is not considered and only the interaction voice is considered, a given interaction result may have a large error, such as a swedish person carrying a garbage bag to remove garbage, which asks the man-machine interaction system: i want to lamp PLEASE TELL ME WHAT KIND of garbage batteries belong to? If the man-machine interaction system does not consider that the sentence is spoken by swedish, the user is considered to want to go to crash according to the conventional understanding "crash" of the bump, and at this time, a correct interaction response cannot be given. According to the scheme, through using multi-mode data to shoot a video, an image and the like interacted by a user, the current user is a Swedish person, and the image contains a garbage bag, so that the current interaction scene can be determined to be a scene of throwing garbage by the Swedish person, and the 'dump' can be accurately translated into 'dumping garbage' at the moment, so that correct interaction response information can be given.
The reply generation model can avoid ambiguity only by using different scene knowledge maps under different interaction scenes, and the purpose of using the scene knowledge maps is also to avoid ambiguity and improve interaction experience of users.
In this embodiment, the reply generation model is obtained by training the multi-mode training data of cross languages and cross scenes and the scene knowledge graph base in an unsupervised manner.
The method and the system have the advantages that a large amount of multi-modal data exist on the Internet, the multi-modal data naturally comprise multiple languages and multiple scenes, and the multi-modal data are non-supervision sample data without labels, and the data are directly used for non-supervision training. In addition, because the video, audio and text information are aligned in the multi-mode data acquired from the Internet, the information quantity is very high. The multi-mode non-labeling training data of multiple languages and multiple scenes is introduced, and the multi-mode training data are fused in an unsupervised mode to perform unsupervised learning, so that the differences and commonalities among different countries and different crowds can be better learned.
According to the application, the multi-modal training data of cross languages and cross scenes and the scene knowledge graph library are used for training the reply generation model, so that the same reply generation model can be used for generating reply information for interaction aiming at different interaction scenes, different countries and different crowds, the development cost is greatly reduced, and the model deployment is convenient.
Meanwhile, the multi-mode training data is used in the reply generation model training, namely the reply generation model training has the same semantic space capable of representing voice, images and characters, so that more accurate reply information can be generated by inputting the multi-mode data in the current interaction scene in the reply information generation stage, and the man-machine interaction experience is improved.
Of course, if only part of multi-mode data can be obtained in actual situations, such as lack of audio and video information, only text information, and the reply generation model has a small part of information sources, so that model decoding can be performed. The reply generation model is trained based on multi-mode path finding data, multi-mode knowledge is implicitly learned through a large amount of unsupervised multi-mode training data, and the generated reply can be ensured to be more reasonable and reliable relative to the existing reply generation system even though the input end lacks data of some modes.
In one embodiment of the application, the training process of the reply generation model described above is introduced.
The application can collect a large amount of cross-language and cross-scene multi-mode training data. The multi-mode training data can be obtained through a network, a large amount of multi-mode data exist on the Internet, the multi-mode data naturally comprise multiple languages and multiple scenes, the multi-mode data comprise at least one or more of audio, video and text, and the data are non-supervision sample data without labels and can be directly used as the multi-mode training data in model training. The model can learn the general information of each language and each scene.
In order to better utilize the multi-modal training data to train the reply generation model, the method can align the multi-modal training data, namely, align video information, audio information and text information contained in the multi-modal training data.
Specifically, taking video stream data obtained from the internet as an example, audio, video and text information may be extracted from the video stream, and it is further necessary to align the audio, video and text information.
For video information:
The video information contains multi-frame video. The application can extract the characteristics of each video frame in the video information, and the extracted characteristics form the video characteristic vector corresponding to the video information. For example, for each video frame, feature extraction may be performed by convolutional neural network CNN or other means to obtain the corresponding features for each video frame.
In order to align the video information with the text information, the video feature vector may be represented in a discretized manner based on the number of characters included in the text information, so as to obtain a video feature vector aligned with each character in the text information one by one. The process of discretizing the video feature vector may use a clustering manner, that is, using the number of characters contained in the text information as a clustering number, to cluster the continuous video feature vector, so as to obtain a video feature vector aligned to each character in the text information.
For audio information:
The audio information contains multi-frame speech. The application can extract the characteristics of each voice frame in the audio information, and the extracted characteristics form the audio characteristic vector corresponding to the audio information. For example, for each voice frame, a waveform diagram of the voice frame is first obtained, and then feature extraction may be performed on the waveform diagram through a recurrent neural network RNN or other manners to obtain features corresponding to each voice frame.
In order to align the voice information with the text information, the audio feature vector may be represented in a discretized manner based on the number of characters contained in the text information, so as to obtain an audio feature vector aligned with each character in the text information one by one. The process of discretizing the audio feature vector may use a clustering manner, that is, using the number of characters contained in the text information as a clustering number, to cluster continuous audio feature vectors, so as to obtain an audio feature vector aligned with each character in the text information.
For text information:
To obtain text information, the text information may be extracted from the video stream data. If subtitles exist in the video stream data, the subtitles can be extracted by using OCR technology to obtain text information. If no subtitle exists in the video stream data, the dialogue information in the video stream can be identified by using a voice identification technology to obtain text information.
Furthermore, in order to enable the reply generation model to comprehensively consider the current interaction scene to generate the reply information adapted to the current interaction scene when generating the reply information, in the embodiment of the application, a scene knowledge graph base can be configured in advance. The scene knowledge graph library comprises scene knowledge graphs corresponding to different interaction scenes one by one.
After video information, audio information and text information contained in the multimodal training data are aligned, the aligned multimodal training data can be used as a sample to be input, the scene knowledge graph library is referred to, characters which are blocked in the text information contained in the multimodal training data are predicted as targets, and a reply generation model is trained.
Referring to FIG. 2, which illustrates the training process of the reply generation model, the method may specifically include the steps of:
And step 200, splicing video information, audio information and text information contained in the input aligned multi-mode training data by using a reply generation model to obtain splicing characteristics.
Step S210, selecting an adaptive scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature.
Specifically, the reply generation model is continuously trained in an iterative mode, so that the reply generation model has the capability of selecting a scene knowledge graph matched with the current man-machine interaction scene from a scene knowledge graph base based on the splicing characteristics. That is, the process of selecting the adapted scene knowledge graph from the scene knowledge graph base based on the splicing features is not needed to be implemented according to rules, but the inherent capability of the trained reply generation model is compared with the mode of selecting the adapted scene knowledge graph according to the set rules, so that the problem that the selected scene knowledge graph is not matched with the current man-machine interaction scene due to the fact that the rules are problematic or incomplete is avoided.
And S220, predicting the blocked characters in the text information based on the splicing features and the knowledge-graph vector features by utilizing a reply generation model.
And step S230, training a reply generation model by taking the blocked character predicted by the reply generation model as a target that the blocked character approaches to the real blocked character in the text information.
In this embodiment, the reply generation model is trained by referring to the scene knowledge graph library, so that the reply generation model is guaranteed to have all common knowledge in each scene, and final reply generation is assisted according to the knowledge. For example, in the context of an electronic mall interaction scenario, i want to learn about apples, and the return of the interaction should be to recommend relevant properties of the apple phone to the user. If the user speaks in the fruit store's interaction scenario that I want to know about apples, the return result of the interaction should be a recommendation to the user of the relevant attributes of the fruit apples. Only when different scene knowledge patterns are used in different scenes, ambiguity can be avoided, and the purpose of using the scene knowledge patterns is also to avoid the ambiguity, so that interaction experience of a user is improved.
In some embodiments of the present application, the multimodal training data may include, in addition to the video information, the audio information, and the text information described above, location information, that is, location information where a human-machine interaction is located. By further introducing the position information, the reply generation model can be more accurately assisted to generate proper reply information.
On the basis that the multimodal training data further includes location information, the process of performing feature stitching in the step S210 may specifically include:
And splicing the video information, the audio information, the text information and the position information contained in the input aligned multi-mode training data by utilizing a reply generation model to obtain splicing characteristics.
Specifically, the aligned video information, audio information and text information may be first spliced to obtain the preliminary splice feature. Further, the position information and the preliminary splicing feature are subjected to secondary splicing, and a final splicing feature is obtained.
After the final stitching feature is obtained, the reply generation model may select an adapted scene knowledge-graph based on the final stitching feature, as well as predict occluded characters in the text information.
By further adding the position information to the multi-modal training data, accuracy of reply information generated by the reply generation model can be improved.
In some embodiments of the application, the reply generation model may include a multi-modal pre-training model. The multimodal pre-training model is similar to a masking language model, whose inputs include video information, audio information, and aligned text information. Of course, the multimodal pre-training model may further include location information. In the training stage of the multimodal pre-training model, for text information, a mask can be randomly shielded to remove part of characters, and then the model predicts characters removed by the mask by combining input multimodal training data, as shown in fig. 3.
In the training stage, text information in the input multi-mode training data is 'dinner today', wherein two characters of the 'dinner' can be randomly mask-removed, video information and audio information aligned with the text information are further input into a model together, and characters which are mask-removed are predicted by the model.
It should be noted that the multimodal training data illustrated in fig. 3 includes only video information, audio information, and text information, but may include location information.
Because the multi-modal data is cross-lingual, the neural network of the video, the audio and the text is shared for all languages, and compared with the text and the audio, the relevance of the video and the languages is weaker, therefore, if some videos are similar, the learned corresponding audio representation and text representation are also restricted to be similar, and the model has the capability of describing multi-lingual semantics.
The multi-mode pre-training process is unsupervised, and training data contains various languages and various scenes, so that the requirement on the scale of the supervision corpus is greatly relieved. The multi-mode pre-training model is used as a reply generation model, so that the method can be effectively adapted to each interaction scene.
Furthermore, the most important part of the man-machine interaction process is the language, and especially in the multilingual interaction process, the language part is more important. Therefore, in the embodiment of the application, a language model can be further added in the reply generation model so as to independently model the text information. In the embodiment of the application, a mask language model can be used, the model is based on a transducer structure, partial characters in text information are shielded, and model training is carried out by using other non-shielded context information to predict the shielded characters as a target. The text information contained in the multi-modal training data obtained by the method has the single language corpus with different languages and the mixed language corpus, so that the multi-language semantics can be described through the pre-trained language model. As shown in fig. 4, which illustrates the training process of the language model.
In the training stage, the input data only comprises text information, such as "I want to play basketball, navigate to nearby courts" for the text information, and two character masks of the "courts" can be randomly dropped, so that the text information input into the language model is "I want to play basketball, navigate to nearby maskmask". The language model predicts the characters which are input into the text and are hidden by the mask, and the non-supervision training process is realized.
Of course, the text in the multimodal training data may also include mixed language corpus, such as "The translation of APPLE IN CHINESE IS apples", and since this type of corpus is trained using the language model, the language model can describe different languages in the same semantic space. Based on the above, the trained language model can generate appropriate reply information aiming at input texts of different languages.
It should be noted that, when the reply generation model includes both the multi-mode pre-training model and the language model, the final reply information may be comprehensively determined based on the output results of the multi-mode pre-training model and the language model. For example, if the multimodal data acquired in the current interaction scenario only includes text information, the text information may be input into the language model, and the reply information output by the language model may be used as a final result. If the multi-modal data acquired in the current interaction scene comprises video information, audio information and text information at the same time, the multi-modal data can be input into a multi-modal pre-training model, the text information is input into a language model, and final reply information is determined based on output results of the multi-modal pre-training model and the language model.
Based on the foregoing description of the reply generation model, in the embodiment of the present application, the process of processing the multimodal data and the preconfigured scene knowledge graph library based on the pretrained reply generation model and outputting reply information for interaction in the foregoing step S110 is described. As shown in fig. 5, the method specifically includes the following steps:
And step S300, splicing video information, audio information and/or text information contained in the multi-mode data by utilizing the reply generation model to obtain splicing characteristics.
Specifically, the multi-mode data acquired in the human-computer interaction process can simultaneously include video information, audio information and text information, and then the three can be directly spliced to obtain splicing characteristics. In addition, if part of the multi-modal data cannot be acquired, the acquired multi-modal data can be spliced, if only the audio information and the text information are acquired, the acquired multi-modal data can be spliced, or if only the text information is acquired, the text information can be directly used as the splicing characteristic.
Furthermore, the multi-mode data may further include location information, and in this step, the location information, the video information, the audio information, and/or the text information may be spliced to obtain a splice feature.
And step S310, selecting an adaptive scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature.
Specifically, the reply generation model has the capability of selecting the scene knowledge graph matched with the current human-computer interaction scene based on the splicing characteristics after training, so that the adaptive scene knowledge graph can be directly selected based on the splicing characteristics in the step, and the process belongs to the process of processing data in the model and is invisible to a user.
And step 320, predicting and outputting reply information for interaction based on the splicing characteristics and the knowledge-graph vector characteristics by using a reply generation model.
Specifically, the reply generation model is trained by using cross-language and cross-scene multi-mode training data in advance, so that reply information for interaction can be output according to multi-mode data of any language and any scene acquired under the current interaction scene in the embodiment, and one reply generation model is suitable for a multi-language and multi-scene man-machine interaction process.
The information interaction device provided by the embodiment of the present application is described below, and the information interaction device described below and the information interaction method described above may be referred to correspondingly.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an information interaction device according to an embodiment of the present application.
As shown in fig. 6, the apparatus may include:
A multi-mode data obtaining unit 11, configured to obtain multi-mode data in a current interaction scene, where the multi-mode data includes video information, audio information, and/or text information in a man-machine interaction process;
The reply information generating unit 12 is configured to process the multimodal data based on a pre-trained reply generation model and output reply information for interaction, where the scene knowledge spectrum library includes scene knowledge spectrums corresponding to different scenes one by one;
The reply generation model is obtained by training the multi-mode training data of cross languages and cross scenes and the scene knowledge graph base in an unsupervised mode.
Optionally, the information interaction device of the present application may further include a model training unit, configured to train to obtain a reply generation model, where the training process may include:
acquiring cross-language and cross-scene multi-mode training data and a pre-configured scene knowledge graph library;
aligning video information, audio information and text information contained in the multi-modal training data;
and taking the aligned multi-mode training data as a sample to be input, referring to the scene knowledge graph library, and training a reply generation model by taking the characters which are blocked in the text information contained in the multi-mode training data as targets.
Optionally, the process of aligning the video information, the audio information and the text information included in the multimodal training data by the model training unit may include:
Extracting the characteristics of each video frame in the video information to obtain a video characteristic vector corresponding to the video information;
performing discretization representation on the video feature vector to obtain a video feature vector aligned with each character in the text information one by one;
extracting the characteristics of each voice frame in the audio information to obtain an audio characteristic vector corresponding to the audio information;
and carrying out discretization representation on the audio feature vector to obtain the audio feature vector aligned with each character in the text information one by one.
Optionally, the model training unit uses the aligned multimodal training data as a sample input, refers to the scene knowledge graph base, uses the characters blocked in the text information contained in the multimodal training data as targets, and trains the process of replying to and generating the model, and may include:
splicing video information, audio information and text information contained in the input aligned multi-mode training data by using a reply generation model to obtain splicing characteristics;
selecting an adapted scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature;
predicting the blocked characters in the text information based on the splicing characteristics and the knowledge-graph vector characteristics by utilizing a reply generation model;
And training a reply generation model by taking the blocked character predicted by the reply generation model as a target that the blocked character approaches to the real blocked character in the text information.
Optionally, the multi-modal training data may further include location information, and the process of the model training unit splicing the video information, the audio information and the text information included in the input aligned multi-modal training data by using the reply generation model to obtain the splicing feature may include:
And splicing the video information, the audio information, the text information and the position information contained in the input aligned multi-mode training data by utilizing a reply generation model to obtain splicing characteristics.
Optionally, the process of processing the multimodal data and the preconfigured scene knowledge graph base by the reply information generating unit based on the pretrained reply generation model and outputting the reply information for interaction may include:
splicing video information, audio information and/or text information contained in the multi-mode data by using a reply generation model to obtain splicing characteristics;
selecting an adapted scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature;
and predicting and outputting reply information for interaction based on the splicing characteristics and the knowledge graph vector characteristics by using a reply generation model.
Optionally, the multi-modal data may further include location information, and the process of the reply information generating unit splicing the video information, the audio information and/or the text information included in the multi-modal data by using a reply generating model to obtain the splicing feature may include:
and splicing the position information, the video information, the audio information and/or the text information contained in the multi-mode data by utilizing a reply generation model to obtain splicing characteristics.
The information interaction device provided by the embodiment of the application can be applied to information interaction equipment, such as a terminal: cell phones, computers, etc. Alternatively, fig. 7 shows a block diagram of a hardware structure of the information interaction device, and referring to fig. 7, the hardware structure of the information interaction device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
Wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
Acquiring multi-mode data in a current interaction scene, wherein the multi-mode data comprises video information, audio information and/or text information in a man-machine interaction process;
The pre-configured scene knowledge graph library processes the multi-mode data based on a pre-trained reply generation model and outputs reply information for interaction, wherein the scene knowledge graph library comprises scene knowledge graphs corresponding to different scenes one by one;
The reply generation model is obtained by training the multi-mode training data of cross languages and cross scenes and the scene knowledge graph base in an unsupervised mode.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:
Acquiring multi-mode data in a current interaction scene, wherein the multi-mode data comprises video information, audio information and/or text information in a man-machine interaction process;
The pre-configured scene knowledge graph library processes the multi-mode data based on a pre-trained reply generation model and outputs reply information for interaction, wherein the scene knowledge graph library comprises scene knowledge graphs corresponding to different scenes one by one;
The reply generation model is obtained by training the multi-mode training data of cross languages and cross scenes and the scene knowledge graph base in an unsupervised mode.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. An information interaction method, comprising:
Acquiring multi-mode data in a current interaction scene, wherein the multi-mode data comprises video information, audio information and/or text information in a man-machine interaction process;
referring to a pre-configured scene knowledge graph library, processing the multi-mode data based on a pre-trained reply generation model, and outputting reply information for interaction, wherein the scene knowledge graph library comprises scene knowledge graphs corresponding to different scenes one by one;
The reply generation model is obtained by training cross-language and cross-scene multi-mode training data and the scene knowledge graph library in an unsupervised mode;
The training process of the reply generation model comprises the following steps:
acquiring cross-language and cross-scene multi-mode training data and a pre-configured scene knowledge graph library;
aligning video information, audio information and text information contained in the multi-modal training data;
splicing video information, audio information and text information contained in the input aligned multi-mode training data by using a reply generation model to obtain splicing characteristics;
selecting an adapted scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature;
predicting the blocked characters in the text information based on the splicing characteristics and the knowledge-graph vector characteristics by utilizing a reply generation model;
And training a reply generation model by taking the blocked character predicted by the reply generation model as a target that the blocked character approaches to the real blocked character in the text information.
2. The method of claim 1, wherein the aligning the video information, the audio information, and the text information contained in the multimodal training data comprises:
Extracting the characteristics of each video frame in the video information to obtain a video characteristic vector corresponding to the video information;
performing discretization representation on the video feature vector to obtain a video feature vector aligned with each character in the text information one by one;
extracting the characteristics of each voice frame in the audio information to obtain an audio characteristic vector corresponding to the audio information;
and carrying out discretization representation on the audio feature vector to obtain the audio feature vector aligned with each character in the text information one by one.
3. The method of claim 1, wherein the multimodal training data further includes location information, and the stitching video information, audio information, and text information included in the aligned multimodal training data using a reply generation model to obtain stitching features includes:
And splicing the video information, the audio information, the text information and the position information contained in the input aligned multi-mode training data by utilizing a reply generation model to obtain splicing characteristics.
4. The method of claim 1, wherein the referencing the pre-configured scene knowledge-graph library, processing the multimodal data based on a pre-trained reply generation model, outputting reply information for interaction, comprises:
splicing video information, audio information and/or text information contained in the multi-mode data by using a reply generation model to obtain splicing characteristics;
selecting an adapted scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature;
and predicting and outputting reply information for interaction based on the splicing characteristics and the knowledge graph vector characteristics by using a reply generation model.
5. The method of claim 4, wherein the multimodal data further includes location information;
The method for splicing the video information, the audio information and/or the text information contained in the multi-mode data by utilizing the reply generation model to obtain splicing characteristics comprises the following steps:
and splicing the position information, the video information, the audio information and/or the text information contained in the multi-mode data by utilizing a reply generation model to obtain splicing characteristics.
6. An information interaction device, comprising:
The multi-mode data acquisition unit is used for acquiring multi-mode data in the current interaction scene, wherein the multi-mode data comprises video information, audio information and/or text information in the man-machine interaction process;
the reply information generation unit is used for referring to a pre-configured scene knowledge graph library, processing the multi-mode data based on a pre-trained reply generation model and outputting reply information for interaction, wherein the scene knowledge graph library comprises scene knowledge graphs corresponding to different scenes one by one;
The reply generation model is obtained by training cross-language and cross-scene multi-mode training data and the scene knowledge graph library in an unsupervised mode;
The training process of the reply generation model comprises the following steps:
acquiring cross-language and cross-scene multi-mode training data and a pre-configured scene knowledge graph library;
aligning video information, audio information and text information contained in the multi-modal training data;
splicing video information, audio information and text information contained in the input aligned multi-mode training data by using a reply generation model to obtain splicing characteristics;
selecting an adapted scene knowledge graph from the scene knowledge graph base based on the splicing features, and representing the selected scene knowledge graph as a knowledge graph vector feature;
predicting the blocked characters in the text information based on the splicing characteristics and the knowledge-graph vector characteristics by utilizing a reply generation model;
And training a reply generation model by taking the blocked character predicted by the reply generation model as a target that the blocked character approaches to the real blocked character in the text information.
7. An information interaction device, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the information interaction method according to any one of claims 1 to 5.
8. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the information interaction method of any of claims 1 to 5.
CN202110423719.6A 2021-04-20 2021-04-20 Information interaction method, device, equipment and storage medium Active CN113127708B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110423719.6A CN113127708B (en) 2021-04-20 2021-04-20 Information interaction method, device, equipment and storage medium
PCT/CN2021/105959 WO2022222286A1 (en) 2021-04-20 2021-07-13 Information interaction method, apparatus and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110423719.6A CN113127708B (en) 2021-04-20 2021-04-20 Information interaction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113127708A CN113127708A (en) 2021-07-16
CN113127708B true CN113127708B (en) 2024-06-07

Family

ID=76777917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110423719.6A Active CN113127708B (en) 2021-04-20 2021-04-20 Information interaction method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113127708B (en)
WO (1) WO2022222286A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148836A (en) * 2020-09-07 2020-12-29 北京字节跳动网络技术有限公司 Multi-modal information processing method, device, equipment and storage medium
CN112200317A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Multi-modal knowledge graph construction method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11226673B2 (en) * 2018-01-26 2022-01-18 Institute Of Software Chinese Academy Of Sciences Affective interaction systems, devices, and methods based on affective computing user interface
CN110110169A (en) * 2018-01-26 2019-08-09 上海智臻智能网络科技股份有限公司 Man-machine interaction method and human-computer interaction device
CN111221984B (en) * 2020-01-15 2024-03-01 北京百度网讯科技有限公司 Multi-mode content processing method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148836A (en) * 2020-09-07 2020-12-29 北京字节跳动网络技术有限公司 Multi-modal information processing method, device, equipment and storage medium
CN112200317A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Multi-modal knowledge graph construction method

Also Published As

Publication number Publication date
CN113127708A (en) 2021-07-16
WO2022222286A1 (en) 2022-10-27

Similar Documents

Publication Publication Date Title
CN108287858B (en) Semantic extraction method and device for natural language
US11322153B2 (en) Conversation interaction method, apparatus and computer readable storage medium
CN111968649B (en) Subtitle correction method, subtitle display method, device, equipment and medium
CN106406806B (en) Control method and device for intelligent equipment
CN111444326B (en) Text data processing method, device, equipment and storage medium
CN113076433B (en) Retrieval method and device for retrieval object with multi-modal information
CN105931644A (en) Voice recognition method and mobile terminal
CN113011186B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium
CN101681365A (en) Method and apparatus for distributed voice searching
CN108920450B (en) Knowledge point reviewing method based on electronic equipment and electronic equipment
CN110187780B (en) Long text prediction method, long text prediction device, long text prediction equipment and storage medium
CN109558513A (en) A kind of content recommendation method, device, terminal and storage medium
CN110895656B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN111178056A (en) Deep learning based file generation method and device and electronic equipment
CN110032734B (en) Training method and device for similar meaning word expansion and generation of confrontation network model
CN110955818A (en) Searching method, searching device, terminal equipment and storage medium
CN113157959A (en) Cross-modal retrieval method, device and system based on multi-modal theme supplement
CN114339450A (en) Video comment generation method, system, device and storage medium
CN111508497B (en) Speech recognition method, device, electronic equipment and storage medium
CN110020429A (en) Method for recognizing semantics and equipment
CN113051384A (en) User portrait extraction method based on conversation and related device
CN113127708B (en) Information interaction method, device, equipment and storage medium
KR101916781B1 (en) Method and system for providing translated result
CN116955591A (en) Recommendation language generation method, related device and medium for content recommendation
CN111477212A (en) Content recognition, model training and data processing method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230516

Address after: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96

Applicant after: University of Science and Technology of China

Applicant after: IFLYTEK Co.,Ltd.

Address before: NO.666, Wangjiang West Road, hi tech Zone, Hefei City, Anhui Province

Applicant before: IFLYTEK Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant