CN117592564A - Question-answer interaction method, device, equipment and medium - Google Patents

Question-answer interaction method, device, equipment and medium Download PDF

Info

Publication number
CN117592564A
CN117592564A CN202311516077.XA CN202311516077A CN117592564A CN 117592564 A CN117592564 A CN 117592564A CN 202311516077 A CN202311516077 A CN 202311516077A CN 117592564 A CN117592564 A CN 117592564A
Authority
CN
China
Prior art keywords
question
information
text
questioning
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311516077.XA
Other languages
Chinese (zh)
Inventor
孙景洲
谭韬
李娜
吴文哲
陈又新
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Chuangke Technology Beijing Co ltd
Original Assignee
Ping An Chuangke Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Chuangke Technology Beijing Co ltd filed Critical Ping An Chuangke Technology Beijing Co ltd
Priority to CN202311516077.XA priority Critical patent/CN117592564A/en
Publication of CN117592564A publication Critical patent/CN117592564A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to the technical field of artificial intelligence and intelligent medical treatment, and discloses a question-answer interaction method, a question-answer interaction device, question-answer interaction equipment and question-answer interaction medium, wherein the question-answer interaction method comprises the following steps of: receiving questioning information to be processed; determining an information processing mode corresponding to the questioning information according to the information type corresponding to the questioning information; based on the information processing mode and a preset information processing architecture, converting the questioning information into a questioning text with a preset format; and processing the question text through a preset question and answer interaction model to obtain an answer text corresponding to the question text. The scheme can improve the generalization capability of the question-answer interaction model, and further can improve the accuracy of question-answer interaction.

Description

Question-answer interaction method, device, equipment and medium
Technical Field
The invention relates to the technical field of intelligent vision and semantic recognition, in particular to a question-answer interaction method, a question-answer interaction device, question-answer interaction equipment and question-answer interaction medium.
Background
The intelligent customer service is currently applied to various industries, and can help enterprises to realize an intelligent man-machine cooperation system of online service, improve the efficiency of seat service, and reduce the manpower cost and the learning cost of new business knowledge. As large models go deep into various vertical areas, the user's problem becomes more specialized and the knowledge required will become more specific.
The current intelligent question-answering system is mainly used for interacting with a user through a question-answering interaction model, and a main data source of the question-answering interaction model is text data, so that although the model has strong language understanding and generating capability, the model has no better generalization capability on complex question information such as voice, pictures and the like, and the accuracy of the final question-answering interaction is poor.
Disclosure of Invention
The invention provides an artificial intelligence question-answering interaction method, an artificial intelligence question-answering interaction device, computer equipment and a medium, which can improve the generalization capability of a question-answering interaction model and further improve the accuracy of question-answering interaction.
In a first aspect, a question-answer interaction method is provided, including:
receiving questioning information to be processed;
determining an information processing mode corresponding to the questioning information according to the information type corresponding to the questioning information;
based on the information processing mode and a preset information processing architecture, converting the questioning information into a questioning text with a preset format;
and processing the question text through a preset question and answer interaction model to obtain an answer text corresponding to the question text.
In a second aspect, a question-answering interaction device is provided, including:
The receiving module is used for receiving the questioning information to be processed;
the determining module is used for determining an information processing mode corresponding to the questioning information according to the information type corresponding to the questioning information;
the conversion module is used for converting the question information into a question text with a preset format based on the information processing mode and a preset information processing architecture;
and the processing module is used for processing the question text through a preset question-answer interaction model to obtain an answer text corresponding to the question text.
In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the question-answer interaction method described above when the computer program is executed.
In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program, which when executed by a processor, implements the steps of the question-answer interaction method described above.
According to the scheme realized by the questioning and answering interaction method, the questioning and answering interaction device, the computer equipment and the storage medium, the questioning information to be processed can be acquired and received, the information processing mode corresponding to the questioning and answering information is determined according to the information type corresponding to the questioning and answering information, the questioning and answering information is converted into the questioning text in the preset format based on the information processing mode and the preset information processing architecture, the questioning and answering text is processed through the preset questioning and answering interaction model, and the answer text corresponding to the questioning and answering text is obtained.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of an application environment of a question-answering interaction method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for interaction of questions and answers according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a question-answer interaction system in accordance with an embodiment of the invention;
FIG. 4 is a schematic diagram of a full decoder in a question-answering interactive system according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating learning of images and texts in a question-answering interactive system according to an embodiment of the present invention
FIG. 6 is a schematic diagram of a device for interaction of questions and answers according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of another structure of a question-answering interaction device according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a computer device according to an embodiment of the invention;
fig. 9 is a schematic diagram of another configuration of a computer device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The question-answer interaction method provided by the embodiment of the invention can be applied to scenes such as intelligent diagnosis and treatment, remote consultation, intelligent vision, semantic recognition and the like, and the application environment is shown in figure 1, wherein a client communicates with a server through a network. The client can receive the questioning information to be processed, then, the client determines an information processing mode corresponding to the questioning information according to the information type corresponding to the questioning information, then, the questioning information is converted into a questioning text in a preset format based on the information processing mode and a preset information processing architecture, and finally, the client can process the questioning text through a preset questioning and answering interaction model to obtain an answer text corresponding to the questioning text.
According to the invention, based on the preset information processing architecture and the information processing mode corresponding to the question information, the question information is converted into the question text in the preset format, and the question text is processed through the preset question-answer interaction model, so that the corresponding answer text is obtained, and the problem that the question-answer interaction model cannot accurately generate the answer text corresponding to the question information under the condition of insufficient question-answer corpus can be avoided, so that the generalization capability of the question-answer interaction model can be improved, and the accuracy of question-answer interaction is further improved.
The question-answer interaction method can be applied to the scene of information inquiry, for example, in the medical field, medical record information required by a user can be inquired from massive electronic medical records, and the method is beneficial to providing medical record reference for the user. For another example, in the internet domain, answer information queried by a user may be output from a vast amount of internet data.
The invention provides a question-answer interaction method, which relates to a natural language processing direction in the field of artificial intelligence.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like
The following is a detailed description.
Referring to fig. 2, fig. 2 is a schematic flow chart of a question-answer interaction method according to an embodiment of the invention, which includes the following steps:
101. and receiving the questioning information to be processed.
The embodiment of the invention can be applied to an intelligent customer service system, which is an industry application-oriented system developed on the basis of large-scale knowledge processing, is applicable to the technical industries of large-scale knowledge processing, natural language understanding, knowledge management, automatic question-answering systems and the like, and can establish a quick and effective technical means based on natural language for communication between users and staff. It may be appreciated that the question information in the embodiment of the present invention may be collected by a question-answering device, which may be a smart phone, a personal computer, or a server. The voice questioning information can be obtained through a microphone, the image questioning information can be obtained through a camera, and the text or the image input by a user can be obtained, so that the text questioning information or the image questioning information can be obtained. The setting may be specifically performed according to actual situations, which will not be described herein.
102. And determining an information processing mode corresponding to the questioning information according to the information type corresponding to the questioning information.
From the foregoing, the question information of the present invention may include different types of question information, such as speech type question information, image type question information, and text type question information, so, in order to facilitate subsequent question-answer interaction, in the present invention, it is necessary to determine the information type corresponding to the question information, thereby determining the information processing mode corresponding to the question information.
103. Based on the information processing mode and a preset information processing architecture, the questioning information is converted into questioning texts in a preset format.
The invention provides an information processing architecture, which comprises three sub-networks, wherein each sub-network corresponds to an information processing mode. For example, the image processing mode corresponds to a first sub-network, the text processing mode corresponds to a second sub-network, and the speech processing mode corresponds to a third sub-network. It may be appreciated that in the preset information processing architecture, a target information processing network corresponding to the information processing mode may be determined, and then, through the target information processing network, the question information is converted into a question text in a preset format, so as to perform a question-answer interaction subsequently, that is, optionally, in some embodiments, the step of converting the question information into the question text in the preset format based on the information processing mode and the preset information processing architecture may specifically include:
(11) In a preset information processing architecture, determining a target information processing network corresponding to an information processing mode;
(12) And converting the questioning information into a questioning text with a preset format according to the target information processing network.
The image processing mode corresponds to the first sub-network, the text processing mode corresponds to the second sub-network, and the voice processing mode corresponds to the third sub-network. It can be understood that the information processing architecture provided by the invention can comprise three different sub-networks, so that for the question information of different information types, the question information is converted into the question text in the preset format by adopting different sub-networks, and the corresponding answer text is generated through a question-answer interaction model.
Processing of different types of question information includes the following three cases;
case one:
for the question information of the image type, the question information may be segmented and encoded to obtain a series of embedded features, and then the embedded features are processed (e.g. decoded) through the target information network, so as to estimate the question text corresponding to the question information, that is, optionally, in some embodiments, the step of "converting the question information into the question text in a preset format according to the target information processing network" may specifically include:
(21) Partitioning the questioning information, and encoding the partitioned questioning information;
(22) And processing the coded information based on the target information processing network to obtain a question text corresponding to the question information.
And a second case:
for the text type question information, the question information may be encoded first, and then the encoded information is processed by using the target information processing network, so as to obtain a word vector sequence corresponding to the question information through decoding, that is, optionally, in some embodiments, the step of converting the question information into a question text in a preset format according to the target information processing network may specifically include:
(31) Encoding the target information to obtain semantic representation of the target information in a preset length;
(32) And decoding the semantic representation by using the target information processing network to obtain a word vector sequence corresponding to the target information.
For example, specifically, the target information may be encoded by an encoder, that is, the target information may be converted into vectors to generate predicted semantic relationships, then the semantic representation is decoded by a target information processing network, and a word sequence vector with the highest prediction probability is output, that is, the word sequence vector includes a plurality of word vectors, and each word vector has a sequential relationship. It will be appreciated that the target information (i.e., text) is encoded into a vector x= (X1, X2, X3) containing semantic relationships, then the vector x= (X1, X2, X3) is decoded through the target information processing network, the word vector of the first bit of the target information is estimated to be X1', then the probability of estimating the next word vector of X1' to be X2 'is 30%, the probability of estimating the next word vector of X1' to be X3 'is 70%, and the word sequence vector is (X1', X3', X2').
And a third case:
aiming at the question information of the voice type, if the question information is directly converted into a voice text, the problems of emotion, speech speed and rhythm information loss of the voice in the conversion process can occur, therefore, the invention provides a thought, and a prompt learner is utilized to combine the converted voice text and answer text corresponding to the voice text, so that the question information is converted into a question text with a preset format, namely, optionally, in some embodiments, the step of converting the question information into the question text with the preset format according to a target information processing network can comprise the following steps:
(41) Extracting characteristics of the questioning information to obtain a characteristic frame sequence corresponding to the questioning information;
(42) Coding the characteristic frame sequence to obtain high-order characteristics corresponding to the questioning information;
(43) Decoding the high-order features to obtain a text sequence corresponding to the question information;
(44) Based on a preset audio prompt learner and a text sequence, the question information is converted into a question text in a preset format.
When the data is small (low resource/small sample), a large number of downstream tasks cannot be used to fine tune the model parameters, prompt Learning (PL) is generated, the PL is a model for letting a large number of downstream tasks migrate, and the downstream tasks are assembled into a natural language form to perfect the model itself.
For example, specifically, firstly, a feature is extracted from target information by a digital signal processing technology such as fast fourier transform processing and mel frequency cepstrum coefficient, question information is converted into a feature frame sequence, then, the feature frame sequence is encoded to obtain a higher-order feature corresponding to the question information, and then, the higher-order feature is decoded to obtain a text sequence corresponding to the question information. Furthermore, it should be noted that, in the present invention, the text sequence is encoded into a soft prompt of a voice mode by using the audio prompt learner, and the question information is converted into a question text of a preset format based on the soft prompt and the text sequence, that is, optionally, in some embodiments, the step of "converting the question information into a question text of a preset format based on the preset audio prompt learner and the text sequence" may specifically include:
(51) Encoding the text sequence into a soft prompt of a voice modality based on a preset audio prompt learner;
(52) And decoding the soft prompt to obtain a question text corresponding to the question information.
104. And processing the question text through a preset question-answer interaction model to obtain an answer text corresponding to the question text.
After the question text with the preset format is obtained, the question text is processed through a preset question-answer interaction model, so that an answer text corresponding to the question text is obtained. Wherein the question-answer interaction model may be a large language model (Large Language Model, LLM), which refers to a deep learning model trained using a large amount of text data, which may generate natural language text or understand the meaning of language text. The large language model can process various natural language tasks, such as text classification, question-answering, dialogue and the like, and is an important path to artificial intelligence.
Aiming at the questioning information of the voice type, after the text sequence is encoded into a soft prompt of a voice mode based on a preset audio prompt learner, the soft prompt and the text sequence are input into a large language model, and a reference answer text corresponding to the target information is output by the large language model. Then, the reference answer text is subjected to word segmentation, and the segmented text is encoded to obtain segmented text characteristics; meanwhile, the text characteristics after word segmentation are decoded by utilizing soft prompts coded by the text sequences, and finally answer texts corresponding to the question texts are obtained.
Aiming at the questioning information of the text type, after coding the questioning information, obtaining semantic representation of the target information in a preset length; then, decoding the semantic representation by using a target information processing network to obtain a word vector sequence corresponding to the target information; and then, inputting the word vector sequence into a large language model, and predicting candidate answer words corresponding to each word vector in the word vector sequence by using the large language model, and finally obtaining an answer text corresponding to the question text.
Aiming at the questioning information of the image type, the coded information is processed based on the target information processing network to obtain a questioning text corresponding to the questioning information, and the questioning text is also input into a large language model, and the large language model outputs the text related to the image and text corresponding to the target information.
In addition, it may be understood that the question-answer interaction model may be pre-trained, specifically, a voice training sample, a text training sample and a picture text pair training sample may be obtained, and then, the first sub-network of the basic interaction model is trained through the voice training sample, the second sub-network of the basic interaction model is trained through the text training sample, and the third sub-network of the basic interaction model is trained through the picture text pair training sample, so that the question-answer interaction model has multiple-mode question-answer interaction capability in actual use.
It should be noted that, the presentation form of the picture text on the training sample is a picture-picture descriptive text, for example, picture a-a bird flying on water, picture b-traffic light, etc., i.e. the descriptive text is associated with the picture.
In order to further understand the question-answer interaction scheme of the present invention, the following will describe the question-answer interaction system in detail, referring to fig. 3, the present invention provides a question-answer interaction model, as shown in fig. 3, including a speech processing model S1, a text processing model S2, and a picture processing model S3, specifically as follows:
The Speech processing model S1 mainly comprises Automatic Speech recognition (Automatic Speech)A Recognition, ASR, module and a Text To Speech (TTS) module, both of which are sequence-To-sequence (Seq 2 Seq) encoder decoder architectures. The automatic voice recognition module firstly extracts characteristics through digital signal processing technologies such as preprocessing, fast Fourier Transform (FFT), mel Frequency Cepstrum Coefficient (MFCC) and the like, and converts voice signals into a characteristic frame sequence 0= { o1, o2, …, oT }; inputting the feature frame sequence to an encoder, extracting high-order features h=encoct (0) of the audio; the decoder decodes the current text wn=decoder t (w 1: n-1, h) in autoregressive form from the higher order features and the historical inputs, assuming the possible sequences are w= { w1, w2, …, wN }, the decoder solves for the text sequence with the highest posterior probability
The text-to-speech synthesis module divides words of the text through the text analysis module and converts input into phonemes; the acoustic model is characterized by the middle of the Mel-spectra (Mels) of phoneme prediction, etc., which the vocoder restores to audio.
The integration of the two modules can process the mutual conversion of the audio signal and the text, and in order to solve the problem of the loss of emotion, speed and rhythm information of the voice in the conversion process, an audio Prompt Learner (PL) is utilized to encode the bottom characteristic frame sequence converted by the automatic voice recognition module into Soft Prompt (SP) of the voice mode:
sPi=PL(oi)
For the entire input, mean pooling (Mean pooling) is used to obtain a global underlying feature frame representation:
Sp=Mean pooling(Spi)
the representation retains emotion, speech speed, prosody information of the speech signal more fully than the translated text sequence, and does not introduce high-level encoder/decoder noise. The representation is provided to a large language model and text-to-speech module for which audio emotion information is injected.
For the text processing model S2, in order to better perform the generation task, a model structure of a full Decoder (Decoder Only) is employed, as shown in fig. 4.
Given a candidate word set u= (u 1,., un) and a historical input window size k, a higher-order feature h of a language model location i i And candidate word probability p (u) is calculated as follows:
h i =Model(h l-1 ),l∈[1,n]
p(u)=softmax(FC(h i )
the training objective is to maximize the likelihood function:
where θ is a trainable model parameter.
For the picture processing model S3, to enhance the visual capabilities of the system, we use a picture Encoder (Image Encoder) to block and encode the picture to obtain a series of embeddings, and then use a visual cue Learner (promt Learner) to learn task-specific picture representations as visual soft cues of the large-scale language model (Large Language Model, LLM) for which visual information is injected. During training, we use aligned picture-text pairs to combine the picture encoder with the large-scale language model for the training of two tasks, picture-text unsupervised generation and graph-text contrast learning. The former helps to prompt the learner to align the modalities, and the latter enables the language model to generate corresponding text from the visual prompts it provides, the process of graph-text learning being shown in fig. 5. Wherein the contrast learning loss uses InfoNCE loss, and for the ith picture-text pair, the matching positive samples are used as the matching positive samples, and the other picture-text pairs in the training batch are used as negative samples, so that the picture-text loss is realized Text-to-graph loss->The following are provided:
the final loss is
The system provided by the invention supports voice input and voice dialogue by using voice recognition and text voice generation, and the audio prompt learner can obtain emotion representation of a voice signal by using the bottom layer characteristics and enrich audio mode representation. The text mode is used as a bridge, the semantic understanding and generating capacity of a large-scale language model is combined, the voice dialogue requirement of a user is supported, different language models and picture encoders can be integrated as a framework, and a vertical field dialogue model can be trained by combining field data; according to scene needs, alignment and combination among modes can be used for fine adjustment of a language model and a picture encoder, and only a prompt learner can be trained, so that noise can not be mutually introduced among modes, parameters and resources required by training are reduced, and application scenes and values are greatly expanded.
It can be seen that the invention provides a question and answer interaction method, after receiving question information to be processed, according to the information type corresponding to the question information, determining an information processing mode corresponding to the question information, then, based on the information processing mode and a preset information processing architecture, converting the question information into a question text of a preset format, and finally, processing the question text through a preset question and answer interaction model to obtain an answer text corresponding to the question text.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
In an embodiment, a question-answer interaction device is provided, and the question-answer interaction device corresponds to the question-answer interaction method in the embodiment one by one. As shown in fig. 6, the question-answering interaction device includes a receiving module 201, a determining module 202, a converting module 203, and a processing module 204. The functional modules are described in detail as follows:
and the receiving module 201 is used for receiving the question information to be processed.
The determining module 202 is configured to determine an information processing mode corresponding to the question information according to an information type corresponding to the question information.
The question information of the invention can comprise different types of question information, such as voice type question information, image type question information and text type question information, so that in order to facilitate subsequent question and answer interaction, in the invention, the information type corresponding to the question information needs to be determined, and thus the information processing mode corresponding to the question information is determined.
The conversion module 203 is configured to convert the question information into a question text in a preset format based on an information processing mode and a preset information processing architecture.
The information processing architecture comprises three sub-networks, and each sub-network corresponds to one information processing mode. For example, the image processing mode corresponds to a first sub-network, the text processing mode corresponds to a second sub-network, and the speech processing mode corresponds to a third sub-network. It can be understood that in the preset information processing architecture, a target information processing network corresponding to the information processing mode can be determined, and then, the question information is converted into a question text in a preset format through the target information processing network, so that the question and answer interaction can be performed subsequently.
And the processing module 204 is used for processing the question text through a preset question-answer interaction model to obtain an answer text corresponding to the question text.
After the question text with the preset format is obtained, the question text is processed through a preset question-answer interaction model, so that an answer text corresponding to the question text is obtained. Wherein the question-answer interaction model may be a large language model (Large Language Model, LLM), which refers to a deep learning model trained using a large amount of text data, which may generate natural language text or understand the meaning of language text. The large language model can process various natural language tasks, such as text classification, question-answering, dialogue and the like, and is an important path to artificial intelligence.
Aiming at the questioning information of the voice type, after the text sequence is encoded into a soft prompt of a voice mode based on a preset audio prompt learner, the soft prompt and the text sequence are input into a large language model, and a reference answer text corresponding to the target information is output by the large language model. Then, the reference answer text is subjected to word segmentation, and the segmented text is encoded to obtain segmented text characteristics; meanwhile, the text characteristics after word segmentation are decoded by utilizing soft prompts coded by the text sequences, and finally answer texts corresponding to the question texts are obtained.
Aiming at the questioning information of the text type, after coding the questioning information, obtaining semantic representation of the target information in a preset length; then, decoding the semantic representation by using a target information processing network to obtain a word vector sequence corresponding to the target information; and then, inputting the word vector sequence into a large language model, and predicting candidate answer words corresponding to each word vector in the word vector sequence by using the large language model, and finally obtaining an answer text corresponding to the question text.
Aiming at the questioning information of the image type, the coded information is processed based on the target information processing network to obtain a questioning text corresponding to the questioning information, and the questioning text is also input into a large language model, and the large language model outputs the text related to the image and text corresponding to the target information.
Optionally, in some embodiments, the conversion module 203 may specifically include:
a determining unit, configured to determine, in a preset information processing architecture, a target information processing network corresponding to the information processing mode;
and the conversion unit is used for converting the question information into a question text with a preset format according to the target information processing network.
For the question information of the image type, the question information may be segmented and encoded to obtain a series of embedded features, and then the embedded features are processed through the target information network to estimate a question text corresponding to the question information, that is, optionally, in some embodiments, the conversion unit may specifically be configured to:
partitioning the questioning information, and encoding the partitioned questioning information;
and processing the coded information based on the target information processing network to obtain a question text corresponding to the question information.
For the text type question information, the question information may be encoded first, and then the encoded information is processed by using the target information processing network, so as to obtain a word vector sequence corresponding to the question information through decoding, that is, optionally, in some embodiments, the conversion unit may specifically be configured to:
Encoding the target information to obtain semantic representation of the target information in a preset length;
and decoding the semantic representation by using the target information processing network to obtain a word vector sequence corresponding to the target information.
For the question information of the voice type, if the question information is directly converted into the voice text, the problems of emotion, speech speed and prosody information loss of the voice in the conversion process may occur, so the invention provides a thought that the prompt learner is utilized to combine the converted voice text and the answer text corresponding to the voice text, so that the question information is converted into the question text in the preset format, that is, optionally, in some embodiments, the conversion unit may be specifically configured to:
extracting characteristics of the questioning information to obtain a characteristic frame sequence corresponding to the questioning information;
coding the characteristic frame sequence to obtain high-order characteristics corresponding to the questioning information;
decoding the high-order features to obtain a text sequence corresponding to the question information;
based on a preset audio prompt learner and a text sequence, the question information is converted into a question text in a preset format.
Alternatively, in some embodiments, the conversion unit may be specifically configured to:
Encoding the text sequence into a soft prompt of a voice modality based on a preset audio prompt learner;
and decoding the soft prompt to obtain a question text corresponding to the question information.
In addition, it may be understood that the question-answer interaction model may be pre-trained, specifically, a voice training sample, a text training sample and a picture text pair training sample may be obtained, and then, the first sub-network of the basic interaction model is trained through the voice training sample, the second sub-network of the basic interaction model is trained through the text training sample, and the third sub-network of the basic interaction model is trained through the picture text pair training sample, so that the question-answer interaction model has multiple-mode question-answer interaction capability in actual use. Optionally, referring to fig. 7, the question-answer interaction device of the present invention may specifically further include a training module 205, where the training module 205 may specifically be configured to: acquiring a voice training sample, a text training sample and a picture text pair training sample; and training a first sub-network of the basic interaction model through a voice training sample, training a second sub-network of the basic interaction model through a text training sample and training a third sub-network of the basic interaction model through a picture text to the training sample to obtain a question-answer interaction model.
The invention provides a question-answer interaction device, after receiving information to be processed, a receiving module 201 determines an information processing mode corresponding to the question information according to the information type corresponding to the question information, then a converting module 203 converts the question information into a question text in a preset format based on the information processing mode and a preset information processing framework, and finally a processing module 204 processes the question text through a preset question-answer interaction model to obtain an answer text corresponding to the question text.
For specific limitations of the question-answering interaction device, reference may be made to the above limitation of the intelligent question-answering method, and no further description is given here. The modules in the question-answering interaction device can be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Alternatively, in some embodiments, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external client via a network connection. The computer program, when executed by a processor, implements the functions or steps of a server side of a question-answer interaction method.
Alternatively, in some embodiments, a computer device is provided, which may be a client, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external server via a network connection. The computer program is executed by a processor to perform functions or steps of a client side of a question-answer interaction method.
Optionally, in some embodiments, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring a preset question-answer corpus;
determining the corresponding relation between each question corpus and each answer corpus, and respectively marking a plurality of question corpora and a plurality of answer corpora based on the corresponding relation to obtain a marked corpus set;
carrying out corpus expansion on the marked corpus based on a preset language generation model to obtain an expanded corpus;
training the initial question-answer model according to the expanded corpus to obtain a target question-answer model so as to conduct question-answer interaction through the target question-answer model.
After a preset question-answer corpus is obtained, the corresponding relation between each question-answer corpus and each answer corpus is determined, a plurality of question-answer corpuses and a plurality of answer corpuses are respectively marked based on the corresponding relation, a marked corpus is obtained, the marked corpus is expanded based on a preset language generation model, an expanded corpus is obtained, finally, an initial question-answer model is trained according to the expanded corpus, a target question-answer model is obtained, question-answer interaction is carried out through the target question-answer model, in the question-answer interaction scheme provided by the invention, the marked question-answer corpuses are expanded through the preset language generation model, the initial question-answer model is trained by utilizing the expanded corpus, the generalization capability of the target question-answer model is improved, and the accuracy of the question-answer interaction is further improved.
Optionally, in some embodiments, a computer readable storage medium is provided, having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring a preset question-answer corpus;
determining the corresponding relation between each question corpus and each answer corpus, and respectively marking a plurality of question corpora and a plurality of answer corpora based on the corresponding relation to obtain a marked corpus set;
carrying out corpus expansion on the marked corpus based on a preset language generation model to obtain an expanded corpus;
training the initial question-answer model according to the expanded corpus to obtain a target question-answer model so as to conduct question-answer interaction through the target question-answer model.
After a preset question-answer corpus is obtained, the corresponding relation between each question-answer corpus and each answer corpus is determined, a plurality of question-answer corpuses and a plurality of answer corpuses are respectively marked based on the corresponding relation to obtain a marked corpus, the marked corpus is expanded based on a preset language generation model to obtain an expanded corpus, and finally, an initial question-answer model is trained according to the expanded corpus to obtain a target question-answer model so as to conduct question-answer interaction through the target question-answer model.
It should be noted that, the functions or steps implemented by the computer readable storage medium or the computer device may correspond to the relevant descriptions of the server side and the client side in the foregoing method embodiments, and are not described herein for avoiding repetition.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (10)

1. A question-answer interaction method is characterized by comprising the following steps:
receiving questioning information to be processed;
determining an information processing mode corresponding to the questioning information according to the information type corresponding to the questioning information;
Based on the information processing mode and a preset information processing architecture, converting the questioning information into a questioning text with a preset format;
and processing the question text through a preset question and answer interaction model to obtain an answer text corresponding to the question text.
2. The method of claim 1, wherein the converting the question information into a question text in a preset format based on the information processing mode and a preset information processing architecture, comprises:
in a preset information processing architecture, determining a target information processing network corresponding to the information processing mode;
and according to the target information processing network, converting the questioning information into a questioning text with a preset format.
3. The method according to claim 2, wherein the converting the question information into a question text in a preset format according to the target information processing network includes:
partitioning the questioning information, and encoding the partitioned questioning information;
and processing the coded information based on the target information processing network to obtain a question text corresponding to the question information.
4. The method according to claim 2, wherein the converting the question information into a question text in a preset format according to the target information processing network includes:
Encoding the target information to obtain semantic representation of the target information in a preset length;
and decoding the semantic representation by using the target information processing network to obtain a word vector sequence corresponding to the target information.
5. The method according to claim 2, wherein the converting the question information into a question text in a preset format according to the target information processing network includes:
extracting characteristics of the questioning information to obtain a characteristic frame sequence corresponding to the questioning information;
coding the characteristic frame sequence to obtain high-order characteristics corresponding to the questioning information;
decoding the high-order features to obtain a text sequence corresponding to the question information;
and converting the questioning information into a questioning text with a preset format based on a preset audio prompt learner and the text sequence.
6. The method of claim 5, wherein the converting the question information into question text in a preset format based on the preset audio prompt learner and the text sequence comprises:
encoding the text sequence into a soft prompt of a voice mode based on a preset audio prompt learner;
And decoding the soft prompt to obtain a question text corresponding to the question information.
7. The method according to any one of claims 1 to 6, further comprising, prior to receiving the challenge information to be processed:
acquiring a voice training sample, a text training sample and a picture text pair training sample;
and training a first sub-network of a basic interaction model through the voice training sample, training a second sub-network of the basic interaction model through the text training sample and training a third sub-network of the basic interaction model through the picture text to obtain a question-answer interaction model.
8. A question-answering interaction device, comprising:
the receiving module is used for receiving the questioning information to be processed;
the determining module is used for determining an information processing mode corresponding to the questioning information according to the information type corresponding to the questioning information;
the conversion module is used for converting the question information into a question text with a preset format based on the information processing mode and a preset information processing architecture;
and the processing module is used for processing the question text through a preset question-answer interaction model to obtain an answer text corresponding to the question text.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the question-answer interaction method of any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the question-answer interaction method of any one of claims 1 to 7.
CN202311516077.XA 2023-11-14 2023-11-14 Question-answer interaction method, device, equipment and medium Pending CN117592564A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311516077.XA CN117592564A (en) 2023-11-14 2023-11-14 Question-answer interaction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311516077.XA CN117592564A (en) 2023-11-14 2023-11-14 Question-answer interaction method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117592564A true CN117592564A (en) 2024-02-23

Family

ID=89919286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311516077.XA Pending CN117592564A (en) 2023-11-14 2023-11-14 Question-answer interaction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117592564A (en)

Similar Documents

Publication Publication Date Title
CN109785824B (en) Training method and device of voice translation model
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
CN111312245B (en) Voice response method, device and storage medium
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN111930914B (en) Problem generation method and device, electronic equipment and computer readable storage medium
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
CN113761841B (en) Method for converting text data into acoustic features
CN114360502A (en) Processing method of voice recognition model, voice recognition method and device
CN112837669A (en) Voice synthesis method and device and server
CN116912642A (en) Multimode emotion analysis method, device and medium based on dual-mode and multi-granularity interaction
CN115269836A (en) Intention identification method and device
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN113782042A (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN117150338A (en) Task processing, automatic question and answer and multimedia data identification model training method
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
CN111310847B (en) Method and device for training element classification model
CN117592564A (en) Question-answer interaction method, device, equipment and medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN113761943A (en) Method for generating judicial dialogues, method and device for training models, and storage medium
CN110110048B (en) Query guiding method and device
CN115081459B (en) Spoken language text generation method, device, equipment and storage medium
CN117150320B (en) Dialog digital human emotion style similarity evaluation method and system
CN117727288B (en) Speech synthesis method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination