WO2020143130A1 - 基于物理环境博弈的自主进化智能对话方法、系统、装置 - Google Patents

基于物理环境博弈的自主进化智能对话方法、系统、装置 Download PDF

Info

Publication number
WO2020143130A1
WO2020143130A1 PCT/CN2019/083354 CN2019083354W WO2020143130A1 WO 2020143130 A1 WO2020143130 A1 WO 2020143130A1 CN 2019083354 W CN2019083354 W CN 2019083354W WO 2020143130 A1 WO2020143130 A1 WO 2020143130A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
picture
dialogue
text
physical environment
Prior art date
Application number
PCT/CN2019/083354
Other languages
English (en)
French (fr)
Inventor
许家铭
姚轶群
徐波
Original Assignee
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院自动化研究所 filed Critical 中国科学院自动化研究所
Priority to US16/641,256 priority Critical patent/US11487950B2/en
Publication of WO2020143130A1 publication Critical patent/WO2020143130A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/086Learning methods using evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • G06V10/7784Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors
    • G06V10/7788Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors the supervisor being a human, e.g. interactive learning with a human teacher
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • the invention belongs to the field of artificial intelligence and visual dialogue, and in particular relates to an autonomous evolution intelligent dialogue method, system and device based on physical environment games.
  • Visual dialogue generation is an important issue in the field of natural language processing.
  • the common manifestation of this problem is to enter a real-world picture and several rounds of conversation history text surrounding the picture, as well as the sentence input from the outside in the current round, the dialogue system generates a response to the external input sentence in the current round Statement.
  • Existing methods based on reinforcement learning and generative learning can improve the quality of visual dialogue to a certain extent, but the calculation cost is too large, the strategy gradient algorithm based on feedback signals converges slowly, and does not consider the game with the physical world or just Through the goal-driven based on single sample to realize the game with the physical world, the quality of visual dialogue needs to be further improved.
  • the present invention provides an autonomous evolution intelligent dialogue method based on physical environment game, including:
  • Step S10 Obtain the image to be processed and the corresponding question text
  • Step S20 an optimized dialogue model is used to generate the response text of the image to be processed and the corresponding question text;
  • Step S30 output the response text
  • the dialogue model includes a picture coding model, a text coding model, a state coding model, and a decoder;
  • the picture coding model is constructed based on a pre-trained convolutional neural network
  • the text encoding model, state encoding model, and decoder are language models based on recurrent neural networks
  • the text encoding model includes a question encoder and a fact encoder.
  • the optimized dialogue model needs to introduce a discriminator in its optimization process.
  • the dialogue model and the discriminator are alternately optimized until the values of the mixed loss function of the dialogue model and the discriminator's loss function no longer decrease or are low At the default value, the steps are:
  • Step M10 Obtain a picture set representing the physical environment and dialogue text corresponding to the picture as a first picture set and a first dialogue text set;
  • the first dialogue text set includes a first question text set and a first answer text set ;
  • Step M20 encoding each picture in the first picture set using a picture coding model, generating a first picture vector, and obtaining a first picture vector set;
  • Step M30 integrating the first picture vector set, using the question encoder, fact encoder and state coding model to encode all rounds of dialogue text in the first dialogue text set into the state vector of the corresponding round, to obtain First state vector set;
  • Step M40 the decoder generates the first state vector set into response sentences corresponding to the rounds to obtain a second response text set; the single state perception mapping function generates the second image vector set from the first state vector set;
  • Step M50 Calculate the probability that all picture vectors in the second picture vector set belong to the physical environment vector through the discriminator, and use the probability and the first response text set to optimize the dialogue model to obtain the first optimized dialogue model;
  • step M60 the first picture vector set and the second picture vector set are sampled to generate an adversarial training sample pool, and the discriminator is optimized to obtain a first optimized discriminator.
  • the construction of the picture coding model is further provided with a pre-training step.
  • the steps are:
  • step T10 a picture set containing a physical environment is selected as a pre-training picture set
  • step T20 a convolutional neural network model is used, and the object category of each picture in the pre-training picture set is used as a label for pre-training, and the pre-trained convolutional neural network is a picture coding model.
  • the first picture vector is:
  • I is the first picture vector
  • CNN pre is the picture coding model
  • Img is each picture in the picture set.
  • step M20 "encode each picture in the first picture set using a picture coding model to generate a first picture vector", the method is:
  • Each picture of the first picture set is input into a picture coding model, and a fully connected layer vector corresponding to the last layer of the picture is output.
  • the vector encodes information of each level of the input picture to obtain the first picture vector set .
  • step M30 "integrate the first picture vector set, and use the question encoder, fact encoder, and state encoding model to encode all rounds of dialogue text in the first dialogue text set. Is the state vector of the corresponding round", the steps are:
  • Step M31 through the word mapping method, encode each word in the dialogue text of all rounds into a word vector to obtain a set of word vectors;
  • Step M32 in the t-round dialogue text, based on the word vector set, use the question encoder to encode the question text into a question vector; use the fact encoder to jointly encode the question text and the response text into a fact vector; use state encoding The encoder encodes the question text question vector, the fact vector, the first picture vector corresponding to the fact vector and the t-1 round state vector as the t round state vector; 1 ⁇ t ⁇ T, T is the total dialogue Rounds
  • step M33 each round of state vectors obtained in step M32 is constructed as a second state vector set.
  • the text encoding model includes a question encoder and a fact encoder; a word vector, a question vector, a fact vector and a state vector, and the calculation method is:
  • e is the word vector
  • b is the dimension of the word vector
  • v is the size of the vocabulary of all words in the data set
  • w is the unique hot code representation of each word.
  • q t is the question text question vector
  • Enc q is the question encoder
  • ⁇ e 1 ,...e n ⁇ t is the sequence of question word vectors.
  • Enc f is the fact encoder; ⁇ e 1 ,...e m+n ⁇ t is the splicing sequence of the vector sequence of question and answer words in round t.
  • s t is the state vector of the current round
  • LSTM s is the state encoder, and only one operation is performed in each dialogue t
  • s t-1 is the hidden layer state of the t-1 round
  • q t is the current round
  • Is the fact vector of the previous round
  • I is the first picture vector on which the dialogue is based.
  • step M40 “the decoder generates the first state vector set into response sentences corresponding to rounds to obtain a second response text set; the single state perception mapping function is used to convert the first state Vector set to generate a second image vector set”, the method is:
  • a decoder is used, with each round of state vectors in the first set of state vectors as the initial state, each word of the predicted answer is generated in turn, and a second set of response texts is obtained for the response sentence of the corresponding round; using a single-layer perception mapping
  • the function maps the state vectors of each round in the first state vector set into picture vectors of corresponding rounds to obtain a second picture vector set.
  • the second picture vector st ' is:
  • st ' is the second picture vector
  • D is the second picture vector dimension, which is also the dimension of the first picture vector I
  • W p is the connection weight of the single-layer perceptron
  • ReLU is the activation function used by the single-layer perceptron .
  • step M50 "calculate the probability that all picture vectors in the second picture vector set belong to the physical environment vector by the discriminator, and use the probability and the first response text set to optimize the dialogue model", which The steps are:
  • Step M51 input each picture vector in the second picture vector set into the discriminator to obtain the probability that the picture vector belongs to the physical environment vector; compare the second response text set with the first response text set to calculate the loss of supervised training Function and game loss function of physical environment;
  • Step M52 combining the loss function with the probability that the second picture vector set belongs to the real physical environment vector to calculate a mixed loss function
  • Step M53 Calculate the gradient of the mixed function to the parameters of the encoder, decoder, and mapping function, and update the parameters of the encoder, decoder, and single-layer perceptual mapping function to obtain the first optimized dialogue model.
  • the probability that the second picture vector belongs to the physical environment vector is calculated as:
  • DBot() is the discriminator
  • st ' is the second picture vector.
  • the supervised training loss function, physical environment game loss function and mixed loss function are calculated by:
  • L su is the loss function of supervised training
  • L adv is the physical environment game loss function
  • L G is the mixed loss function
  • N is the round t the length of the true dialogue response sentence
  • T is the total number of rounds of dialogue
  • is a hyperparameter.
  • step M60 "sampling the first picture vector set and the second picture vector set, generating an adversarial training sample pool, and optimizing the discriminator", the steps are:
  • Step M61 select some samples from the first picture vector set and mark them as true; select some samples from the second picture vector set and mark them as false; all the marked vectors form the discriminator training sample pool;
  • Step M62 the loss function of the discriminator is calculated, so that the probability of the discriminator outputting to the true sample is as high as possible, and the probability of outputting the fake sample is as low as possible, and the parameter of the discriminator is updated to obtain an optimized discriminator.
  • the discriminator loss function is calculated by:
  • L D is the discriminator loss function
  • I is the first picture vector
  • st t is the second picture vector
  • DBot() is the discriminator
  • Is the average value of the probability that the second picture vector belongs to the physical environment vector
  • E I ⁇ p(I) is the average value of the probability of the true sample output.
  • a self-evolving intelligent dialogue system based on a physical environment game is proposed, an acquisition module, a dialogue model, and an output module;
  • the acquisition module is configured to acquire and input images to be processed and corresponding problem information
  • the dialogue model is configured to use the optimized dialogue model to generate the response information of the image to be processed and corresponding question information;
  • the output module is configured to output response information
  • the dialogue model includes an image encoding module, a text encoding module, and a decoding module;
  • the image encoding module is configured to encode each picture in the acquired first picture set using the constructed picture encoding model, generate a first picture vector, and obtain the first picture vector set;
  • the text encoding module is configured to integrate into the first picture vector set, and the text encoding and the state encoding model are used to encode all rounds of dialogue text in the dialogue text in the first dialogue text set into a state vector corresponding to the round, to obtain First state vector set;
  • the decoding module is configured to generate a response text corresponding to a round based on the first state vector set.
  • a storage device in which a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to implement the above-mentioned autonomous evolution intelligent dialogue method based on physical environment games.
  • a processing device including a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; the program is suitable for Loaded and executed by the processor to realize the above-mentioned autonomous evolution intelligent dialogue method based on physical environment game.
  • the self-evolving intelligent dialogue method based on physical environment game of the present invention can make the state vector generated by the codec model and the picture vector of the physical world have a closely related distribution through the comprehensive use of confrontation training and supervised training, thus achieving intelligence
  • the ternary game between the body and the person, the agent and the physical environment improves the accuracy and fluency of the dialogue response, while avoiding the large computational burden caused by the use of reinforcement learning.
  • the self-evolutionary intelligent dialogue method based on the physical environment game of the present invention introduces a wide range of real physical world information into the self-evolving artificial intelligence method. Compared with the existing method, the method of the present invention can make full use of the extensive and easy The acquired physical environment information enables the model to obtain more general and extensible knowledge through autonomous evolution in the game with the physical environment.
  • the autonomous evolution intelligent system of the present invention is completed through an interactive game with the physical environment, which can better simulate the learning process of human beings, rely on more easily obtained resources, and acquire more general knowledge.
  • the physical environmental resources are unsupervised information, the amount of data is more sufficient, and it is easier to obtain.
  • FIG. 1 is a schematic flowchart of an autonomous evolution intelligent dialogue method based on physical environment game of the present invention
  • FIG. 2 is a schematic diagram of a question encoder and a fact encoder module in a round of dialogue in an embodiment of an autonomous evolution intelligent dialogue method based on a physical environment game of the present invention
  • FIG. 3 is a schematic diagram of a loss function generation process of supervision and confrontation training based on an embodiment of an autonomous evolution intelligent dialogue method based on a physical environment game of the present invention.
  • the existing natural language processing and generating dialogues are mainly methods based on reinforcement learning and generative learning. This method can improve the quality of the dialogue to a certain extent, but there are often two defects: one is that each word or sentence must be generated It takes a lot of sampling and trial and error to accurately estimate the loss function based on the feedback signal.
  • the strategy gradient algorithm based on the feedback signal itself converges slowly, resulting in excessive calculation consumption; second, it does not consider the game with the physical world , Only through the text itself and simple goal-driven completion, resulting in low accuracy of processing information.
  • the present invention introduces a universal game method with the physical environment to realize the ternary game of human, machine and physical world to improve the system's ability to integrate multi-modal information without introducing excessive computational complexity. The calculation cost is low and the convergence speed is fast, which further improves the accuracy of processing information.
  • a self-evolving intelligent dialogue method based on physical environment game of the present invention includes:
  • Step S10 Obtain the image to be processed and the corresponding question text
  • Step S20 an optimized dialogue model is used to generate the response text of the image to be processed and the corresponding question text;
  • Step S30 output the response text
  • the dialogue model includes a picture coding model, a text coding model, a state coding model, and a decoder;
  • the picture coding model is constructed based on a pre-trained convolutional neural network
  • the text encoding model, state encoding model, and decoder are language models based on recurrent neural networks
  • the text encoding model includes a question encoder and a fact encoder.
  • a self-evolving intelligent dialogue method based on a physical environment game includes steps S10-S30, and each step is described in detail as follows:
  • Step S10 Obtain the image to be processed and the corresponding question text.
  • step S20 the optimized dialogue model is used to generate the response text of the image to be processed and the corresponding question text.
  • the dialogue model includes a picture coding model, a text coding model, a state coding model, and a decoder.
  • the text encoding model, state encoding model, and decoder are language models based on recurrent neural networks.
  • the text encoding model includes a question encoder and a fact encoder.
  • the image coding model is based on a pre-trained convolutional neural network. The steps are:
  • step T10 a picture set containing a physical environment is selected as a pre-training picture set
  • step T20 a convolutional neural network model is used, and the object category of each picture in the pre-training picture set is used as a label for pre-training, and the pre-trained convolutional neural network is a picture coding model.
  • ImageNet is selected as a large-scale data set containing a large number of real-world images, and the mature convolutional neural network model VGG16 is selected, and the object class in each image in the data set is used for pre-training to obtain the image coding model CNN pre .
  • a discriminator In the process of optimizing the dialogue model, a discriminator needs to be introduced.
  • the dialogue model and the discriminator are alternately optimized until the values of the mixed loss function of the dialogue model and the loss function of the discriminator no longer drop or fall below the preset value.
  • the steps are:
  • Step M10 Obtain a picture set representing the physical environment and dialogue text corresponding to the picture as a first picture set and a first dialogue text set; the first dialogue text set includes a first question text set and a first answer text set .
  • Step M20 encoding the first picture set using a picture coding model, generating a first picture vector, and obtaining the first picture vector set.
  • the picture coding model CNN pre can output the fully connected layer vector of the last layer of the picture for an input picture.
  • This vector encodes the information of each level of the input picture, which is the first picture vector I, as shown in equation (1):
  • I is the first picture vector
  • CNN pre is the picture coding model
  • Img is each picture in the picture set.
  • the picture vector is obtained according to the above method, which is the first picture vector set.
  • the parameters of the CNN pre model are not updated as the model is trained.
  • Step M30 integrating the first picture vector set, using the question encoder, fact encoder and state coding model to encode all rounds of dialogue text in the first dialogue text set into the state vector of the corresponding round, to obtain First state vector set.
  • Step M31 through the word mapping method, encode each word in the dialogue text of all rounds into a word vector to obtain a set of word vectors.
  • Step M32 in the t-round dialogue text, based on the word vector set, use the question encoder to encode the question text into a question vector; use the fact encoder to jointly encode the question text and the response text into a fact vector; use state encoding
  • the encoder encodes the question text question vector, the fact vector, the first picture vector corresponding to the fact vector and the t-1 round state vector as the t round state vector; 1 ⁇ t ⁇ T, T is the total dialogue Rounds.
  • FIG. 2 it is a schematic diagram of a question encoder and a fact encoder module according to an embodiment of the present invention.
  • step M33 each round of state vectors obtained in step M32 is constructed as a second state vector set.
  • each word w ⁇ ⁇ x 1 ,...x n ,y 1 ,...y m ⁇ t in the question and answer sentence is a one-hot code vector, and the word mapping matrix can be used to The vector map becomes the word vector e, as shown in equation (2)
  • b is the dimension of the word vector
  • v is the size of the vocabulary of all words in the data set
  • w is the unique hot code representation of each word.
  • the LSTM model (Long Short-Term Memory model) is used as the question encoder Enc q ; LSTM is a recurrent neural network. For each word vector of a word input, the network is based on the input The word vector and the hidden layer state of the previous moment are calculated to produce the new layer of the hidden layer state; the word vector sequence ⁇ e 1 ,...e n ⁇ t of the question is input into the question encoder, and the last moment is obtained The hidden layer state serves as a question vector q t , as shown in equation (3):
  • the fact vector records the question and answer information in the current round of dialogue, and is used as historical information input in the next round (t+1) dialogue.
  • step M40 the decoder generates the first state vector set into corresponding rounds of response sentences to obtain a second response text set; and generates a second picture vector set from the first state vector set through a single-layer perceptual mapping function.
  • a decoder is used, with each round of state vectors in the first state vector set as the initial state, each word of the predicted answer is generated in turn, and the second response text set is obtained for the corresponding round of response sentences; using a single-layer perception mapping function to The state vectors of each round in the first state vector set are mapped into picture vectors of the corresponding round, and the second picture vector set is obtained.
  • the single-layer perceptron model is used as the mapping function f, and the state vector st is mapped to the second picture vector st ', as shown in equation (6):
  • D is the dimension of the second picture vector and the dimension of the first picture vector I;
  • W p is the connection weight of the single-layer perceptron;
  • ReLU is the activation function used by the single-layer perceptron.
  • Step M50 Calculate the probability that all picture vectors in the second picture vector set belong to the physical environment vector through the discriminator, and use the probability and the first response text set to optimize the dialogue model to obtain the first optimized dialogue model.
  • Step M51 input each picture vector in the second picture vector set into the discriminator to obtain the probability that the picture vector belongs to the physical environment vector; compare the second response text set with the first response text set to calculate the loss of supervised training Function and game loss function in physical environment.
  • FIG. 3 it is a schematic diagram of a process of generating a loss function for supervision and adversarial training in an embodiment of the present invention.
  • the LSTM model is used as the decoder Decoder, and the state vector s t is used as the initial state to sequentially generate each predicted answer word.
  • the structure of the LSTM model used by the decoder is the same as the structure of the encoder Enc q shown in FIG. 2.
  • the decoded word is encoded into a new hidden layer vector. Based on the new hidden layer vector, through a single-layer perceptron model with softmax activation function, for each word in the vocabulary, the probability of generating the word in the time slice is calculated.
  • the discriminator DBot() For each second picture vector, the discriminator outputs the probability that the vector belongs to the physical environment vector As shown in equation (7):
  • DBot() is the discriminator and st ' is the second picture vector.
  • the response sentence in the first dialogue text is a sequence of words N is the sentence length, and T is the number of historical dialogue rounds.
  • the cross-entropy is used to calculate the supervised training loss function L su of all the whole sentence response sentences in the dialogue, as shown in equation (8) :
  • Step M52 Combine the loss function with the probability that the second picture vector set belongs to the real physical environment vector to calculate a mixed loss function.
  • is a hyperparameter.
  • Step M53 Calculate the gradient of the mixed function to the parameters of the encoder, decoder, and mapping function, and update the parameters of the encoder, decoder, and single-layer perceptual mapping function to obtain the first optimized dialogue model.
  • the Adam algorithm is used to update the parameters of the encoder, decoder, and mapping function to reduce the value of the loss function.
  • step M60 the first picture vector set and the second picture vector set are sampled to generate an adversarial training sample pool, and the discriminator is optimized to obtain a first optimized discriminator.
  • Step M61 select some samples from the first picture vector set and mark them as true; select some samples from the second picture vector set and mark them as false; all the marked vectors form the discriminator training sample pool.
  • a sample subset of a packet (usually 32) in size is sampled from the dialog data (dialog), and the dialog text of the sample subset is encoded by the current encoder parameters to generate a second picture vector, and Label these vectors and mark them as false.
  • first picture vectors are sampled from the first picture vector set (which may not correspond to the sample subset of dialog data), and these vectors are labeled and marked as true.
  • Step M62 the loss function of the discriminator is calculated, so that the probability of the discriminator outputting to the true sample is as high as possible, and the probability of outputting the fake sample is as low as possible, and the parameter of the discriminator is updated to obtain an optimized discriminator.
  • I is the first picture vector
  • st ' is the second picture vector
  • DBot() is the discriminator
  • Is the average value of the probability that the false picture vector belongs to the physical environment vector
  • E I ⁇ p(I) is the average value of the probability of the true sample output.
  • Step S30 output the response text
  • the embodiment of the present invention selects VisDial v0.5 multiple rounds of question and answer data sets for evaluation.
  • the typical form of the data in the VisDial dataset is: give a picture and the corresponding 10 rounds of natural language dialogue, require the dialogue system to read the picture and all previous dialogue history in each round, and predict the response to the question in this round, And compare with the real response sentence.
  • Each response sentence has 100 candidate sentences, and the system must give the probability of producing each candidate sentence.
  • the test index of the data set is related to the ranking of the probability of the true answer among all the candidate answers.
  • MRR Mel Reciprocal Rank
  • Recall @1/5/10 the recall rate of the correct answer in the top 1/5/10 of the probability
  • Mean the average ranking of correct answers
  • n is all training data
  • the maximum sentence length in b is the dimension of the word vector mapping
  • d is the dimension of the vector generated by all LSTM recurrent neural networks in the codec, the dimension of the D picture vector and the second picture vector.
  • lr pre is the learning rate used when supervised learning is used for pre-training. During pre-training, the learning rate gradually decreased from 1e-3 to 5e-5, and the pre-training process was conducted for a total of 30 rounds.
  • bs is the size of the data packet sampled during each training.
  • is the weight of the loss function against training when calculating the mixed loss function.
  • c is the size of the compression interval for discriminator weights during adversarial training.
  • the embodiment of the present invention adds adversarial training to the game against the physical environment, and can converge within 20 rounds.
  • the resulting codec parameters are used as the final visual dialogue system.
  • Comparison method one SL-pretrain The purely supervised training version (SL-pretrain) of the codec described in the present invention, the model is a hierarchical text sentence and dialogue state encoder, which is only trained with a supervised loss function, It does not involve adversarial learning in physical environment games.
  • Answer-Prior this model is a baseline model, directly using a long-short-term memory (LSTM) neural network to encode each candidate answer, and then through a single-layer perceptron Output a score.
  • the model is directly trained on all the answers in the training set, without considering the picture information.
  • LSTM long-short-term memory
  • MN Memory Network Model
  • Comparison method 4 LF Late Fusion Encoder (LF), this model treats all rounds of question and answer sentences in the dialogue history as a long sequence, and uses an LSTM recurrent neural network to encode; for the current round of questions, use another An LSTM is used for encoding. After stitching together historical coding, question coding and picture vectors, a multi-layer perceptron is used for fusion mapping, and a recurrent neural network is used for decoding based on the vectors obtained by the perceptron.
  • LSTM Late Fusion Encoder
  • Comparison method 5 HREA: Hierarchical attention coding (HREA), the structure of this model is similar to the SL-pretrain model described in comparison method 1. Referring to step S1024, the only difference is that the fact vector of the input state encoder is no longer from the t-1th round, but a parameterized attention mechanism is used to calculate the similar weight of the current question vector and the fact vector of each round of dialogue history , The weighted sum of the fact vectors of each round is input as a new fact vector to the state encoder.
  • HREA Hierarchical attention coding
  • Comparison method 6 PL-Multi: Target-driven reinforcement learning (RL-Multi), this method uses a model similar to SL-pretrain described in comparison method 1, the difference is that after encoding the question, the method is in the question vector Based on a single-layer perceptron, a prediction of the picture vector is generated. Using the Euclidean distance between the prediction vector and the picture vector on which the dialogue is based as a reward signal, the loss function is calculated by the reinforcement learning method and the loss function of the supervised training Perform weighted summation.
  • Comparing the first three rows of Table 2 shows that using the exact same codec structure, the improvement brought by the physical environment game in the present invention is much more obvious than the previous best comparison method, Six RL-Multi.
  • the reason is that the comparison method 6 uses goal-driven reinforcement learning, but the rewards of reinforcement learning are only related to the only picture involved in the current sample. Due to the high abstraction of human language, the real world picture that can be correctly described by the 10 rounds of dialogue in the text is not limited to this one in real data. Therefore, choosing the Euclidean distance from this picture as a reward and punishment signal is not a very reliable auxiliary training method.
  • the idea of the present invention is to make the state code generated by the encoder closer to the real world picture through the confrontation training of the physical environment game, so as to integrate the prior knowledge from the multi-modality at the overall data distribution level. Comparing the last three rows of Table 2 can show that the mixed loss function of adversarial learning and supervised learning involved in the present invention can steadily bring performance improvement to different codec models, and is a more efficient and universal visual dialogue method.
  • any physical environment picture data set can be directly used to participate in the game of the model, and the game process of the model is also applicable to any target task that requires knowledge from visual information. Therefore, the data used by the model is easier to obtain, and solves the problem of lack of universality of other autonomous evolution methods.
  • the self-evolving intelligent dialogue system based on the physical environment game of the second embodiment of the present invention includes an acquisition module, a dialogue model, and an output module;
  • the acquisition module is configured to acquire and input images to be processed and corresponding problem information
  • the dialogue model is configured to use the optimized dialogue model to generate the response information of the image to be processed and corresponding question information;
  • the output module is configured to output response information
  • the dialogue model includes an image encoding module, a text encoding module, and a decoding module;
  • the image encoding module is configured to encode each picture in the acquired first picture set using the constructed picture encoding model, generate a first picture vector, and obtain the first picture vector set;
  • the text encoding module is configured to integrate into the first picture vector set, and use a question encoder, a fact encoder, and a state encoding model to encode all rounds of dialogue text in the first dialogue text set into corresponding rounds State vector to get the first state vector set;
  • the decoding module is configured to generate a response text corresponding to a round based on the first state vector set.
  • the self-evolving intelligent dialogue system based on the physical environment game provided in the above embodiment is only exemplified by the division of the above functional modules.
  • the above functions can be allocated by different functions as needed
  • the module is completed, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined.
  • the modules in the above embodiments can be combined into one module, or can be further split into multiple sub-modules to complete all or part of the above description Features.
  • the names of the modules and steps involved in the embodiments of the present invention are only for distinguishing each module or step, and are not regarded as an improper limitation of the present invention.
  • a storage device wherein a plurality of programs are stored, the programs are suitable to be loaded and executed by a processor to implement the above-mentioned autonomous evolution intelligent dialogue method based on physical environment games.
  • a processing device includes a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; the program is suitable for being loaded and executed by the processor In order to realize the above-mentioned autonomous evolution intelligent dialogue method based on physical environment game.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Physiology (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

一种基于物理环境博弈的自主进化智能对话方法、系统、装置,所述方法包括:获取待处理图像及对应问题文本;采用优化的对话模型将图片编码为图片向量,问题文本编码为问句向量;基于图片向量及问句向量生成状态向量;解码状态向量获得应答文本并输出。其中,对话模型的优化过程需引入判别器,对话模型与判别器交替优化直至对话模型的混合损失函数和判别器的损失函数值不再下降或低于预设值,完成模型优化。所述方法属于人工智能及视觉对话领域,可解决智能系统计算消耗大、收敛速度慢、处理信息精确度低的问题,与传统方法相比,计算消耗小、收敛速度快,进一步提高了处理信息的精确度。

Description

基于物理环境博弈的自主进化智能对话方法、系统、装置 技术领域
本发明属于人工智能及视觉对话领域,具体涉及了一种基于物理环境博弈的自主进化智能对话方法、系统、装置。
背景技术
在人工智能领域,训练一个能够理解数据的模型,最常用的方法是监督训练。监督训练通过最大化样本数据和对应的标签出现的概率,从统计分布角度设计目标函数,并对模型参数进行更新。监督训练需要大量的数据,并且以“从统计角度解释数据”为唯一目标,这与人类的实际学习过程并不相同,也因此监督学习存在一个明显的缺陷:当目标任务的某些重要部分缺乏标签和参照时,监督学习的效果会有明显下降。
人类的实际学习中,除了模仿之外,在有限的监督信息的条件下,通过与外界的交互博弈而进行自主更新的过程是必不可少的。目前能够从一定程度上模拟这一过程方法是使用强化学习。强化学习的目标是通过不断生成动作来进行尝试,最大化每一步决策所能获得的奖励的期望。强化学习需要定义完整的动作空间和外部奖励,因此往往只被用来解决单一的问题,例如如何在一个交互式游戏中取得更高的分数。然而,人类智能的自主进化过程中包含与物理世界环境的广泛的交互和博弈,而目前方法普遍只考虑了智能体之间的,人为任务设定下的交互博弈。这些方法只对特定的任务有效,而不同任务之间必须引入不同的动作空间和奖励机制,导致其缺少泛用性,设计复杂且难以扩展和迁移。
视觉对话生成是自然语言处理领域的重要问题。该问题的常见表现形式是,输入一张现实世界的图片和围绕该图片进行的若干轮对话历史文本,以及当前轮次从外部输入的语句,对话系统生成一句对当前轮次外部输入语句的应答语句。现有的基于强化学习和生成对抗学习的方法能够在一定程度上提高视觉对话的质量,但是计算消耗过大,基于反馈信号的策略梯度算法收敛较慢,并且没有考虑与物理世界的博弈或仅仅通过基于单样本的目标驱动来实现与物理世界的博弈,视觉对话质量有待进一步提高。
因此,如何在模型的训练过程中引入一种通用的、与物理环境进行博弈的方法,实现人类、机器、物理世界的三元博弈,以提高系统对视觉、文本等多模态信息的整合能力,同时不引入过大的计算复杂度,是人工智能及视觉对话领域的重要问题。
发明内容
为了解决现有技术中的上述问题,即为了解决智能系统计算消耗大、收敛速度慢以及处理信息精确度低的问题,本发明提供了一种基于物理环境博弈的自主进化智能对话方法,包括:
步骤S10,获取待处理图像及对应问题文本;
步骤S20,采用优化的对话模型生成所述待处理图像和对应问题文本的应答文本;
步骤S30,输出应答文本;
其中,所述对话模型包括图片编码模型、文本编码模型、状态编码模型、解码器;
所述图片编码模型基于预训练的卷积神经网络构建;
所述文本编码模型、状态编码模型、解码器为基于循环神经网络的语言模型;
所述文本编码模型包括问句编码器、事实编码器。
在一些优选的实施例中,所述优化的对话模型,其优化过程还需引入判别器,对话模型与判别器交替优化直至对话模型的混合损失函数和判别器的损失函数值不再下降或低于预设值,其步骤为:
步骤M10,获取代表物理环境的图片集及所述图片对应的对话文本,作为第一图片集和第一对话文本集;所述第一对话文本集包括第一问题文本集、第一应答文本集;
步骤M20,采用图片编码模型对所述第一图片集中每一个图片进行编码,生成第一图片向量,获得第一图片向量集;
步骤M30,融入第一图片向量集,利用的问句编码器、事实编码器和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话编码为对应轮次的状态向量,得到第一状态向量集;
步骤M40,通过解码器将所述第一状态向量集生成对应轮次的应答语句,获得第二应答文本集;通过单层感知映射函数将所述第一状态向量集生成第二图片向量集;
步骤M50,通过判别器对第二图片向量集中所有图片向量属于物理环境向量的概率进行计算,利用所述概率以及第一应答文本集,优化对话模型,得到第一优化对话模型;
步骤M60,对第一图片向量集和第二图片向量集进行采样,生成对抗训练样本池,对判别器进行优化,得到第一优化判别器。
在一些优选的实施例中,所述图片编码模型的构建,还设置有预训练步骤,其步骤为:
步骤T10,选取包含物理环境的图片集,作为预训练图片集;
步骤T20,采用卷积神经网络模型,以所述预训练图片集中每一张图片的物体类别为标签进行预训练,预训练所得的卷积神经网络为图片编码模型。
在一些优选的实施例中,第一图片向量为:
I=CNN pre(Img)
其中,I为第一图片向量,CNN pre为的图片编码模型,Img为图片集中每一个图片。
在一些优选的实施例中,步骤M20中“采用图片编码模型对所述第一图片集中每一个图片进行编码,生成第一图片向量”,其方法为:
将所述第一图片集的每一个图片分别输入图片编码模型,输出对应图片最后一层的全连接层向量,所述向量编码了所述输入图片的各个层级的信息,获得第一图片向量集。
在一些优选的实施例中,步骤M30中“融入第一图片向量集,利用的问句编码器、事实编码器和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话编码为对应轮次的状态向量”,其步骤为:
步骤M31,通过词映射的方法,将所有轮次对话文本中每个词编码为词向量,获得词向量集;
步骤M32,在t轮对话文本中,基于所述词向量集,使用问句编码器将问题文本编码成为问句向量;使用事实编码器将问题文本和应答文本联合编码成为事实向量;使用状态编码器将所述问题文本问句向量、事实向量、所述事实向量对应的第一图片向量和t-1轮的状态向量融合编码为第t轮状态向量;1≤t≤T,T为对话总轮次数;
步骤M33,将通过步骤M32得到的各轮状态向量构建为第二状态向量集。
在一些优选的实施例中,所述文本编码模型包括问句编码器、事实编码器;词向量、问句向量、事实向量和状态向量,计算方法为:
Figure PCTCN2019083354-appb-000001
其中,e为词向量,b为词向量维度,v为数据集中所有单词构成的词表的大小,w为每个词的独热码表示。
Figure PCTCN2019083354-appb-000002
其中,q t为问题文本问句向量,Enc q为问句编码器,{e 1,...e n} t为问句词向量序列。
Figure PCTCN2019083354-appb-000003
其中,
Figure PCTCN2019083354-appb-000004
为事实向量,Enc f为事实编码器;{e 1,...e m+n} t为第t轮的问句和答句词向量序列的拼接序列。
Figure PCTCN2019083354-appb-000005
其中,s t为当前轮次的状态向量;LSTM s为状态编码器,每一对话t内只进行一步运算;s t-1为第t-1轮的隐藏层状态;q t为当前轮次的问题文本问句向量;
Figure PCTCN2019083354-appb-000006
为上一轮次事实向量;I为对话所基于的第一图片向量。
在一些优选的实施例中,步骤M40中“通过解码器将所述第一状态向量集生成对应轮次的应答语句,获得第二应答文本集;通过单层感知映射函数将所述第一状态向量集生成第二图片向量集”,其方法为:
采用解码器,以所述第一状态向量集中每一轮状态向量为初始状态,依次生成预测答案的每个词,为对应轮次的应答语句,获得第二应答文本集;使用单层感知映射函数将所述第一状态向量集中每一轮的状态向量映射成为对应轮次的图片向量,获得第二图片向量集。
第二图片向量s t'为:
s t'=ReLU(W ps t),s t'∈R D
其中,s t'为第二图片向量;D为第二图片向量维度,也是第一图片向量I的维度;W p是单层感知机的连接权重;ReLU是单层感知机所使用的激活函数。
在一些优选的实施例中,步骤M50中“通过判别器对第二图片向量集中所有图片向量属于物理环境向量的概率进行计算,利用所述概率以及第一应答文本集,优化对话模型”,其步骤为:
步骤M51,将所述第二图片向量集中每一个图片向量输入判别器,获得图片向量属于物理环境向量的概率;将所述第二应答文本集与第一应答文本集比较,计算监督训练的损失函数和物理环境博弈损失函数;
步骤M52,将所述损失函数与第二图片向量集属于真实物理环境向量的概率相结合,计算混合损失函数;
步骤M53,计算所述混合函数对所述编码器、解码器和映射函数的参数的梯度,对所述编码器、解码器和单层感知映射函数的参数更新,得到第一优化对话模型。
在一些优选的实施例中,第二图片向量属于物理环境向量的概率,计算方法为:
Figure PCTCN2019083354-appb-000007
其中,
Figure PCTCN2019083354-appb-000008
为第二图片向量属于物理环境向量的概率,DBot()为判别器,s t'为第二图片向量。
在一些优选的实施例中,监督训练的损失函数、物理环境博弈损失函数和混合损失函数,计算方法为:
Figure PCTCN2019083354-appb-000009
Figure PCTCN2019083354-appb-000010
L G=L su+λL adv
其中,L su为监督训练的损失函数、L adv为物理环境博弈损失函数和L G为混合损失函数,N为轮次t真实对话应答语句长度,
Figure PCTCN2019083354-appb-000011
为第一应答文本词序列,T为对话总轮次数,
Figure PCTCN2019083354-appb-000012
为该序列中的每个词的生成概率,
Figure PCTCN2019083354-appb-000013
为第二图片向量属于物理环境向量的概率的平均值,其中,λ为超参数。
在一些优选的实施例中,步骤M60中“对第一图片向量集和第二图片向量集进行采样,生成对抗训练样本池,对判别器进行优化”,其步骤为:
步骤M61,从所述第一图片向量集中选取若干样本,标记为真;从所述第二图片向量集中选取若干样本,标记为假;所有带有标记的向量构成判别器的训练样本池;
步骤M62,计算判别器的损失函数,使判别器对真样本输出的概率尽可能高,对假样本输出的概率尽可能低,对判别器进行参数更新,得到优化的判别器。
在一些优选的实施例中,判别器损失函数,计算方法为:
Figure PCTCN2019083354-appb-000014
其中,L D为判别器损失函数,I为第一图片向量,s t'为第二图片向量,DBot()为判别器,
Figure PCTCN2019083354-appb-000015
为第二图片向量属于物理环境向量的概率的平均值,E I~p(I)为真样本输出的概率的平均值。
本发明的另一方面,提出了一种基于物理环境博弈的自主进化智能对话系统,获取模块、对话模型、输出模块;;
所述获取模块,配置为获取待处理的图像及对应的问题信息并输入;
所述对话模型,配置为采用优化的对话模型生成所述待处理的图像和对应的问题信息的应答信息;
所述输出模块,配置为输出应答信息;
其中,所述对话模型,包括图像编码模块、文本编码模块、解码模块;
所述图像编码模块,配置为采用构建的图片编码模型对获取的第一图片集中每一个图片进行编码,生成第一图片向量,获得第一图片向量集;
所述文本编码模块,配置为融入第一图片向量集,利用的文本编码和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话文本编码为对应轮次的状态向量,得到第一状态向量集;
所述解码模块,配置为以第一状态向量集为基础,生成对应轮次的应答文本。
本发明的第三方面,提出了一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于物理环境博弈的自主进化智能对话方法。
本发明的第四方面,提出了一种处理装置,包括处理器、存储装置;所述处理器,适于执行各条程序;所述存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于物理环境博弈的自主进化智能对话方法。
本发明的有益效果:
(1)本发明的基于物理环境博弈的自主进化智能对话方法,能够通过综合使用对抗训练和监督训练,使编解码模型产生的状态向量与物理世界的图片向量具有紧密相关的分布,从而实现智能体与人、智能体与物理环境之间的三元博弈,提高对话应答的准确度和流畅度,同时避免使用强化学习所导致的较大计算负担。
(2)本发明基于物理环境博弈的自主进化智能对话方法在自主进化的人工智能方法中引入广泛的真实物理世界信息,相比于现有的方法,本发明方法能够更充分地利用广泛、易获取的物理环境信息,使模型能够在与物理环境的博弈中,通过自主进化,获得更泛用和可扩展的知识。
(3)本发明自主进化智能系统是通过与物理环境的交互博弈完成的,能够更好地模拟人类的学习过程,依赖于更易获取的资源,获取更泛用的知识。同时,物理环境资源是无监督信息,数据量更充足,也更容易获得。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是本发明基于物理环境博弈的自主进化智能对话方法的流程示意图;
图2是本发明基于物理环境博弈的自主进化智能对话方法实施例的一轮对话中的问句编码器和事实编码器模块示意图;
图3是本发明基于物理环境博弈的自主进化智能对话方法实施例的监督和对抗训练的损失函数产生过程的示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
现有的自然语言处理并生成对话主要为基于强化学习和生成对抗学习的方法,该方法能够在一定程度上提高对话的质量,但往往存在两个缺陷:一是必须在生成每一个词或句子时进行大量的采样试错,才能对基于反馈信号和期望的损失函数进行准确的估计,基于反馈信号的策略梯度算法本身收敛较慢,导致计算消耗过大;二是没有考虑与物理世界的博弈,仅仅通过文本本身和简单的目标驱动完成,导致处理信息精确度低。本发明引入一种通用的、与物理环境进行博弈的方法,实现人类,机器,物理世界的三元博弈,以提高系统对多模态信息的整合能力,同时不引入过大的计算复杂度,计算消耗低,收敛速度快,进一步提高了处理信息的精确度。
本发明的一种基于物理环境博弈的自主进化智能对话方法,包括:
步骤S10,获取待处理图像及对应问题文本;
步骤S20,采用优化的对话模型生成所述待处理图像和对应问题文本的应答文本;
步骤S30,输出应答文本;
其中,所述对话模型包括图片编码模型、文本编码模型、状态编码模型、解码器;
所述图片编码模型基于预训练的卷积神经网络构建;
所述文本编码模型、状态编码模型、解码器为基于循环神经网络的语言模型;
所述文本编码模型包括问句编码器、事实编码器。
为了更清晰地对本发明基于物理环境博弈的自主进化智能对话方法进行说明,下面结合图1对本发明方法实施例中各步骤展开详述。
本发明一种实施例的基于物理环境博弈的自主进化智能对话方法,包括步骤S10-步骤S30,各步骤详细描述如下:
步骤S10,获取待处理图像及对应问题文本。
步骤S20,采用优化的对话模型生成所述待处理图像和对应问题文本的应答文本。
所述对话模型包括图片编码模型、文本编码模型、状态编码模型、解码器。
所述文本编码模型、状态编码模型、解码器为基于循环神经网络的语言模型。
所述文本编码模型包括问句编码器、事实编码器。
图片编码模型基于预训练的卷积神经网络构建,步骤为:
步骤T10,选取包含物理环境的图片集,作为预训练图片集;
步骤T20,采用卷积神经网络模型,以所述预训练图片集中每一张图片的物体类别为标签进行预训练,预训练所得的卷积神经网络为图片编码模型。
本发明实施例选取ImageNet作为包含大量真实世界图片的大规模数据集,选取成熟的卷积神经网络模型VGG16,以数据集中每一张图片中物体类别为标签进行预训练,获得图片编码模型CNN pre
对话模型的优化过程,还需要引入判别器,对话模型与判别器交替优化直至对话模型的混合损失函数和判别器的损失函数值不再下降或低于预设值,其步骤为:
步骤M10,获取代表物理环境的图片集及所述图片对应的对话文本,作为第一图片集和第一对话文本集;所述第一对话文本集包括第一问题文本集、第一应答文本集。
步骤M20,采用图片编码模型对所述第一图片集进行编码,生成第一图片向量,获得第一图片向量集。
图片编码模型CNN pre对于一个输入的图片,能够输出图片最后一层的全连接层向量,该向量编码了输入图片的各个层级的信息,为第一图片向量I,如式(1)所示:
I=CNN pre(Img)       式(1)
其中,I为第一图片向量,CNN pre为的图片编码模型,Img为图片集中每一个图片。
对于第一图片集中每一个图片,都按照上述方法获得图片向量,为第一图片向量集,CNN pre模型的参数不随着模型的训练而进行更新。
步骤M30,融入第一图片向量集,利用的问句编码器、事实编码器和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话编码为对应轮次的状态向量,得到第一状态向量集。
步骤M31,通过词映射的方法,将所有轮次对话文本中每个词编码为词向量,获得词向量集。
步骤M32,在t轮对话文本中,基于所述词向量集,使用问句编码器将问题文本编码成为问句向量;使用事实编码器将问题文本和应答文本联合编码成为事实向量;使用状态编码器将所述问题文本问句向量、事实向量、所述事实向量对应的第一图片向量和t-1轮的状态向量融合编码为第t轮状态向量;1≤t≤T,T为对话总轮次数。如图2所示,为本发明实施例的问句编码器和事实编码器模块示意图。
步骤M33,将通过步骤M32得到的各轮状态向量构建为第二状态向量集。
在历史对话的第t轮,{x 1,...,x n} t,称为问句,数据集给出的对该问句的标准答案{y 1,...y m} t称为答句,问句和答句中的每一个词w∈{x 1,...x n,y 1,...y m} t都是一个独热码向量,通过词映射矩阵可以将该向量映射成为词向量e,如式(2)所示:
Figure PCTCN2019083354-appb-000016
其中,b为词向量维度,v为数据集中所有单词构成的词表的大小,w为每个词的独热码表示。
本实施例中,采用LSTM模型(长短期记忆网络模型,Long Short-Term Memory)作为问句编码器Enc q;LSTM是一种循环神经网络,每输入一个词的词向量,该网络根据所输入的词向量和上一时刻自身的隐藏层状态计算产生新一时刻的隐藏层状态;将问句的词向量序列{e 1,...e n} t输入问句编码器,所得最后一时刻隐藏层状态作为问句向量q t,如式(3)所示:
Figure PCTCN2019083354-appb-000017
采用LSTM模型作为事实编码器Enc f,将第t轮的问句和答句词向量序列进行拼接,得到{e 1,...e m+n} t,输入事实编码器,所得最后一时刻隐藏层状态作为事实向量
Figure PCTCN2019083354-appb-000018
如式(4)所示:
Figure PCTCN2019083354-appb-000019
该事实向量记录了当前轮对话中的问句和答句信息,用来在下一轮(t+1)对话中作为历史信息输入。
采用采用LSTM模型作为状态编码器LSTM s,处于问句编码器Enc q和事实编码器Enc f的层级之上,每一轮对话t内只进行一步运算,输入第t-1轮的事实向量
Figure PCTCN2019083354-appb-000020
和隐藏层状态s t-1,当前时刻的问句向量
Figure PCTCN2019083354-appb-000021
以及整个对话所基于的第一图片向量I,输出当前轮的状态向量s t,如式(5)所示:
Figure PCTCN2019083354-appb-000022
步骤M40,通过解码器将所述第一状态向量集生成对应轮次的应答语句,获得第二应答文本集;通过单层感知映射函数将所述第一状态向量集生成第二图片向量集。
采用解码器,以第一状态向量集中每一轮状态向量为初始状态,依次生成预测答案的每个词,为对应轮次的应答语句,获得第二应答文本集;使用单层感知映射函数将第一状态向量集中每一轮的状态向量映射成为对应轮次的图片向量,获得第二图片向量集。
在对话轮次t,使用单层感知机模型作为映射函数f,将状态向量s t映射为第二图片向量s t',如式(6)所示:
s t'=ReLU(W ps t),s t'∈R D     (式)6
其中,D为第二图片向量维度,也是第一图片向量I的维度;W p是单层感知机的连接权重;ReLU是单层感知机所使用的激活函数。
步骤M50,通过判别器对第二图片向量集中所有图片向量属于物理环境向量的概率进行计算,利用所述概率以及第一应答文本集,优化对话模型,得到第一优化对话模型。
步骤M51,将所述第二图片向量集中每一个图片向量输入判别器,获得图片向量属于物理环境向量的概率;将所述第二应答文本集 与第一应答文本集比较,计算监督训练的损失函数和物理环境博弈损失函数。如图3所示,为本发明实施例的监督和对抗训练的损失函数产生过程的示意图。
在对话轮次t,使用LSTM模型作为解码器Decoder,以状态向量s t作为初始状态,依次生成出所预测的每个答案词。解码器所使用的LSTM模型结构与图2所示的编码器Enc q结构相同,在每个时间片,将已经解码出的词编码成为新的隐层向量。在新的隐层向量基础上,通过带有softmax激活函数的单层感知机模型,对词表中的每个词,计算在该时间片产生该词的概率。
本实施例中,采用带有ReLU激活函数的单层感知机作为判别器DBot(),对于每个第二图片向量,判别器输出该向量属于物理环境向量的概率
Figure PCTCN2019083354-appb-000023
如式(7)所示:
Figure PCTCN2019083354-appb-000024
其中,DBot()为判别器,s t'为第二图片向量。
在对话轮次t,第一对话文本中的应答语句为一个词序列
Figure PCTCN2019083354-appb-000025
N为句子长度,,T为历史对话轮次数,在所有的T轮对话进行完成后,使用交叉熵计算该对话中所有整句应答语句的监督训练损失函数L su,如式(8)所示:
Figure PCTCN2019083354-appb-000026
其中,
Figure PCTCN2019083354-appb-000027
为该序列中的每个词的生成概率。
当一个样本中的所有的T轮对话预测结束后,对于每一轮产生的第二图片向量s t',采用判别器产生的概率
Figure PCTCN2019083354-appb-000028
对这些概率的平均值取反,作为与物理环境博弈的损失函数L adv,如式(9)所示:
Figure PCTCN2019083354-appb-000029
其中,
Figure PCTCN2019083354-appb-000030
为第二图片向量属于物理环境向量的概率的平均值。
L adv越小,代表所产生的第二图片向量越接近于第一图片向量的分布。
步骤M52,将所述损失函数与第二图片向量集属于真实物理环境向量的概率相结合,计算混合损失函数。
混合损失函数L G由监督训练和物理环境博弈的损失函数通过参数λ加权求和得到,如式(10)所示:
L G=L su+λL adv       式(10)
其中,λ为超参数。
步骤M53,计算所述混合函数对所述编码器、解码器和映射函数的参数的梯度,对所述编码器、解码器和单层感知映射函数的参数更新,得到第一优化对话模型。
本实施例中,基于计算的参数梯度,采用Adam算法更新编码器、解码器和映射函数的参数,以降低损失函数的值。
步骤M60,对第一图片向量集和第二图片向量集进行采样,生成对抗训练样本池,对判别器进行优化,得到第一优化判别器。
步骤M61,从所述第一图片向量集中选取若干样本,标记为真;从所述第二图片向量集中选取若干样本,标记为假;所有带有标记的向量构成判别器的训练样本池。
本实施例中,从对话数据(dialog)中采样出一个包(通常为32)大小的样本子集,通过当前的编码器参数对样本子集的对话文本进行编码,生成第二图片向量,并对这些向量打上标签,标记为假。
从第一图片向量集中采样相同数量的第一图片向量(可不对应对话数据的样本子集),并对这些向量打上标签,标记为真。
带有标记真和标记假的所有图片向量构成判别器的训练样本池。
步骤M62,计算判别器的损失函数,使判别器对真样本输出的概率尽可能高,对假样本输出的概率尽可能低,对判别器进行参数更新,得到优化的判别器。
判别器的损失函数L D,如式(11)所示:
Figure PCTCN2019083354-appb-000031
其中,I为第一图片向量,s t'为第二图片向量,DBot()为判别器,
Figure PCTCN2019083354-appb-000032
为伪图片向量属于物理环境向量的概率的平均值,E I~p(I)为真样本输出的概率的平均值。
计算判别器损失函数L D对判别器DBot()的参数的梯度。基于该梯度,使用RMSProp算法更新判别器参数,以降低该损失函数的值。
步骤S30,输出应答文本;
为了进一步说明本发明基于物理环境博弈的自主进化智能对话方法的性能,本发明实施例选择VisDial v0.5多轮问答数据集进行评测。VisDial数据集中数据的典型形式为:给出一张图片和对应的10轮自然语言对话,要求对话系统在每一轮阅读图片和之前所有的对话历史,预测在这一轮对问句的应答,并与真实的应答语句作比较。每个应答语句有100个候选语句,系统必须给出产生每一个候选语句的概率。数据集的测试指标与真实答案的概率在所有候选答案中的排名有关,分为五类,分别是MRR(Mean Reciprocal Rank),生成概率前1/5/10名中正确答案的召回率(Recall@1/5/10),以及正确答案的平均排名(Mean Rank)。其中平均排名值越低说明结果准确度越高;其他四项指标越高则说明结果准确度越高。
参数设置为:n=20,b=300,d=512,D=4096,lr=5e-5,lr pre=1e-3,bs=32,λ=10,c=0.01;n为所有训练数据中的最大句子长度,b为词向量映射的维度,d为编解码器中所有LSTM循环神经网络产生的向量的维度,D图片向量和第二图片向量的维度。lr监督训练和对抗训练时采用的学习率,lr pre是只采用监督学习进行预训练时使用的学习率。预训练时,学习率逐渐从1e-3衰减至5e-5,预训练过程一共进行30轮。bs是每次训练时采样的数据包的大小。λ为计算混合损失函数时对抗训练的损失函数的权重大小。c是对抗训练时对判别器权重的压缩区间大小。
在上述设定下,本发明的实施例在进行预训练后,加入与物理环境进行博弈的对抗训练,在20轮内能够收敛,得到的编解码器的参数作为最终的视觉对话系统。
本发明实施例采用以下对比方法:
对比方法一SL-pretrain:本发明中所描述编解码器的纯监督训练版本(SL-pretrain),该模型是一个层级的文本句子和对话状态编码器,只是用监督式的损失函数进行训练,不涉及物理环境博弈中的对抗学习。
对比方法二Answer-Prior:答句先验(Answer-Prior),该模型是一个基线模型,直接使用一个长短期记忆(LSTM)神经网络对每一条候选答句进行编码,再通过单层感知机输出一个分数。该模型直接在训练集的所有答句上进行训练,不考虑图片信息。
对比方法三MN:记忆网络模型(MN),该模型对每一轮对话历史进行离散的向量式存储,并在产生回答时对历史向量以点积相似度计算和加权和的形式进行检索,同时整合图片向量信息,使用循环神经网络进行解码。
对比方法四LF:后期融合编码器(LF),该模型将对话历史中的所有轮的问答句子视为一个长序列,并用一个LSTM循环神经网络 进行编码;对于当前轮次的问句,用另一个LSTM进行编码。将历史编码、问句编码和图片向量进行拼接后,用多层感知机融合映射,在感知机所得向量的基础上使用循环神经网络进行解码。
对比方法五HREA:层级注意力编码(HREA),该模型的结构与对比方法一所述SL-pretrain模型类似。参照步骤S1024,唯一的区别是输入状态编码器的事实向量不再是来自第t-1轮,而是使用参数化注意力机制计算当前问句向量与每一轮对话历史的事实向量的相似权重,对每一轮的事实向量进行加权和,作为新的事实向量输入状态编码器。
对比方法六PL-Multi:目标驱动的强化学习(RL-Multi),该方法使用与对比方法一所述SL-pretrain类似的模型,区别在于,对问句进行编码后,该方法在问句向量基础上通过单层感知机产生一个对图片向量的预测,使用该预测向量与对话所基于的图片向量之间的欧式距离作为奖励信号,通过强化学习方法计算损失函数,并与监督训练的损失函数进行加权求和。
本发明实施例和对比方法实验结果如表1所示:
表1
方法 MRR R@1(%) R@5(%) R@10(%) Mean
SL-pretrain 0.436 33.02 53.41 60.09 21.83
Answer-Prior 0.311 19.85 39.14 44.28 31.56
MN 0.443 34.62 53.74 60.18 21.69
LF 0.430 33.27 51.96 58.09 23.04
HREA 0.442 34.47 53.43 59.73 21.83
RL-Multi 0.437 33.22 53.67 60.48 21.13
本发明方法 0.446 34.55 54.29 61.15 20.72
表1中的实验结果表明,本发明的基于物理环境博弈的自主进化智能对话方法的实施例对模型在数据集各项指标上的表现均有明显的提升作用。一个结构简单、不包含对文本和图像的任何注意力机制的 模型(SL-pretrain)经过本发明所述的训练过程后,在所有指标上明显超越了大多数其他模型。
此外,为了验证本发明中基于物理环境博弈的对抗训练不但是一种较为理想的提升视觉对话系统性能的途径,且这一提升是稳定、鲁棒的,与编解码器本身的结构无关,在对比方法LF的模型基础上加入了本发明所述的物理环境博弈进行混合损失函数训练。不同训练方法对不同模型的性能提升比较如表2所示:
表2
方法 MRR R@1(%) R@5(%) R@10(%) Mean
SL-pretrain 0.436 33.02 53.41 60.09 21.83
RL-Multi提升 0.001 0.20 0.26 0.39 0.70
本发明提升 0.010 1.53 0.88 1.06 1.11
LF 0.430 33.27 51.96 58.09 23.04
本发明提升 0.009 1.54 0.90 1.14 1.11
对比表2的前三行可以表明,使用完全相同的编解码器结构,本发明中物理环境博弈所带来的提升远比之前最佳的对比方法六RL-Multi提升明显。原因在于,对比方法六使用目标驱动的强化学习方式,但强化学习的奖励只与当前样本中所涉及的唯一一张图片有关。由于人类语言的高度抽象性,文本中的10轮对话能够正确描述的现实世界图片并不限于真实数据中的这一张。因此,选择与这一张图片的欧式距离作为奖惩信号并不是一种很可靠的辅助训练方式。与此相反,本发明的思路是通过物理环境博弈的对抗训练,使编码器产生的状态编码从分布上更贴近于现实世界图片,从而在整体数据分布层面整合来自多模态的先验知识。对比表2的后三行可以表明,本发明所涉及的对抗学习与监督学习混合损失函数能够稳定地为不同的编解码器模型带来性能提升,是一种较为高效和通用的视觉对话方法。同时,任何物理环境图片数据 集都可以直接被用来参与该模型的博弈,而该模型的博弈过程也适用于任何需要从视觉信息中获取知识的目标任务。因此,该模型所使用的数据更容易获得,并且解决了其他自主进化方法缺少泛用性的问题。
本发明第二实施例的基于物理环境博弈的自主进化智能对话系统,包括获取模块、对话模型、输出模块;
所述获取模块,配置为获取待处理的图像及对应的问题信息并输入;
所述对话模型,配置为采用优化的对话模型生成所述待处理的图像和对应的问题信息的应答信息;
所述输出模块,配置为输出应答信息;
其中,所述对话模型,包括图像编码模块、文本编码模块、解码模块;
所述图像编码模块,配置为采用构建的图片编码模型对获取的第一图片集中每一个图片进行编码,生成第一图片向量,获得第一图片向量集;
所述文本编码模块,配置为融入第一图片向量集,利用问句编码器、事实编码器和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话文本编码为对应轮次的状态向量,得到第一状态向量集;
所述解码模块,配置为以第一状态向量集为基础,生成对应轮次的应答文本。
所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统的具体工作过程及有关说明,可以参考前述方法实施例中的对应过程,在此不再赘述。
需要说明的是,上述实施例提供的基于物理环境博弈的自主进化智能对话系统,仅以上述各功能模块的划分进行举例说明,在实际 应用中,可以根据需要而将上述功能分配由不同的功能模块来完成,即将本发明实施例中的模块或者步骤再分解或者组合,例如,上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块,以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称,仅仅是为了区分各个模块或者步骤,不视为对本发明的不当限定。
本发明第三实施例的一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于物理环境博弈的自主进化智能对话方法。
本发明第四实施例的一种处理装置,包括处理器、存储装置;处理器,适于执行各条程序;存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于物理环境博弈的自主进化智能对话方法。
所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的存储装置、处理装置的具体工作过程及有关说明,可以参考前述方法实施例中的对应过程,在此不再赘述。
本领域技术人员应该能够意识到,结合本文中所公开的实施例描述的各示例的模块、方法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来 使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
术语“第一”、“第二”等是用于区别类似的对象,而不是用于描述或表示特定的顺序或先后次序。
术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素,而且还包括没有明确列出的其它要素,或者还包括这些过程、方法、物品或者设备/装置所固有的要素。
至此,已经结合附图所示的优选实施方式描述了本发明的技术方案,但是,本领域技术人员容易理解的是,本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下,本领域技术人员可以对相关技术特征作出等同的更改或替换,这些更改或替换之后的技术方案都将落入本发明的保护范围之内。

Claims (17)

  1. 一种基于物理环境博弈的自主进化智能对话方法,其特征在于,包括:
    步骤S10,获取待处理图像及对应问题文本;
    步骤S20,采用优化的对话模型生成所述待处理图像和对应问题文本的应答文本;
    步骤S30,输出应答文本;
    其中,所述对话模型包括图片编码模型、文本编码模型、状态编码模型、解码器;
    所述图片编码模型基于预训练的卷积神经网络构建;
    所述文本编码模型、状态编码模型、解码器为基于循环神经网络的语言模型;
    所述文本编码模型包括问句编码器、事实编码器。
  2. 根据权利要求1所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,所述优化的对话模型,其优化过程还需引入判别器,对话模型与判别器交替优化直至对话模型的混合损失函数和判别器的损失函数值不再下降或低于预设值,其步骤为:
    步骤M10,获取代表物理环境的图片集及所述图片集对应的对话文本,作为第一图片集和第一对话文本集;所述第一对话文本集包括第一问题文本集、第一应答文本集;
    步骤M20,采用图片编码模型对所述第一图片集中每一个图片分别进行编码,生成第一图片向量,获得第一图片向量集;
    步骤M30,融入第一图片向量集,利用问句编码器、事实编码器和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话编码为 对应轮次的状态向量,得到第一状态向量集;
    步骤M40,通过解码器将所述第一状态向量集生成对应轮次的应答语句,获得第二应答文本集;通过单层感知映射函数将所述第一状态向量集生成第二图片向量集;
    步骤M50,通过判别器对第二图片向量集中所有图片向量属于物理环境向量的概率进行计算,利用所述概率以及第一应答文本集,优化对话模型,得到第一优化对话模型;
    步骤M60,对第一图片向量集和第二图片向量集进行采样,生成对抗训练样本池,对判别器进行优化,得到第一优化判别器。
  3. 根据权利要求1所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,所述图片编码模型的构建,还设置有预训练步骤,其步骤为:
    步骤T10,选取包含物理环境的图片集,作为预训练图片集;
    步骤T20,采用卷积神经网络模型,以所述预训练图片集中每一张图片的物体类别为标签进行预训练,预训练所得的卷积神经网络为图片编码模型。
  4. 根据权利要求2所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,所述第一图片向量为:
    I=CNN pre(Img)
    其中,I为第一图片向量,CNN pre为图片编码模型,Img为图片集中每一个图片。
  5. 根据权利要求2或4所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,步骤M20中“采用图片编码模型对所述第一图片集进 行编码,生成第一图片向量”,其方法为:
    将所述第一图片集的每一张图片分别输入图片编码模型,输出对应图片最后一层的全连接层向量,所述向量编码了所述输入图片的各个层级的信息,获得第一图片向量集。
  6. 根据权利要求2所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,步骤M30中“融入第一图片向量集,利用问句编码器、事实编码器和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话编码为对应轮次的状态向量”,其步骤为:
    步骤M31,通过词映射的方法,将所有轮次对话文本中每个词编码为词向量,获得词向量集;
    步骤M32,在t轮对话文本中,基于所述词向量集,使用问句编码器将问题文本编码成为问句向量;使用事实编码器将问题文本和应答文本联合编码成为事实向量;使用状态编码器将所述问句向量、事实向量、所述事实向量对应的第一图片向量和t-1轮的状态向量融合编码为第t轮状态向量;1≤t≤T,T为对话总轮次数;
    步骤M33,将通过步骤M32得到的各轮状态向量构建为第二状态向量集。
  7. 根据权利要求6所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,词向量、问句向量、事实向量和状态向量,计算方法为:
    Figure PCTCN2019083354-appb-100001
    其中,e为词向量,b为词向量维度,v为数据集中所有单词构成的词表的大小,w为每个词的独热码表示;
    Figure PCTCN2019083354-appb-100002
    其中,q t为问句向量,Enc q为问句编码器,{e 1,...e n} t为问句词向量序列;
    Figure PCTCN2019083354-appb-100003
    其中,
    Figure PCTCN2019083354-appb-100004
    为事实向量,Enc f为事实编码器;{e 1,...e m+n} t为第t轮的问句和答句词向量序列的拼接序列;
    Figure PCTCN2019083354-appb-100005
    其中,s t为当前轮次的状态向量;LSTM s为状态编码器,每一对话t内只进行一步运算;s t-1为第t-1轮的隐藏层状态;q t为当前轮次的问题文本问句向量;
    Figure PCTCN2019083354-appb-100006
    为上一轮次事实向量;I为对话所基于的第一图片向量。
  8. 根据权利要求2所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,步骤M40中“通过解码器将所述第一状态向量集生成对应轮次的应答语句,获得第二应答文本集;通过单层感知映射函数将所述第一状态向量集生成第二图片向量集”,其方法为:
    采用解码器,以所述第一状态向量集中每一轮状态向量为初始状态,依次生成预测答案的每个词,为对应轮次的应答语句,获得第二应答文本集;使用单层感知映射函数将所述第一状态向量集中每一轮的状态向量映射成为对应轮次的图片向量,获得第二图片向量集。
  9. 根据权利要求8所述基于物理环境博弈的自主进化智能对话方法, 其特征在于,第二图片向量为:
    s t'=ReLU(W ps t),s t'∈R D
    其中,s t'为第二图片向量;D为第二图片向量维度,也是第一图片向量I的维度;W p是单层感知机的连接权重;ReLU是单层感知机所使用的激活函数。
  10. 根据权利要求2所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,步骤M50中“通过判别器对第二图片向量集中所有图片向量属于物理环境向量的概率进行计算,利用所述概率以及第一应答文本集,优化对话模型”,其步骤为:
    步骤M51,将所述第二图片向量集中每一个图片向量输入判别器,获得图片向量属于物理环境向量的概率;将所述第二应答文本集与第一应答文本集比较,计算监督训练的损失函数和物理环境博弈损失函数;
    步骤M52,将所述损失函数与第二图片向量集属于真实物理环境向量的概率相结合,计算混合损失函数;
    步骤M53,计算所述混合函数对所述编码器、解码器和映射函数的参数的梯度,对所述编码器、解码器和单层感知映射函数的参数更新,得到第一优化对话模型。
  11. 根据权利要求10所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,所述第二图片向量属于物理环境向量的概率,计算方法为:
    Figure PCTCN2019083354-appb-100007
    其中,
    Figure PCTCN2019083354-appb-100008
    为第二图片向量属于物理环境向量的概率,DBot()为判别器,s t'为第二图片向量。
  12. 根据根据权利要求10所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,所述监督训练的损失函数、物理环境博弈损失函数和混合损失函数,计算方法为:
    Figure PCTCN2019083354-appb-100009
    Figure PCTCN2019083354-appb-100010
    L G=L su+λL adv
    其中,L su为监督训练的损失函数、L adv为物理环境博弈损失函数和L G为混合损失函数,N为轮次t真实对话应答语句长度,
    Figure PCTCN2019083354-appb-100011
    为第一应答文本词序列,T为对话总轮次数,
    Figure PCTCN2019083354-appb-100012
    为该序列中的每个词的生成概率,
    Figure PCTCN2019083354-appb-100013
    为第二图片向量属于物理环境向量的概率的平均值,其中,λ为超参数。
  13. 根据权利要求2所述的基于物理环境博弈的自主进化智能对话方法,其特征在于,步骤M60中“对第一图片向量集和第二图片向量集进行采样,生成对抗训练样本池,对判别器进行优化”,其步骤为:
    步骤M61,从所述第一图片向量集中选取若干样本,标记为真;从所述第二图片向量集中选取若干样本,标记为假;所有带有标记的向量构成判别器的训练样本池;
    步骤M62,计算判别器的损失函数,使判别器对真样本输出的概率尽可能高,对假样本输出的概率尽可能低,对判别器进行参数更新,得到优化的判别器。
  14. 根据权利要求13所述的基于物理环境博弈的自主进化智能对话 方法,其特征在于,所述判别器损失函数,计算方法为:
    Figure PCTCN2019083354-appb-100014
    其中,L D为判别器损失函数,I为第一图片向量,s t'为第二图片向量,DBot()为判别器,
    Figure PCTCN2019083354-appb-100015
    为第二图片向量属于物理环境向量的概率的平均值,E I~p(I)为真样本输出的概率的平均值。
  15. 一种基于物理环境博弈的自主进化智能对话系统,其特征在于,包括获取模块、对话模型、输出模块;
    所述获取模块,配置为获取待处理的图像及对应的问题信息并输入;
    所述对话模型,配置为采用优化的对话模型生成所述待处理的图像和对应的问题信息的应答信息;
    所述输出模块,配置为输出应答信息;
    其中,所述对话模型,包括图像编码模块、文本编码模块、解码模块;
    所述图像编码模块,配置为采用构建的图片编码模型对获取的第一图片集进行编码,生成第一图片向量,获得第一图片向量集;
    所述文本编码模块,配置为融入第一图片向量集,利用的文本编码和状态编码模型将所述第一对话文本集中对话文本的所有轮次的对话文本编码为对应轮次的状态向量,得到第一状态向量集;
    所述解码模块,配置为以第一状态向量集为基础,生成对应轮次的应答文本。
  16. 一种存储装置,其中存储有多条程序,其特征在于,所述程序适于由处理器加载并执行以实现权利要求1-14任一项所述的基于物理环境博弈的自主进化智能对话方法。
  17. 一种处理装置,包括
    处理器,适于执行各条程序;以及
    存储装置,适于存储多条程序;
    其特征在于,所述程序适于由处理器加载并执行以实现:
    权利要求1-14任一项所述的基于物理环境博弈的自主进化智能对话方法。
PCT/CN2019/083354 2019-01-08 2019-04-19 基于物理环境博弈的自主进化智能对话方法、系统、装置 WO2020143130A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/641,256 US11487950B2 (en) 2019-01-08 2019-04-19 Autonomous evolution intelligent dialogue method, system, and device based on a game with a physical environment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910014369.0 2019-01-08
CN201910014369.0A CN109800294B (zh) 2019-01-08 2019-01-08 基于物理环境博弈的自主进化智能对话方法、系统、装置

Publications (1)

Publication Number Publication Date
WO2020143130A1 true WO2020143130A1 (zh) 2020-07-16

Family

ID=66556816

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/083354 WO2020143130A1 (zh) 2019-01-08 2019-04-19 基于物理环境博弈的自主进化智能对话方法、系统、装置

Country Status (3)

Country Link
US (1) US11487950B2 (zh)
CN (1) CN109800294B (zh)
WO (1) WO2020143130A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177115A (zh) * 2021-06-30 2021-07-27 中移(上海)信息通信科技有限公司 对话内容的处理方法、装置及相关设备
CN113590800A (zh) * 2021-08-23 2021-11-02 北京邮电大学 图像生成模型的训练方法和设备以及图像生成方法和设备
CN113656570A (zh) * 2021-08-25 2021-11-16 平安科技(深圳)有限公司 基于深度学习模型的视觉问答方法及装置、介质、设备
CN114821088A (zh) * 2022-05-07 2022-07-29 湖北工业大学 基于优化bert模型的多模态深度特征抽取方法及系统

Families Citing this family (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
DE112014000709B4 (de) 2013-02-07 2021-12-30 Apple Inc. Verfahren und vorrichtung zum betrieb eines sprachtriggers für einen digitalen assistenten
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770428A1 (en) 2017-05-12 2019-02-18 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. MULTI-MODAL INTERFACES
US20180336275A1 (en) 2017-05-16 2018-11-22 Apple Inc. Intelligent automated assistant for media exploration
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. VIRTUAL ASSISTANT OPERATION IN MULTI-DEVICE ENVIRONMENTS
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
CN110162783B (zh) * 2019-04-17 2024-10-18 腾讯科技(深圳)有限公司 用于语言处理的循环神经网络中隐状态的生成方法和装置
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
CN110188182B (zh) * 2019-05-31 2023-10-27 中国科学院深圳先进技术研究院 模型训练方法、对话生成方法、装置、设备及介质
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
KR102234097B1 (ko) * 2019-07-17 2021-04-01 부산대학교 산학협력단 딥러닝을 위한 이미지 처리 방법 및 이미지 처리 시스템
CN110704588B (zh) * 2019-09-04 2023-05-30 平安科技(深圳)有限公司 基于长短期记忆网络的多轮对话语义分析方法和系统
CN112905754B (zh) * 2019-12-16 2024-09-06 腾讯科技(深圳)有限公司 基于人工智能的视觉对话方法、装置及电子设备
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
CN111813954B (zh) * 2020-06-28 2022-11-04 北京邮电大学 文本语句中两实体的关系确定方法、装置和电子设备
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
CN111897939B (zh) * 2020-08-12 2024-02-02 腾讯科技(深圳)有限公司 视觉对话方法、视觉对话模型的训练方法、装置及设备
US11181988B1 (en) * 2020-08-31 2021-11-23 Apple Inc. Incorporating user feedback into text prediction models via joint reward planning
CN112364665A (zh) * 2020-10-11 2021-02-12 广州九四智能科技有限公司 一种语义提取方法、装置、计算机设备及存储介质
CN112527962A (zh) * 2020-12-17 2021-03-19 云从科技集团股份有限公司 基于多模态融合的智能应答方法、装置、机器可读介质及设备
CN113111241B (zh) * 2021-04-08 2022-12-06 浙江大学 一种博弈对话中基于对话历史和强化学习的多轮对话方法
CN113392196B (zh) * 2021-06-04 2023-04-21 北京师范大学 一种基于多模态交叉比较的题目检索方法和系统
CN113434652B (zh) * 2021-06-30 2024-05-28 平安科技(深圳)有限公司 智能问答方法、智能问答装置、设备及存储介质
CN113920404A (zh) * 2021-11-09 2022-01-11 北京百度网讯科技有限公司 训练方法、图像处理方法、装置、电子设备以及存储介质
CN114186093B (zh) * 2021-12-13 2023-04-28 北京百度网讯科技有限公司 多媒体数据的处理方法、装置、设备和介质
CN114416948A (zh) * 2022-01-18 2022-04-29 重庆邮电大学 一种基于语义感知的一对多对话生成方法及装置
CN114626392B (zh) * 2022-01-29 2023-02-21 北京中科凡语科技有限公司 端到端文本图像翻译模型训练方法
CN114999610B (zh) * 2022-03-31 2024-08-13 华东师范大学 基于深度学习的情绪感知与支持的对话系统构建方法
CN115099855B (zh) * 2022-06-23 2024-09-24 广州华多网络科技有限公司 广告文案创作模型制备方法及其装置、设备、介质、产品
CN115827879B (zh) * 2023-02-15 2023-05-26 山东山大鸥玛软件股份有限公司 基于样本增强和自训练的低资源文本智能评阅方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130132083A1 (en) * 2007-02-20 2013-05-23 Microsoft Corporation Generic framework for large-margin mce training in speech recognition
CN106448670A (zh) * 2016-10-21 2017-02-22 竹间智能科技(上海)有限公司 基于深度学习和强化学习的自动回复对话系统
CN108334497A (zh) * 2018-02-06 2018-07-27 北京航空航天大学 自动生成文本的方法和装置
CN108345692A (zh) * 2018-03-16 2018-07-31 北京京东尚科信息技术有限公司 一种自动问答方法和系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10909329B2 (en) * 2015-05-21 2021-02-02 Baidu Usa Llc Multilingual image question answering
CN113421652B (zh) * 2015-06-02 2024-06-28 推想医疗科技股份有限公司 对医疗数据进行分析的方法、训练模型的方法及分析仪
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
US10346727B2 (en) * 2016-10-28 2019-07-09 Adobe Inc. Utilizing a digital canvas to conduct a spatial-semantic search for digital visual media
US10467509B2 (en) * 2017-02-14 2019-11-05 Microsoft Technology Licensing, Llc Computationally-efficient human-identifying smart assistant computer
CN109145927A (zh) * 2017-06-16 2019-01-04 杭州海康威视数字技术股份有限公司 一种对形变图像的目标识别方法及装置
US11024424B2 (en) * 2017-10-27 2021-06-01 Nuance Communications, Inc. Computer assisted coding systems and methods
US10592767B2 (en) * 2017-10-27 2020-03-17 Salesforce.Com, Inc. Interpretable counting in visual question answering
US10776581B2 (en) * 2018-02-09 2020-09-15 Salesforce.Com, Inc. Multitask learning as question answering
CN111819568B (zh) * 2018-06-01 2024-07-09 华为技术有限公司 人脸旋转图像的生成方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130132083A1 (en) * 2007-02-20 2013-05-23 Microsoft Corporation Generic framework for large-margin mce training in speech recognition
CN106448670A (zh) * 2016-10-21 2017-02-22 竹间智能科技(上海)有限公司 基于深度学习和强化学习的自动回复对话系统
CN108334497A (zh) * 2018-02-06 2018-07-27 北京航空航天大学 自动生成文本的方法和装置
CN108345692A (zh) * 2018-03-16 2018-07-31 北京京东尚科信息技术有限公司 一种自动问答方法和系统

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177115A (zh) * 2021-06-30 2021-07-27 中移(上海)信息通信科技有限公司 对话内容的处理方法、装置及相关设备
CN113177115B (zh) * 2021-06-30 2021-10-26 中移(上海)信息通信科技有限公司 对话内容的处理方法、装置及相关设备
CN113590800A (zh) * 2021-08-23 2021-11-02 北京邮电大学 图像生成模型的训练方法和设备以及图像生成方法和设备
CN113590800B (zh) * 2021-08-23 2024-03-08 北京邮电大学 图像生成模型的训练方法和设备以及图像生成方法和设备
CN113656570A (zh) * 2021-08-25 2021-11-16 平安科技(深圳)有限公司 基于深度学习模型的视觉问答方法及装置、介质、设备
CN113656570B (zh) * 2021-08-25 2024-05-10 平安科技(深圳)有限公司 基于深度学习模型的视觉问答方法及装置、介质、设备
CN114821088A (zh) * 2022-05-07 2022-07-29 湖北工业大学 基于优化bert模型的多模态深度特征抽取方法及系统

Also Published As

Publication number Publication date
US20210150151A1 (en) 2021-05-20
CN109800294A (zh) 2019-05-24
CN109800294B (zh) 2020-10-13
US11487950B2 (en) 2022-11-01

Similar Documents

Publication Publication Date Title
WO2020143130A1 (zh) 基于物理环境博弈的自主进化智能对话方法、系统、装置
US20210256354A1 (en) Artificial intelligence learning-based user knowledge tracing system and operating method thereof
CN108875807B (zh) 一种基于多注意力多尺度的图像描述方法
CN109657041A (zh) 基于深度学习的问题自动生成方法
CN111159419B (zh) 基于图卷积的知识追踪数据处理方法、系统和存储介质
CN111858931A (zh) 一种基于深度学习的文本生成方法
KR102688187B1 (ko) 성어 괄호넣기문제의 답안 선택장치와 컴퓨터장비
CN112926655A (zh) 一种图像内容理解与视觉问答vqa方法、存储介质和终端
CN104504442A (zh) 神经网络优化方法
CN114048301B (zh) 一种基于满意度的用户模拟方法及系统
CN111737439B (zh) 一种问题生成方法及装置
CN113902129A (zh) 多模态的统一智能学习诊断建模方法、系统、介质、终端
CN115455985A (zh) 一种基于机器阅读理解的自然语言系统的处理方法
CN111339274A (zh) 对话生成模型训练方法、对话生成方法及装置
Scialom et al. To beam or not to beam: That is a question of cooperation for language gans
CN112488147A (zh) 一种基于对抗网络的冗余去除主动学习方法
CN115186147A (zh) 对话内容的生成方法及装置、存储介质、终端
Rohmatillah et al. Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy.
CN115238169A (zh) 一种慕课可解释推荐方法、终端设备及存储介质
CN111985560B (zh) 知识追踪模型的优化方法、系统及计算机存储介质
CN112884129B (zh) 一种基于示教数据的多步规则提取方法、设备及存储介质
CN112907004B (zh) 学习规划方法、装置及计算机存储介质
Yang et al. Relationship between the Difficulty of Words and its Distribution of Numbers of Tries in Wordle
CN118467709B (zh) 视觉问答任务的评价方法、设备、介质及计算机程序产品
US10878713B2 (en) Quantitative education system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19909458

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19909458

Country of ref document: EP

Kind code of ref document: A1