CN116680380A

CN116680380A - Visual intelligent question-answering method and device, electronic equipment and storage medium

Info

Publication number: CN116680380A
Application number: CN202310654872.9A
Authority: CN
Inventors: 唐小初; 黎铭; 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-09-01

Abstract

The invention relates to the field of machine learning, and discloses a visual intelligent question-answering method which can be applied to intelligent question-answering scenes and comprises the following steps: obtaining user data, and performing word segmentation and vector conversion on text data in the user data to obtain text word vectors; respectively extracting features of picture data and text word vectors in the user data to obtain a multi-level feature map and the text feature word vectors; respectively constructing a first relation diagram, a second relation diagram and a third relation diagram among feature diagrams in the multi-level feature diagrams, among word vectors in text feature word vectors and among the multi-level feature diagrams and the text feature word vectors, and constructing a final relation diagram of user data according to the three relation diagrams; information aggregation processing is carried out on the final relation diagram, and an aggregation information vector is obtained; performing dimension reduction processing on the aggregate information vector to obtain a dimension reduction information vector; and analyzing answers to the questions of the user data according to the dimension reduction information vector. The invention can improve the answer accuracy of the user questions.

Description

Visual intelligent question-answering method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of machine learning, and in particular, to a visual intelligent question-answering method, a visual intelligent question-answering device, an electronic device, and a computer readable storage medium.

Background

The visual question-answering (Visual Question Answering, VQA) is a multi-modal learning task involving computer vision and natural language processing, and the main objective is to make a computer output an answer conforming to natural language rules and reasonable in content according to an input picture and an open natural language question posed about the picture. Medical Visual Questions and Answers (VQA) are a combination of medical artificial intelligence and VQA challenges. Given a medical image and a natural language clinically relevant question, the medical VQA system is expected to predict a trusted and convincing answer. At present, a convolutional neural network and a cyclic neural network are mainly used for respectively extracting features of medical images and text information of clinically relevant questions, then a fusion module is used for carrying out combined reasoning on the two information, and finally answer output of the questions is carried out, but the methods lack of relation extraction between medical picture scenes and objects in the text of the clinically relevant questions, so that a lot of useful information is lost, and the answer accuracy of predicted questions is low.

Disclosure of Invention

The invention provides a visual intelligent question answering method, a visual intelligent question answering device, electronic equipment and a computer readable storage medium, and aims to improve accuracy of predicted answers to questions.

In order to achieve the above purpose, the invention provides a visual intelligent question-answering method, which comprises the following steps:

obtaining user data, performing word segmentation on text data in the user data to obtain text word segmentation, and performing vector conversion on the text word segmentation to obtain text word vectors;

performing multi-level feature extraction on the picture data in the user data by using a preset picture encoder to obtain a multi-level feature map, and performing feature extraction on the text word vector by using a preset text encoder to obtain a text feature word vector;

constructing a first relation diagram among feature diagrams in the multi-level feature diagrams, constructing a second relation diagram among word vectors in the text feature word vectors, constructing a third relation diagram among the multi-level feature diagrams and the text feature word vectors, and constructing a final relation diagram of the user data according to the first relation diagram, the second relation diagram and the third relation diagram;

Carrying out information aggregation processing on the final relation graph by using an information aggregation function in a graph neural network in the trained intelligent question-answering model to obtain an aggregation information vector of the final relation graph;

performing dimension reduction processing on the aggregate information vector by using a dimension reduction layer in the trained intelligent question-answering model to obtain a dimension reduction information vector;

and analyzing answers to the questions of the user data by utilizing a full connection layer in the trained intelligent question-answering model according to the dimension reduction information vector.

Optionally, the performing multi-level feature extraction on the picture data in the user data by using a preset picture encoder to obtain a multi-level feature map includes:

performing multi-layer convolution operation on the picture data by utilizing a multi-layer convolution layer in the preset picture encoder to obtain a multi-layer convolution picture;

calculating a multi-layer linear picture of the multi-layer convolution picture according to the sharing weight and the sharing deviation of the multi-layer convolution layer; and determining a multi-level feature map of the picture data according to the multi-level linear picture.

Optionally, the extracting features of the text word vector by using a preset text encoder to obtain a text feature word vector includes:

Acquiring text data corresponding to the text word vector;

extracting description object information of the text data;

determining a description object vector corresponding to the description object information;

performing position coding on the text word vector by using the preset text encoder to obtain a position vector;

constructing a text initial word vector of the text word vector according to the position vector;

calculating cosine similarity between the description object vector and the text initial word vector;

and when the cosine similarity is not smaller than a preset threshold, taking the text initial word vector corresponding to the cosine similarity as the text characteristic word vector.

Optionally, the constructing a first relation graph between feature graphs in the multi-level feature graph includes:

carrying out vector conversion on the feature images in the multi-level feature images by using a preset image vector conversion function to obtain a same-dimensional feature vector;

normalizing the same-dimensional feature vector to obtain a graph feature vector;

calculating the vector similarity between any two image vectors in the image feature vectors, and determining the connectivity between any two feature images corresponding to the any two image vectors according to the vector similarity;

And constructing a first relation graph among the feature graphs in the multi-level feature graph according to the connectivity.

Optionally, the constructing a final relationship graph of the user problem according to the first relationship graph, the second relationship graph and the third relationship graph includes:

performing node full connection on the first relationship diagram, the second relationship diagram and the third relationship diagram to obtain an initial relationship diagram;

identifying any node characteristic and adjacent node characteristic in the initial relation diagram;

calculating cosine similarity between the arbitrary node characteristics and the adjacent node characteristics;

normalizing the cosine similarity to obtain normalized similarity;

and determining the final relation diagram according to the normalized similarity and the initial relation diagram.

Optionally, the information aggregation processing is performed on the final relationship graph by using an information aggregation function in a graph neural network in the trained intelligent question-answering model to obtain an aggregated information vector of the final relationship graph, which includes:

taking the characteristic vector of each node in the final relation diagram as an initial information vector, and calculating a target aggregation vector of the initial information vector by utilizing an information aggregation function in the graphic neural network;

And determining the aggregation information vector according to the target aggregation vector.

Optionally, the information aggregation function includes:

wherein ,aggregate information vector after k-th information aggregation of current information aggregation node v, v represents current information aggregation node, sigma represents nonlinear activation function, W _k The neighbor node representing the current information aggregation node is subjected to the weight coefficient of the neighbor information vector corresponding to the kth-1 time of information aggregation, and u represents the current information aggregationNeighboring node of the node B _k The method comprises the steps of representing a weight coefficient of a k-1 th aggregation information vector of a current information aggregation node v, N (v) representing a neighbor node set of the node v, N (v) representing the number of neighbor nodes of the node v, k representing the information aggregation times, and m representing the maximum iteration times of the information aggregation.

In order to solve the above problems, the present invention further provides a visual intelligent question-answering device, which includes:

the text word vector generation module is used for acquiring user data, performing word segmentation on text data in the user data to obtain text word segmentation, and performing vector conversion on the text word segmentation to obtain a text word vector;

the feature extraction module is used for carrying out multi-level feature extraction on the picture data in the user data by utilizing a preset picture encoder to obtain a multi-level feature map, and carrying out feature extraction on the text word vector by utilizing a preset text encoder to obtain a text feature word vector;

The relation diagram construction module is used for constructing a first relation diagram among feature diagrams in the multi-level feature diagrams, a second relation diagram among word vectors in the text feature word vectors, a third relation diagram among the multi-level feature diagrams and the text feature word vectors, and a final relation diagram of the user data according to the first relation diagram, the second relation diagram and the third relation diagram;

the information aggregation module is used for carrying out information aggregation processing on the final relation graph by utilizing an information aggregation function in a graph neural network in the trained intelligent question-answering model to obtain an aggregated information vector of the final relation graph;

the information dimension reduction module is used for carrying out dimension reduction processing on the aggregate information vector by utilizing a dimension reduction layer in the trained intelligent question-answering model to obtain a dimension reduction information vector;

and the question answer analysis module is used for analyzing the question answers of the user data by utilizing the full connection layer in the trained intelligent question answer model according to the dimension reduction information vector.

In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:

at least one processor; the method comprises the steps of,

A memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to implement the visual intelligent question-answering method described above.

In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the visual intelligent question-answering method described above.

It can be seen that, in the embodiment of the invention, the user data is obtained to provide an operation object for the implementation of the subsequent method, word segmentation processing is performed on text data in the user data, vector conversion is performed on the text word segmentation, and the problem text can be preprocessed to provide a guarantee for the subsequent extraction of the feature word vector of the problem text; secondly, the embodiment of the invention utilizes a preset picture encoder to carry out multi-level feature extraction on picture data in the user data, a multi-level feature image is obtained, which can provide support for generating a first relation image for the subsequent generation of fusion information, a preset text encoder is utilized to carry out feature extraction on the text word vector, a text feature word vector is obtained, which can provide support for generating a second relation image for the subsequent generation of fusion information, the first relation image, the second relation image and the third relation image are all important components for establishing a final relation image of the three relation images, and the final relation image for constructing the user problem can be information aggregation for the subsequent continuous operation according to the first relation image, the second relation image and the third relation image so as to obtain a final more accurate answer prediction result; further, according to the embodiment of the invention, the information aggregation function in the graph neural network in the trained intelligent question-answering model is utilized to conduct information aggregation processing on the final relation graph, the aggregated information vector of the final relation graph can be obtained to fuse all associated fusion information of the question picture and the question text, support is provided for generating a final answer, the dimension reduction layer in the trained intelligent question-answering model is utilized to conduct dimension reduction processing on the aggregated information vector, the calculated amount and complexity of the method can be reduced to obtain the dimension reduction information vector, so that the final answer vector of the question can be determined, and according to the dimension reduction information vector, the dimension reduction information vector fused with all associated information of picture data and text data in user data can be fully connected by utilizing a fully connected layer in the trained intelligent question-answering model, so that a more accurate answer of a visual question-answering can be obtained. Therefore, the visual intelligent question answering method, the visual intelligent question answering device, the electronic equipment and the storage medium can improve the accuracy of answers to questions of users, particularly in an intelligent question asking scene, can enable patients to quickly locate disease categories, provide convenience for further medical diagnosis, greatly relieve confusion of patients on a disease department before medical treatment, enable medical treatment directions to be more definite, and improve medical treatment efficiency of the patients.

Drawings

FIG. 1 is a flow chart of a visual intelligent question-answering method according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of a visual intelligent question-answering device according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an internal structure of an electronic device for implementing a visual intelligent question-answering method according to an embodiment of the present invention;

the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the invention provides a visual intelligent question-answering method. The execution subject of the visual intelligent question-answering method includes, but is not limited to, at least one of a server, a terminal and the like capable of being configured to execute the method provided by the embodiment of the invention. In other words, the visual intelligent question-answering method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Referring to fig. 1, a flow chart of a visual intelligent question-answering method according to an embodiment of the invention is shown. In the embodiment of the invention, the visual intelligent question answering method comprises the following steps:

s1, obtaining user data, performing word segmentation on text data in the user data to obtain text word segmentation, and performing vector conversion on the text word segmentation to obtain text word vectors.

The embodiment of the invention provides the operation object for the subsequent method implementation by acquiring the user data. The user data refer to question conditions and contents according to which a computer completes visual questions and answers, and the question conditions and contents comprise picture data and text data, the picture data and the text data can be acquired through a data script, the data script can be compiled through a JS script language, the user data are generated based on different business scenes, and the picture data and the text data can be medical pictures and medical text data under medical scenes.

Further, according to the embodiment of the invention, the text data in the user data is subjected to word segmentation and vector conversion is performed on the text word segmentation, so that the problem text can be preprocessed, and a guarantee is provided for the subsequent extraction of the feature word vector of the problem text. The text word segmentation refers to the segmentation of a Chinese character sequence into individual words. The text word such as "the medical level of the hospital is high" is "the medical level of the hospital is high". The text word vector refers to a vector mapping words or words in a natural language vocabulary to real space, which conceptually involves mathematical embedding from one-dimensional space to multi-dimensional continuous vector space for each word.

Alternatively, the text word may be obtained by word segmentation of the question text using a word segmentation tool, for example, jieba, snowNLP and pkuseeg, the word segmentation tool may be used for a word such as "what is the type of tumor and what is the location of the tumor? The word segmentation processing is carried out on the text with the problems of the like to obtain text word segmentation, namely, what type of tumor and where the tumor is located. The text word vector may be implemented using vector conversion models, such as CBOW, LBL, skip-gram, NNLM, glove and C & W.

S2, carrying out multi-level feature extraction on the picture data in the user data by using a preset picture encoder to obtain a multi-level feature map, and carrying out feature extraction on the text word vector by using a preset text encoder to obtain a text feature word vector.

According to the embodiment of the invention, the picture data in the user data is subjected to multi-level feature extraction by utilizing the preset picture encoder, so that support can be provided for generating the first relation diagram to generate the fusion information. The picture encoder is a device for compiling and converting picture data into a signal form which can be used for communication, transmission and storage. The multi-level feature map refers to a feature map under multiple dimensions generated by the convolution treatment of the convolution layer of the convolution neural network after the image data is subjected to convolution treatment, such as medical feature maps with different dimensions generated by carrying out convolution treatment on a clinical B-ultrasonic image or a nuclear magnetic resonance image.

Further, in an optional embodiment of the present invention, the performing multi-level feature extraction on the picture data in the user data by using a preset picture encoder to obtain a multi-level feature map includes: performing multi-layer convolution operation on the picture data by utilizing a multi-layer convolution layer in the preset picture encoder to obtain a multi-layer convolution picture; calculating a multi-layer linear picture of the multi-layer convolution picture according to the sharing weight and the sharing deviation of the multi-layer convolution layer; and determining a multi-level feature map of the picture data according to the multi-level linear picture.

Wherein, the shared weight means that the same weight coefficient is applied in the same-layer convolution calculation in order to reduce the complexity of the weight parameter in the convolution neural network training. The shared deviation is to apply the same deviation coefficient in the same-layer convolution calculation in order to reduce the complexity of deviation parameters in the convolution neural network training.

Further, the embodiment of the invention performs feature extraction on the text word vector by using the preset text encoder, so that the text feature word vector can provide support for generating a second relation diagram for generating fusion information. The text encoder refers to equipment for compiling and converting text data into a signal form which can be used for communication, transmission and storage.

Further, in an optional embodiment of the present invention, the feature extraction of the text word vector by using a preset text encoder to obtain a text feature word vector includes: acquiring text data corresponding to the text word vector; extracting description object information of the text data; determining a description object vector corresponding to the description object information; performing position coding on the text word vector by using the preset text encoder to obtain a position vector; constructing a text initial word vector of the text word vector according to the position vector; calculating cosine similarity between the description object vector and the text initial word vector; and when the cosine similarity is not smaller than a preset threshold, taking the text initial word vector corresponding to the cosine similarity as the text characteristic word vector.

S3, constructing a first relation diagram among feature diagrams in the multi-level feature diagrams, constructing a second relation diagram among word vectors in the text feature word vectors, constructing a third relation diagram among the multi-level feature diagrams and the text feature word vectors, and constructing a final relation diagram of the user problem according to the first relation diagram, the second relation diagram and the third relation diagram.

It can be understood that, in the embodiment of the present invention, the first relationship diagram, the second relationship diagram, and the third relationship diagram are all important components for subsequently establishing the final relationship diagrams of the three relationship diagrams.

Further, in an optional embodiment of the present invention, the constructing a first relationship graph between feature graphs in the multi-level feature graph includes: carrying out vector conversion on the feature images in the multi-level feature images by using a preset image vector conversion function to obtain a same-dimensional feature vector; normalizing the same-dimensional feature vector to obtain a graph feature vector; calculating the vector similarity between any two image vectors in the image feature vectors, and determining the connectivity between any two feature images corresponding to the any two image vectors according to the vector similarity; and constructing a first relation graph among the feature graphs in the multi-level feature graph according to the connectivity.

The picture vector conversion function refers to a method for converting data of a picture type into data of a vector type in order to meet the needs of practical application in the process of database application, for example, a multi-level feature map of a B-mode picture of clinical tumor or pathological condition can be converted into a corresponding multi-level feature vector through functions such as tossor () and image2 vector. The vector similarity is a measure used to evaluate similarity between vectors.

Optionally, the normalizing the feature vectors in the same dimension to obtain the feature vectors of the graph may be implemented by a normalizing function. The normalization function refers to data normalization processing needed to solve comparability between data indexes in order to eliminate dimension influence between indexes, and comprises an arctangent normalization function, an L2 normalization norm, a z-score normalization function and the like.

Optionally, the calculating the vector similarity between any two of the map feature vectors may be implemented by a cosine similarity method.

Optionally, the determining, according to the vector similarity, the connectivity between any two feature graphs corresponding to the any two graph vectors may be determined by a comparison result between the vector similarity and a preset similarity threshold, for example, when the vector similarity is not less than the preset similarity threshold, determining that the any two feature graphs are connected; and when the vector similarity is smaller than the preset similarity threshold, judging that the two arbitrary feature images are not connected. The preset similarity threshold is a critical value for judging the connectivity between the two feature images.

Further, in an optional embodiment of the present invention, the implementation principle of constructing the second relationship diagram between word vectors in the text feature word vectors and constructing the third relationship diagram between the multi-level feature diagram and the text feature word vectors is similar to the principle of constructing the first relationship diagram between feature diagrams in the multi-level feature diagram, which is not described herein again.

Further, according to the embodiment of the invention, the final relation diagram of the user question can be constructed according to the first relation diagram, the second relation diagram and the third relation diagram, so that information aggregation can be continuously carried out for follow-up, and a final answer prediction result can be obtained.

Further, in an optional embodiment of the present invention, the constructing a final relationship diagram of the user problem according to the first relationship diagram, the second relationship diagram, and the third relationship diagram includes: performing node full connection on the first relationship diagram, the second relationship diagram and the third relationship diagram to obtain an initial relationship diagram; identifying any node characteristic and adjacent node characteristic in the initial relation diagram; calculating cosine similarity between the arbitrary node characteristics and the adjacent node characteristics; normalizing the cosine similarity to obtain normalized similarity; and determining the final relation diagram according to the normalized similarity and the initial relation diagram.

Wherein, the full connection means that any two nodes are connected in the graph structure. The arbitrary node feature and the adjacent node feature both comprise a graph feature vector and a text feature word vector. The cosine similarity is the similarity between two vectors evaluated by calculating the cosine value of the included angle of the two vectors.

Optionally, the identifying the arbitrary node feature and the neighboring node feature in the initial relationship graph may be implemented by identifying a node type of the arbitrary node and the neighboring node, and extracting a feature vector corresponding to the node type. The node type refers to the category to which the node belongs, and comprises text nodes and image nodes.

Further, in an optional embodiment of the present invention, the calculating the cosine similarity between the arbitrary node feature and the neighboring node feature includes:

and calculating cosine similarity between the arbitrary node characteristic and the adjacent node characteristic by using the following formula:

wherein ρ represents cosine similarity between the arbitrary node feature and the neighboring node feature, θ represents an included angle between a vector corresponding to the arbitrary node feature and a vector corresponding to the neighboring node feature, a represents a vector corresponding to the arbitrary node feature, b represents a vector corresponding to the neighboring node feature, a represents a modulus of the vector corresponding to the arbitrary node feature, b represents a modulus of the vector corresponding to the neighboring node feature, a _j The j-th component, b, representing the vector corresponding to any node feature _j The j-th component of the vector corresponding to the neighboring node feature is represented.

And S4, performing information aggregation processing on the final relation graph by using an information aggregation function in a graph neural network in the trained intelligent question-answering model to obtain an aggregated information vector of the final relation graph.

According to the embodiment of the invention, the final relation graph is subjected to information aggregation processing by utilizing the information aggregation function in the graph neural network in the trained intelligent question-answering model, so that all associated fusion information fusing the question picture and the question text can be obtained, and support is provided for generating a final answer, such as a question picture clinical fracture picture and a question text' where is the fracture position? And (3) carrying out information aggregation on the final relation diagram to obtain a final aggregate information vector so as to represent and locate the specific occurrence position of the fracture.

Further, in an optional embodiment of the present invention, the information aggregation processing is performed on the final relationship graph by using an information aggregation function in a graph neural network in the trained intelligent question-answering model to obtain an aggregated information vector of the final relationship graph, including: taking the characteristic vector of each node in the final relation diagram as an initial information vector, and calculating a target aggregation vector of the initial information vector by utilizing an information aggregation function in the graphic neural network; and determining the aggregation information vector according to the target aggregation vector.

Further, in an alternative embodiment of the present invention, the information aggregation function includes:

wherein ,aggregate information vector after k-th information aggregation of current information aggregation node v, v represents current information aggregation node, sigma represents nonlinear activation function, W _k The neighbor node representing the current information aggregation node passes through the weight coefficient of the neighbor information vector corresponding to the kth-1 time information aggregation, u represents the neighbor node of the current information aggregation node, B _k The method comprises the steps of representing a weight coefficient of a k-1 th aggregation information vector of a current information aggregation node v, N (v) representing a neighbor node set of the node v, N (v) representing the number of neighbor nodes of the node v, k representing the information aggregation times, and m representing the maximum iteration times of the information aggregation.

Optionally, the process of determining the aggregate information vector according to the target aggregate vector may be to use the target aggregate vector as the aggregate information vector of the final relationship graph when the number of iterations of calculating the target aggregate vector reaches a preset threshold.

And S5, performing dimension reduction processing on the aggregate information vector by using a dimension reduction layer in the trained intelligent question-answer model to obtain a dimension reduction information vector.

According to the embodiment of the invention, the dimension reduction layer in the trained intelligent question-answering model is utilized to carry out dimension reduction processing on the aggregate information vector, so that the dimension reduction information vector can be obtained to determine a final question answer vector, such as the dimension reduction of the aggregate information vector with tens of thousands of dimensions corresponding to the medical image in medical diagnosis to the dimension reduction information vector with tens or hundreds of dimensions.

Further, in an optional embodiment of the present invention, the dimension reduction processing is performed on the aggregate information vector by using a dimension reduction layer in the trained intelligent question-answering model, so as to obtain a dimension reduction information vector, and the dimension reduction processing is performed on the aggregate information vector by using a pooling function of the dimension reduction layer in the trained intelligent question-answering model.

The pooling function is a method for performing dimension reduction sampling on the features obtained through convolution extraction, such as average pooling, maximum pooling, global average pooling, global self-adaptive pooling, roI pooling, pyramid pooling, overlapping pooling, random pooling, bilinear pooling and the like.

S6, analyzing the answers of the questions of the user data by utilizing the full connection layer in the trained intelligent question-answer model according to the dimension reduction information vector.

According to the embodiment of the invention, the final answer of the visual response can be obtained by analyzing the question answer of the user data by utilizing the full-connection layer in the trained intelligent question-answer model according to the dimension reduction information vector, and the accuracy of the response result is improved. The full connection layer is a neural network layer, wherein each neuron node in the intelligent question-answering model is connected with all neuron nodes of the upper layer and used for integrating the features extracted in the previous step.

Further, in an optional embodiment of the present invention, according to the dimension reduction information vector, by using a full connection layer in the trained intelligent question-answering model, analyzing the question answer of the user data may convert the dimension reduction information vector into a question-answering text through a preset vector text conversion function, and using the question-answering text as the question answer of the user data, for example, where is the clinical medical image and the text in the user data? And (3) performing full connection on the dimension reduction information vectors corresponding to the aggregate information vectors to obtain question answer vectors, and performing text conversion to obtain final question reply text.

It can be seen that, in the embodiment of the present invention, by acquiring user data, where the user data includes picture data and text data, to provide an operation object for a subsequent implementation of a method, and performing word segmentation processing on the text data in the user data and vector conversion on the text word segmentation, the problem text may be preprocessed, so as to provide a guarantee for a subsequent extraction of a feature word vector of the problem text; secondly, the embodiment of the invention utilizes a preset picture encoder to carry out multi-level feature extraction on picture data in the user data, a multi-level feature image is obtained, which can provide support for generating a first relation image for the subsequent generation of fusion information, a preset text encoder is utilized to carry out feature extraction on the text word vector, a text feature word vector is obtained, which can provide support for generating a second relation image for the subsequent generation of fusion information, the first relation image, the second relation image and the third relation image are all important components for establishing a final relation image of the three relation images, and the final relation image for constructing the user problem can be information aggregation for the subsequent continuous operation according to the first relation image, the second relation image and the third relation image so as to obtain a final more accurate answer prediction result; further, according to the embodiment of the invention, the information aggregation function in the graph neural network in the trained intelligent question-answering model is utilized to conduct information aggregation processing on the final relation graph, the aggregated information vector of the final relation graph can be obtained to fuse all associated fusion information of the question picture and the question text, support is provided for generating a final answer, the dimension reduction layer in the trained intelligent question-answering model is utilized to conduct dimension reduction processing on the aggregated information vector, the calculated amount and complexity of the method can be reduced to obtain the dimension reduction information vector, so that the final answer vector of the question can be determined, and according to the dimension reduction information vector, the dimension reduction information vector fused with all associated information of picture data and text data in user data can be fully connected by utilizing a fully connected layer in the trained intelligent question-answering model, so that a more accurate answer of a visual question-answering can be obtained. Therefore, the visual intelligent question answering method, the visual intelligent question answering device, the electronic equipment and the storage medium can improve the accuracy of answers to questions of users, particularly in an intelligent question asking scene, can enable patients to quickly locate disease categories, provide convenience for further medical diagnosis, greatly relieve confusion of patients on a disease department before medical treatment, enable medical treatment directions to be more definite, and improve medical treatment efficiency of the patients.

As shown in FIG. 2, a functional block diagram of the visual intelligent answering device of the present invention is shown.

The visual intelligent question answering apparatus 100 of the present invention can be installed in an electronic device. Depending on the implemented functions, the visual intelligent question-answering apparatus may include a text word vector generation module 101, a feature extraction module 102, a relationship diagram construction module 103, an information aggregation module 104, an information dimension reduction module 105, and a question answer analysis module 106. The module according to the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the text word vector generation module 101 is configured to obtain user data, perform word segmentation on text data in the user data to obtain text words, and perform vector conversion on the text words to obtain text word vectors;

the feature extraction module 102 is configured to perform multi-level feature extraction on picture data in the user data by using a preset picture encoder to obtain a multi-level feature map, and perform feature extraction on the text word vector by using a preset text encoder to obtain a text feature word vector;

The relationship diagram construction module 103 is configured to construct a first relationship diagram between feature diagrams in the multi-level feature diagram, and construct a second relationship diagram between word vectors in the text feature word vector, and construct a third relationship diagram between the multi-level feature diagram and the text feature word vector, and construct a final relationship diagram of the user data according to the first relationship diagram, the second relationship diagram, and the third relationship diagram;

the information aggregation module 104 is configured to perform information aggregation processing on the final relationship graph by using an information aggregation function in the graph neural network in the trained intelligent question-answering model, so as to obtain an aggregated information vector of the final relationship graph;

the information dimension reduction module 105 is configured to perform dimension reduction processing on the aggregate information vector by using a dimension reduction layer in the trained intelligent question-answer model to obtain a dimension reduction information vector;

the question answer analysis module 106 is configured to analyze the question answer of the user data according to the dimension reduction information vector by using a full connection layer in the trained intelligent question answer model.

In detail, the modules in the visual intelligent question-answering device 100 in the embodiment of the present invention use the same technical means as the visual intelligent question-answering method described in fig. 1, and can produce the same technical effects, which are not described herein.

As shown in fig. 3, a schematic structural diagram of an electronic device 1 implementing a visual intelligent question-answering method according to the present invention is shown.

The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a visual intelligent questioning and answering program, stored in the memory 11 and executable on the processor 10.

The processor 10 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and so on. The processor 10 is a Control Unit (Control Unit) of the electronic device 1, connects the respective components of the entire electronic device 1 using various interfaces and lines, executes programs or modules stored in the memory 11 (for example, executes a visual intelligent question-answering program, etc.), and invokes data stored in the memory 11 to perform various functions of the electronic device 1 and process data.

The memory 11 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a visual intelligent question-answering program, etc., but also for temporarily storing data that has been output or is to be output.

The communication bus 12 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.

The communication interface 13 is used for communication between the electronic device 1 and other devices, including a network interface and an employee interface. Optionally, the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device 1 and other electronic devices 1. The employee interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), or alternatively a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual staff interface.

Fig. 3 shows only an electronic device 1 with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.

For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

It should be understood that the embodiments described are for illustrative purposes only and are not limited in scope by this configuration.

The visual intelligent question-answering program stored in the memory 11 of the electronic device 1 is a combination of a plurality of computer programs, which, when run in the processor 10, can implement:

In particular, the specific implementation method of the processor 10 on the computer program may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.

Further, the integrated modules/units of the electronic device 1 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device 1, may implement:

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A visual intelligent question-answering method, the method comprising:

2. The visual intelligent question-answering method according to claim 1, wherein the step of performing multi-level feature extraction on the picture data in the user data by using a preset picture encoder to obtain a multi-level feature map comprises:

calculating a multi-layer linear picture of the multi-layer convolution picture according to the sharing weight and the sharing deviation of the multi-layer convolution layer;

and determining a multi-level feature map of the picture data according to the multi-level linear picture.

3. The visual intelligent question-answering method according to claim 1, wherein the feature extraction of the text word vector by using a preset text encoder to obtain a text feature word vector comprises:

acquiring text data corresponding to the text word vector;

extracting description object information of the text data;

4. The visual intelligent question-answering method according to claim 1, wherein the constructing a first relationship graph between feature graphs in the multi-level feature graph comprises:

5. The visual intelligent question-answering method according to claim 1, wherein the constructing a final relationship diagram of the user question from the first relationship diagram, the second relationship diagram, and the third relationship diagram comprises:

normalizing the cosine similarity to obtain normalized similarity;

6. The visual intelligent question-answering method according to claim 1, wherein the information aggregation processing is performed on the final relationship graph by using an information aggregation function in a graph neural network in the trained intelligent question-answering model to obtain an aggregated information vector of the final relationship graph, and the method comprises:

7. The visual intelligent question-answering method according to claim 6, wherein the information aggregation function includes:

8. A visual intelligent question-answering device, the device comprising:

9. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the visual intelligent question-answering method according to any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program which when executed by a processor implements the visual intelligent question-answering method according to any one of claims 1 to 7.