CN114780701B

CN114780701B - Automatic question-answer matching method, device, computer equipment and storage medium

Info

Publication number: CN114780701B
Application number: CN202210419404.9A
Authority: CN
Inventors: 姚海申; 孙行智
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2024-07-02
Anticipated expiration: 2042-04-20
Also published as: CN114780701A

Abstract

The embodiment of the application belongs to the field of artificial intelligence, and is applied to the medical field, and relates to an automatic question-answer matching method, which comprises the steps of obtaining image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmented image, a first word segmentation text and a second word segmentation text; performing position coding, segmentation coding and embedding coding on the segmented image, the first segmented text and the second segmented text to obtain a position vector, a segmentation vector and an embedding vector; and vector splicing is carried out on the obtained vectors to obtain fusion characterization vectors, and the fusion characterization vectors are input to multi-modal model calculation to obtain target reply texts. The application also provides an automatic question-answer matching device, computer equipment and a storage medium. Further, the target reply text may be stored in the blockchain. The application realizes automatic question and answer based on images and texts, and improves the question and answer efficiency and accuracy.

Description

Automatic question-answer matching method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an automatic question-answer matching method, apparatus, computer device, and storage medium.

Background

In recent years, with the great increase of computing power and data volume, artificial intelligence technology has further developed, and the application of artificial intelligence to solve the problem in the medical field has become a hotspot. However, in the medical field, when a doctor diagnoses a patient, the doctor often needs to manually check and judge the picture of the patient, which has complicated process and long time consumption, and finally leads to the problem of low accuracy of medical inquiry.

Disclosure of Invention

The embodiment of the application aims to provide an automatic question-answer matching method, an automatic question-answer matching device, computer equipment and a storage medium, so as to solve the technical problem of low question-answer accuracy.

In order to solve the above technical problems, the embodiment of the present application provides an automatic question-answer matching method, which adopts the following technical scheme:

Acquiring image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmented image, a first word segmentation text and a second word segmentation text;

Performing position coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain a position vector, performing segmentation coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain a segmentation vector, and performing embedding coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain an embedding vector;

And vector splicing is carried out on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, a trained multi-modal model is obtained, the fusion characterization vector is input to the multi-modal model for calculation, and a target answer text corresponding to the question text is obtained.

Further, the step of performing embedded coding on the segmented image, the first segmented text and the second segmented text to obtain an embedded vector includes:

Respectively calculating coding information of the first word segmentation text and the second word segmentation text to obtain a first text embedded code corresponding to the first word segmentation text and a second text embedded code corresponding to the second word segmentation text;

The vector dimension of the first text embedded code or the second text embedded code is obtained, and the segmented image is mapped to a target dimension according to the vector dimension to obtain an image embedded code;

and splicing the first text embedded code, the second text embedded code and the image embedded code to obtain the embedded vector.

Further, the step of performing segmentation encoding on the segmented image, the first segmented text and the second segmented text to obtain a segmentation vector includes:

Acquiring role marks of the segmentation image, the first word segmentation text and the second word segmentation text;

And splicing and encoding the role marks according to a preset sequence to obtain the segmentation vector.

Further, the step of obtaining the trained multi-modal model includes:

collecting a plurality of groups of medical inquiry data, and splitting the medical inquiry data to obtain a training sample;

and inputting the training sample into a basic prediction model, calculating to obtain a target loss function, training the basic prediction model according to the target loss function, and determining that the basic prediction model is trained when the target loss function is the minimum value to obtain the multi-modal model.

Further, the step of inputting the training samples into a basic prediction model to calculate a target loss function includes:

and calculating text mask loss, picture loss and prediction loss of the training sample, and generating the target loss function according to the text mask loss, the picture loss and the prediction loss.

Further, the step of collecting a plurality of sets of medical inquiry data and splitting the medical inquiry data to obtain a training sample includes:

Obtaining sample categories in the medical inquiry data, wherein the sample categories comprise picture categories, question categories and answer categories, and marking and splitting the medical inquiry data according to the picture categories, the question categories and the answer categories to obtain a plurality of sample data;

carrying out mask marking on the sample data to obtain mask prediction samples, and respectively carrying out position coding, segmentation coding and embedded coding on the mask prediction samples to obtain a first coding result, a second coding result and a third coding result;

And splicing the first coding result, the second coding result and the third coding result to obtain the training sample.

Further, the step of performing mask marking on the sample data to obtain a mask prediction sample includes:

Acquiring a target mask word, and determining whether a target picture corresponding to the target mask word exists in data corresponding to a picture category in the sample data, and whether a target text corresponding to the target mask word exists in data corresponding to a question category and an answer category in the sample data;

and when the target picture and the target text are determined to exist simultaneously, selecting the target picture or the target text to mask, and obtaining the mask prediction sample.

In order to solve the above technical problems, the embodiment of the present application further provides an automatic question-answer matching device, which adopts the following technical scheme:

The acquisition module is used for acquiring image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmented image, a first word segmentation text and a second word segmentation text;

The encoding module is used for carrying out position encoding on the segmented image, the first word segmentation text and the second word segmentation text to obtain a position vector, carrying out segmentation encoding on the segmented image, the first word segmentation text and the second word segmentation text to obtain a segmentation vector, and carrying out embedding encoding on the segmented image, the first word segmentation text and the second word segmentation text to obtain an embedding vector;

The prediction module is used for carrying out vector splicing on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, obtaining a trained multi-modal model, and inputting the fusion characterization vector to the multi-modal model for calculation to obtain a target reply text corresponding to the question text.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

According to the automatic question-answer matching method, image data, question text and reference text are obtained, and are preprocessed respectively, so that a segmented image, a first word segmentation text and a second word segmentation text are obtained; position coding is carried out on the segmented image, the first segmented text and the second segmented text to obtain position vectors, segmentation coding is carried out on the segmented image, the first segmented text and the second segmented text to obtain segmentation vectors, embedding coding is carried out on the segmented image, the first segmented text and the second segmented text to obtain embedding vectors, and accurate expression of the image and the text is realized; and then, vector splicing is carried out on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, so that the images and the texts can be fully fused, then, a trained multi-modal model is obtained, the fusion characterization vector is input to multi-modal model calculation to obtain a target answer text corresponding to the answer text, automatic answer based on the images and the text is realized, the answer efficiency and the accuracy are improved, the answer accuracy in the medical field is further improved, and the waste of medical resources is avoided.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of an automatic question-answer matching method according to the present application;

FIG. 3 is a schematic diagram of an embodiment of an automatic question-answer matching apparatus according to the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.

Reference numerals: an automatic question-answer matching device 300, an acquisition module 301, an encoding module 302 and a prediction module 303.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the automatic question-answer matching method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the automatic question-answer matching device is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow chart of one embodiment of an automatic question-answer matching method according to the present application is shown. The automatic question-answer matching method comprises the following steps:

Step S201, obtaining image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmented image, a first word segmentation text and a second word segmentation text.

In this embodiment, the image data is an image (such as a pathological picture submitted during a user inquiry, etc.) associated with the question text submitted by the user; the questioning text is the text of voice conversion of the questioning party in the dialogue process; the reference text is a stored reference answer text, and the question text and the reference text constitute text data. And acquiring image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmented image, a first word segmentation text and a second word segmentation text. Specifically, preprocessing of image data comprises operations such as segmentation, pixel adjustment and the like, and preprocessing of question text and reference text comprises operations such as word segmentation, error correction and the like; when image data, a question text and a reference text are obtained, preprocessing the image data to obtain a segmented image; and preprocessing the question text and the reference text to obtain a first word segmentation text and a second word segmentation text. The segmentation image is an image obtained by segmenting image data and adjusting pixels, the first word segmentation text is a text obtained by segmenting and correcting a question text, and the second word segmentation text is a text obtained by segmenting and correcting a reference text.

Step S202, performing position coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain a position vector, performing segmentation coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain a segmentation vector, and performing embedding coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain an embedding vector.

In the embodiment, when a segmented image, a first word segmentation text and a second word segmentation text are obtained, position encoding is performed on the segmented image, the first word segmentation text and the second word segmentation text to obtain a position vector; carrying out segmentation coding on the segmented image, the first segmentation text and the second segmentation text to obtain segmentation vectors; and performing embedded coding on the segmented image, the first segmented text and the second segmented text to obtain an embedded vector. Specifically, coding is carried out according to the relative position of the segmented image in the image data, so as to obtain the image position code of the segmented image; coding the relative positions of the question text and the reference text according to the first word segmentation text and the second word segmentation text respectively to obtain a first text position code and a second text position code of the word segmentation text; and splicing the image position code, the first text position code and the second text position code back and forth (namely splicing the first text position code and the second text position code after the image position code or splicing the first text position code and the second text position code before the image position code) to obtain a position vector.

In order to distinguish the image data from the text data, the finally generated text is more in accordance with the expected characteristics, and the segmentation image, the first segmentation text and the second segmentation text are subjected to segmentation coding through the segmentation numbers, so that segmentation vectors are obtained. Specifically, a preset segmentation number is obtained, segmented images of different sources are identified based on the segmentation number, and image segmentation codes are obtained; and identifying the segmented texts of different sources based on the segmentation numbers to obtain text segmentation codes. The segmentation vectors are obtained by concatenating the image segmentation code and the text segmentation code (i.e., concatenating the text segmentation code after the image segmentation code or concatenating the text segmentation code before the image segmentation code).

The embedded coding is to respectively perform vector conversion on the segmented image, the first segmented text and the second segmented text to obtain an image embedded coding and a text embedded coding, and then splice the image embedded coding and the text embedded coding back and forth to obtain an embedded vector. The splicing sequence of the embedded vectors is the same as that of the split vectors and the position vectors.

And step 203, performing vector splicing on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, obtaining a trained multi-modal model, and inputting the fusion characterization vector to the multi-modal model to obtain a target reply text corresponding to the question text.

In this embodiment, when the embedded vector, the position vector, and the split vector are obtained, vector stitching is performed on the embedded vector, the position vector, and the split vector, that is, vector addition is performed on the embedded vector, the position vector, and the split vector, to obtain the fusion characterization vector. And then, acquiring a trained multi-modal model, inputting the fusion characterization vector to the multi-modal model, and calculating based on the multi-modal model to obtain a target reply text. Specifically, the multi-modal model is composed of a plurality of bidirectional converter models, wherein the converter models comprise a multi-head attention layer, a regularization layer and a forward propagation layer, and the bidirectional converter models in the multi-modal model can be used for calculating bidirectional attention of an input fusion characterization vector, so that images and texts can be fully fused; and finally, outputting and obtaining the target reply text based on the multi-modal model.

It is emphasized that, to further ensure the privacy and security of the target reply text, the target reply text may also be stored in a blockchain node.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The embodiment realizes automatic question and answer based on the images and the texts, improves the question and answer efficiency and accuracy, further improves the question and diagnosis accuracy in the medical field, and avoids the waste of medical resources.

In some optional implementations of this embodiment, the step of performing embedded encoding on the segmented image, the first segmented text, and the second segmented text to obtain an embedded vector includes:

In this embodiment, when the segmented image, the first segmented text and the second segmented text are subjected to embedded coding, coding information of each of the first segmented text and the second segmented text is calculated, where the coding information is token embedding corresponding to the segmented text, and the first text embedded coding and the second text embedded coding corresponding to the first segmented text and the second segmented text can be calculated through wordpiece. Then, the vector dimension of the first text embedded code or the second text embedded code is obtained, and the segmented image is mapped to the target dimension through a full connection layer according to the vector dimension, so that the image embedded code is obtained; the target dimension is a dimension that is the same size as the vector dimension. For example, the vector dimension of the first text embedded code and the second text embedded code is 768-dimensional, and the split image is mapped into the vector of the same dimension through the full connection layer, so that the image embedded code of the same 768-dimensional is obtained. And then, splicing the first text embedded code, the second text embedded code and the image embedded code back and forth, such as splicing the first text embedded code and the second text embedded code in sequence after the image embedded code, so as to obtain the embedded vector.

According to the embodiment, the split images, the first word segmentation text and the second word segmentation text are embedded and encoded, so that the image features are accurately extracted and fused, the accurate inquiry can be performed by combining the images provided by the patient in the inquiry process in the medical field, and the accuracy of inquiry and answering is improved.

In some optional implementations of this embodiment, the step of performing segmentation encoding on the segmented image, the first segmented text, and the second segmented text to obtain a segmentation vector includes:

In this embodiment, the images and the texts are of two different character categories, and the first word segmentation text and the second word segmentation text may be divided into different character categories according to the division of the questioner and the responder. And acquiring role marks of the segmentation image, the first word segmentation text and the second word segmentation text. One of the character categories corresponds to one of the character labels, for example, the segmented image, the first segmented text, and the second segmented text may be labeled V, S as a character category, and S2 as a character category. And acquiring a preset sequence, and splicing and encoding the role marks according to the preset sequence to obtain the segmentation vector. The preset sequence is the sequence of character marks, and the sequence of character marks can be spliced according to the sequence of image or text segmentation; the different character labels can be spliced according to a preset sequence (such as a conversation sequence). For example, the role marks corresponding to one inquiry dialogue may be spliced to V, V, V, S1, S2, S1 according to a preset sequence. And encoding the spliced character marks to obtain the segmentation vectors.

According to the method, the device and the system, the images and the text roles are distinguished by performing segmentation coding on the segmented images, the first segmented text and the second segmented text, so that the finally generated target reply text is more in line with the role characteristics, and the accuracy of question and answer is further improved.

In some optional implementations of this embodiment, the step of obtaining the trained multimodal model includes:

In this embodiment, the medical inquiry data is acquired multiple inquiry dialogue data, and one inquiry dialogue includes image data and multiple rounds of dialogue text. Collecting a plurality of groups of medical inquiry data, and splitting the medical inquiry data to obtain a training sample; and inputting the training sample into a basic prediction model for calculation to obtain a target loss function. And finally, training the basic prediction model according to the target loss function, and determining that the basic prediction model is trained when the target loss function is the minimum value to obtain the multi-mode model. Specifically, the basic prediction model is the model with the same structure as the multi-modal model, the parameters of the basic prediction model are adjusted through the target loss function, the next calculation of the target loss function is performed based on the adjusted basic prediction model until the calculated target loss function is the minimum, namely, the adjusted basic prediction model is determined to be the multi-modal model.

According to the method and the device for training the multi-modal model, the basic prediction model is trained through the target loss function, so that efficient training and adjustment of the model are achieved, and the multi-modal model obtained through training can accurately answer questions.

In some optional implementations of this embodiment, the step of inputting the training samples into the basic prediction model, and calculating the target loss function includes:

In this embodiment, the objective loss function includes text mask loss, picture loss, and prediction loss. The text mask loss is loss caused by text when mask prediction is carried out; the picture loss is the loss caused by category prediction of an input image (such as corresponding diseases, parts or skin manifestations, etc.); the predicted loss is the loss of the target reply text and the real reply text which are finally generated by the model. An objective loss function is generated based on the text mask loss, the picture loss, and the prediction loss.

Specifically, the calculation formula of the text mask loss is: l _MLM(θ)＝-∑logP_θ(w_m|v,w_\m), wherein w _m is a masked word vector and w _\m is the remaining word vector other than the masked word vector. The calculation formula of the picture loss is as follows: Wherein CE is a cross-entropy loss function (CE) loss, For the class one-hot vector,And the model is output as class probability distribution, and M is the number of classes. The calculation formula of the prediction loss is as follows: Where v is the image, w is the text, (r 1, r2, …, ri-1) is the predicted reply vector, and ri is the target predicted reply vector. Based on the text mask loss, the picture loss and the prediction loss, the calculation formula of the generated target loss function is as follows: l=l _MLM(θ)+L_MRM(θ)+L_GP (θ).

According to the method and the device, the target loss function is generated through text mask loss, picture loss and prediction loss, so that accurate calculation of model loss is achieved, accurate training of the model through the target loss function is achieved, and accuracy of model question-answering is further improved.

In some optional implementations of this embodiment, the step of collecting a plurality of sets of medical inquiry data and splitting the medical inquiry data to obtain a training sample includes:

In this embodiment, when the medical inquiry data is split, sample categories in the medical inquiry data are acquired, where the sample categories may be divided into a picture category, a question category, and an answer category, and the question category and the answer category are collectively referred to as a character category. Splitting the medical inquiry data according to the picture category, the question category and the answer category to obtain a plurality of sample data. Specifically, the medical consultation data includes multiple consultation dialogue data, and one consultation dialogue includes image data and multiple rounds of dialogue text. When the sample category of the inquiry dialogue data is obtained, marking the data in the same inquiry dialogue according to the image category, the inquiry category and the answer category; the data of the same inquiry session is then split based on the tag. The method comprises the steps of forming a round of dialogue data by image data of an image category, a question text of a question category and a answer text of a next answer category corresponding to the question text; the image data of the image category, the question text of one question category, the answer text of the next answer category corresponding to the question text and the next question text corresponding to the answer text form a second round of dialogue data … …, and the like, and a round of dialogue data is constructed when the role category is switched once. Therefore, the splitting of the dialogue data of one inquiry is realized, and the dialogue data of each round is the sample data obtained by splitting.

When sample data is obtained, carrying out mask marking on the sample data, namely masking the sample data through a mask to obtain a mask prediction sample; and then, respectively performing position coding, segmentation coding and embedded coding on the mask prediction samples to obtain a first coding result, a second coding result and a third coding result. Vector splicing (i.e., vector summation) is performed on the first encoding result, the second encoding result and the third encoding result, so as to obtain training samples. In addition, the split sample data may be directly encoded, and the encoded sample data may be used as the training sample.

According to the embodiment, the training sample is obtained by splitting the medical inquiry data and marking and encoding the mask, so that the model can be fully trained through the training sample, and the model training efficiency is further improved.

In some optional implementations of this embodiment, the step of performing mask marking on the sample data to obtain a mask prediction sample includes:

In this embodiment, in order not to lose image and text information, when mask marking is performed on sample data, a target mask word (such as an apple) is obtained, and whether a target picture corresponding to the target mask word exists in data corresponding to a picture category in the sample data or not and whether a target text corresponding to the target mask word exists in data corresponding to a question category and an answer category in the sample data or not are determined; and when the target picture and the target text exist at the same time, selecting the target picture or the target text to mask, and obtaining a mask prediction sample. Thus, information corresponding to the same target mask word in the picture and the text is prevented from being masked at the same time.

According to the method and the device, the image and the text are independently masked, so that information loss caused by masking is avoided, the relation between the image and the text can be fully learned by the model, and the relevance of the target reply text and the image is further improved.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an automatic question-answer matching apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 3, the automatic question-answer matching apparatus 300 according to this embodiment includes: an acquisition module 301, a coding module 302 and a prediction module 303. Wherein:

The obtaining module 301 is configured to obtain image data, a question text and a reference text, and respectively pre-process the image data, the question text and the reference text to obtain a segmented image, a first word segmentation text and a second word segmentation text;

The encoding module 302 is configured to perform position encoding on the segmented image, the first word segmentation text, and the second word segmentation text to obtain a position vector, perform segmentation encoding on the segmented image, the first word segmentation text, and the second word segmentation text to obtain a segmented vector, and perform embedding encoding on the segmented image, the first word segmentation text, and the second word segmentation text to obtain an embedded vector;

in some alternative implementations of the present embodiment, the encoding module 302 includes:

The first coding unit is used for respectively calculating coding information of the first word segmentation text and the second word segmentation text to obtain a first text embedded code corresponding to the first word segmentation text and a second text embedded code corresponding to the second word segmentation text;

the mapping unit is used for obtaining the vector dimension of the first text embedded code or the second text embedded code, and mapping the segmented image to a target dimension according to the vector dimension to obtain an image embedded code;

and the first splicing unit is used for splicing the first text embedded code, the second text embedded code and the image embedded code to obtain the embedded vector.

In some alternative implementations of the present embodiment, the encoding module 302 further includes:

An acquisition unit configured to acquire character marks of the divided image, the first word segmentation text, and the second word segmentation text;

And the second coding unit is used for splicing and coding the role marks according to a preset sequence to obtain the segmentation vector.

And the prediction module 303 is configured to perform vector stitching on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, obtain a trained multimodal model, and input the fusion characterization vector to the multimodal model for calculation to obtain a target reply text corresponding to the question text.

In some alternative implementations of the present embodiment, the prediction module 303 includes:

the acquisition unit is used for acquiring a plurality of groups of medical inquiry data and splitting the medical inquiry data to obtain training samples;

The training unit is used for inputting the training sample into a basic prediction model, calculating to obtain a target loss function, training the basic prediction model according to the target loss function, and determining that the basic prediction model training is completed when the target loss function is the minimum value to obtain the multi-modal model.

In some optional implementations of this embodiment, the training unit includes:

And the generating unit is used for calculating text mask loss, picture loss and prediction loss of the training sample and generating the target loss function according to the text mask loss, the picture loss and the prediction loss.

In some optional implementations of this embodiment, the acquisition unit includes:

The splitting unit is used for acquiring sample categories in the medical inquiry data, wherein the sample categories comprise picture categories, question categories and answer categories, and marking and splitting the medical inquiry data according to the picture categories, the question categories and the answer categories to obtain a plurality of sample data;

the third coding unit is used for carrying out mask marking on the sample data to obtain mask prediction samples, and respectively carrying out position coding, segmentation coding and embedded coding on the mask prediction samples to obtain a first coding result, a second coding result and a third coding result;

And the second splicing unit is used for splicing the first coding result, the second coding result and the third coding result to obtain the training sample.

In some optional implementations of this embodiment, the third encoding unit includes:

The confirming unit is used for acquiring target mask words and determining whether target pictures corresponding to the target mask words exist in data corresponding to picture categories in the sample data and whether target texts corresponding to the target mask words exist in data corresponding to question categories and answer categories in the sample data;

And the selecting unit is used for selecting the target picture or the target text to mask when the target picture and the target text are determined to exist simultaneously, so as to obtain the mask prediction sample.

The automatic question and answer device provided by the embodiment realizes automatic question and answer based on images and texts, improves the question and answer efficiency and accuracy, further improves the accuracy of question diagnosis in the medical field, and avoids the waste of medical resources.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only computer device 6 having components 61-63 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 61 includes at least one type of readable storage media including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal memory unit of the computer device 6 and an external memory device. In this embodiment, the memory 61 is typically used to store an operating system and various application software installed on the computer device 6, such as computer readable instructions of an automatic question-answer matching method. Further, the memory 61 may be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, such as computer readable instructions for executing the automatic question-answer matching method.

The network interface 63 may comprise a wireless network interface or a wired network interface, which network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The computer equipment provided by the embodiment realizes automatic question and answer based on the images and the texts, improves the question and answer efficiency and accuracy, further improves the accuracy of question diagnosis in the medical field, and avoids the waste of medical resources.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the automatic question-answer matching method as described above.

The computer readable storage medium provided by the embodiment realizes automatic question and answer based on images and texts, improves the question and answer efficiency and accuracy, further improves the question and answer accuracy in the medical field, and avoids the waste of medical resources.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. An automatic question-answer matching method is characterized by comprising the following steps:

Vector splicing is carried out on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, a trained multi-modal model is obtained, the fusion characterization vector is input to the multi-modal model for calculation, and a target answer text corresponding to the question text is obtained;

The step of performing embedded coding on the segmented image, the first segmented text and the second segmented text to obtain an embedded vector comprises the following steps:

Splicing the first text embedded code, the second text embedded code and the image embedded code to obtain the embedded vector;

the step of obtaining the trained multi-modal model comprises the following steps:

Inputting the training sample into a basic prediction model, calculating to obtain a target loss function, training the basic prediction model according to the target loss function, and determining that the basic prediction model is trained when the target loss function is the minimum value to obtain the multi-modal model;

The step of inputting the training sample into a basic prediction model and calculating to obtain a target loss function comprises the following steps:

calculating text mask loss, picture loss and prediction loss of the training sample, and generating the target loss function according to the text mask loss, the picture loss and the prediction loss;

The calculation formula of the text mask loss is as follows: l _MLM(θ)＝-∑log P_θ(w_m|v,w_\m), w _m is a masked word vector, w _\m is a residual word vector except the masked word vector, and the calculation formula of the picture loss is as follows: the CE is a cross-entropy loss function (CE) loss, the For the category one-hot vector, theThe model is a class probability distribution output by the model, wherein M is the number of classes, the calculation formula of the prediction loss is L _GP(θ)＝∑_ilogP_θ(r_i|v,w,r₁,r₂,…,r_i-1,), v is an image, w is a text, r ₁,r₂,…,r_i-1 is a predicted reply vector, r _i is a target prediction reply vector, and the calculation formula of a generated target loss function based on the text mask loss, the picture loss and the prediction loss is as follows: l=l _MLM(θ)+L_MRM(θ)+L_GP (θ).

2. The automatic question-answering matching method according to claim 1, wherein the step of performing segmentation encoding on the segmented image, the first segmented text and the second segmented text to obtain segmentation vectors includes:

3. The automatic question-answer matching method according to claim 1, wherein the step of collecting a plurality of sets of medical question data and splitting the medical question data to obtain training samples comprises:

4. The automatic question-answer matching method according to claim 3, wherein said step of masking said sample data to obtain masked predicted samples comprises:

5. An automatic question-answer matching device, characterized in that it performs the steps of the automatic question-answer matching method according to any one of claims 1 to 4, comprising:

6. A computer device comprising a memory having stored therein computer readable instructions which when executed by the processor implement the steps of the automatic question-answer matching method of any one of claims 1 to 4.

7. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the automatic question-answer matching method of any one of claims 1 to 4.