CN114780701A

CN114780701A - Automatic question-answer matching method, device, computer equipment and storage medium

Info

Publication number: CN114780701A
Application number: CN202210419404.9A
Authority: CN
Inventors: 姚海申; 孙行智
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-07-22

Abstract

The embodiment of the application belongs to the field of artificial intelligence, is applied to the field of medical treatment, and relates to an automatic question-answer matching method which comprises the steps of obtaining image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmentation image, a first segmentation text and a second segmentation text; carrying out position coding, segmentation coding and embedded coding on the segmented image, the first segmentation text and the second segmentation text to obtain a position vector, a segmentation vector and an embedded vector; and carrying out vector splicing on the obtained vectors to obtain a fusion characterization vector, inputting the fusion characterization vector to a multi-modal model for calculation, and obtaining a target reply text. The application also provides an automatic question-answer matching device, computer equipment and a storage medium. In addition, the target reply text may be stored in a blockchain. According to the method and the device, automatic question answering based on the images and the texts is realized, and the question answering efficiency and accuracy are improved.

Description

Automatic question-answer matching method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an automatic question and answer matching method and apparatus, a computer device, and a storage medium.

Background

In recent years, with the great improvement of computing power and data volume, the artificial intelligence technology is further developed, and the application of artificial intelligence to solve the problems in the medical field becomes a hotspot. However, in the medical field, when a doctor diagnoses a patient, the doctor often needs to manually check a picture of the patient for judgment, and the process is complex and time-consuming, which finally results in a problem of low accuracy of medical inquiry.

Disclosure of Invention

The embodiment of the application aims to provide an automatic question-answer matching method, an automatic question-answer matching device, computer equipment and a storage medium, so as to solve the technical problem of low inquiry accuracy.

In order to solve the above technical problem, an embodiment of the present application provides an automatic question-answer matching method, which adopts the following technical solutions:

acquiring image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmentation image, a first segmentation text and a second segmentation text;

performing position coding on the segmented image, the first participle text and the second participle text to obtain a position vector, performing segmentation coding on the segmented image, the first participle text and the second participle text to obtain a segmented vector, and performing embedded coding on the segmented image, the first participle text and the second participle text to obtain an embedded vector;

and carrying out vector splicing on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, obtaining a trained multi-modal model, inputting the fusion characterization vector to the multi-modal model for calculation, and obtaining a target reply text corresponding to the question text.

Further, the step of performing embedded coding on the segmented image, the first participle text and the second participle text to obtain an embedded vector includes:

respectively calculating the coding information of the first participle text and the second participle text to obtain a first text embedded code corresponding to the first participle text and a second text embedded code corresponding to the second participle text;

obtaining the vector dimension of the first text embedded code or the second text embedded code, and mapping the segmented image to a target dimension according to the vector dimension to obtain an image embedded code;

and splicing the first text embedded code, the second text embedded code and the image embedded code to obtain the embedded vector.

Further, the step of performing segmentation coding on the segmentation image, the first segmentation text and the second segmentation text to obtain a segmentation vector includes:

acquiring role marks of the segmentation image, the first segmentation text and the second segmentation text;

and splicing and coding the role marks according to a preset sequence to obtain the segmentation vectors.

Further, the step of obtaining the trained multi-modal model comprises:

collecting a plurality of groups of medical inquiry data, and splitting the medical inquiry data to obtain a training sample;

inputting the training samples into a basic prediction model, calculating to obtain a target loss function, training the basic prediction model according to the target loss function, and determining that the basic prediction model is trained to obtain the multi-modal model when the target loss function is the minimum value.

Further, the step of inputting the training samples into a basic prediction model and calculating to obtain a target loss function includes:

and calculating the text mask loss, the picture loss and the prediction loss of the training sample, and generating the target loss function according to the text mask loss, the picture loss and the prediction loss.

Further, the step of acquiring a plurality of groups of medical inquiry data, and splitting the medical inquiry data to obtain a training sample comprises:

obtaining sample types in the medical inquiry data, wherein the sample types comprise a picture type, a question type and an answer type, and marking and splitting the medical inquiry data according to the picture type, the question type and the answer type to obtain a plurality of sample data;

mask marking is carried out on the sample data to obtain a mask prediction sample, and position coding, segmentation coding and embedded coding are respectively carried out on the mask prediction sample to obtain a first coding result, a second coding result and a third coding result;

and splicing the first coding result, the second coding result and the third coding result to obtain the training sample.

Further, the step of performing mask marking on the sample data to obtain a mask prediction sample includes:

acquiring a target mask word, and determining whether a target picture corresponding to the target mask word exists in data corresponding to a picture category in the sample data, and whether a target text corresponding to the target mask word exists in data corresponding to a question category and an answer category in the sample data;

and when the target picture and the target text are determined to exist at the same time, selecting the target picture or the target text for masking to obtain the mask prediction sample.

In order to solve the above technical problem, an embodiment of the present application further provides an automatic question-answer matching device, which adopts the following technical solutions:

the system comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmentation image, a first segmentation text and a second segmentation text;

the encoding module is used for carrying out position encoding on the segmented image, the first participle text and the second participle text to obtain a position vector, carrying out segmentation encoding on the segmented image, the first participle text and the second participle text to obtain a segmented vector, and carrying out embedded encoding on the segmented image, the first participle text and the second participle text to obtain an embedded vector;

and the prediction module is used for carrying out vector splicing on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, obtaining a trained multi-modal model, inputting the fusion characterization vector to the multi-modal model for calculation, and obtaining a target reply text corresponding to the question text.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

According to the automatic question-answer matching method, image data, a question text and a reference text are obtained, and the image data, the question text and the reference text are respectively preprocessed to obtain a segmentation image, a first segmentation text and a second segmentation text; the method comprises the steps of carrying out position coding on a segmented image, a first word segmentation text and a second word segmentation text to obtain a position vector, carrying out segmentation coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain a segmentation vector, and carrying out embedded coding on the segmented image, the first word segmentation text and the second word segmentation text to obtain an embedded vector, so that accurate expression of the image and the text is realized; and then, vector splicing is carried out on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, so that the image and the text can be fully fused, then a trained multi-modal model is obtained, the fusion characterization vector is input to the multi-modal model for calculation, and a target reply text corresponding to the question text is obtained, so that automatic question answering based on the image and the text is realized, the question answering efficiency and accuracy are improved, the question answering accuracy in the medical field is further improved, and the waste of medical resources is avoided.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an automatic question-answer matching method according to the present application;

FIG. 3 is a schematic diagram of an embodiment of an automatic question-answering matching device according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Reference numerals are as follows: the automatic question-answer matching device 300, an obtaining module 301, an encoding module 302 and a prediction module 303.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the foregoing drawings are used for distinguishing between different objects and not for describing a particular sequential order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the automatic question-answer matching method provided in the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the automatic question-answer matching apparatus is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of an automatic question-answer matching method according to the present application is shown. The automatic question-answer matching method comprises the following steps:

step S201, acquiring image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmentation image, a first segmentation text and a second segmentation text.

In this embodiment, the image data is an image associated with a question text submitted by a user (e.g., a pathological picture submitted by the user during a consultation); the questioning text is a text converted by the voice of the questioner in the conversation process; the reference text is stored reference reply text, and the question text and the reference text constitute text data. The method comprises the steps of obtaining image data, a question text and a reference text, and preprocessing the image data, the question text and the reference text respectively to obtain a segmentation image, a first word segmentation text and a second word segmentation text. Specifically, the preprocessing of the image data comprises operations such as segmentation and pixel adjustment, and the preprocessing of the question text and the reference text comprises operations such as word segmentation and error correction; preprocessing the image data to obtain a segmentation image when obtaining the image data, the question text and the reference text; and preprocessing the question text and the reference text to obtain a first word segmentation text and a second word segmentation text. The segmentation image is an image obtained by segmenting image data and adjusting pixels, the first segmentation text is a text obtained by segmenting and correcting a question text, and the second segmentation text is a text obtained by segmenting and correcting a reference text.

Step S202, position coding is carried out on the segmentation image, the first segmentation text and the second segmentation text to obtain a position vector, segmentation coding is carried out on the segmentation image, the first segmentation text and the second segmentation text to obtain a segmentation vector, and embedding coding is carried out on the segmentation image, the first segmentation text and the second segmentation text to obtain an embedding vector.

In this embodiment, when a segmented image, a first segmentation text and a second segmentation text are obtained, position coding is performed on the segmented image, the first segmentation text and the second segmentation text to obtain a position vector; carrying out segmentation coding on the segmented image, the first segmentation text and the second segmentation text to obtain a segmentation vector; and carrying out embedded encoding on the segmented image, the first segmentation text and the second segmentation text to obtain an embedded vector. Specifically, coding is carried out according to the relative position of the segmentation image in the image data, and the image position code of the segmentation image is obtained; respectively coding the relative positions in the question text and the reference text according to the first participle text and the second participle text to obtain a first text position code and a second text position code of the participle text; and splicing the image position code, the first text position code and the second text position code front and back (namely splicing the first text position code and the second text position code after the image position code or splicing the first text position code and the second text position code before the image position code) to obtain a position vector.

In order to distinguish image data from text data and enable the finally generated text to be more in line with expected characteristics, segmentation coding is carried out on the segmentation image, the first segmentation text and the second segmentation text through segmentation numbers to obtain segmentation vectors. Specifically, a preset segmentation number is obtained, and segmentation images from different sources are identified based on the segmentation number to obtain an image segmentation code; and identifying the word segmentation texts from different sources based on the segmentation numbers to obtain text segmentation codes. And splicing the image segmentation codes and the text segmentation codes in front and back (namely splicing the text segmentation codes after the image segmentation codes or splicing the text segmentation codes before the image segmentation codes) to obtain segmentation vectors.

And the embedding encoding is to perform vector conversion on the segmented image, the first participle text and the second participle text respectively to obtain an image embedding encoding and a text embedding encoding, and then perform front-back splicing on the image embedding encoding and the text embedding encoding to obtain an embedding vector. The splicing sequence of the embedded vectors is the same as that of the segmentation vectors and the position vectors.

And S203, carrying out vector splicing on the embedded vector, the position vector and the segmentation vector to obtain a fusion characterization vector, obtaining a trained multi-modal model, and inputting the fusion characterization vector to the multi-modal model to obtain a target reply text corresponding to the question text.

In this embodiment, when obtaining the embedded vector, the position vector, and the split vector, vector stitching is performed on the embedded vector, the position vector, and the split vector, that is, vector addition is performed on the embedded vector, the position vector, and the split vector, so as to obtain a fusion characterization vector. And then, acquiring a trained multi-mode model, inputting the fusion characterization vector to the multi-mode model, and calculating to obtain a target reply text based on the multi-mode model. Specifically, the multi-modal model is composed of a plurality of bidirectional transducer models, wherein each transducer model comprises a multi-head attention layer, a regularization layer and a forward propagation layer, and bidirectional attention calculation can be performed on an input fusion characterization vector through the bidirectional transducer models in the multi-modal model, so that images and texts can be fully fused; and finally, outputting the target reply text based on the multi-modal model.

It is emphasized that, in order to further ensure the privacy and security of the target reply text, the target reply text may also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

According to the embodiment, automatic question answering based on images and texts is realized, the question answering efficiency and accuracy are improved, the question diagnosing accuracy in the medical field is further improved, and the waste of medical resources is avoided.

In some optional implementation manners of this embodiment, the step of performing embedded coding on the segmented image, the first participle text, and the second participle text to obtain an embedded vector includes:

respectively calculating coding information of the first participle text and the second participle text to obtain a first text embedded code corresponding to the first participle text and a second text embedded code corresponding to the second participle text;

In this embodiment, when the segmented image, the first participle text and the second participle text are embedded and coded, coding information of each of the first participle text and the second participle text is calculated, where the coding information is token embedding corresponding to the participle text, and a first text embedded code and a second text embedded code corresponding to the first participle text and the second participle text, respectively, can be calculated through wordpiente. Then, obtaining the vector dimension of the first text embedded code or the second text embedded code, and mapping the segmented image to a target dimension through a full connection layer according to the vector dimension to obtain an image embedded code; the target dimension is a dimension of a size consistent with the vector dimension. For example, the dimensionality of the vector of the first text embedding code and the dimensionality of the vector of the second text embedding code are 768 dimensions, and the segmented image is mapped into the vector with the same dimensionality through the full connection layer, so that the image embedding code with the same dimensionality is obtained. And then splicing the first text embedded code, the second text embedded code and the image embedded code front and back, such as the first text embedded code and the second text embedded code, and splicing the first text embedded code, the second text embedded code and the image embedded code in sequence to obtain an embedded vector.

According to the method and the device, the segmented image, the first segmentation text and the second segmentation text are embedded and coded, so that the image features are accurately extracted and fused, the image provided by the patient can be combined for accurate inquiry in the inquiry process in the medical field, and the inquiry and answer accuracy rate is improved.

In some optional implementation manners of this embodiment, the step of performing segmentation coding on the segmented image, the first segmentation text, and the second segmentation text to obtain a segmentation vector includes:

In this embodiment, the image and the text are in two different role categories, and the first and second segmented texts can be divided into different role categories according to the division of the questioner and the answerer. And acquiring the role marks of the segmentation image, the first segmentation text and the second segmentation text. Wherein one character category corresponds to one character label, for example, the segmented image, the first participle text and the second participle text can be respectively labeled V, S1 and S2 according to the character category. And acquiring a preset sequence, splicing and coding the character marks according to the preset sequence to obtain a segmentation vector. The preset sequence is the arrangement sequence of the role marks, and for the same role marks, the arrangement sequence can be spliced according to the sequence when the image or the text is segmented; the different character tags can be spliced according to a preset sequence (such as a conversation sequence). For example, the character marks corresponding to one inquiry dialog may be spliced into V, V, V, S1, S2, S1, S2, and S1 according to a preset sequence. And coding the spliced role marks to obtain the segmentation vectors.

In the embodiment, the image and the text roles are distinguished by carrying out segmentation coding on the segmented image, the first segmentation text and the second segmentation text, so that the finally generated target reply text is more in line with the role characteristics, and the question and answer accuracy is further improved.

In some optional implementations of this embodiment, the step of obtaining the trained multi-modal model includes:

In this embodiment, the medical inquiry data is acquired multiple inquiry dialogue data, and one inquiry dialogue includes image data and multiple rounds of dialogue text. Collecting a plurality of groups of medical inquiry data, and splitting the medical inquiry data to obtain a training sample; and inputting the training sample into a basic prediction model for calculation to obtain a target loss function. And finally, training a basic prediction model according to the target loss function, and determining that the basic prediction model is trained to obtain a multi-modal model when the target loss function is the minimum value. Specifically, the basic prediction model is the model with the same structure as the multi-modal model, parameters of the basic prediction model are adjusted through a target loss function, next target loss function calculation is carried out on the basis of the adjusted basic prediction model until the target loss function obtained through calculation is the minimum value, and the adjusted basic prediction model is determined to be the multi-modal model.

In the embodiment, the basic prediction model is trained through the target loss function, so that efficient training and adjustment of the model are realized, and the multi-modal model obtained through training can accurately answer the question.

In some optional implementation manners of this embodiment, the step of inputting the training samples into the basic prediction model and calculating to obtain the target loss function includes:

In the present embodiment, the target loss function includes a text mask loss, a picture loss, and a prediction loss. The text mask loss is the loss caused when the mask prediction is carried out on the text; the picture loss is the loss caused by performing category prediction on an input image (such as a corresponding disease, a part or skin expression); the predicted loss is the loss of the target reply text and the real reply text which are finally generated by the model. And generating a target loss function according to the text mask loss, the picture loss and the prediction loss.

Specifically, the text mask loss is calculated by the following formula: l is_MLM(θ)＝-∑logP_θ(w_m|v,w_\m) Wherein w is_mFor masked word vectors, w_\mAre the remaining word vectors other than the masked word vector. The formula for calculating the picture loss is as follows:

wherein CE is cross-entropy loss function (cross-entropy (CE) loss),

as a vector of the class one-hot,

m is the number of classes, and is the class probability distribution output by the model. The predicted loss is calculated as:

where v is the image, w is the text, (r1, r2, …, ri-1) is the predicted reply vector, and ri is the target predicted reply vector. Based on the text mask loss, the picture loss and the prediction loss, a calculation formula of the generated target loss function is as follows: l ═ L_MLM(θ)+L_MRM(θ)+L_GP(θ)。

According to the method and the device, the target loss function is generated through the text mask loss, the picture loss and the prediction loss, and the model loss is accurately calculated, so that the model is accurately trained through the target loss function, and the accuracy of the model question answering is further improved.

In some optional implementation manners of this embodiment, the acquiring multiple sets of medical inquiry data, and the splitting the medical inquiry data to obtain the training sample includes:

In this embodiment, when splitting medical inquiry data, sample categories in the medical inquiry data are obtained, where the sample categories may be divided into a picture category, a question category, and an answer category, and the question category and the answer category are collectively referred to as a role category. And splitting the medical inquiry data according to the picture category, the question category and the answer category to obtain a plurality of sample data. Specifically, the medical inquiry data includes a plurality of inquiry dialogue data, and one inquiry dialogue includes image data and a plurality of rounds of dialogue text. When the sample type of the inquiry dialogue data is obtained, marking the data in the same inquiry dialogue according to the image type, the question type and the answer type; the data for the same interrogation session is then split based on the markers. The image data of the image category, the question text of one question category and the answer text of the next answer category corresponding to the question text form a pair of dialogue data; the image data of the image category, the question text of one question category, the answer text of the next answer category corresponding to the question text and the next question text corresponding to the answer text form second wheel conversation data … …, and so on, and each time the role category is switched, one wheel conversation data is constructed. Therefore, the splitting of one-time inquiry dialogue data is realized, and each round of dialogue data is sample data obtained through splitting.

When sample data is obtained, mask marking is carried out on the sample data, namely, the mask is carried out on the sample data through the mask, and a mask prediction sample is obtained; and then, respectively carrying out position coding, partition coding and embedded coding on the mask prediction sample to obtain a first coding result, a second coding result and a third coding result. And carrying out vector splicing (namely vector summation) on the first coding result, the second coding result and the third coding result to obtain a training sample. In addition, the split sample data can also be directly encoded, and the encoded sample data is used as the training sample.

According to the embodiment, the medical inquiry data are split, and the mask marks and codes are used for obtaining the training sample, so that the model can be sufficiently trained through the training sample, and the model training efficiency is further improved.

In some optional implementation manners of this embodiment, the masking marking the sample data to obtain a mask prediction sample includes:

In this embodiment, in order to avoid losing image and text information, when sample data is masked, a target mask word (such as an apple) is obtained, and it is determined whether a target picture corresponding to the target mask word exists in data corresponding to a picture category in the sample data, and whether a target text corresponding to the target mask word exists in data corresponding to a question category and an answer category in the sample data; and when the target picture and the target text exist at the same time, selecting the target picture or the target text for masking to obtain a mask prediction sample. Therefore, the information corresponding to the same target mask word in the picture and the text is prevented from being masked at the same time.

According to the embodiment, the image and the text are masked independently, so that information loss caused by masking is avoided, the model can fully learn the relation between the image and the text, and the relevance between the target reply text and the image is further improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an automatic question-answer matching device, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be specifically applied to various electronic devices.

As shown in fig. 3, the automatic question-answer matching device 300 according to the present embodiment includes: an acquisition module 301, an encoding module 302, and a prediction module 303. Wherein:

an obtaining module 301, configured to obtain image data, a question text, and a reference text, and perform preprocessing on the image data, the question text, and the reference text, respectively, to obtain a segmented image, a first segmentation text, and a second segmentation text;

in this embodiment, the image data is an image associated with a question text submitted by a user (e.g., a pathological picture submitted by the user during a consultation); the questioning text is a text converted by the voice of the questioner in the conversation process; the reference text is stored reference reply text, and the question text and the reference text constitute text data. The method comprises the steps of obtaining image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmentation image, a first segmentation text and a second segmentation text. Specifically, the preprocessing of the image data comprises operations such as segmentation and pixel adjustment, and the preprocessing of the question text and the reference text comprises operations such as word segmentation and error correction; preprocessing the image data to obtain a segmentation image when obtaining the image data, the question text and the reference text; and preprocessing the question text and the reference text to obtain a first segmentation text and a second segmentation text. The segmentation image is an image obtained by segmenting and adjusting pixels of image data, the first segmentation text is a text obtained by segmenting and correcting a question text, and the second segmentation text is a text obtained by segmenting and correcting a reference text.

The encoding module 302 is configured to perform position encoding on the segmented image, the first participle text, and the second participle text to obtain a position vector, perform segmentation encoding on the segmented image, the first participle text, and the second participle text to obtain a segmented vector, and perform embedded encoding on the segmented image, the first participle text, and the second participle text to obtain an embedded vector;

in some optional implementations of this embodiment, the encoding module 302 includes:

the first coding unit is used for respectively calculating coding information of the first participle text and the second participle text to obtain a first text embedded code corresponding to the first participle text and a second text embedded code corresponding to the second participle text;

the mapping unit is used for acquiring the vector dimension of the first text embedded code or the second text embedded code, and mapping the segmented image to a target dimension according to the vector dimension to obtain an image embedded code;

and the first splicing unit is used for splicing the first text embedded code, the second text embedded code and the image embedded code to obtain the embedded vector.

In some optional implementations of this embodiment, the encoding module 302 further includes:

an obtaining unit, configured to obtain role labels of the segmented image, the first segmentation text, and the second segmentation text;

and the second coding unit is used for splicing and coding the role marks according to a preset sequence to obtain the segmentation vectors.

In this embodiment, when a segmented image, a first segmentation text and a second segmentation text are obtained, position coding is performed on the segmented image, the first segmentation text and the second segmentation text to obtain a position vector; carrying out segmentation coding on the segmented image, the first segmentation text and the second segmentation text to obtain a segmentation vector; and carrying out embedded coding on the segmented image, the first participle text and the second participle text to obtain an embedded vector. Specifically, coding is carried out according to the relative position of the segmentation image in the image data, and the image position code of the segmentation image is obtained; respectively coding the relative positions in the question text and the reference text according to the first participle text and the second participle text to obtain a first text position code and a second text position code of the participle text; and splicing the image position code, the first text position code and the second text position code front and back (namely splicing the first text position code and the second text position code after the image position code or splicing the first text position code and the second text position code before the image position code) to obtain a position vector.

In order to distinguish the image data from the text data and enable the finally generated text to be more in line with the expected characteristics, the segmentation image, the first participle text and the second participle text are segmented and coded through the segmentation number to obtain a segmentation vector. Specifically, a preset segmentation number is obtained, and segmentation images from different sources are identified based on the segmentation number to obtain image segmentation codes; and identifying the word segmentation texts from different sources based on the segmentation numbers to obtain text segmentation codes. The image segmentation code and the text segmentation code are spliced in front and back (namely the text segmentation code is spliced after the image segmentation code or the text segmentation code is spliced before the image segmentation code) to obtain a segmentation vector.

And the embedding coding comprises the steps of respectively carrying out vector conversion on the segmented image, the first participle text and the second participle text to obtain an image embedding code and a text embedding code, and then splicing the image embedding code and the text embedding code front and back to obtain an embedding vector. The splicing sequence of the embedded vectors is the same as that of the segmentation vectors and the position vectors.

The prediction module 303 is configured to perform vector stitching on the embedded vector, the position vector, and the segmentation vector to obtain a fusion characterization vector, obtain a trained multi-modal model, input the fusion characterization vector to the multi-modal model, and calculate to obtain a target reply text corresponding to the question text.

In some optional implementations of this embodiment, the prediction module 303 includes:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of groups of medical inquiry data and splitting the medical inquiry data to obtain training samples;

and the training unit is used for inputting the training samples into a basic prediction model, calculating to obtain a target loss function, training the basic prediction model according to the target loss function, and determining that the basic prediction model is trained to obtain the multi-modal model when the target loss function is the minimum value.

In some optional implementations of this embodiment, the training unit includes:

and the generating unit is used for calculating the text mask loss, the picture loss and the prediction loss of the training sample and generating the target loss function according to the text mask loss, the picture loss and the prediction loss.

In some optional implementations of this embodiment, the acquisition unit includes:

the splitting unit is used for obtaining sample types in the medical inquiry data, wherein the sample types comprise a picture type, a question type and an answer type, and the medical inquiry data is marked and split according to the picture type, the question type and the answer type to obtain a plurality of sample data;

the third coding unit is used for carrying out mask marking on the sample data to obtain a mask prediction sample, and respectively carrying out position coding, segmentation coding and embedded coding on the mask prediction sample to obtain a first coding result, a second coding result and a third coding result;

and the second splicing unit is used for splicing the first coding result, the second coding result and the third coding result to obtain the training sample.

In some optional implementations of this embodiment, the third encoding unit includes:

a confirming unit, configured to acquire a target mask word, determine whether a target picture corresponding to the target mask word exists in data corresponding to a picture category in the sample data, and determine whether a target text corresponding to the target mask word exists in data corresponding to a question category and an answer category in the sample data;

and the selecting unit is used for selecting the target picture or the target text to carry out mask when the target picture and the target text are determined to exist at the same time, so as to obtain the mask prediction sample.

In this embodiment, when obtaining the embedded vector, the position vector, and the segmentation vector, vector stitching is performed on the embedded vector, the position vector, and the segmentation vector, that is, vector addition is performed on the embedded vector, the position vector, and the segmentation vector, so as to obtain a fusion characterization vector. And then, acquiring a trained multi-modal model, inputting the fusion characterization vector to the multi-modal model, and calculating to obtain a target reply text based on the multi-modal model. Specifically, the multi-modal model is composed of a plurality of bidirectional transducer models, wherein each transducer model comprises a multi-head attention layer, a regularization layer and a forward propagation layer, and bidirectional attention calculation can be performed on an input fusion characterization vector through the bidirectional transducer models in the multi-modal model, so that images and texts can be fully fused; and finally, outputting the target reply text based on the multi-modal model.

The automatic question-answering device provided by the embodiment realizes automatic question-answering based on images and texts, improves the efficiency and accuracy of question-answering, further improves the accuracy of inquiry in the medical field, and avoids the waste of medical resources.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 6 includes a memory 61, a processor 62, and a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control equipment mode.

The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various application software, such as computer readable instructions of an automatic question and answer matching method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically arranged to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, such as computer-readable instructions for executing the automatic question and answer matching method.

The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The computer equipment provided by the embodiment realizes automatic question answering based on images and texts, improves the efficiency and accuracy of question answering, further improves the accuracy of inquiry in the medical field, and avoids the waste of medical resources.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the automatic question-answer matching method as described above.

The computer-readable storage medium provided by the embodiment realizes automatic question answering based on images and texts, improves the efficiency and accuracy of question answering, further improves the accuracy of inquiry in the medical field, and avoids the waste of medical resources.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that modifications can be made to the embodiments described in the foregoing detailed description, or equivalents can be substituted for some of the features described therein. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. An automatic question-answer matching method is characterized by comprising the following steps:

2. The automatic question-answer matching method according to claim 1, wherein the step of performing embedded coding on the segmented image, the first participle text and the second participle text to obtain embedded vectors comprises:

3. The automatic question-answer matching method according to claim 1, wherein the step of performing segmentation coding on the segmentation image, the first segmentation text and the second segmentation text to obtain a segmentation vector comprises:

4. The automated question-answer matching method according to claim 1, wherein the step of obtaining a trained multi-modal model comprises:

5. The automated question-answer matching method according to claim 4, wherein the step of inputting the training samples into a basic prediction model and calculating a target loss function comprises:

6. The automatic question-answer matching method according to claim 4, wherein the step of collecting a plurality of sets of medical inquiry data and splitting the medical inquiry data to obtain training samples comprises:

7. The automatic question-answer matching method according to claim 6, wherein the step of mask-marking the sample data to obtain a mask prediction sample comprises:

8. An automatic question-answer matching device, comprising:

the device comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring image data, a question text and a reference text, and respectively preprocessing the image data, the question text and the reference text to obtain a segmentation image, a first segmentation text and a second segmentation text;

the encoding module is used for carrying out position encoding on the segmented image, the first segmentation text and the second segmentation text to obtain a position vector, carrying out segmentation encoding on the segmented image, the first segmentation text and the second segmentation text to obtain a segmentation vector, and carrying out embedded encoding on the segmented image, the first segmentation text and the second segmentation text to obtain an embedded vector;

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the automatic question-answer matching method according to any one of claims 1 to 7.

10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the steps of the automatic question-answer matching method according to any one of claims 1 to 7.