CN114443822B

CN114443822B - Method, system and computing device for multimodal question-answering in the building field

Info

Publication number: CN114443822B
Application number: CN202111599500.8A
Authority: CN
Inventors: 吴瑞萦; 李直旭; 陈志刚
Original assignee: Iflytek Suzhou Technology Co Ltd
Current assignee: Iflytek Suzhou Technology Co Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2023-05-26
Anticipated expiration: 2041-12-24
Also published as: CN114443822A

Abstract

Disclosed are methods, systems, and computing devices for multimodal question-answering in the field of construction, the methods comprising: determining a target entity corresponding to the user problem in a stored multi-mode ontology tree based on the user problem and a building entity diagram obtained from a user; and determining a target specification corresponding to the user question based on the user question and the target entity screening the stored specification in the specification set. The invention can more accurately acquire the related information in a picture-text interaction mode, and can accurately identify and extract the key information input by the user based on a multi-layer attention mechanism, thereby correspondingly screening the information really wanted by the user and returning the information to the user.

Description

Method, system and computing device for multimodal question-answering in the building field

Technical Field

The present invention relates to the field of image and natural language processing, and in particular to a method, system and computing device for multimodal question-answering in the field of construction.

Background

With the vigorous development of artificial intelligence technology, a question-answering system (Question Answering System, abbreviated as QA) is applied to the aspect of life, and the figure of the question-answering system can be seen by a mobile phone, a household intelligent appliance or a market. A question-answering system is a system that can respond to natural language questions posed by people. Such responses may be some non-factual statement, such as a common boring robot, or a judgment of facts, such as how to go to a restaurant at a mall to ask a navigation robot. The question-answering system can be classified into a "professional field" and a "general field" according to the knowledge field. Among them, the question-answering system of the "professional field" is focused on resolving the relevant knowledge of a specific field, such as building, medicine, sports, etc. Such question-answering systems typically require expert knowledge in the field as technical support.

The traditional question-answering system is mostly an answer screening mechanism based on an information retrieval mode, and the main flow is to analyze questions presented by users, find keywords in the questions, clear the intention of the users to ask questions, and then match the most suitable document fragments or candidate answers in a related document or answer library to return to the users. However, with the continuous development of question-answering systems, the questions of users are not limited to simple knowledge questions and answers, so that the conventional answer screening mechanism cannot directly and accurately hit the correct candidate items. At this time, a knowledge-graph-based question-answering method is proposed. The knowledge graph is to collect a piece of knowledge to form a large knowledge base, and store the knowledge in a structuring way, so that the knowledge is convenient for the computer to understand and calculate.

However, whether the question and answer are based on the search formula or the knowledge graph, the system is required to analyze the natural language questions presented by the user, which not only requires the user to accurately and concisely describe the questions themselves, but also requires the semantic information that the question and answer model can accurately understand the questions. However, natural language itself has a variety of expression modes, and in some professional fields, a certain knowledge threshold exists, so that users often cannot accurately express own problem needs. For example, in the field of construction, non-construction professionals may not know the specific name of an entity, or which attributes an entity possesses. The existing question-answering system is mostly used for improving the accuracy of answer screening, but ignores factors such as the diversity of the questions in the actual scene, information loss, knowledge field and the like, so that no matter how high the performance of the question-answering system is, the answer satisfied by the user can not be found.

Accordingly, there is a need for a new type of multi-modal question-answering method, system and computing device for the construction field to address the above-mentioned problems.

Disclosure of Invention

In the summary, a series of concepts in a simplified form are introduced, which will be further described in detail in the detailed description. The summary of the invention is not intended to define the key features and essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

According to one embodiment of the present invention, there is provided a method for multimodal question-answering in the field of construction, the method comprising: determining a target entity corresponding to the user problem in a stored multi-mode ontology tree based on the user problem and a building entity diagram obtained from a user; and determining a target specification corresponding to the user question based on the user question and the target entity screening the stored specification in the specification set.

In one embodiment, wherein determining a target entity in a stored multi-modal ontology tree corresponding to a user problem based on the user problem and a building entity map obtained from a user, comprises: obtaining a graphic vector representation based on the question text of the user question and the building entity graph, wherein the graphic vector representation is used for representing both the question text and the building entity graph; and comparing the graphic vector representation with entity feature vectors of each top-level entity in the multi-modal ontology tree to determine a target entity feature vector of a target entity corresponding to the user problem.

In one embodiment, wherein obtaining a teletext vector representation based on the question text of the user question and the building entity graph comprises: based on the question text and the building entity diagram respectively, obtaining a text feature vector of the question text and an image feature vector of the building entity diagram; and fusing the text feature vector and the image feature vector to obtain the image-text vector representation.

In one embodiment, wherein the determining the target specification corresponding to the user question based on the user question and the target entity screening the stored specification of the set of specifications comprises: and screening the stored specifications in the specification set based on the user problem and the target entity by utilizing a multi-layer attention mechanism, and determining a target specification corresponding to the user problem.

In one embodiment, wherein the determining the target specification corresponding to the user question based on the user question and the target entity screening the stored specification of the set of specifications comprises: and screening the stored specifications in the specification set by using the target entity feature vector and the text feature vector by using a multi-layer attention mechanism respectively to determine a target specification corresponding to the user problem.

In one embodiment, the method further comprises: and fusing the text feature vector and the image feature vector based on a cooperative attention mechanism to obtain the image-text vector representation.

In one embodiment, the top-level entity in the multi-modal ontology tree includes a name of the top-level entity, attribute information, and at least one image of the top-level entity.

In one embodiment, the entity feature vector of the top-level entity is obtained by fusing an image vector representation of the top-level entity, a structure vector representation, and a text vector representation, wherein the image vector representation is used to characterize an image of the top-level entity, the structure vector representation is used to characterize a path from a root node to the top-level entity in the multi-modal ontology tree, and the text vector representation is used to characterize name and attribute information of the top-level entity.

In one embodiment, wherein the image vector representation of the top-level entity is obtained by: extracting feature vectors of each image of the top-level entity; and calculating an image vector representation of the top-level entity from the feature vectors of the respective images based on the noise values.

In one embodiment, wherein comparing the graphic vector representation with entity feature vectors of respective top-level entities in the stored multi-modal ontology tree to determine a target entity feature vector of a target entity corresponding to the user question comprises: calculating similarity scores of the image-text vector representation and entity feature vectors of all top-level entities in the multi-mode ontology tree, and taking the entity feature vectors with the similarity scores being greater than or equal to a preset threshold similarity score as candidate entity feature vectors; and comparing the similarity scores of the candidate entity feature vectors, wherein the candidate entity feature vector with the maximum similarity score is used as the target entity feature vector.

In one embodiment, wherein the screening of the specification in the specification set with the target entity feature vector and the text feature vector, respectively, using a multi-layer attention mechanism, determines a target specification corresponding to the user question, comprises: encoding each specification Fan Jinhang in the specification set to obtain a specification vector representation of the specification; combining the canonical vector representation with the target entity feature vector by using a first attention mechanism in the multi-layer attention mechanism to obtain a first intermediate vector; combining the first intermediate vector with the text feature vector by using a second attention mechanism in the multi-layer attention mechanism to obtain a second intermediate vector; and obtaining a matching score of each rule in the rule set for the user problem based on the second intermediate vector, and taking a rule corresponding to the matching score higher than a preset matching score threshold as the target rule.

According to another embodiment of the present invention, there is provided a system for multimodal question-answering in the field of construction, the system comprising: a processor for using one or more neural networks to: determining a target entity corresponding to the user problem in a stored multi-mode ontology tree based on the user problem and a building entity diagram obtained from a user; and screening the stored specifications in the specification set based on the user problem and the target entity, determining a target specification corresponding to the user problem, and a memory for storing network parameters of the neural network.

According to yet another embodiment of the present invention, a computing device is provided, comprising a memory and a processor, the memory having stored thereon a computer program which, when executed by the processor, causes the processor to perform the method as described above.

According to a further embodiment of the present invention, a computer readable medium is provided, having stored thereon a computer program which, when executed, performs the method as described above.

According to the multi-mode question-answering method, system and computing equipment for the building field, provided by the embodiment of the invention, through a picture-text interaction mode, more relevant information of the questioned entity can be more accurately obtained, and based on a multi-layer attention mechanism, key information input by a user can be accurately identified and extracted, so that information really wanted by the user is correspondingly screened out and returned to the user.

Drawings

The following drawings are included to provide an understanding of the invention and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and their description to explain the principles of the invention.

In the accompanying drawings:

FIG. 1 is a schematic block diagram of an electronic device implementing a method, system, and computing device for multimodal question-answering in the field of construction according to one embodiment of the present invention.

Fig. 2 is a flowchart of exemplary steps of a method for multimodal question-answering in the field of construction, according to one embodiment of the present invention.

FIG. 3 shows a schematic diagram of an exemplary multi-modal ontology tree in the field of construction according to one embodiment of the present invention.

Fig. 4 is a schematic block diagram of a system for multimodal question-answering in the field of construction according to one embodiment of the present invention.

FIG. 5 shows a schematic block diagram of a computing device, according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein. Based on the embodiments of the invention described in the present application, all other embodiments that a person skilled in the art would have without inventive effort shall fall within the scope of the invention.

As described above, the existing question-answering system does not consider the problems of diversity of actual questions, missing information, etc., and thus it is difficult to find an answer to the satisfaction of the user.

Therefore, in order to accurately screen out information required by users, the invention provides a multi-mode question-answering method for the building field, which comprises the following steps: determining a target entity corresponding to the user problem in a stored multi-mode ontology tree based on the user problem and a building entity diagram obtained from a user; and determining a target specification corresponding to the user question based on the user question and the target entity screening the stored specification in the specification set.

According to the multi-mode question-answering method for the building field, through the image-text interaction mode, more relevant information of the questioned entity can be obtained more accurately, and based on a multi-layer attention mechanism, key information input by a user can be identified and extracted accurately, so that information really wanted by the user is screened out correspondingly and returned to the user.

The method, system and computing device for multimodal question-answering in the construction field according to the present invention are described in detail below in connection with specific embodiments.

First, an electronic device 100 for implementing a method, system and computing device for multimodal question-answering in the field of construction according to an embodiment of the present invention is described with reference to fig. 1.

In one embodiment, the electronic device 100 may be, for example, a notebook computer, a desktop computer, a tablet computer, a learning machine, a mobile device (such as a smartphone, a phone watch, etc.), an embedded computer, a tower server, a rack server, a blade server, or any other suitable electronic device.

In one embodiment, the electronic device 100 may include at least one processor 102 and at least one memory 104.

The memory 104 may be volatile memory, such as Random Access Memory (RAM), cache memory (cache), dynamic Random Access Memory (DRAM) (including stacked DRAM), or High Bandwidth Memory (HBM), etc., or nonvolatile memory, such as Read Only Memory (ROM), flash memory, 3D Xpoint, etc. In one embodiment, some portions of memory 104 may be volatile memory while other portions may be non-volatile memory (e.g., using a two-level memory hierarchy). The memory 104 is used to store a computer program that, when executed, is capable of performing client functions (implemented by a processor) and/or other desired functions in embodiments of the invention described below.

The processor 102 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a microprocessor, or other processing unit having data processing capabilities and/or instruction execution capabilities. The processor 102 may be communicatively coupled to any suitable number or variety of components, peripheral devices, modules, or devices via a communication bus. In one embodiment, the communication bus may be implemented using any suitable protocol, such as Peripheral Component Interconnect (PCI), peripheral component interconnect express (PCIe), accelerated Graphics Port (AGP), hyperTransport, or any other bus or one or more point-to-point communication protocols.

The electronic device 100 may also include an input device 106 and an output device 108. The input device 106 is a device for receiving user input, and may include a keyboard, a mouse, a touch pad, a microphone, and the like. In addition, the input device 106 may be any interface that receives information. The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), which may include one or more of a display, speakers, etc. The output device 108 may be any other device having an output function, such as a printer.

An exemplary step flow diagram of a method 200 for multimodal question-answering in the field of construction according to one embodiment of the present invention is described below with reference to fig. 2.

As shown in fig. 2, a method 200 for multimodal question-answering in the field of construction may include the steps of:

in step S210, based on the user question and the building entity map acquired from the user, a target entity corresponding to the user question in the stored multi-modal ontology tree is determined.

In step S220, the target specification corresponding to the user question is determined based on the user question and the target entity screening the stored specifications in the set of specifications.

In one embodiment, the method 200 may be implemented using a trained neural network model.

In one embodiment, determining the target entity corresponding to the user problem in the stored multi-modal ontology tree based on the user problem and the building entity map obtained from the user in step S210 may include: obtaining a graphic vector representation based on the question text and the building entity diagram of the user question, wherein the graphic vector representation is used for representing both the question text and the building entity diagram of the user question; and comparing the graphic vector representation with entity feature vectors of each top-level entity in the multi-modal ontology tree to determine a target entity feature vector of a target entity corresponding to the user problem.

In one embodiment, the step of deriving a teletext vector representation based on the question text and the building entity map of the user question may comprise: based on the problem text and the building entity diagram, respectively, obtaining a text feature vector of the problem text and an image feature vector of the building entity diagram; and fusing the text feature vector and the image feature vector to obtain the graphic vector representation.

In one embodiment, the filtering the stored specifications in the specification set based on the user problem and the target entity in step S220 to determine the target specification corresponding to the user problem may include: the stored specifications in the set of specifications are filtered based on the user question and the target entity using a multi-tier attention mechanism to determine a target specification corresponding to the user question.

In one embodiment, the filtering the stored specifications in the specification set based on the user problem and the target entity in step S220 to determine the target specification corresponding to the user problem may further include: and screening the stored specifications in the specification set by using the target entity feature vector and the text feature vector respectively by using a multi-layer attention mechanism, and determining the target specification corresponding to the user problem.

In one embodiment, the user questions may be obtained by the user entering the user questions via a question and answer system. In one embodiment, the user questions may be entered in the form of, for example, speech, text, etc., as the invention is not limited in this regard. When a user inputs a user question via speech, any suitable speech recognition model known in the art (e.g., GMM (gaussian mixture model) -HMM (hidden markov model), seq2seq model, etc.) may be used to convert the user-input speech to text, thereby obtaining the question text of the user question.

In one embodiment, the building model diagram may be pre-stored in the question-answering system, and the building entity diagram that the user wants to query may be obtained by the user selecting a certain building entity on the building model diagram.

In one embodiment, the question text may be encoded using any suitable neural network model (e.g., a pre-trained model) known in the art to obtain a text feature vector q for the question text. In one embodiment, the pre-training model may be any suitable pre-training model known in the art, such as a BERT (bi-directional encoder representation based on a converter) model, a Word2Vec model, an ELMO (embedded from language model) model, and the like, as the invention is not limited in this regard.

In one embodiment, the image feature vector p of the building entity graph may be extracted using any suitable feature extraction model known in the art, such as, for example, the Faster-RCNN model, the Spp-Net model, the DSSD model, the YOLOv2 model, etc., as the invention is not limited in this regard.

In one embodiment, fusing the text feature vector and the image feature vector may include stitching, adding, etc., the two vectors, which is not limited by the present invention. In one embodiment, the text feature vector q and the image feature vector p may be fused based on a collaborative attention mechanism to obtain the teletext vector representation a. The process can be formulated as follows:

A＝Co-Attention(q，p)

in one embodiment, a multi-modal ontology tree may be pre-established and stored. Referring to fig. 3, fig. 3 shows a schematic diagram of an exemplary multi-modal ontology tree in the field of construction according to one embodiment of the present invention. As shown in fig. 3, the multi-modal ontology tree may include multi-level entities including root entities (e.g., building blocks), intermediate entities (e.g., walls, doors, beams, ladder beams, swing doors, shock walls, etc.), and top-level entities (e.g., two-leaf swing doors, one-leaf swing doors, etc.), wherein the top-level entities, also referred to as leaf entities, are the finest-grained entities in the multi-modal ontology tree. Wherein, inheritance relationship exists among entities of all levels, for example, the child entity 'evacuation door' and the vertical hinged door inherit the attribute of the parent entity 'door'. In one embodiment, each top-level entity and intermediate entity in the multi-modal ontology tree may include the name of the entity (e.g., evacuation door, swing door, structural beam, etc.) and attribute information (e.g., double swing door includes clear width, clear height, width on opening, etc.). Because images can help find an entity to be queried faster and more accurately, in one embodiment, the top-level entity also includes at least one image of the entity. Wherein the image may be crawled from the network, for example by crawler technology, or manually pre-stored, as the invention is not limited in this regard.

In one embodiment, the entity feature vector g of the top-level entity _i P can be represented by representing the image vector of the top-level entity _i Structural vector representation e _i And text vector representation t _i And fusing to obtain the final product.

Wherein the image vector represents p _i For characterizing the image of the top-level entity. In the multi-modal ontology tree, most top-level entities include more than one image, so that feature information of multiple images needs to be fused to obtain an image vector representation of the top-level entity.

In one embodiment, the image vector of the top-level entity represents p _i The method can be obtained by the following steps: extracting feature vectors of each image of the top-level entity; and calculating an image vector representation of the entity from the feature vectors of the respective images based on the noise values.

In one embodiment, the feature vectors of the respective images may be extracted using any suitable feature extraction model known in the art, such as, for example, the Faster-RCNN model, the Spp-Net model, the DSSD model, the YOLOv2 model, etc., as the invention is not limited in this regard.

In one embodiment, assuming that each top-level entity i contains n pictures, the image vector of entity i represents p _i Can be expressed as follows:

Wherein alpha is _ik Noise value, p, representing the kth image of entity i _ik Is the feature vector of the kth image of entity i.

The structural vector representation e _i For characterizing paths from the root node to the top-level entity in the multi-modal ontology tree. In one embodiment, the structure vector represents e _i Can be obtained by the following method:

construct an entity set S and define a path p= [ S ] ₁ ，s ₂ ，...，s _l ](p∈R ^l×2d ) Wherein s is _i E S, l is the path length, d is a preset hyper-parameter for indicating the dimension of the feature vector of path p. Thus, the structural vector represents e _i For the sum of word tokens after the residual attention operation, the following formula can be used:

wherein W1, W2, W3 are three transformation matrices.

Text vector representation t _i Name and attribute information for characterizing the top-level entity. In one embodiment, the text vector represents t _i Can be obtained by the following method: extracting word vectors of words in names and attributes of each top-level entity by using a neural network model (for example, a pre-training model such as a BERT model, etc.), and splicing the word vectors of words to obtain a text vector representation t of the top-level entity _i 。

In one embodiment, the top layer is implemented Image vector representation p of a volume _i Structural vector representation e _i And text vector representation t _i Fusing may include representing the image vector by p _i Structural vector representation e _i And text vector representation t _i Splicing, addition, and the like are performed, which is not limited by the present invention. For example, representing the image vector p _i Structural vector representation e _i And text vector representation t _i The splice can be formulated as follows:

g _i ＝[e _i ，t _i ，p _i ]

in one embodiment, comparing the graphic vector representation with the entity feature vectors of each top-level entity in the stored multi-modal ontology tree to obtain a target entity feature vector of a target entity corresponding to the user problem may include the steps of:

step a: calculating entity characteristic vector g of graphic vector representation A and each top-level entity in multi-mode ontology tree _i Is equal to or greater than a predetermined threshold similarity score, and is a physical feature vector g _i As candidate entity feature vectors.

Step b: and comparing the similarity scores of the candidate entity feature vectors, and taking the candidate entity feature vector with the maximum similarity score as the target entity feature vector.

In one embodiment, steps a and b may be implemented using any suitable neural network model known in the art (e.g., fully connected layer FFNN and classification layer of a neural network), as the invention is not limited in this regard. The process can be formulated as follows:

score(i)＝softmax(FFNN(A，G))

Wherein score (i) represents a binary similarity score of the entity feature vector of the ith top-level entity, G represents a multi-modal ontology tree, G (i) represents a target entity feature vector, and τ represents a threshold similarity score.

In one embodiment, the threshold similarity score may be preset as desired, e.g., 0.5, 0.6, etc., as the invention is not limited in this regard.

Since all information of each entity is contained in the multi-modal ontology tree, sometimes users are concerned with some specific information, not all information. Therefore, we need not only to find exactly the specifications related to the entity, but also to filter out the specifications that the user really wants from. Thus, in one embodiment, the screening of the specification in the specification set with the target entity feature vector and the text feature vector, respectively, using a multi-layer attention mechanism, to obtain the target specification corresponding to the user problem may include the steps of:

step c: each rule Fan Jinhang in the set of rules is encoded to obtain a rule vector representation of each rule.

Step d: and combining the canonical vector representation with the target entity feature vector by using a first attention mechanism in the multi-layer attention mechanism to obtain a first intermediate vector. This step filters the specification mainly from an entity aspect.

Step e: and combining the first intermediate vector with the text feature vector by using a second attention mechanism in the multi-layer attention mechanism to obtain a second intermediate vector. This step filters the specification mainly in terms of entity properties.

Step f: and obtaining a matching score of each rule in the rule set for the user problem based on the second intermediate vector, and taking the rule corresponding to the matching score higher than a preset matching score threshold as a target rule.

In one embodiment, each strip gauge Fan Jinhang in the set of specifications D may be encoded using any suitable neural network model known in the art (e.g., BERT model) to obtain a specification vector representation D of each specification. Where, assuming the length of the specification is k, the specification vector representation d can be formulated as: d= [ d ] ₁ ，d ₂ ，...，d _k ]。

In the step d, the canonical vector representation is combined with the target entity feature vector by using a first attention mechanism in the multi-layer attention mechanism to obtain a first intermediate vector, which can be expressed as follows:

h＝Attention(d，g)

where h represents a first intermediate vector, g represents a target entity feature vector, and d represents a canonical vector representation.

Combining the first intermediate vector with the text feature vector by using a second attention mechanism in the multi-layer attention mechanism in the step e to obtain a second intermediate vector, which can be expressed as follows:

o＝Attention(h，q)

Where o represents a second intermediate vector and q represents a text feature vector of the question text of the user question.

The matching score of each rule in the rule set for the user question based on the second intermediate vector in step f can be formulated as follows:

score＝softmax(Wo+b)

where score is a binary match score and W and b are the transformation matrices.

In one embodiment, the matching score threshold may be preset as needed, for example, 0.5, 0.6, etc., which is not limited by the present invention.

In one embodiment, after the target specification is obtained, the target specification may be presented to the user in any suitable form, such as by text, voice, video, combinations thereof, or the like, as the invention is not limited in this regard.

In yet another embodiment, the present invention provides a system for multimodal question-answering in the field of construction. A schematic block diagram of a system 400 for multimodal question-answering in the field of construction according to one embodiment of the present invention is described below with reference to fig. 4. As shown in fig. 4, a system 400 for multimodal question-answering in the building field may include a processor 410 and a memory 420.

Wherein the processor 410 is configured to use one or more trained neural networks to implement the following processing steps: determining a target entity corresponding to the user problem in a stored multi-mode ontology tree based on the user problem and a building entity diagram obtained from a user; and determining a target specification corresponding to the user question based on the user question and the target entity screening the stored specification in the specification set.

Illustratively, the processor 410 may be any processing device known in the art, such as, for example, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a microprocessor, a microcontroller, a Field Programmable Gate Array (FPGA), etc., as the invention is not limited in this regard.

Wherein the memory 420 is used to store network parameters of one or more neural networks. Memory 420 may be, for example, RAM, ROM, EEPROM, flash memory or other storage technology, a CD-ROM, digital Versatile Disk (DVD) or other optical storage, a magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which may be used to store the desired information and which may be accessed by processor 410.

The system 400 for multimodal questioning and answering in the construction field according to the embodiment of the present invention may implement the method 200 for multimodal questioning and answering in the construction field according to the embodiment of the present invention described above. Those skilled in the art can understand the specific operation of the system 400 for multimodal question-answering in the construction field according to the embodiment of the present invention in combination with the foregoing, and for brevity, will not be described herein again.

According to the multi-mode question-answering system for the building field, through the image-text interaction mode, more relevant information of the questioned entities can be obtained more accurately, and based on a multi-layer attention mechanism, key information input by a user can be identified and extracted accurately, so that information really wanted by the user is screened out correspondingly and returned to the user.

In yet another embodiment, the present invention provides a computing device. Referring to fig. 5, fig. 5 shows a schematic block diagram of a computing device 500 according to one embodiment of the invention. As shown in fig. 5, computing device 500 may include a memory 510 and a processor 520, wherein memory 510 has stored thereon a computer program that, when executed by the processor 520, causes the processor 520 to perform the method 200 for multimodal question-answering for the building area as described above.

Those skilled in the art will understand the specific operation of the computing device 500 according to embodiments of the present invention in conjunction with the foregoing description, and for brevity, only some of the main operations of the processor 520 will be described below: determining a target entity corresponding to the user problem in a stored multi-mode ontology tree based on the user problem and a building entity diagram obtained from a user; and determining a target specification corresponding to the user question based on the user question and the target entity screening the stored specification in the specification set.

The computing device 500 according to embodiments of the present invention may implement the method 200 for multimodal question-answering in the building field according to embodiments of the present invention described previously. Those skilled in the art will appreciate the specific operation of the computing device 500 according to embodiments of the present invention in conjunction with the foregoing description, and for brevity, will not be described in detail herein.

According to the computing equipment provided by the invention, more relevant information of the questioned entity can be more accurately obtained in a picture-text interaction mode, and key information input by a user can be accurately identified and extracted based on a multi-layer attention mechanism, so that information really wanted by the user is correspondingly screened out and returned to the user.

In yet another embodiment, the present invention provides a computer readable medium having a computer program stored thereon, which when executed performs the method 200 for multimodal question-answering in the building field as described in the above embodiments. Any tangible, non-transitory computer readable medium may be used, including magnetic storage devices (hard disks, floppy disks, etc.), optical storage devices (CD-ROMs, DVDs, blu-ray discs, etc.), flash memory, and/or the like. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including means which implement the function specified. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above illustrative embodiments are merely illustrative and are not intended to limit the scope of the present invention thereto. Various changes and modifications may be made therein by one of ordinary skill in the art without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in order to streamline the invention and aid in understanding one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the invention. However, the method of the present invention should not be construed as reflecting the following intent: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for multimodal question-answering in the field of construction, the method comprising:

determining a target entity corresponding to the user problem in a stored multi-modal ontology tree based on the user problem and a building entity graph acquired from a user, wherein the building entity graph is acquired by selecting a certain building entity on a pre-stored building model graph by the user, the multi-modal ontology tree comprises a root entity, an intermediate entity and a top-level entity, the top-level entity comprises a name, attribute information and at least one image of the top-level entity, wherein the top-level entity is characterized by an entity feature vector, the entity feature vector is obtained by fusing an image vector representation, a structure vector representation and a text vector representation of the top-level entity, wherein the image vector representation is used for characterizing an image of the top-level entity, the structure vector representation is used for characterizing a path from the root entity to the top-level entity in the multi-modal ontology tree, and the text vector representation is used for characterizing the name and attribute information of the top-level entity; and

And screening the specifications in the stored specification set based on the user problem and the target entity, and determining a target specification corresponding to the user problem.

2. The method of claim 1, wherein determining a target entity in a stored multi-modal ontology tree corresponding to a user problem based on the user problem and a building entity map obtained from a user, comprises:

obtaining a graphic vector representation based on the question text of the user question and the building entity graph, wherein the graphic vector representation is used for representing both the question text and the building entity graph; and

and comparing the graphic vector representation with the entity feature vectors of all top-level entities in the multi-mode ontology tree to determine a target entity feature vector of a target entity corresponding to the user problem.

3. The method of claim 2, wherein deriving a teletext vector representation based on the question text of the user question and the building entity map comprises:

based on the question text and the building entity diagram respectively, obtaining a text feature vector of the question text and an image feature vector of the building entity diagram; and

And fusing the text feature vector and the image feature vector to obtain the image-text vector representation.

4. The method of claim 1, wherein determining a target specification corresponding to the user question based on the user question and the target entity screening the stored specification of the set of specifications comprises:

and screening the stored specifications in the specification set based on the user problem and the target entity by utilizing a multi-layer attention mechanism, and determining a target specification corresponding to the user problem.

5. The method of claim 3, wherein determining a target specification corresponding to the user question based on the user question and the target entity screening the stored specification of the set of specifications comprises:

and screening the stored specifications in the specification set by using the target entity feature vector and the text feature vector by using a multi-layer attention mechanism respectively to determine a target specification corresponding to the user problem.

6. A method as claimed in claim 3, wherein the method further comprises: and fusing the text feature vector and the image feature vector based on a cooperative attention mechanism to obtain the image-text vector representation.

7. The method of claim 1, wherein the image vector representation of the top-level entity is obtained by:

extracting feature vectors of each image of the top-level entity; and

an image vector representation of the top-level entity is calculated from the feature vectors of the respective images based on the noise values.

8. The method of claim 2, wherein comparing the teletext vector representation with the entity feature vectors of each top-level entity in a stored multi-modal ontology tree to determine a target entity feature vector for a target entity corresponding to the user question, comprises:

calculating similarity scores of the image-text vector representation and the entity feature vectors of all top-level entities in the multi-mode ontology tree, and taking the entity feature vectors with the similarity scores being greater than or equal to a preset threshold similarity score as candidate entity feature vectors; and

and comparing the similarity scores of the candidate entity feature vectors, and taking the candidate entity feature vector with the maximum similarity score as the target entity feature vector.

9. The method of claim 5, wherein using a multi-layer attention mechanism to filter the specification in the specification set with the target entity feature vector and the text feature vector, respectively, to determine a target specification corresponding to the user question, comprises:

Encoding each specification Fan Jinhang in the specification set to obtain a specification vector representation of the specification;

combining the canonical vector representation with the target entity feature vector by using a first attention mechanism in the multi-layer attention mechanism to obtain a first intermediate vector;

combining the first intermediate vector with the text feature vector by using a second attention mechanism in the multi-layer attention mechanism to obtain a second intermediate vector; and

and obtaining a matching score of each rule in the rule set for the user problem based on the second intermediate vector, and taking a rule corresponding to the matching score higher than a preset matching score threshold as the target rule.

10. A system for multimodal question-answering in the field of construction, the system comprising:

a processor for using one or more neural networks to:

determining a target entity corresponding to the user problem in a stored multi-modal ontology tree based on the user problem and a building entity graph acquired from a user, wherein the building entity graph is acquired by selecting a certain building entity on a pre-stored building model graph by the user, the multi-modal ontology tree comprises a root entity, an intermediate entity and a top-level entity, the top-level entity comprises a name, attribute information and at least one image of the top-level entity, wherein the top-level entity is characterized by an entity feature vector, the entity feature vector is obtained by fusing an image vector representation, a structure vector representation and a text vector representation of the top-level entity, wherein the image vector representation is used for characterizing an image of the top-level entity, the structure vector representation is used for characterizing a path from the root entity to the top-level entity in the multi-modal ontology tree, and the text vector representation is used for characterizing the name and attribute information of the top-level entity;

Screening the stored specifications in the specification set based on the user question and the target entity, determining a target specification corresponding to the user question, and

and the memory is used for storing network parameters of the neural network.

11. A computing device comprising a memory and a processor, the memory having stored thereon a computer program which, when executed by the processor, causes the processor to perform the method of any of claims 1-9.

12. A computer readable medium, characterized in that it has stored thereon a computer program which, when executed, performs the method according to any of claims 1-9.