CN114265924A

CN114265924A - Method and device for retrieving associated table according to question

Info

Publication number: CN114265924A
Application number: CN202111586986.1A
Authority: CN
Inventors: 刘星光; 程振波; 肖刚; 孟航程; 李琴; 孙力; 张皓鑫; 王亚明; 徐雪松; 陆佳炜; 张元鸣
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-01

Abstract

The method and the device for realizing the associated table retrieval by using the question are provided aiming at the table retrieval part in the table question-answering system, and the method can match the most relevant table for the question. The method comprises the following steps: respectively calculating the frequency of the words in the question in the form and the frequency of the words in the form in the question according to the question and the form; carrying out word embedding vector representation on the question and the table; embedding words of the question and the table into the expression vector and calculating frequency to fuse, and further obtaining a fusion vector of the question and the table; finally the fused vector will be used to calculate the similarity of the question to the table.

Description

Method and device for retrieving associated table according to question

Technical Field

The application relates to the field of computer natural language question answering and information retrieval. And more particularly, to a method of retrieving a table most related to a question described in a natural language using the question.

Background

To more accurately represent information, tables are often used to organize the information. The table generally includes a title, a header, and contents. The title is a sentence describing the table function. The header is typically an attribute that represents the association of the stored content, typically described by a phrase. The content is information stored in a table or instantiation of a header attribute, and the representation mode of the content often comprises numbers or short sentences. Forms are often found in various types of documents, documents and reports, and some professional documents, such as design standards in the construction and machinery fields, often include multiple forms. When constructing a question-answering system for professional documents, it is often necessary to obtain answers s of question sentences from a table according to input question sentences Q. For this purpose, it is necessary to first determine from a plurality of tables T ═ T₁，T₂，...，T_j]Determining a table T having the closest relationship with the question Q_jThen, the answer s of the question sentence is determined according to the determined table. Where j represents the number of tables, s ∈ T_j. Therefore, being able to retrieve the form most relevant to Q from question Q is one of the most important steps in the design of a question-answering system based on form retrieval. However, there are methods such as publication No. CN109670028A, publication No. CN110737671A and publication No. CN107203528B, which examineThe index is implemented based on a keyword matching manner. This type of search method simply treats the keyword as an independent object, and loses semantic information of the sentence and table contents associated with the keyword. Therefore, the invention provides a new expression method for semantically fusing the question Q with the table title, the table header and the content, thereby improving the accuracy of searching the associated table according to the question.

Disclosure of Invention

The invention provides a method and a device for searching an associated table according to a question, aiming at the table searching problem in a table question-answering system, so as to improve the accuracy of matching the associated table for the question.

The general flow of the invention comprises: calculating a pre-attention vector between a question and a table by using the question and the table; carrying out word embedding vector representation on the question and the table; the method comprises the steps that a question and a word embedding expression vector of a table and a pre-attention vector are fused to obtain a fusion vector expression of the question and the table; and calculating the similarity between the question and the table according to the fusion vector.

A method for retrieving an association table based on a question, comprising the steps of:

step 1: respectively calculating the frequency of the words in the question in the table and the frequency of the words in the table in the question, and recording the calculated frequencies as pre-attention vectors;

step 2: obtaining word embedding expression vectors of a question and a table;

and step 3: the method comprises the steps of fusing word embedding expression vectors and calculation frequency of a question and a table to obtain fusion vector expression of the question and the table;

and 4, step 4: and calculating the similarity between the question and the table according to the fusion vector.

Preferably, the calculating of the pre-attention vector between the question and the table in step 1 specifically includes:

step 2.1: a question is expressed by the symbol Q, and first, the question is participled to obtain Q ═ Q₁，q₂，…，q_nWherein q is_nEach participle in the question is represented, and n represents the number of participles in the question.

Step 2.2: constructing characteristic information of the table, wherein the characteristic information of the table comprises a title and a header of the table, and recording the characteristic information of the table as T_jWherein j represents the index of the table; denote the title of the table as title_jAnd the table title is participled to represent title_j＝{t₁，t₂，…，t_mWhere t is_mEach participle in the title is represented, and m represents the number of the participles in the title; representing the header of the table as head_j＝{s₁，s₂，…，s_kIn which s is_kEach attribute participle in the header is represented, and k represents the number of attribute participles in the header. Finally, the header and the header information are spliced to obtain form characteristic information T_j＝{t₁，t₂，…，t_m，s₁，s₂，…，s_k}。

Step 2.3: and calculating a pre-attention vector between the problem Q and the table, and defining a function Match (x, y) to calculate the occurrence frequency of the participle x in the text y, wherein the meaning of the Match (x, y) indicates that when the text y contains the participle x, the function returns 1, otherwise, the function returns 0, and if the text y is a common stop word, the function returns 0. The use of other methods to calculate the frequency of occurrence of the participles in the text herein does not affect the results of the present invention. The specific method of predicting the vector of interest is as follows:

2.3.1) first calculate the participles Q in Q_nIn the table title_jThe frequency of occurrence of (a). According to the calculation method of Match (x, y), the participle Q in Q can be obtained_nIn title_jThe frequency of occurrence in (a) is expressed as:

in addition, the title is calculated_jMiddle participle t_mFrequency of occurrence in Q:

2.3.2) calculating the participles Q in Q_nHead at table head_jThe frequency of occurrence of (1):

calculate the word in Q is head_jCorresponding frequency of appearance in table contents, head_jThe corresponding table contents are denoted C_jThe contents of all headers in the table are denoted C_j＝{c₁，c₂，…，c_kIn which c is_kRepresenting the content of each cell, and k represents the number of cells. Q middle participle Q_nAt C_jThe frequency of occurrence in (a) is expressed as:

q middle participle Q_nHead at the head_jThe frequency of occurrence in (a) will eventually be expressed as:

further, a head is obtained_jThe frequency of occurrence of each attribute participle in Q:

2.3.3) frequency of occurrences in the table title of the participles in the question Q calculated in steps 2.3.1) and 2.3.2), respectively

And the frequency of appearance in the table header

The frequency with which the tokens in the available question Q appear in the entire table is:

2.3.4) calculating according to the steps to obtain a pre-attention vector between the question Q and the table, representing the frequency of each participle in the question appearing in the table, and splicing the frequency of the title participles and the header attribute participles in the table respectively appearing in the question, and recording the pre-attention vector as M_ij：

Preferably, the obtaining of the word embedding expression vector of the question sentence and the table in step 2 specifically includes:

question Q and title feature information T_jSplicing is directly performed to represent: [ Q: t is_j]＝[q₁，q₂，…，q_n，t₁，t₂，…，t_m，s₁，s₂，…，s_k]Then, using a generic word embedding model, the [ Q: t is_j]Vector representation as Z_ij. Each Z_ijThe length of the sequence is n + m + k; where the length l of each participle is derived from the word embedding model (the size of l does not affect the result of the invention). Then obtaining Z through the existing self-attention mechanism_ijAnd the self-attention feature vector is denoted as A_ij. The attention mechanism can quickly extract important features of sparse data, and is widely used for natural language processing tasks. While the autoflight mechanism is an improvement of the attentiveness mechanism, which reduces reliance on external information and is more adept at capturing internal correlations of data or features.

Preferably, the question sentence and the word embedding expression vector of the table and the calculation frequency are fused in the step 3,

obtaining a fusion vector representation of the question and the table, which specifically comprises the following steps:

pre-attention vector M_ijThe self-attention score A obtained in claim 3_ijThe fusion was performed according to the following formula:

A_ij*M_ij+M_ij (8)

a fused vector representation is obtained, and is recorded as:

preferably, the calculating the similarity between the question and the table according to the fusion vector in the step 4 specifically includes:

similarity calculation is carried out by using the existing sigmod activating function, the sigmod function can map a real number x to an interval of (0, 1) and can be used for carrying out classification, and the formula of the sigmod function is defined as:

S(x)＝1/(1+e-x) (10)

the sigmod function is used for mapping the fusion vector to a (0, 1) interval so as to represent the similarity degree between the question and the table, the mapping value close to 1 represents that the question is very similar, and the mapping value close to 0 represents that the question is not similar.

The invention relates to a device for searching a correlation table according to a question, which comprises: one or more processors; a storage device for storing one or more programs; when the one or more programs are executed by the one or more processors, cause the one or more processors to implement a method as above; the processor implements the steps of the method of the present invention when executing the computer program.

The invention has the advantages that: the method and the device for retrieving the associated table according to the question fully utilize the associated information between the question and the table, solve the problem that semantic information of the sentence and the table content associated with the keyword is lost only by taking the keyword as an independent object in the conventional method for retrieving the table based on the keyword, and further improve the accuracy of matching the associated table according to the question.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a fused self attention score and pre-attention vector flow diagram of the method of the present invention;

FIG. 3 is a flow chart of a fusion vector representation of a question and a table of the method of the present invention;

FIG. 4 is a schematic of the process of the present invention;

FIG. 5 is an example of a process for performing pre-attention vector calculation between a natural question and a form according to the present invention;

FIG. 6 is a schematic diagram of the calculation and fusion of self-attention scores and pre-attention vectors according to the present invention;

fig. 7 is a block diagram of the apparatus of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The invention uses ASME boiler and pressure vessel standard international standard as reference file, and makes training data set according to design standard form, extracts design standard form of 'design requirement according to analysis' in the fifth chapter of book, and asks mechanical professional to ask question according to design standard according to the form (the experiment uses other files as reference data and does not affect the result of the invention). The questions presented according to the corresponding tables are regarded as positive samples, 1500 positive samples are established together, 2 tables are randomly allocated to each question in the experiment as negative sample data, and 4500 training data sets are finally established together. The experimental task matches the question with all tables, and the most relevant tables are matched for the question by using the text method. The specific experimental procedure is as follows:

the structures, proportions, and dimensions shown in the drawings and detailed description are for understanding and reading only, and are not intended to limit the scope of the invention, which is defined in the claims, but are not necessarily essential to the skilled in the art, and any structural modifications, changes in proportions, or adjustments in size, without affecting the efficacy and attainment of the same, are intended to be included within the scope of the invention. The specific values given for the parameters in the present invention are only for clarity of description, and are not intended to limit the scope of the invention, and the changes or adjustments of the relative relationship should be considered as the scope of the invention without substantial technical changes.

1. calculating a pre-focus vector M between a question Q and a table using the question and the table_ijThe process is shown in fig. 2, and specifically includes:

step 1.1: as shown by reference numeral S41 in fig. 4, the question is participled to obtain a participle set Q ═ { Q } of the question Q₁，q₂，…，q_nWherein q is_nThe number of the participles in the question sentence is n.

Step 1.2: as shown by reference numeral S42 in FIG. 4, the title participle of the table is denoted as title_j＝{t₁，t₂，…，t_mHead attribute participles of the table are denoted as head_j＝{s₁，s₂，…，s_kWill the title and head of the table_jSplicing is carried out to obtain form characteristic information T_j＝{t₁，t₂，…，t_m；s₁，s₂，…，s_k}。

Step 1.3: according to Q and T obtained in the step 1 and the step 2_jA pre-attention vector between the question and the table is calculated. In the invention, a function Match (x, y) is defined to calculate the occurrence frequency of the participle x in the text y, wherein the meaning of the Match (x, y) indicates that when the text y contains the participle x, the function returns 1, otherwise, the function returns 0, and if the function is a common stop word, the function returns 0.

1.3.1) calculating title separately as shown by reference numeral S43 in FIG. 4_jAnd frequency of occurrences of participles between Q

And

according to the calculation method of Match (x, y), the participle Q in Q can be obtained_nIn title_jThe frequency of occurrence in (a) is expressed as:

can also obtain title_jMiddle participle t_mFrequency of occurrence in Q:

1.3.2) calculating Q and the table header head, respectively, as shown by reference numeral S44 in FIG. 4_jFrequency of inter-word occurrences

Medicine for curing cancer

According to the function Match (x, y) in the step 3, the participle Q in the Q is firstly calculated_nHead at table head_jThe frequency of occurrence of (1):

considering that some information in the problem may appear in the table content corresponding to the header, the invention also calculates the head of the participle in Q_jFrequency of occurrence in corresponding table contents, head_jThe corresponding table contents are denoted C_jThe contents of all headers in the table are denoted C_j＝{c₁，c₂，…，c_kIn which c is_kRepresenting the content of each cell, and k represents the number of cells. Q middle participle Q_nAt C_jThe frequency of occurrence in (a) is expressed as:

final Q-term segmentation Q_nHead at the head_jThe frequency of occurrence in (a) will eventually be expressed as:

head can be obtained by the same method_jThe frequency of occurrence of each attribute participle in Q:

1.3.3) frequency of occurrences in the table title of the participles in the question Q, respectively, calculated according to 1.3.1 and 1.3.2

And the frequency of appearance in the table header

1.3.4) as shown by reference numeral S45 in fig. 4, a pre-attention vector between the question Q and the table is expressed as a concatenation of the frequency of occurrence of each participle in the question in the table and the frequency of occurrence of the title participle and the header attribute participle in the table in the question, respectively, and the pre-attention vector is expressed as M_ij：

2. The word embedding expression vector for obtaining the question and the table is shown in fig. 3, and specifically includes:

as shown at S46 in fig. 4, will be described hereinbeforeQuestion Q and title feature information T_jSplicing is directly performed to represent: [ Q: t is_j]＝[q₁，q₂，…，q_n，t₁，t₂，…，t_m，s₁，s₂，…，s_k]Then, a general Word embedding model is used, such as a Word2vec Word vector representation method proposed by T-Mikolov et al in the article effectiveness Estimation of Word expressions in vector Space; or J-Devrlin et al in the article Bert: the bert model proposed in the Pre-training of deep bidirectional transformations for language understating can vectorize a text or word. The method of using any word embedding representation here does not affect the results of the present invention. Mixing [ Q: t is_j]Vector representation as Z_ij. Each Z_ijThe length of the sequence is n + m + k; wherein the length l of each participle is obtained by a word embedding model (the size of l does not influence the result of the invention); then obtaining Z by the existing self-attention mechanism method_ijAnd the self-attention feature vector is denoted as A_ij。

3. As shown by reference numeral S47 in fig. 4, the question is fused with the self-attention feature vector and the pre-attention vector of the table to obtain a fused vector representation of the question and the table:

pre-attention vector M_ijAnd self-attention score A_ij is fused according to the following formula:

A_ij*M_ij+M_ij(8) a fused vector representation is obtained, and is recorded as:

4. and calculating the similarity between the question and the table according to the fusion vector:

the invention uses the existing sigmod activation function to map the fusion vector to the (0, 1) interval, so as to represent the similarity degree between the question and the table, the mapping value is close to 1 to represent similarity, and the mapping value is close to 0 to represent dissimilarity.

An apparatus for retrieving an association table based on a question, comprising:

one or more processors; a storage device for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above. The processor, when executing the computing program, implements the method as above.

Examples of the invention are as follows:

1. the experiment used as the analysis target a table titled "example of stress classification" and presented the natural question "what classification a concave head whose stress type is film stress belongs to? ". As shown in reference numeral S51 in fig. 5, the natural question is segmented, and the experiment uses the chinese segmentation tool jieba commonly used in natural language processing to segment words, so as to obtain a segmentation set Q of the question:

q ═ stress type; film stress; of (1); a concave end enclosure; belongs to the group of; what; classification } (11)

2. As shown by reference numeral S52 in fig. 5, table feature information T of the table "example of stress classification" is obtained: the title of the table is denoted as title ═ stress classification; of (1); example, the header information of the table is denoted as head ═ container part; a location; the origin of the stress; a stress type; classification }; the characteristic information of the table is obtained as follows:

t ═ stress classification; of (1); examples; a container member; a location; the origin of the stress; a stress type; classification } (12)

3. The pre-attention vector between Q and T is cross-calculated as indicated by reference numeral S53 in fig. 5. The frequency of occurrence of the participles between the problem and the table is calculated separately using the function Match (x, y) defined in the present invention. The words "of", "of" and "what" in the question-participles Q and T can be regarded as stop words in the text, so that these words score 0 when the Match function calculation is performed.

As shown by reference numerals S53, S54 in FIG. 5, q in the participle of the question₃As an example, made ofAt q is₃Is not included in title, shown

As shown by reference numeral S54 in fig. 5, although header information head ═ container part; a location; the origin of the stress; a stress type; classification } contains no q₃I.e. by

But at the head of the watch s₄The table content of "stress type" contains "film stress", and thus Match (q)₃，s₄) 1 is ═ 1, i.e

Final participle q₃At T_jThe frequency of occurrence in (a) is:

calculating a natural question Q ═ { stress type; film stress; of (1); a concave end enclosure; belongs to the group of; what; classification and table T ═ stress classification; of (1); examples; a container member; a location; the origin of the stress; a stress type; the pre-attention vector between classes is:

4. the method of the present application, in which the self-attention score and the pre-attention vector are fused, is shown in fig. 6, and a natural question Q and a natural question T are spliced, as indicated by reference number S61 in fig. 6:

as shown by reference numeral S62 in fig. 6, the [ Q: t is]The pre-attention vector M; as shown in fig. 6 by reference number S63, the word2vec pair Q: t is]Word vector embeddingI represents; as shown in reference numeral S64 in fig. 6, the experiment uses the existing Bi-LSTM neural network to further extract feature information from the text after vector representation, so that the vector representation of the text can focus on context information, and the use of other semantic information extraction methods does not affect the result of the present invention. To obtain [ Q: f. of]Semantic information Z of (a); calculating a self-attention score a for Z on the self-attention layer as shown by reference numeral S65 in fig. 6; finally, as shown by reference numeral S66 in fig. 6, the obtained self-attention score is combined with the pre-attention vector M to obtain a fused vector

5. The experiment was trained and tested using 4500 data set up in the international code for ASME boiler and pressure vessel code. The accuracy of the method for matching the table in the experiment reaches 90.6 percent. In the comparative experiment, the accuracy of the semantic matching model which only uses the self-attention mechanism to extract the feature vectors and then performs similarity calculation in the table matching task is 73.7%. In the experiment, a method of performing statistical matching in tables only by using the problem keywords is compared, the times of occurrence of the problem keywords in each table are calculated, and all the tables are sorted according to the times of occurrence. Finally, a table with the first rank is returned as the most relevant table of the problem, and the experimental result shows that the accuracy rate of matching the table by a keyword statistical method only reaches 87.3 percent, which shows that the method can be used for calculating the similarity between the question and the table.

An apparatus for retrieving an association table based on a question, the apparatus comprising: one or more processors; a storage device for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above. A block diagram of an apparatus for retrieving an association table according to a question sentence is shown in fig. 7, and the processor executes the calculation program to realize the method.

It should be understood that the data processing apparatuses correspond to the data processing methods one to one, and are not described herein again. It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

Claims

1. A method for retrieving an association table based on a question, comprising the steps of:

step 2: obtaining word embedding expression vectors of a question and a table;

2. A method of retrieving an association table according to a question as set forth in claim 1, characterized in that: the calculating of the pre-attention vector between the question and the table in the step 1 specifically includes:

Step 2.2: constructing characteristic information of the table, wherein the characteristic information of the table comprises a title and a header of the table, and recording the characteristic information of the table as T_jWherein j represents the index of the table; denote the title of the table as title_jAnd is combined withDividing the table title into words to represent title_j＝{t₁，t₂，…，t_mWhere t is_mEach participle in the title is represented, and m represents the number of the participles in the title; representing the header of the table as head_j＝{s₁，s₂，…，s_kIn which s is_kEach attribute participle in the header is represented, and k represents the number of attribute participles in the header. Finally, the header and the header information are spliced to obtain form characteristic information T_j＝{t₁，t₂，…，t_m，s₁，s₂，…，s_k}。

And the frequency of appearance in the table header

3. A method of retrieving an association table based on a question as claimed in claim 2, characterized in that: the obtaining of the word embedding expression vector of the question sentence and the table in the step 2 specifically includes:

question Q and title feature information T_jSplicing is directly performed to represent: [ Q: t is_j]＝[q₁，q₂，…，q_n，t₁，t₂，…，t_m，s₁，s₂，…，s_k]Then, using a generic word embedding model, the [ Q: t is_j]Vector representation as Z_ijEach of Z_ijThe length of the sequence is n + m + k. Then obtaining Z through the existing self-attention mechanism_ijAnd the self-attention feature vector is denoted as A_ij. The attention mechanism can quickly extract important features of sparse data, and is widely used for natural language processing tasks. While the autoflight mechanism is an improvement of the attentiveness mechanism, which reduces reliance on external information and is more adept at capturing internal correlations of data or features.

4. A method of retrieving an association table according to a question as set forth in claim 3, characterized in that: step 3, said embedding the words of the question and the table into the expression vector and calculating the frequency to obtain the fusion vector expression of the question and the table, which specifically includes:

pre-attention vector M_ijThe self-attention score A obtained in claim 3_ijAccording to the following disclosureFormula (iv) is fused:

A_ij*M_ij+M_ij (8)

a fused vector representation is obtained, and is recorded as:

5. a method of retrieving an association table according to a question according to claim 1 or 4, characterized in that: step 4, calculating the similarity between the question and the table according to the fusion vector, which specifically comprises:

S(x)＝1/(1+e^-x) (10)

6. An apparatus for retrieving an association table based on a question, characterized in that: the method comprises the following steps: one or more processors; a storage device for storing one or more programs; when the one or more programs are executed by the one or more processors, cause the one or more processors to implement a method as above; the processor, when executing the computer program, realizes the steps of claim 1.