CN112632256A

CN112632256A - Information query method and device based on question-answering system, computer equipment and medium

Info

Publication number: CN112632256A
Application number: CN202011590805.8A
Authority: CN
Inventors: 史文鑫
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-09

Abstract

The embodiment of the application belongs to the field of artificial intelligence, is applied to the field of intelligent banks, and relates to an information query method based on a question-answering system, which comprises the steps of querying from a document database to obtain at least one query document corresponding to query data if the query data are received; vectorizing a word segmentation text obtained by word segmentation to obtain a word segmentation vector; inputting the word segmentation vectors into a frame selection model to obtain a data vector sequence based on query data; performing convolution processing on the data vector sequence through a plurality of expansion operation units of the frame selection model to obtain a first query result; and screening according to a screening algorithm to obtain a final information query result. The application also provides an information inquiry device, computer equipment and a storage medium based on the question answering system. In addition, the application also relates to a block chain technology, and the query document is also stored in the block chain. The method solves the technical problems of low model characteristic coverage rate and overlarge model in the prior art.

Description

Information query method and device based on question-answering system, computer equipment and medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an information query method and apparatus, a computer device, and a storage medium based on a question-answering system.

Background

Intelligent customer service has been widely used in various industries as a medium for customers to seek help. When a user asks a question, the intelligent customer service algorithm performs similarity matching on the expanded questions one by one, and reversely deduces the standard questions to obtain answers to feed back to the user. Therefore, the quality of the document database directly determines the service quality of the intelligent customer service; however, the construction process of the document database is complex, especially in the banking industry. This is due to the nature of the industry, as banking is more business and more specialized. When a user's question cannot be answered, the system needs to supplement the question item by item through a traversal algorithm, a large amount of data needs to be traversed, corresponding industry instructions are learned, simplified answers are extracted and added into a knowledge base, and therefore the operation efficiency is extremely low, and the task difficulty is high.

In the conventional technology, the extracted answers are added to the knowledge base by generally adopting custom templates of commonly used questions, such as templates of '… … time', '… … for example', or '… … for annual interest rate', and the like, so that although the purpose of extracting the answers to the knowledge base can be achieved, too much manual intervention is required, and the coverage rate is low; in addition, a BERT pre-training model can be used to output answers, but the model is too large, and the output length of the model is only 512 at the maximum, which has certain limitation.

In summary, an information query scheme capable of solving the technical problems of low feature coverage and too large model in the conventional technology is needed.

Disclosure of Invention

Based on this, in order to solve the above technical problems, the present application provides an information query method, apparatus, computer device and storage medium based on a question-answering system, so as to solve the technical problems in the prior art that the coverage rate of model features is low and the model is too large.

An information query method based on a question-answering system, the method comprises the following steps:

if receiving query data, querying from a document database to obtain at least one query document corresponding to the query data;

performing word segmentation on the query document, and performing vectorization on a plurality of word segmentation texts obtained by word segmentation to obtain word segmentation vectors;

inputting the word segmentation vectors into a frame selection model, and obtaining a data vector sequence based on the query data;

performing convolution processing on the data vector sequence through a plurality of expansion operation units of the frame selection model to obtain a first query result;

and screening the first query result according to a screening algorithm to obtain a final information query result.

An information inquiry apparatus based on a question-answering system, the apparatus comprising:

the rough query module is used for querying from a document database to obtain at least one query document corresponding to the query data if the query data is received;

the vector module is used for performing word segmentation processing on the query document and performing vectorization processing on a plurality of word segmentation texts obtained by word segmentation to obtain word segmentation vectors;

the coding module is used for inputting the word segmentation vectors into a frame selection model, generating position codes for the word segmentation vectors, and summing the problem codes of the query data and the word segmentation vectors after the position codes are generated to obtain a data vector sequence;

the convolution module is used for performing convolution processing on the data vector sequence through a plurality of expansion operation units of the frame selection model to obtain a first query result;

and the screening module is used for screening the first query result according to a screening algorithm to obtain a final information query result.

A computer device, comprising a memory and a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor implements the steps of the above information query method based on question-answering system when executing the computer readable instructions.

A computer readable storage medium, which stores computer readable instructions, and when the computer readable instructions are executed by a processor, the steps of the above information query method based on the question-answering system are implemented.

According to the information query method, the device, the computer equipment and the storage medium based on the question-and-answer system, at least one query document with the first similarity conforming to the first threshold value with the query data is obtained through query from the document database, the screening result is obtained, the data calculation amount is reduced, after the obtained query document is subjected to word segmentation, the corresponding position code is generated for each word segmentation text, the position code is summed with the question data, and the answer frame selection operation is carried out according to the self-defined frame selection model, so that the first query result is obtained. In order to enable the obtained information query result to be more accurate, the method and the device further perform screening processing on the first query result through a screening algorithm to obtain a final information query result. The user-defined frame selection model is subjected to convolution processing on the vector sequence by adding a door mechanism and a residual result and combining the expansion CNN, so that the frame selection model based on the expansion CNN can capture farther distance, parameters of the frame selection model cannot be increased, the calculated amount is increased, the coverage range of the frame selection model is wider, and the technical problems that the coverage rate of model features is low and the model is too large in the prior art are solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of an information query method based on a question-answering system;

FIG. 2 is a schematic flow chart of an information query method based on a question-answering system;

FIG. 3 is a schematic diagram of a structure of a frame model;

FIG. 4 is a schematic diagram of sequence integration;

FIG. 5 is a schematic diagram of an information query device based on a question-answering system;

FIG. 6 is a diagram of a computer device in one embodiment.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The information query method based on the question-answering system provided by the embodiment of the invention can be applied to the application environment shown in figure 1. The application environment may include a terminal 102, a network for providing a communication link medium between the terminal 102 and the server 104, and a server 104, wherein the network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may use the terminal 102 to interact with the server 104 over a network to receive or send messages, etc. The terminal 102 may have installed thereon various communication client applications, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The terminal 102 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), a laptop portable computer, a desktop computer, and the like.

The server 104 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal 102.

It should be noted that the information query method based on the question-answering system provided in the embodiment of the present application is generally executed by the server/terminal, and accordingly, the information query apparatus based on the question-answering system is generally disposed in the server/terminal device.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The method and the device can be applied to the field of smart cities, particularly to the field of smart banks, and accordingly construction of the smart cities is promoted.

It should be understood that the number of terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Wherein, the terminal 102 communicates with the server 104 through the network. The server 104 receives the query data sent by the terminal 102, acquires the query document corresponding to the query data from the document database, selects an answer matched with the query data from the query document through the improved frame selection model, screens the answer, and sends the screened answer to the terminal 102. The terminal 102 and the server 104 are connected through a network, the network may be a wired network or a wireless network, the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In an embodiment, as shown in fig. 2, an information query method based on a question-answering system is provided, which is described by taking the method applied to the server in fig. 1 as an example, and includes the following steps:

step 202, if receiving query data, querying from a document database to obtain at least one query document corresponding to the query data.

The technical scheme of the application can be applied to a question-answering system comprising a document retrieval module and an answer frame selection module, and the purpose of expanding question-answering data is achieved. The query data refers to a question to which a simplified answer needs to be matched, for example: the repayment date of the credit card is a number; the document database refers to a database for collecting corpus data or other document data under a corresponding service scene; for example, in a bank question-answering system, the document database may be an internal bank knowledge base and data about bank's papers, works, etc.

If the user can not obtain the desired answer to the problem that the payment date of the credit card is a number, the rough search module in the document search module can realize the rough search of the document through a full text search engine ES (elastic search) technology, and can store and search data in near real time. The full text search engine ES adopts a BM25 algorithm based on a probability retrieval model to evaluate a first similarity between a search term (question) and a document (document), and calculates the similarity between the question and the document by utilizing a BM25 algorithm. The documents with the first similarity ranking in the top 5 are retained in the present embodiment.

Optionally, in order to improve the computational efficiency of the coarse screening, it may further:

dividing a document to be queried in a document database to obtain at least one text paragraph; calculating a first ratio of stop words in each text paragraph according to the stop word list; calculating a first similarity between the query data and a text paragraph corresponding to the minimum first proportion according to a BM25 algorithm; and taking the document to be queried where the text paragraphs with the first similarity greater than the first threshold value are as the queried document obtained by querying.

The method comprises the steps that a document is divided according to paragraph marks in the document to be inquired to obtain at least one text paragraph, wherein the text paragraph refers to a paragraph in the document; stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some Words or phrases are automatically filtered before or after processing natural language data (or text), and these Words or phrases are called Stop Words, and a Stop word list refers to a data list including Stop Words.

The first proportion is the proportion of stop words in each text paragraph, because the number of stop words reflects the professionalism of the document to a certain extent; therefore, the method improves the mode of full text search, directly calculates the correlation degree of a specific text paragraph and the query data, can greatly reduce the data calculation amount, and improves the information query efficiency. The particular passage of text, i.e. the passage of text for which the stop word is at a minimum.

Then, a first similarity between the query data and the text passage is calculated according to the BM25 algorithm. BM25 algorithm, which is usually used to score the search relevance and perform morpheme parsing on text paragraphs to generate morphemes; then, for the query data, calculating a relevance score of each morpheme and the query data, and finally, performing weighted summation on the relevance scores of the morphemes relative to the query data to obtain a relevance score of the text paragraph and the query data, wherein the relevance score is the first similarity.

Further, the value of the first similarity may be set by training according to different scenes, or may be obtained according to historical experience, and in this embodiment, the first similarity may be equal to 0.8. The number of query documents obtained may be more than one.

Preferably, in order to improve the efficiency of subsequent information query, 5 documents with the first similarity ranking in the top five are reserved.

Selecting text segments which are most likely to have correlation with query data from the documents to be queried to perform similarity calculation, and then determining whether the documents to be queried are the query documents required by us according to the calculated result. By the aid of the query mode, data processing amount can be greatly reduced, and information query efficiency is improved.

And 204, performing word segmentation on the query document, and performing vectorization on a plurality of word segmentation texts obtained by word segmentation to obtain word segmentation vectors.

After obtaining at least one query document, the contents of the 5 queried documents are extracted in an answer frame selection module, namely a custom model. To achieve an efficient effect, in some embodiments, we can replace LSTM for NLP tasks with CNN models and use a "pointer network" to represent the beginning and end of the answer.

Firstly, the query document is subjected to word segmentation, in some embodiments, the query document can be accurately segmented through jieba word segmentation, and redundant words do not exist; further, in order to adapt to the mode of the search engine, after the query document is accurately cut, the participle texts with the length larger than a set value in the participle texts can be queried, and the participle texts are cut again.

Specifically, a jieba tool may be employed to tokenize the question data and query documents and use a pre-trained word vector model as a vocabulary for model input. And inputting the word segmentation text to obtain a vector of the word segmentation text through table lookup, and expressing the word segmentation text outside the table by 0 to obtain a vector of the query document to obtain a vector sequence.

Further, the dimension of the vector sequence can be selected to be 200 dimensions, the vector sequence comprises corpus such as bank internal data sets, webQA corpus, Wikipedia and encyclopedia problem learning corpus and is formed by using Word2Vec provided by Gensim to be pre-trained, wherein the model of the Word2Vec is Skip Gram, the window is 6, the negative sampling number is 8, and the Gensim is an open-source third-party Python toolkit and is used for unsupervised learning of the theme vector expression of the text hidden layer from the original unstructured text.

Further, for the purpose of selecting a model with more inputs, char embedding may be trained for each word, the dimension is 200, and the vector and position code of token are added to obtain a vector sequence, the maximum length of the vector sequence is 100, if a convolution operation (embedding) is involved in some samples in a Batch (the size of the Batch is a super parameter for defining the number of samples to be processed before updating the internal model parameters), a Mask is made for the convolution part (Mask is a very conventional operation in NLP, and there are also many scenarios and forms of application).

Optionally, in some embodiments, a thula (local Analyzer for chinese) analysis tool kit may be further used for word segmentation, and the tool has the functions of chinese word segmentation and part-of-speech tagging, and is high in accuracy and speed.

Further, the word segmentation of the query document is not limited to the above manner, and the technology that can segment the query document into word segmentation texts in the prior art can be applied to the embodiment. Only through carrying out word segmentation processing in the above mode, can rely on ripe technique to realize quick word segmentation processing, improve the whole efficiency of this application answer frame selection.

Further, the word segmentation processing on the query data can also be implemented in the above manner, and is not described herein again.

And step 206, inputting the word segmentation vectors into a frame selection model, generating position codes for the word segmentation vectors, and summing the problem codes of the query data and the word segmentation vectors after the position codes are generated to obtain a data vector sequence.

In order to add the position information of each participle text in the query document to the CNN, a position code is added to the frame selection model, wherein the text sentence segment refers to a sentence unit in the query document, namely, the position code of each participle text in the text sentence segment of the query document is used as a feature input, the position code facilitates the beginning and the end of obtaining answers from the query document in long and large theory, and the efficiency of information frame selection is improved. The position code is generated by using sine and cosine functions with different frequencies, and then added with the word vector of the corresponding position as a new feature combination; it should be noted that the PE vector dimension must be consistent with the dimension of the word vector. Expression (1) is as follows:

where pos corresponds to the input position, i refers to the dimension, d_posRefers to the length of the position vector.

Further, in order to facilitate the selection of the corresponding answer from the query document by the frame according to the query data, the query data and the query vector may be input into the frame selection model together for convolution processing. Specifically, vectorization processing is carried out on the query data to obtain corresponding problem codes; and acquiring the position code of each word segmentation vector, and summing the problem code and the word segmentation vectors subjected to the position code to obtain a data vector sequence. The vector summation is a vector calculation method.

And 208, performing convolution processing on the data vector sequence through a plurality of expansion operation units of the frame selection model to obtain a first query result.

In order to get a more lightweight box model and output a longer answer, in some embodiments, the box model is refined.

The convolution operation unit is defined by adding a gating mechanism, specifically, assuming that the data vector sequence to be processed is x ═ x₁，x₂，......，x_n]Wherein x is₁，x₂，......，x_nThe word vectors corresponding to each participle text in the query document are referred to; assume that the text to be processed is: [ Credit card application Condition ], then, x₁，x₂，...，x_nRespectively, represent vector representations of words. We add a gate mechanism to the common one-dimensional convolution to get the expression (2):

Y＝Conv1D₁(X)×σ(Conv1D₂(X))

(2)

wherein, Y refers to the output corresponding to the convolution operation unit, the two one-dimensional convolutions have the same form but do not share the weight, and sigmoid aims to add a valve to each output to control the flow of information. For example, multiplying by a number σ less than 1 reduces the amount of information traffic.

Further, when the convolution processing is reversely propagated, because the situation of gradient disappearance occurs, the difficulty of network learning long-distance dependence is increased, in order to avoid the situation of gradient disappearance, more feature information is also transmitted at the same time, after X is input, the X is continuously transformed in the frame selection model, and on the basis, the expression (2) can be improved to obtain the expression (3):

Y＝X+Conv1D₁(X)×σ(Conv1D₂(X))

(3)

wherein, an X is added in the formula (3) to avoid that the gradient disappears when Y is 0. Expressions (3) and (2) are a progressive relationship, in which X denotes a data vector sequence.

Further, because the existing BERT-like pre-training models are too large and the output length of the model is limited, and because the field of view of the boxed model depends on the size of the CNN convolution kernel, in order to enable the CNN-based boxed model to capture further distances without increasing the parameters of the boxed model, in some embodiments, the data may be processed using the dilated convolution CNN.

As shown in the structural diagram of the block model shown in FIG. 3, when the convolution kernels with the same size are used, the dilation convolution has a larger field of view, and the dispations of the dilation convolution can grow according to the geometric progression of 1, 2, 4, 8, … for the purpose of more comprehensive coverage characteristics. The interpolation expansion convolution is also called hole convolution, and represents the number of the skip lattices. In fig. 4, the question code is a sequence code of query data, the document code is a sequence code of query document, the document vector is a vector corresponding to each text sentence segment or word segmentation text in the query document, the position code is a position corresponding to the text sentence segment, the start position refers to a probability value of an obtained answer at a certain position in the query document, and the end position refers to a probability value of the answer at a certain position in the query document.

Specifically, in this embodiment, the structure of the frame selection model is set as multiple convolution layers, each convolution layer uses a different convolution kernel, and the data vector sequence is processed by adding a gate mechanism and a residual structure, and combining a full link layer, an attention mechanism, and the like until the queried answer is obtained. For example, when the data vector sequence is convolved at least twice by the dilated CNN, and the obtained convolution result is processed based on a gate mechanism and a residual structure, so as to obtain the first query result, where the dilated CNN is an expanded convolutional neural network.

The first query result may be one or a combination of different text segments corresponding to different position codes in the same query document, the same text segment corresponding to different position codes in different query documents, or different text segments corresponding to different position codes in different query documents.

For example, the first query result may be "e.g., No. 20 per month as the billing day, and No. 8 next month as the last payment period (no difference between large and small months). The free payment period is from the bank accounting date to the last payment date, the free period of the bank credit card is 50 days at the longest and 19 days at the shortest, and the consumption of the day after the bill day is enjoyed with the longest free period. The "and" last payment day for the bank credit card is day 18 after the bill day. The "and" payment period is 18 days. "

The text segments may be descriptive texts located at different positions on a certain query document, or descriptive texts located at different positions on different query documents.

Specifically, the implementation mode is as follows: performing convolution processing on each text sequence in the data vector sequence through expansion CNNs of a plurality of different convolution kernels to obtain semantic correlation among each participle text in the text sequence, wherein the text sequence is represented by a vector of a text sentence segment in the query document; generating question weights corresponding to the query data for each participle text according to an attention mechanism based on the semantic relevance; integrating the question weight, the participle text corresponding to the question weight and the position code of the participle text to obtain the first query result, wherein the first query result comprises a text segment and the position code of the text segment in the corresponding query document.

Specifically, as shown in the schematic structural diagram of the frame selection model shown in fig. 3, after the data vector sequence is calculated for multiple times through the expansion operation unit and the full connection layer, the initial position of the answer in the query document and the probability value corresponding to the initial position are output. Specifically, the frame selection model performs two-class prediction on each participle text in the query document, and determines probability values of the initial position and the end position of the answer corresponding to the query data in the query document. As expression (4):

wherein the content of the first and second substances,

a probability value of a starting position of each answer in the first query result in the corresponding query document,

probability values, σ, W, for the end positions of the answers₁、W₂、β₁、β₂、a₁And a₂For trainable parameters, x_iIs a vector representation of the participle text.

In some embodiments, the box model of the present application introduces an attention mechanism instead of simple posing to complete the integration of the output sequence obtained after the convolution processing, including encoding the vector sequence corresponding to the query data into a total problem vector, and encoding the sequence corresponding to the query document into a total query vector, formula (5).

λ_i＝softmax(a^Tactivation(Wx_i))

(5)

Wherein X is the output of the frame selection model, X_iFor the vector sequence corresponding to each participle text, a, W and W are trainable parameters, lambda_iFor query data corresponding to segmented text xiActivation is an activation function, here chosen to be the activation function tanh. As shown in the sequence integration diagram of FIG. 4, a matrix of position code + document vector + problem code is obtained for use as the output of the frame selection model. In fig. 4, a, b, c, and d represent a certain participle text, the position code is the corresponding position of the participle text a in the query document, y is the question code of the query data, and 0.1, 0.3, 0.15, and 0.55 refer to the weight corresponding to the query data corresponding to each participle text.

Further, in some query documents, or in all query documents, there is no answer, that is, in the case that the first query result is empty, in order to better obtain the output result, a control item may be set for the frame selection model to determine whether the first query result exists, that is, whether the answer is included in the first query result.

In particular, an output p may be set^globalThe model determines whether there is an answer based on the output, and if the portion is 0, there is no answer, expression (6):

p^global＝σ(Wo+b)

if there is no answer in the query document, p^globalIs 0, then in expression (4)

And

the output of (1) is all 0, i.e., there is no text segment matching the query data in the current query document, and there is no answer. By the method, the efficiency of the frame selection model for inquiring the information can be greatly improved.

Before the data vector sequence is input into the frame selection model, the frame selection model needs to be trained. When the frame selection model is used for labeling the initial position and the final position of the answer, the labeling mode of two classifications is used for realizing, namely, each character or word in the query document is classified, whether the character or word is the initial position or the final position of the answer is judged, and the probability value is output. For Example, query data Example: what is the boiling point of water? And querying the document: at normal atmospheric pressure, the boiling point of water is 100 degrees celsius. In the case of the dichotomy:

beginning: 000000000000100000

And (4) ending: 000000000000000001

Where the first "1" indicates the beginning of the answer and the second 1 indicates the end of the answer. When the algorithm processes the answer, each character (word) is classified into two categories, namely whether the character (word) is started or not and whether the character (word) is ended or not. Also, given that the positive and negative classes are unbalanced, i.e., there are more 0's than 1's, focalloss in the class can be used as a loss function. Formula (7):

L_flfor the loss value, α is 0.25, γ is 2, y is the actual labeling result, and y is the binary labeling result.

Specifically, the output of the frame model has two parts, so the loss function loss1 at the start position and the loss function loss2 at the end position can be detected separately, and then the total loss function loss is calculated as formula (8):

loss＝(loss1+loss2)*λ

(8)

wherein λ is a hyper-parameter, and the loss function loss1 at the start position and the loss function loss2 at the end position can both pass through L_flAnd (4) calculating. To facilitate visualization of the value of loss, we choose 200; training a frame selection model by using an adam optimizer, training the frame selection model to be optimal by using a warmstart strategy and using a learning rate of 10 < -3 >, wherein a war _ start parameter is used in the modelDuring type training, False is the default, and the model is trained by literally understanding that the model is trained from a warm place; if the warmstart is equal to True, the model training process is performed, and training is continued on the training result of the previous stage; if arm _ start ═ False, it means to train the model from the beginning and then load the optimal frame selection model. And then reducing the learning rate to train the frame selection model to be optimal.

Further, in order to make the trained box selection model more stable, a weight Moving Average (ema) is adopted, which can improve the performance of the solution with almost zero additional cost.

Specifically, in the weight moving average method (9), θ is a "shadow variable" for maintenance, a is an attenuation factor, and the plane takes a value of 0.999:

θ_n+1＝aθ_n+(1-a)θ_n+1

(9)

further, because the data volume of the bank is small, sample data can be amplified to increase the diversity of the data, specifically, the sample document is cut, the cut query document is randomly spliced to obtain a new sample document, and the new sample document is added into the sample document.

And obtaining a new material (the number and the position of answers are changed) from the same section of material by repeated splicing and random cutting, doubling the internal data of the bank, and dividing a training set and a testing set according to the ratio of 8: 2.

Further, the probability value that the position corresponding to a certain word segmentation text is the starting position is obtained through sigmoid of two outputs of the trained frame selection model, namely:

wherein sigma is sigmoid, and the probability value is obtained. The answer is to search for documents with 0,1]Is the starting and ending positions selected, then how to determine that a segment is the appropriate result? In this embodiment, use is made of

As a ranking indicator. In the actual process, when question data is input, the frame selection model can predict a plurality of query documents, and under the condition that a plurality of answers exist, the answers are selected in a voting mode.

Further, in order to obtain the most appropriate answer from the plurality of query documents, in some embodiments, the obtained first query result is further filtered.

And step 210, screening the first query result according to a screening algorithm to obtain a final information query result.

The obtained first query result may be one or a combination of different text segments corresponding to different position codes in the same query document, the same text segment corresponding to different position codes in different query documents, or different text segments corresponding to different position codes in different query documents.

The first query result needs to be filtered in order to get the most accurate answer. If the first query result is not empty, acquiring the position code of each text segment in the first query result; if the position codes of the text segments correspond to different query documents, calculating the scores of the text segments, and taking the text segment corresponding to the maximum score as the first query result; and synthesizing all the first query results to obtain a final information query result.

The fact that the position codes of the text segments correspond to different query documents means that different text segments correspond to different query documents, and a plurality of standard answers, namely simplified answers cannot exist in the different text segments. The most suitable one of the text segments can be selected from the plurality of text segments obtained by re-scoring the text segments.

Specifically, the score of each text segment can be obtained by obtaining the probability value of the starting position and the ending position of each text segment in the corresponding query document, then calculating according to the probability value, and taking the text segment corresponding to the maximum score as the final information query result.

In detail, the expression (10) can be used

Calculating the score of each text segment, wherein F refers to the final score of the text segment v, n refers to the number of the text segments included, and S refers to

For example, 10 query documents are searched out, and a text segment "100 degrees celsius" corresponding to the query data "how much the boiling point of water is" appears in 3 of the query documents, then n is 3.

Further, when the position codes of the text segments all correspond to the same query text, that is, the same query document includes a plurality of different text segments, the occurrence frequency of each text segment in the query document is detected, the text segments are sorted according to the occurrence frequency, and the text segment corresponding to the maximum occurrence frequency is used as a final result.

Optionally, if the first query result is null, the step of data screening is skipped, the frame selection model may be retrained, and the operation of answer frame selection is continued through the trained frame selection model, or the number of documents in the document database is increased.

It should be emphasized that, in order to further ensure the privacy and security of the bank data, the query document may also be stored in a node of a block chain.

In the information query method based on the question-answering system, at least one query document with the first similarity conforming to the first threshold value with the query data is obtained by querying from the document database, the screening result is obtained, the data calculation amount is reduced, the obtained query document is subjected to word segmentation, corresponding position codes are generated for each word segmentation text, the position codes are summed with the question data, and the answer frame selection operation is carried out according to a self-defined frame selection model to obtain the first query result. In order to enable the obtained information query result to be more accurate, the method and the device further perform screening processing on the first query result through a screening algorithm to obtain a final information query result. The self-defined frame selection model is subjected to convolution processing on the vector sequence by adding a door mechanism and a residual result and combining with the expanded CNN, so that the frame selection model based on the expanded CNN can capture a longer distance without increasing frame selection model parameters, the calculated amount is increased, the coverage range of the frame selection model is wider, and the technical problems of low coverage rate and overlarge model in the prior art are solved; in addition, the position codes of all word segmentation texts in the query document are used as a feature gold for input, so that the frame selection model can accurately position the initial position and the end position of the answer, and the efficiency of information query is greatly improved.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, an information query device based on a question-answering system is provided, and the information query device based on the question-answering system corresponds to the information query method based on the question-answering system in the above embodiments one to one. The information inquiry device based on the question-answering system comprises:

a rough query module 502, configured to query from a document database to obtain at least one query document corresponding to query data if the query data is received;

a vector module 504, configured to perform word segmentation on the query document, and perform vectorization on multiple word segmentation texts obtained by word segmentation to obtain word segmentation vectors;

the encoding module 506 is configured to input the word segmentation vectors into a frame selection model, generate position codes for the word segmentation vectors, and sum the problem codes of the query data and the word segmentation vectors after the position codes are generated to obtain a data vector sequence;

a convolution module 508, configured to perform convolution processing on the data vector sequence through the multiple expansion operation units of the frame selection model to obtain a first query result;

and a screening module 510, configured to screen the first query result according to a screening algorithm to obtain a final information query result.

Further, the rough inspection module 502 includes:

the dividing submodule is used for dividing the document to be inquired in the document database to obtain at least one text paragraph;

the occupation ratio submodule is used for calculating a first occupation ratio of stop words in each text paragraph according to the stop word list; and are

The similarity submodule is used for calculating a first similarity between the query data and the text paragraph corresponding to the minimum first proportion according to the BM25 algorithm;

and the query submodule is used for taking the document to be queried where the text paragraphs with the first similarity larger than the first threshold value are as the queried document obtained by querying.

Further, the encoding module 506 includes:

the vector submodule is used for vectorizing the query data to obtain a corresponding problem code;

and the summation submodule is used for acquiring the position code of each participle vector, and summing the problem code and the participle vector subjected to the position code to obtain a data vector sequence.

Further, the convolution module 508 includes:

and the convolution submodule is used for performing convolution processing on the data vector sequence at least twice through expansion CNN, and processing the convolution result obtained each time based on a door mechanism and a residual error structure to obtain the first query result.

Further, a convolution sub-module comprising:

the correlation unit is used for performing convolution processing on each text sequence in the data vector sequence through expansion CNNs of a plurality of different convolution kernels to obtain semantic correlation among each participle text in the text sequence, wherein the text sequence is represented by a vector of a text sentence segment in the query document;

the weighting unit is used for generating question weights corresponding to the query data for each participle text according to an attention mechanism based on the semantic relevance;

and the integration unit is used for integrating the question weight, the participle text corresponding to the question weight and the position code of the participle text to obtain the first query result, wherein the first query result comprises a text segment and the position code of the text segment in the corresponding query document.

Further, the screening module 510 includes:

the position sub-module is used for acquiring the position code of each text segment in the first query result if the first query result is not empty;

the score sub-module is used for calculating the score of each text segment if the position code of each text segment corresponds to different query documents, and taking the text segment corresponding to the maximum score as the first query result;

and the synthesis sub-module is used for synthesizing each first query result to obtain a final information query result.

Further, a score submodule, comprising:

a probability unit, configured to obtain probability values of starting positions and ending positions of the text segments in the corresponding query documents;

the score unit is used for calculating the score of each text segment according to the probability value;

and the screening unit is used for taking the text segment corresponding to the maximum score as a final information query result.

According to the information inquiry device based on the question-answering system, at least one inquiry document with the first similarity of the inquiry data meeting the first threshold is obtained through inquiry from the document database, the screening result is obtained, the data calculation amount is reduced, after the obtained inquiry document is subjected to word segmentation, the corresponding position code is generated for each word segmentation text, the position code is summed with the question data, and the answer frame selection operation is carried out according to the self-defined frame selection model, so that the first inquiry result is obtained. In order to enable the obtained information query result to be more accurate, the method and the device further perform screening processing on the first query result through a screening algorithm to obtain a final information query result. The self-defined frame selection model is subjected to convolution processing on the vector sequence by adding a door mechanism and a residual result and combining with the expanded CNN, so that the frame selection model based on the expanded CNN can capture a longer distance without increasing frame selection model parameters, the calculated amount is increased, the coverage range of the frame selection model is wider, and the technical problems of low coverage rate and overlarge model in the prior art are solved; in addition, the position codes of all word segmentation texts in the query document are used as a feature gold for input, so that the frame selection model can accurately position the initial position and the end position of the answer, and the efficiency of information query is greatly improved.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operating system and execution of computer-readable instructions in the non-volatile storage medium. The database of the computer device is used to store query documents. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions are executed by a processor to realize an information query method based on a question-answering system.

In the embodiment, at least one query document with the first similarity to the query data meeting the first threshold is obtained by querying from the document database, so that the screening result is obtained, the data calculation amount is reduced, after the obtained query document is subjected to word segmentation, a corresponding position code is generated for each word segmentation text, the position code is summed with the question data, and the answer frame selection operation is performed according to a self-defined frame selection model, so that the first query result is obtained. In order to enable the obtained information query result to be more accurate, the method and the device further perform screening processing on the first query result through a screening algorithm to obtain a final information query result. The self-defined frame selection model is subjected to convolution processing on the vector sequence by adding a door mechanism and a residual result and combining with the expanded CNN, so that the frame selection model based on the expanded CNN can capture a longer distance without increasing frame selection model parameters, the calculated amount is increased, the coverage range of the frame selection model is wider, and the technical problems of low coverage rate and overlarge model in the prior art are solved; in addition, the position codes of all word segmentation texts in the query document are used as a feature gold for input, so that the frame selection model can accurately position the initial position and the end position of the answer, and the efficiency of information query is greatly improved.

As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

In one embodiment, a computer-readable storage medium is provided, on which computer-readable instructions are stored, and the computer-readable instructions, when executed by a processor, implement the steps of the information query method based on the question-and-answer system in the above-described embodiment, such as steps 202 to 210 shown in fig. 2, or implement the functions of the modules/units of the information query device based on the question-and-answer system in the above-described embodiment, such as modules 502 to 510 shown in fig. 5.

It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a non-volatile computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, without departing from the spirit and scope of the present invention, several changes, modifications and equivalent substitutions of some technical features may be made, and these changes or substitutions do not make the essence of the same technical solution depart from the spirit and scope of the technical solution of the embodiments of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An information query method based on a question-answering system is characterized by comprising the following steps:

2. The method of claim 1, wherein querying from a document database to obtain at least one query document corresponding to the query data comprises:

dividing a document to be queried in a document database to obtain at least one text paragraph;

calculating a first ratio of stop words in each text paragraph according to the stop word list; and are

Calculating a first similarity between the query data and a text paragraph corresponding to the minimum first proportion according to a BM25 algorithm;

and taking the document to be queried where the text paragraphs with the first similarity greater than the first threshold value are as the queried document obtained by querying.

3. The method of claim 1, wherein the inputting the word segmentation vector into a frame selection model, and obtaining a data vector sequence based on the query data comprises:

vectorizing the query data to obtain corresponding problem codes;

and acquiring the position code of each word segmentation vector, and summing the problem code and the word segmentation vectors subjected to the position code to obtain a data vector sequence.

4. The method of claim 1, wherein the convolving the sequence of data vectors with the plurality of dilation operation units of the frame selection model to obtain a first query result comprises:

and performing convolution processing on the data vector sequence at least twice through expansion CNN, and processing the convolution result obtained each time based on a door mechanism and a residual error structure to obtain the first query result.

5. The method according to claim 4, wherein the convolving the sequence of data vectors at least twice by expanding the CNN and processing each convolution result based on a gate mechanism and a residual structure to obtain the first query result comprises:

performing convolution processing on each text sequence in the data vector sequence through expansion CNNs of a plurality of different convolution kernels to obtain semantic correlation among each participle text in the text sequence, wherein the text sequence is represented by a vector of a text sentence segment in the query document;

generating question weights corresponding to the query data for each participle text according to an attention mechanism based on the semantic relevance;

integrating the question weight, the participle text corresponding to the question weight and the position code of the participle text to obtain the first query result, wherein the first query result comprises a text segment and the position code of the text segment in the corresponding query document.

6. The method of claim 1, wherein the screening the first query result according to a screening algorithm to obtain a final information query result comprises:

if the first query result is not empty, acquiring the position code of each text segment in the first query result;

if the position codes of the text segments correspond to different query documents, calculating the scores of the text segments, and taking the text segment corresponding to the maximum score as the first query result;

and synthesizing all the first query results to obtain a final information query result.

7. The method of claim 6, wherein said calculating a score for each of said text segments comprises:

obtaining probability values of the initial positions and the end positions of the text segments in the corresponding query documents;

calculating the score of each text segment according to the probability value;

and taking the text segment corresponding to the maximum score as a final information query result.

8. An information inquiry device based on a question-answering system is characterized by comprising:

9. A computer device comprising a memory and a processor, the memory storing computer readable instructions, wherein the processor when executing the computer readable instructions implements the steps of the method of any one of claims 1 to 7.

10. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor implement the steps of the method of any one of claims 1 to 7.