CN112883182A - Question-answer matching method and device based on machine reading - Google Patents

Question-answer matching method and device based on machine reading Download PDF

Info

Publication number
CN112883182A
CN112883182A CN202110244992.2A CN202110244992A CN112883182A CN 112883182 A CN112883182 A CN 112883182A CN 202110244992 A CN202110244992 A CN 202110244992A CN 112883182 A CN112883182 A CN 112883182A
Authority
CN
China
Prior art keywords
question
text
answer
associated document
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110244992.2A
Other languages
Chinese (zh)
Inventor
李俊彦
芮智琦
柳志德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Electronic Technology Wuhan Co ltd
Original Assignee
Hisense Electronic Technology Wuhan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Electronic Technology Wuhan Co ltd filed Critical Hisense Electronic Technology Wuhan Co ltd
Priority to CN202110244992.2A priority Critical patent/CN112883182A/en
Publication of CN112883182A publication Critical patent/CN112883182A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the application provides a question-answer matching method and device based on machine reading, and the method comprises the following steps: and after the problem text of the user is obtained, searching the associated document text related to the problem text from the real-time database. After the question text and the associated document text are input into a trained reading understanding model, the starting position and the ending position of the answer are determined, and therefore the question answer matched with the question text is obtained. The document texts in the real-time database are unstructured text data acquired from a big data platform in real time, so that question and answer candidate texts can be expanded in time. When the question text of the user is related to the real-time hot spot, even the breaking news time, the question-answer matching method can be used for replying the user in time, and therefore the use experience of the user is improved.

Description

Question-answer matching method and device based on machine reading
Technical Field
The application relates to the technical field of machine reading, in particular to a question and answer matching method and device based on machine reading.
Background
With the development of artificial intelligence technology, machine reading understanding technology is also widely applied. The method is applied to machine reading understanding technology in application scenes such as web page search, question and answer robots, intelligent voice assistants and the like. Intelligent devices such as smart televisions, smart sound boxes and the like have a question and answer function.
In the current intelligent equipment, the question-answering framework based on retrieval and knowledge graph is particularly widely applied.
However, the question-answering framework based on retrieval and knowledge maps relies on collecting data of structured format texts, and therefore depends on timely expansion of data by operators. The method is time-consuming and labor-consuming, and if data expansion is not timely, a question-answering system can be caused, and the condition that responses are not timely or even can not be responded on answering a real-time hot spot problem occurs, so that user experience is poor.
Disclosure of Invention
In order to solve the problems that a question-answer framework based on retrieval and knowledge maps depends on timely expansion of data by operators, time and labor are wasted, and if the data expansion is not timely, a question-answer system cannot timely reply to a real-time hot question or even cannot reply to the real-time hot question, so that the user experience is poor, the question-answer matching method and device based on machine reading are provided.
In a first aspect, an embodiment of the present application provides a question-answer matching method based on machine reading, including:
acquiring a question text of a user;
searching for associated document texts related to the problem texts from a real-time database, wherein the document texts stored in the real-time database are all unstructured text data acquired from a large data platform in real time;
inputting the question text and the associated document text into a trained reading understanding model, determining a starting position and an ending position of an answer corresponding to the question text in the associated document text, and determining a text between the starting position and the ending position as a question answer matched with the question text.
In a second aspect, an embodiment of the present application provides a question-answer matching device based on machine reading, including:
the acquisition module is configured to acquire a question text of a user;
a lookup module configured to: searching for associated document texts related to the problem texts from a real-time database, wherein the document texts stored in the real-time database are all unstructured text data acquired from a large data platform in real time;
a location determination module configured to: inputting the question text and the associated document text into a trained reading understanding model, determining a starting position and an ending position of an answer corresponding to the question text in the associated document text, and determining a text between the starting position and the ending position as a question answer matched with the question text.
The technical scheme provided by the application comprises the following beneficial effects: and after the problem text of the user is obtained, searching the associated document text related to the problem text from the real-time database. After the question text and the associated document text are input into a trained reading understanding model, the starting position and the ending position of the answer are determined, and therefore the question answer matched with the question text is obtained. The document texts in the real-time database are unstructured text data acquired from a big data platform in real time, so that question and answer candidate texts can be expanded in time. When the question text of the user is related to the real-time hot spot, even the breaking news time, the question-answer matching method can be used for replying the user in time, and therefore the use experience of the user is improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic flowchart illustrating a question-answer matching method based on machine reading according to an embodiment of the present disclosure;
FIG. 2 is a block diagram of a machine-reading based question-answer matching system according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a real-time data collection process provided by an embodiment of the present application;
fig. 4 is a block diagram illustrating a question-answer matching device based on machine reading according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Reference throughout this specification to "embodiments," "some embodiments," "one embodiment," or "an embodiment," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in at least one other embodiment," or "in an embodiment" or the like throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, the particular features, structures, or characteristics shown or described in connection with one embodiment may be combined, in whole or in part, with the features, structures, or characteristics of one or more other embodiments, without limitation. Such modifications and variations are intended to be included within the scope of the present application.
For the purpose of clearly illustrating the embodiments of the present application, some explanations of related terms are given below.
Machine reading understanding technology: the computer can accurately answer the questions related to the text content by reading the natural language text and then understanding the natural language text like a person.
Inverted index correlation technique based on ElasticSearch: the current mature reverse index technology and semantic retrieval technology are mainly used. The ElasticSearch is a distributed extensible real-time search and analysis engine, and is a search engine established on the basis of a full-text search engine Apache Lucene (TM) (project development open source search software). The ElasticSearch can do the following: distributed real-time file storage and indexes each field so that it can be searched. Distributed search engines for real-time analysis. It can be extended to hundreds of servers with capacity for PB-level structured or unstructured data.
The reverse index is also called a reverse index, which is opposite to the forward index. The forward index is to find value (target text) by key, and the backward index is to find key by value. In the data structure of the inverted index, a zipper is arranged behind each word, and document numbers containing the words are stored in the zippers. All documents containing a word can be quickly found by using the data structure. And finally, performing intersection operation on all the documents of all the words in the sentence to obtain a document set having an association relation with the sentence.
ELMO and SIF-based core word extraction technology: the basic idea of ELMO (embedding in Language Models) is to pre-train a target of a certain Language model on a large amount of text by using a bidirectional LSTM (Long Short-Term Memory) structure, and obtain word vectors from the LSTM layer. Where the lower layer LSTM represents simpler syntax information, while the upper layer LSTM captures context-dependent semantic information. For downstream tasks, vectors of different layers are linearly combined and then supervised learning is carried out.
The main goal of ELMO is to capture the different features of the vocabulary (grammar and semantics) and to resolve the phenomenon of ambiguous words. For example, words and vectors are in one-to-one correspondence and do not change due to part of speech or semantic differences. When two identical words at different positions are input in the ELMO, the two output vectors are the result of linear superposition with the output vectors after passing through the 2-layer LSTM, and the other vector is different. This is the result of the ELMO based on the context of the input sentence.
The ELMO network structure is composed of a single input layer and two bi-directional LSTMs, where the input layer can be considered as an embedded layer. ELMO is an embedded layer obtained by convolution of characters, not by matrix multiplication. Since the output of each Layer of LSTM may be distributed differently, a Layer Normalization is added after the output of each bi-directional LSTM, and a residual connection is added between the two layers of bi-directional LSTM.
Weighted bag of words model SIF (Smooth Inverse Frequency): the bag of words model does not consider the context relationship between words in the text, but only the weights of the words (related to the frequency with which the words appear in the text). Each word is independent, similar to putting all words in a bag. The weighted bag-of-words model provides a method named smooth word frequency reversal on the basis of the bag-of-words model, and the method is used for calculating the weighting coefficient of each word.
The calculation of SIF is divided into two steps: firstly, calculating the weighting coefficient of each word by using a smooth inverse word frequency method. The lower the frequency the greater the importance of the word in the sentence, the greater the weighting factor. And secondly, removing the common information (namely the partial words with smaller importance) of all sentences according to the weighting coefficients of the words, wherein the reserved sentence vectors can represent the sentences and generate differences with other sentence vectors.
Word segmentation technology: a word is the smallest meaningful language component that can be moved independently. Analysis is the first step of natural language processing and is also a core technology. Each word distinguished from English is divided by a space or a punctuation mark, and the boundary of the word is difficult to define in Chinese sentences. Currently, the mainstream word segmentation techniques are classified into three categories, namely rule-based, statistic-based and understanding-based. For example, the rule-based word segmentation technique is specifically: and performing word segmentation by using a forward maximum matching algorithm based on the word bank.
At present, most intelligent devices have a question and answer function. The question-answering functions of these smart devices are typically based on knowledge graph classes and on retrieval. Knowledge-graph questions and retrievable questions rely on the collection of data in structured formatted text. Therefore, knowledge-graph questions and retrievable questions and answers also depend on timely augmentation of the data by the operator. The method is time-consuming and labor-consuming, and can cause the question-answering system to reply the real-time hot spot problem in a non-timely manner or even fail to reply if the data is not expanded in time, so that the user experience is poor.
In order to solve the problems, the method comprises the steps of collecting hot news events and hot website data in real time through a big data platform, expanding candidate document texts of the questions and answers in time, analyzing the candidate document texts expanded in real time, and obtaining answers from the candidate document texts, so that the situation that the real-time hot questions are not answered in time is avoided.
The flow chart of the question-answer matching method based on machine reading in fig. 1 and the frame diagram of the question-answer matching system based on machine reading in fig. 2 are shown, the method includes the following steps:
step S101, obtaining a question text input by a user.
Illustratively, the user enters the question text "how well the maradonian was lost", either directly or by recognizing speech data entered by the user.
Step S102, searching the relevant document text related to the question text from the real-time database.
As shown in the flow chart of real-time data collection shown in fig. 3, the document texts stored in the real-time database are all unstructured text data obtained from the big data platform in real time. Specifically, the big data platform may be various big websites, such as popular information websites like microblog, hundredth, Tencent, and the like. According to the method and the device, the hot search terms can be obtained by utilizing the crawler program based on the microblog hot search list and the Baidu hot search list.
Exemplarily, keyword entries such as the death of maladora and the advance of wang-yibo-new song are obtained according to a microblog hot search list, and corresponding news data are retrieved from the keyword entries on a Baidu information website and a Tencent news website.
Corresponding web page acquisition tools can be written, data can be acquired on the big data platforms at regular time, for example, network hot spot news data is acquired every 10min, and then the acquired network hot spot news data is transmitted to an online data service engine. The question-answer matching process can acquire the network hot news data in real time through an online data service engine. Therefore, the effect of timely replying the real-time hotspot problem can be realized.
In some embodiments, the obtained network hotspot news data can be analyzed and sorted. For example, the current time is taken as a node, news data with the time exceeding 1 month is filtered, characters (description such as news reporter and editing information) irrelevant to news content are filtered, sensitive words are filtered (sensitive words in the news data are filtered according to a word bank of the sensitive words collected and sorted in advance), data segmentation processing, word segmentation processing, core word extraction and the like are performed. The processed news data is beneficial to improving the retrieval accuracy and efficiency.
In some embodiments, prior to looking up the associated document text related to the question text from the real-time database, the method further comprises: and performing word segmentation processing and core word extraction processing on the problem text to obtain a problem core word. The obtained associated document text is located in the text related to the question core word.
Illustratively, for the question text "how well marathon was removed", word segmentation processing is first performed to obtain word segmentation information "how well, removed, and is". And then, performing core word extraction processing on the participle information by utilizing ELMO and SIF algorithms to obtain a problem core word 'Maladona, death'. Other core word extraction algorithms such as the TF-IDF (Term Frequency-Inverse text Frequency index) algorithm can be adopted, and the core word extraction accuracy of the ELMO and SIF algorithms is higher.
In some embodiments, after the problem core word is extracted, it may be further determined whether the problem core word includes a sensitive word, and if not, a subsequent matching search stage is performed. And if the sensitive words are contained, further performing sensitive word filtering processing on the question core words.
After the problem core words are extracted, relevant document texts, namely relevant news data, relevant to the problem core words are searched in a real-time database.
Illustratively, unstructured text data related to the problem core word "maladora, deaths" may be found in a real-time database using an elastic search inverted index based technique.
In some embodiments, there may be multiple associated document texts related to the question text found from the real-time database. Multiple associated document texts may be relevance scored, for example using a TF-IDF algorithm. And obtaining the score ranking of a plurality of associated document texts, and taking the associated document text with the highest score as the associated document text of the search answer. Thereby further improving the accuracy of answer search.
At present, the deep learning field widely adopts a mode of finely adjusting a pre-training model to solve various problems. The method specifically comprises the following two steps: firstly, pre-training a better model by adopting a larger data set; and secondly, reconstructing the pre-training model according to different tasks, namely fine-tuning the pre-training model by using the data set of the new task. The method has the advantages that a better trained model is used, the model is directly used for being matched with downstream tasks to achieve faster convergence speed, the training agent payment is smaller, and the model performance can be effectively improved.
The reading understanding model in the embodiment of the present application may be a reading understanding model that is fine-tuned based on the ALBERT model. ALBERT (A Lite Bidirectional Encoder reproduction from transformers) is a lightweight, Bidirectional Transformer Encoder. And the encoding part of the BERT adopts a mask mode to shield part of information from the encoding part of the transformer, so as to obtain mutual information of different positions among word words. Two tasks are introduced at pre-train, the mlm task (predicting part of mask) and the nsp task (concatenating two sets of sentences of encoding part). And predicting whether the next sentence is a joint sentence or not at the time of pre-train. Then, a reading understanding model is introduced into the fine-tuning to perform machine reading understanding. The introduced reading understanding model may be a BiDAF model.
Step S103, after inputting the question core word and the associated document text into the read understanding model after the fine tuning, a start position and an end position of an answer corresponding to the associated document text may be determined. The text between the start position and the end position is the final answer.
Specifically, the step of determining the starting position and the ending position of the answer corresponding to the associated text by using the reading understanding model comprises the following steps:
and sequentially splicing the question core words and the associated document texts, and inputting the spliced question core words and the associated document texts into a reading understanding model finely adjusted based on the ALBERT model. Acquiring spliced vector representation, and acquiring probability distribution of answer starting positions and answer ending positions through a full link layer, wherein the expression formula is as follows: logits ═ wx + b. Wherein, w is the neuron data vector of the previous layer of neural network, x is the neuron vector of the next layer of neural network, and b is the weight matrix in the neural network.
The model can output the starting probability that each word in the associated document text is the starting position of the answer, and the formula is as follows:
Pbegin=softmax(start_logtisbegin)
wherein softmax is a normalization function.
And simultaneously, the end probability that each word in the associated document text is the answer end position can be output, and the formula is as follows:
Pend=softmax(end_logitsend)
finally, a word corresponding to the start probability when the product of the start probability and the end probability is the maximum value may be determined as the start position of the answer, and a word corresponding to the end probability when the product of the start probability and the end probability is the maximum value may be determined as the end position of the answer. And the text between the start position and the end position is the final answer to the question.
Illustratively, the associated document text "marando of the international memorable arrencing arista" related to the question text "marando was removed" is found from the real-time database: one of the most outstanding days. Local time 25 am, the Argentina globalsing Maradona broke off in its home in Argentina, enjoying 60 years of age. "inputting the question text and the associated document text into the reading understanding model of the above embodiment, the combination of the values that result in the maximum product of the start probability and the end probability is { start: 42 pro: 0.9, end: 62 pro: 0.8}. The result is that the answer to the question is determined to have a starting position of 42 th word and an ending position of 62 th word. The text between the two words is the final answer to the question "maladora was a sudden death of the heart in its home in argentina".
In some embodiments, if the maximum value of the product of the start probability and the end probability is greater than or equal to a preset product threshold, the position of the corresponding word is determined as the start position of the corresponding answer when the product of the start probability and the end probability is the maximum value, and the position of the corresponding word is determined as the end position of the corresponding answer when the product of the start probability and the end probability is the maximum value. If the maximum value of the product of the starting probability and the ending probability is smaller than the preset product threshold value, the question input by the user is possibly not reported by related news, and the user can be prompted to carry out question-answer matching again, so that the accuracy of question-answer matching is improved.
In some embodiments, if the text length of the final question answer is less than or equal to a preset length threshold, the question answer is directly displayed on the display. And if the text length of the final question answer is larger than a preset length threshold value, not displaying the question answer on the display. The final answer may be informed to the user by voice announcement. Therefore, the situation that the user cannot know the complete final answer because the display on the display is influenced due to the fact that the text of the final answer is too long is avoided.
An embodiment of the present application provides a log replay device, which is configured to execute the embodiment corresponding to fig. 1, and as shown in fig. 4, the question-answer matching device based on machine reading provided by the present application includes:
an acquisition module 201 configured to: acquiring a question text of a user;
a lookup module 202 configured to: searching for associated document texts related to the problem texts from a real-time database, wherein the document texts stored in the real-time database are all unstructured text data acquired from a large data platform in real time;
a position determination module 203 configured to: inputting the question text and the associated document text into a trained reading understanding model, determining a starting position and an ending position of an answer corresponding to the question text in the associated document text, and determining a text between the starting position and the ending position as a question answer matched with the question text.
In some embodiments, the apparatus further comprises:
a question text pre-processing module 204 configured to: before searching for a related document text related to the problem text from a real-time database, performing word segmentation processing and core word extraction processing on the problem text to obtain a problem core word;
the lookup module 202 is specifically configured to: and matching and searching in the real-time database by using the question core words to obtain associated document texts related to the question core words.
In some embodiments, determining a starting position and an ending position of an answer in the associated document text corresponding to the question text specifically includes:
calculating the starting probability of each word in the associated document text being the answer starting position and calculating the ending probability of each word in the associated document text being the answer ending position;
and determining the position of the corresponding word as the starting position of the corresponding answer when the product of the starting probability and the ending probability is the maximum value, and determining the position of the corresponding word as the ending position of the corresponding answer when the product of the starting probability and the ending probability is the maximum value.
What has been described above includes examples of implementations of the invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but it is to be appreciated that many further combinations and permutations of the subject innovation are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Moreover, the foregoing description of illustrated implementations of the present application, including what is described in the "abstract," is not intended to be exhaustive or to limit the disclosed implementations to the precise forms disclosed. While specific implementations and examples are described herein for illustrative purposes, various modifications are possible which are considered within the scope of such implementations and examples, as those skilled in the relevant art will recognize.
In particular and in regard to the various functions performed by the above described components, devices, circuits, systems and the like, the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary aspects of the claimed subject matter. In this regard, it will also be recognized that the innovation includes a system as well as a computer-readable storage medium having computer-executable instructions for performing the acts and/or events of the various methods of the claimed subject matter.
The above-described systems/circuits/modules have been described with respect to interaction between several components/blocks. It can be appreciated that such systems/circuits and components/blocks can include those components or the referenced stator components, some of the specified components or sub-components, and/or additional components, and in various permutations and combinations of the above. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it should be noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers (e.g., a management layer) may be provided to communicatively couple to such sub-components in order to provide comprehensive functionality. Any components described herein may also interact with one or more other components not specifically described herein but known to those of skill in the art.
Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors necessarily resulting from the standard deviation found in their respective testing measurements. Moreover, all ranges disclosed herein are to be understood to encompass any and all subranges subsumed therein. For example, a range of "less than or equal to 11" can include any and all subranges between (and including) the minimum value of zero and the maximum value of 11, i.e., any and all subranges have a minimum value equal to or greater than zero and a maximum value of equal to or less than 11 (e.g., 1 to 5). In some cases, the values as described for the parameters can have negative values.
In addition, while a particular feature of the subject innovation may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "includes," "including," "has," "incorporates," variants thereof, and other similar words are used in either the detailed description or the claims, these terms are intended to be inclusive in a manner similar to the term "comprising" as an open transition word without precluding any additional or other elements.
Reference throughout this specification to "one implementation" or "an implementation" means that a particular feature, structure, or characteristic described in connection with the implementation is included in at least one implementation. Thus, the appearances of the phrases "in one implementation" or "in an implementation" in various places throughout this specification are not necessarily all referring to the same implementation. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more implementations.
Furthermore, reference throughout this specification to "an item" or "a file" means that a particular structure, feature, or object described in connection with the implementation is not necessarily the same object. Further, "file" or "item" can refer to objects in various formats.
The terms "node," "component," "module," "system," and the like as used herein are generally intended to refer to a computer-related entity, either hardware (e.g., circuitry), a combination of hardware and software, or an entity associated with an operating machine having one or more specific functionalities. For example, a component may be, but is not limited to being, a process running on a processor (e.g., a digital signal processor), a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. Although individual components are depicted in various implementations, it is to be appreciated that the components can be represented using one or more common components. Further, the design of each implementation can include different component placements, component selections, etc. to achieve optimal performance. Furthermore, "means" can take the form of specially designed hardware; generalized hardware specialized by the execution of software thereon (which enables the hardware to perform specific functions); software stored on a computer readable medium; or a combination thereof.
Moreover, the word "exemplary" or "exemplary" is used herein to mean "serving as an example, instance, or illustration". Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "exemplary" or "exemplary" is intended to present concepts in a concrete fashion. As used herein, the term "or" is intended to mean including "or" rather than exclusive "or". That is, unless specified otherwise, or clear from context, "X employs A or B" is intended to mean that it naturally includes either of the substitutions. That is, if X employs A; x is B; or X employs both A and B, then "X employs A or B" is satisfied under any of the above examples. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.

Claims (10)

1. A question-answer matching method based on machine reading, the method comprising:
acquiring a question text of a user;
searching for associated document texts related to the problem texts from a real-time database, wherein the document texts stored in the real-time database are all unstructured text data acquired from a large data platform in real time;
inputting the question text and the associated document text into a trained reading understanding model, determining a starting position and an ending position of an answer corresponding to the question text in the associated document text, and determining a text between the starting position and the ending position as a question answer matched with the question text.
2. The machine-reading-based question-answer matching method according to claim 1, wherein before searching for associated document text related to the question text from a real-time database, the method further comprises:
performing word segmentation processing and core word extraction processing on the problem text to obtain a problem core word;
the specific steps of searching the associated document text related to the question text from the real-time database are as follows: and matching and searching in the real-time database by using the question core words to obtain associated document texts related to the question core words.
3. The machine-reading-based question-answer matching method according to claim 2, wherein before matching retrieval in the real-time database with the question core words, the method further comprises: and filtering sensitive words in the question core words.
4. The machine-reading-based question-answer matching method according to claim 1, wherein a plurality of the associated document texts related to the question text are searched from a real-time database, and before the question text and the associated document texts are input into a trained reading understanding model, the method further comprises:
and performing relevance scoring on the plurality of associated document texts by using a TF-IDF algorithm, and taking the associated document text with the highest score as the associated document text for determining the answer to the question.
5. The machine-reading-based question-answer matching method according to any one of claims 1 to 4, wherein the reading understanding model is a reading understanding model which is finely adjusted based on an ALBERT model.
6. The machine-reading-based question-answer matching method according to claim 1, wherein determining a start position and an end position of an answer in the associated document text corresponding to the question text specifically comprises:
calculating the starting probability of each word in the associated document text being the answer starting position and calculating the ending probability of each word in the associated document text being the answer ending position;
and determining the position of the corresponding word as the starting position of the corresponding answer when the product of the starting probability and the ending probability is the maximum value, and determining the position of the corresponding word as the ending position of the corresponding answer when the product of the starting probability and the ending probability is the maximum value.
7. The machine-reading-based question-answer matching method according to claim 1, further comprising: when the text length of the question answer is smaller than or equal to a preset length threshold, displaying the question answer on a display;
and when the text length of the question answer is greater than the preset length threshold, not displaying the question answer on a display.
8. A question-answer matching device based on machine reading, the device comprising:
an acquisition module configured to: acquiring a question text of a user;
a lookup module configured to: searching for associated document texts related to the problem texts from a real-time database, wherein the document texts stored in the real-time database are all unstructured text data acquired from a large data platform in real time;
a location determination module configured to: inputting the question text and the associated document text into a trained reading understanding model, determining a starting position and an ending position of an answer corresponding to the question text in the associated document text, and determining a text between the starting position and the ending position as a question answer matched with the question text.
9. The machine-reading-based question-answer matching device of claim 8, wherein said device further comprises:
a question text preprocessing module configured to: before searching for a related document text related to the problem text from a real-time database, performing word segmentation processing and core word extraction processing on the problem text to obtain a problem core word;
the lookup module is specifically configured to: and matching and searching in the real-time database by using the question core words to obtain associated document texts related to the question core words.
10. The machine-reading-based question-answer matching device according to claim 8, wherein the determining of the start position and the end position of the answer corresponding to the question text in the associated document text is specifically:
calculating the starting probability of each word in the associated document text being the answer starting position and calculating the ending probability of each word in the associated document text being the answer ending position;
and determining the position of the corresponding word as the starting position of the corresponding answer when the product of the starting probability and the ending probability is the maximum value, and determining the position of the corresponding word as the ending position of the corresponding answer when the product of the starting probability and the ending probability is the maximum value.
CN202110244992.2A 2021-03-05 2021-03-05 Question-answer matching method and device based on machine reading Pending CN112883182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110244992.2A CN112883182A (en) 2021-03-05 2021-03-05 Question-answer matching method and device based on machine reading

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110244992.2A CN112883182A (en) 2021-03-05 2021-03-05 Question-answer matching method and device based on machine reading

Publications (1)

Publication Number Publication Date
CN112883182A true CN112883182A (en) 2021-06-01

Family

ID=76055595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110244992.2A Pending CN112883182A (en) 2021-03-05 2021-03-05 Question-answer matching method and device based on machine reading

Country Status (1)

Country Link
CN (1) CN112883182A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444488A (en) * 2022-01-26 2022-05-06 中国科学技术大学 Reading understanding method, system, device and storage medium for few-sample machine
CN115828893A (en) * 2022-11-28 2023-03-21 北京海致星图科技有限公司 Method, device, storage medium and equipment for question answering of unstructured document

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107493353A (en) * 2017-10-11 2017-12-19 宁波感微知著机器人科技有限公司 A kind of intelligent robot cloud computing method based on contextual information
CN109766423A (en) * 2018-12-29 2019-05-17 上海智臻智能网络科技股份有限公司 Answering method and device neural network based, storage medium, terminal
CN109918487A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 Intelligent answer method and system based on network encyclopedia
CN111368042A (en) * 2020-02-13 2020-07-03 平安科技(深圳)有限公司 Intelligent question and answer method and device, computer equipment and computer storage medium
CN111949798A (en) * 2019-05-15 2020-11-17 北京百度网讯科技有限公司 Map construction method and device, computer equipment and storage medium
CN112417104A (en) * 2020-12-04 2021-02-26 山西大学 Machine reading understanding multi-hop inference model and method with enhanced syntactic relation
CN112434142A (en) * 2020-11-20 2021-03-02 海信电子科技(武汉)有限公司 Method for marking training sample, server, computing equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107493353A (en) * 2017-10-11 2017-12-19 宁波感微知著机器人科技有限公司 A kind of intelligent robot cloud computing method based on contextual information
CN109766423A (en) * 2018-12-29 2019-05-17 上海智臻智能网络科技股份有限公司 Answering method and device neural network based, storage medium, terminal
CN109918487A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 Intelligent answer method and system based on network encyclopedia
CN111949798A (en) * 2019-05-15 2020-11-17 北京百度网讯科技有限公司 Map construction method and device, computer equipment and storage medium
CN111368042A (en) * 2020-02-13 2020-07-03 平安科技(深圳)有限公司 Intelligent question and answer method and device, computer equipment and computer storage medium
CN112434142A (en) * 2020-11-20 2021-03-02 海信电子科技(武汉)有限公司 Method for marking training sample, server, computing equipment and storage medium
CN112417104A (en) * 2020-12-04 2021-02-26 山西大学 Machine reading understanding multi-hop inference model and method with enhanced syntactic relation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张超然等: "基于预训练模型的机器阅读理解研究综述", 《计算机工程与应用》 *
朱晨光, 机械工业出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444488A (en) * 2022-01-26 2022-05-06 中国科学技术大学 Reading understanding method, system, device and storage medium for few-sample machine
CN115828893A (en) * 2022-11-28 2023-03-21 北京海致星图科技有限公司 Method, device, storage medium and equipment for question answering of unstructured document
CN115828893B (en) * 2022-11-28 2023-11-17 北京海致星图科技有限公司 Unstructured document question-answering method, device, storage medium and equipment

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN110427463B (en) Search statement response method and device, server and storage medium
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN106844658A (en) A kind of Chinese text knowledge mapping method for auto constructing and system
CN107391614A (en) A kind of Chinese question and answer matching process based on WMD
Mills et al. Graph-based methods for natural language processing and understanding—A survey and analysis
CN106570180A (en) Artificial intelligence based voice searching method and device
CN110929038A (en) Entity linking method, device, equipment and storage medium based on knowledge graph
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
US20220414463A1 (en) Automated troubleshooter
CN109271524A (en) Entity link method in knowledge base question answering system
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN112883182A (en) Question-answer matching method and device based on machine reading
CN116244448A (en) Knowledge graph construction method, device and system based on multi-source data information
CN115470313A (en) Information retrieval and model training method, device, equipment and storage medium
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN114722774B (en) Data compression method, device, electronic equipment and storage medium
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
CN113392245B (en) Text abstract and image-text retrieval generation method for public testing task release
CN110442759B (en) Knowledge retrieval method and system, computer equipment and readable storage medium
CN114239555A (en) Training method of keyword extraction model and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210601

RJ01 Rejection of invention patent application after publication