CN113918686A - Intelligent question-answering model construction method and device, computer equipment and storage medium - Google Patents

Intelligent question-answering model construction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113918686A
CN113918686A CN202111003318.1A CN202111003318A CN113918686A CN 113918686 A CN113918686 A CN 113918686A CN 202111003318 A CN202111003318 A CN 202111003318A CN 113918686 A CN113918686 A CN 113918686A
Authority
CN
China
Prior art keywords
document
text
event
structured data
intelligent question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111003318.1A
Other languages
Chinese (zh)
Inventor
高鹏
康维鹏
袁兰
吴飞
周伟华
高峰
潘晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Mjoys Big Data Technology Co ltd
Original Assignee
Hangzhou Mjoys Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Mjoys Big Data Technology Co ltd filed Critical Hangzhou Mjoys Big Data Technology Co ltd
Priority to CN202111003318.1A priority Critical patent/CN113918686A/en
Publication of CN113918686A publication Critical patent/CN113918686A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses an intelligent question-answering model construction method, an intelligent question-answering model construction device, computer equipment and a storage medium. The method comprises the following steps: acquiring a document; analyzing the document to obtain structured data; extracting knowledge from the structured data to obtain event types and event elements; and constructing an intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by utilizing the intelligent question-answering model. By implementing the method provided by the embodiment of the invention, the corpus data of the document based on the media such as PDF, Word, Excel and the like can be automatically extracted by a machine, the question-answer knowledge base is constructed, and the intelligent question-answer model is constructed, so that the question-answer efficiency and the accuracy are improved.

Description

Intelligent question-answering model construction method and device, computer equipment and storage medium
Technical Field
The invention relates to a computer, in particular to an intelligent question-answering model construction method, an intelligent question-answering model construction device, computer equipment and a storage medium.
Background
The intelligent question-answering is a question-answering mode, accurately positions question knowledge required by the website user, and provides personalized information service for the website user through interaction with the website user.
However, in the financial field and in general office scenes, there are a large number of documents with media such as PDF, Word, Excel, etc., the content of the documents is few, and several pages and many pages are hundreds, and the organization structure of the content is complex and changeable, and the existing technology cannot perform machine automatic extraction and construct a question-answering knowledge base based on the document corpus data to form an intelligent question-answering model, and cannot provide accurate answers according to the existing documents, so that the question-answering efficiency is low and the accuracy is low.
Therefore, it is necessary to design a new method for implementing the corpus data of the document based on the media such as PDF, Word, Excel, etc., performing machine automated extraction, constructing a question-and-answer knowledge base, and constructing an intelligent question-and-answer model, so as to improve the question-and-answer efficiency and accuracy.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an intelligent question-answering model construction method, an intelligent question-answering model construction device, computer equipment and a storage medium.
In order to achieve the purpose, the invention adopts the following technical scheme: the intelligent question-answering model construction method comprises the following steps:
acquiring a document;
analyzing the document to obtain structured data;
extracting knowledge from the structured data to obtain event types and event elements;
and constructing an intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by utilizing the intelligent question-answering model.
The further technical scheme is as follows: the parsing the document to obtain structured data includes:
carrying out structured analysis on the data of the text type in the document to obtain text structured data;
carrying out structured analysis on the table objects in the document to obtain table structured data;
carrying out structural analysis on the picture objects in the document to obtain image structural data;
wherein the structured data includes text structured data, table structured data, and image structured data.
The further technical scheme is as follows: the performing structural analysis on the data of the text type in the document to obtain text structural data includes:
identifying text body content within the document;
eliminating text character patterns in the text content;
performing word segmentation on the text content after the text content is removed by adopting an NLP word segmentation tool to obtain a word segmentation result;
and identifying the general entity for the word segmentation result to obtain text structured data.
The further technical scheme is as follows: the performing structural analysis on the table object in the document to obtain table structural data includes:
performing element extraction on the table objects in the document to obtain table elements;
and carrying out structural transformation on the table elements to obtain table structural data.
The further technical scheme is as follows: the extracting knowledge of the structured data to obtain event types and event elements includes:
adopting a deep learning classification model to judge the time relation of the structured data to obtain an event type;
and according to a specific time type, performing sequence labeling on corresponding structured data by adopting a BilSTM + CRF event extraction model to obtain event elements.
The further technical scheme is as follows: the constructing of the intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by using the intelligent question-answering model comprises the following steps:
constructing a language model, and training the language model according to the event type and the event elements to obtain an intelligent question-answering model;
performing coarse-fine retrieval on the content of the question by using an intelligent question-answering model to obtain a potential content paragraph;
and carrying out fine-grained sequencing on the potential content paragraphs to obtain fine-grained paragraph answers.
The invention also provides an intelligent question-answering model constructing device, which comprises:
a document acquisition unit configured to acquire a document;
the analysis unit is used for analyzing the document to obtain structured data;
the knowledge extraction unit is used for extracting knowledge from the structured data to obtain an event type and an event element;
and the model construction unit is used for constructing an intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by utilizing the intelligent question-answering model.
The further technical scheme is as follows: the analysis unit includes:
the text analysis subunit is used for performing structured analysis on the data of the text type in the document to obtain text structured data;
the table analysis subunit is used for carrying out structured analysis on the table objects in the document to obtain table structured data;
and the picture analysis subunit is used for carrying out structural analysis on the picture objects in the document to obtain image structural data.
The invention also provides computer equipment which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the method when executing the computer program.
The invention also provides a storage medium storing a computer program which, when executed by a processor, is operable to carry out the method as described above.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, through analyzing and extracting the document and finally constructing the question-answering model structure, people can quickly ask questions through natural language question sentences, so that relatively accurate content fragments in the document can be obtained, the corpus data of the document based on media such as PDF, Word and Excel is realized, the automatic extraction of a machine is carried out, the question-answering knowledge base is constructed, and the intelligent question-answering model is constructed, so that the question-answering efficiency and the accuracy are improved.
The invention is further described below with reference to the accompanying drawings and specific embodiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of an intelligent question-answering model construction method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method for constructing an intelligent question-answering model according to an embodiment of the present invention;
FIG. 3 is a sub-flow diagram of a method for constructing an intelligent question-answering model according to an embodiment of the present invention;
FIG. 4 is a sub-flow diagram of a method for constructing an intelligent question-answering model according to an embodiment of the present invention;
FIG. 5 is a sub-flow diagram of a method for constructing an intelligent question-answering model according to an embodiment of the present invention;
FIG. 6 is a sub-flow diagram of a method for constructing an intelligent question-answering model according to an embodiment of the present invention;
FIG. 7 is a sub-flow diagram of a method for constructing an intelligent question-answering model according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a BilSTM + CRF event extraction model according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a fine similarity ranking provided by an embodiment of the present invention;
fig. 10 is a schematic block diagram of an intelligent question-answering model building device according to an embodiment of the present invention;
fig. 11 is a schematic block diagram of an analysis unit of the intelligent question-answering model building apparatus according to the embodiment of the present invention;
fig. 12 is a schematic block diagram of a text parsing subunit of the intelligent question-answering model building apparatus according to the embodiment of the present invention;
fig. 13 is a schematic block diagram of a table parsing subunit of the intelligent question-answering model building apparatus according to the embodiment of the present invention;
fig. 14 is a schematic block diagram of a knowledge extraction unit of the intelligent question-answering model building device according to the embodiment of the present invention;
fig. 15 is a schematic block diagram of a model building unit of the intelligent question answering model building device according to the embodiment of the present invention;
FIG. 16 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of an intelligent question-answering model construction method according to an embodiment of the present invention. Fig. 2 is a schematic flow chart of a method for constructing an intelligent question-answering model according to an embodiment of the present invention. The intelligent question-answering model construction method is applied to a server. The server performs data interaction with the terminal, documents are input through the terminal, the server constructs an intelligent question-answering model according to the documents, and when the questions and answers are required, the server searches for accurate answers according to the contents of the questions and answers by using the intelligent question-answering model and feeds the answers back to the terminal.
Fig. 2 is a schematic flow chart of a method for constructing an intelligent question-answering model according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S140.
And S110, acquiring the document.
In the embodiment, the documents include various forms of documents such as word, pdf, excel and the like, and the documents are mainly derived from data related to the financial field and general office scenes.
And S120, analyzing the document to obtain structured data.
In this embodiment, the structured data refers to JSON data structured data with a good structural form. The structured data includes text structured data, table structured data, and image structured data.
Document parsing is structured parsing of documents in various forms, such as word, pdf, excel, and the like, and structured data processing is performed on the documents in various forms, so as to facilitate subsequent content extraction and knowledge construction. Because there is a certain difference in the representation structure of each document, the required mode and specific processing technique for each document are not exactly the same, but the conversion of format levels among the documents is considered, for example: format conversion is carried out on the WORD document and the PDF document, and the Excel document and the PDF document are more normal and formal in the financial field, other vertical fields and general fields. Therefore, the implementation of the PDF document in the structured parsing is explained, and the WORD document parsing and PDF are the same in principle and technical difficulty.
The PDF document contains 4 parts of content, which are: and (4) a file header. It indicates PDF version number in the first line of PDF; a file body, which is a content set of a PDF file, and is called an object set, and content objects such as tables, texts, pictures, and the like are common; the cross reference table is an address index table set for quick access; and the file tail declares the position information of the cross index table and also stores the information of PDF encryption security and the like. The arrangement focuses on explaining the PDF file body, namely the structured analysis of the content part of the PDF file body. The contents of the PDF document include contents such as text, table, picture, and attachment, and are analyzed and explained from these several types of object files.
In an embodiment, referring to fig. 2, the step S120 may include steps S121 to S123.
And S121, carrying out structural analysis on the data of the text type in the document to obtain text structural data.
In this embodiment, the Text structured data refers to data obtained by structuring Text type data.
In an embodiment, referring to fig. 4, the step S121 may include steps S1211 to S1214.
S1211, identifying text content in the document;
s1212, eliminating text character styles in the text contents;
s1213, performing word segmentation on the text content after the elimination by adopting an NLP word segmentation tool to obtain a word segmentation result;
s1214, identifying the general entity for the word segmentation result to obtain text structured data.
PDF text objects consist of operators that can display text strings, locate text positions, set text states, and other parameters. Normally, text parameters may affect all following text objects, but there are three exceptions to these parameters, which can only describe one text object, and cannot continue from one text object to the next, respectively: tm (text matrix), Tlm (text line matrix), Trm (text rendering matrix, actually just an intermediate result, which combines the effects of text state parameters, text matrix (Tm) and current transformation matrix).
For the Text object analysis of the PDF document, Text content is mainly identified, information such as Text character patterns and the like is removed, and word segmentation and identification processing of general entities (including characters, places, mechanisms, numerical values, time and the like) are carried out by utilizing a NLP (neural-linear Programming) word segmentation tool. In addition, in the structuring process, document paragraph classification, index directory association, PDF line view position index and other processing are required, so that the content structure information of the original PDF file is reserved as much as possible and the reduction processing is facilitated.
And S122, carrying out structural analysis on the table objects in the document to obtain table structural data.
In this embodiment, the Table structured data refers to data obtained by performing structural analysis on a Table.
In one embodiment, referring to fig. 5, the step S122 may include steps S1221 to S1222.
S1221, extracting elements of the table objects in the document to obtain table elements;
in this embodiment, the Table element refers to a result of extracting elements from Table tables in the PDF document body object, such as a frame Table, a frame-free Table, a merge cell, a page-crossing Table, and the like.
For the table with the frame, the table structure identification is carried out according to the horizontal and vertical lines of the table, the horizontal and vertical lines are intersected to form a two-dimensional structure, then the operations such as cell combination and the like are carried out according to the positions of the cells, the texts, numerical values and the like of the cells are extracted, and the sentence segmentation or combination operation is carried out on the long text.
For borderless table extraction, a two-dimensional table structure is identified by dividing rows and columns mainly according to text blocks in a table range. Specifically, semantic merging of character blocks is needed, problems caused by line feed and the like are solved, then the positions of rows and columns of the table are determined according to the distribution of spatial positions, and comprehensive judgment is conducted on files at special positions. In the concrete implementation of semantic block combination and the like, a mode of combining rules and models can be adopted to process whether two text blocks are the same table unit or not.
In addition, after the problem of single-page table identification extraction is solved, cross-page table identification extraction is required, which mainly connects the same tables scattered on two adjacent pages end to end, so that the same tables scattered on different pages are finally merged together. For the merging and identification of two adjacent tables, model classification is mainly performed by using characteristic information of the table such as row and column positions, cell distribution, row and column text types and the like.
And S1222, performing structural transformation on the table elements to obtain table structural data.
And (4) extracting table elements, wherein the final aim is to perform structured transformation on the elements in the table bulletins. Besides the data content in the table, metadata information such as PDF directory structure, start/stop position, and explanatory text below the table needs to be extracted and stored.
And S123, carrying out structural analysis on the picture objects in the document to obtain image structural data.
In the present embodiment, the image structured data refers to a result of performing structured analysis on type data such as pictures and attachments.
Specifically, besides the PDF text and the form element, other information such as a picture and an attachment is also available, and for the picture data, only information identification and extraction such as a picture whole file, a picture introduction text, a picture theme type, PDF index directory concatenation, a start-stop position and the like are performed. For the identification of the incidence relation between the picture and the title Text introduced by the picture, a classification mode is mainly adopted, and the feature extraction and model classification are mainly carried out on the Text information, the font style, the position and other information of the Text document paragraph at the front and back positions of the picture. In the research report of financial securities institutions for a specific industry, in addition to identifying specific elements, the content paragraphs of pictures (diagrams) also need to perform structured extraction, tag classification and the like on the picture description text paragraphs, the auxiliary table text contents, the start and stop position information in the pictures and document paragraphs, so as to facilitate subsequent display and analysis traceability.
S130, knowledge extraction is carried out on the structured data to obtain event types and event elements.
In this embodiment, the event type refers to an event relationship, and the event element refers to an event and a corresponding element in the structured data.
The PDF document is converted into JSON data structured data with a good structure form, and knowledge extraction needs to be further performed on the document, wherein the knowledge extraction mainly includes extracting events and event elements in text paragraphs or extracting specific relations and relation elements. The specific extraction method is to adopt a mode of event (relationship) type classification and sequence labeling to extract events or relationships.
In an embodiment, referring to fig. 6, the step S130 may include steps S131 to S132.
S131, judging the time relation of the structured data by adopting a deep learning classification model to obtain the event type.
In this embodiment, a classification model based on deep learning is used to perform event relationship determination, specifically, a BERT model is used to perform event type determination. The input of BERT is the original Word Vector of each character/Word in the text, the Vector can be initialized randomly, and the Vector can also be pre-trained by using Word2Vector and the like; the output is vector representation of each character/word in the text after full-text semantic information is fused. The BERT Model performs combined training on an MLM (Masked Language Model) task and a next sentence prediction task, and adopts a module of which a Transformer is completely based on an attention mechanism, so that the semantic relation of important words of a long sentence can be captured, the vector representation of each word/word output by the Model can completely and accurately depict the whole information of an input text (a single sentence or a sentence pair) as far as possible, and a better initial value of a Model parameter is provided for a subsequent fine tuning task. In the specific classification and use, a Fine-tuning layer is added on the BERT output layer for classification and prediction,
and S132, according to the specific time type, performing sequence labeling on the corresponding structured data by adopting a BilSTM + CRF event extraction model to obtain event elements.
And (3) extracting events and relations of the text, namely determining the event type of the text by adopting a text classification mode, and then labeling sequences by adopting a BilSTM + CRF event extraction model according to the specific event type, which is basically similar to the CRF entity identification. Extracting event relations, and constructing a joint identification model integrating entity identification, event type judgment and event elements, namely a BilSTM + CRF event extraction model, wherein the deep network structure of the model is shown in FIG. 8.
The event element extraction input has three sources, namely Bi-LSTM intermediate semantic information of an original text, NER-CRF optimal serialization label Embedding and a fully-connected output layer of event type identification. Thus, entity recognition, event classification, event element extraction, and the like can be jointly fused and trained as a whole. For example, for the text "9/13/2020, great announces the repudiation of $ 400 million to the chip manufacturer ARM company under the soft silver flag, which is the one of the largest semiconductor field scales worldwide since history. ". The method comprises the steps of determining an (acquisition event), analyzing the main relationship between 'Yingweida' and 'acquisition' by using an event/relationship extraction model, and analyzing the acquired relationship between 'ARM company' and acquisition; "9/13/2020" is the time of acquisition and "400 billion dollars" indicates both the acquisition mode as "cash" and the acquisition of funds.
S140, an intelligent question-answering model is built according to the event types and the event elements, and answers of the questions are provided through the intelligent question-answering model.
The PDF document table is converted into structured data and knowledge, and question-answering model training is further carried out on the related structured text data and the knowledge type event relation data, wherein the training comprises multiple stages of training a language model, content indexing, fine sorting and the like.
In an embodiment, referring to fig. 7, the step S140 may include steps S141 to S143.
S141, a language model is constructed, and the language model is trained according to the event type and the event elements to obtain an intelligent question-answering model.
In the real-time method, Word vectors are an important technical means of semantic representation, and Word Vector training is performed by using extracted texts, table field data and event relation data and adopting a Word2Vector pre-training model. The specific training method is that a word segmentation tool is adopted for text data, words are firstly uniformly segmented and converted into text contents separated by spaces, and Bert model training data are prepared according to original text paragraph lines; and the table data aligns the content of each row with the header, and then expands the content into the header according to the row-column relationship: preparing Bert model training data according to rows by using column names 1, column values 1, column names 2, column values 2, column names 3 and column values 3 …; events are similar to table data, and for each event/relationship data, the event type, event subject value, event element name 1, event element value 1, event element name 2, and event element value 2. For example: the acquisition event in the foregoing, "9/13/2020, great announced a repudiation of $ 400 million to acquire the chip manufacturer ARM company under the soft silver flag, which was the one acquisition with the largest scale in the global semiconductor field since history. ". The extracted structured data is used for carrying out Word2Vec training corpus as follows: [ purchase acquisition subject: england purchased object: ARM company: and (3) purchasing time: fund acquisition in 9-month 30-2020: a 400-billion acquisition mode: cash ]. The training tool used was the Gensim Word2Vec training tool set provided by Google.
And S142, performing coarse-fine retrieval on the content of the question by using the intelligent question-answering model to obtain a potential content paragraph.
In the present embodiment, the potential content passage refers to a passage in which an answer corresponding to the content of the question exists.
The intelligent question-answering is the tracing information from the relative accurate answer provided to the client and the related answer content. In this embodiment, the question-answering processing is made into two stages, and coarse-grained potential candidate retrieval is performed first, and then fine-grained accurate answer ranking is performed. Therefore, search tools such as Solr are firstly utilized to index and establish data such as texts, tables, event relations and the like, and specific index and storage information fields comprise: content type (text, table, event, picture, etc.), key words of content segments (subject words, entity words, specific event element values, specific table row and column values, etc.), original text of content segments, cascading index of content segments, location information of content segments, etc.
For the question of the user, word segmentation and classification recognition (determining whether the question is an event, icon data, a text, an image paragraph or the like) are firstly carried out on the question, the content type, key words, entity words and the like which are required to be searched by the user are determined, and the potential content paragraph of Top-N is determined according to preliminary sequencing.
S143, conducting fine-grained sequencing on the potential content paragraphs to obtain fine-grained paragraph answers.
In this embodiment, the fine-grained paragraph answers refer to answers formed by performing fine similarity ordering on the whole text paragraphs of the sentence set by using the BERT model.
Specifically, after the preliminary search, candidate Top-N potential content paragraphs, which may include texts, tables or event relation data, or diagram paragraphs, etc., are determined, and then the text paragraphs need to be subjected to fine-grained sequencing to determine fine-grained paragraph answers. The specific fine-grained sequencing method is that a Bert pre-model is utilized to splice an original question string and each candidate answer text paragraph by adopting [ SEP ] segmentation symbols to form a new 'question [ SEP ] candidate text', then BERT is adopted to carry out semantic coding through a BERT model, and finally a full connection layer with similar calculation is sleeved on the basis of the output of the Bert model, so that the text sequencing of candidate contents is obtained. Fig. 9 is a fine similarity ordering of sentences + text paragraphs as a whole using the BERT model.
According to the intelligent question-answering model construction method, the document is analyzed, knowledge is extracted, and a final question-answering model construction structure is obtained, so that people can quickly ask questions through natural language question sentences, relatively accurate content fragments in the document are obtained, the corpus data of the document based on media such as PDF, Word and Excel is achieved, machine automatic extraction is conducted, a question-answering knowledge base is constructed, and an intelligent question-answering model is constructed, so that the question-answering efficiency and accuracy are improved.
Fig. 10 is a schematic block diagram of an intelligent question-answering model building apparatus 300 according to an embodiment of the present invention. As shown in fig. 10, the present invention further provides an intelligent question-answering model constructing apparatus 300 corresponding to the above intelligent question-answering model constructing method. The intelligent question-and-answer model construction apparatus 300 includes a unit for executing the above-described intelligent question-and-answer model construction method, and the apparatus may be configured in a server. Specifically, referring to fig. 10, the intelligent question-answering model building device 300 includes a document obtaining unit 301, an analyzing unit 302, a knowledge extracting unit 303, and a model building unit 304.
A document acquisition unit 301 for acquiring a document; an analyzing unit 302, configured to analyze the document to obtain structured data; a knowledge extraction unit 303, configured to perform knowledge extraction on the structured data to obtain an event type and an event element; a model building unit 304, configured to build an intelligent question-answering model according to the event type and the event element, so as to provide answers to questions by using the intelligent question-answering model.
In one embodiment, as shown in fig. 11, the parsing unit 302 includes a text parsing sub-unit 3021, a table parsing sub-unit 3022, and a picture parsing sub-unit 3023.
A text parsing subunit 3021, configured to perform structured parsing on the data of the text type in the document to obtain text structured data; a table parsing subunit 3022, configured to perform structured parsing on the table object in the document to obtain table structured data; a picture parsing subunit 3023, configured to perform structural parsing on the picture object in the document to obtain image structural data.
In one embodiment, as shown in fig. 12, the text parsing subunit 3021 includes a content recognition module 30211, a culling module 30212, a word segmentation module 30213, and an entity recognition module 30214.
A content identification module 30211, configured to identify text body content within the document; a rejecting module 30212, configured to reject text character styles in the text body content; a word segmentation module 30213, configured to perform word segmentation on the text content after being removed by using an NLP word segmentation tool, so as to obtain a word segmentation result; an entity recognition module 30214, configured to perform recognition on a general entity on the word segmentation result to obtain text structured data.
In one embodiment, as shown in fig. 13, the table parsing subunit 3022 includes an element extraction module 30221 and an element conversion module 30222.
An element extraction module 30221, configured to perform element extraction on the table object in the document to obtain a table element; an element conversion module 30222, configured to perform structural conversion on the table element to obtain table structured data.
In one embodiment, as shown in fig. 14, the knowledge extracting unit 303 includes a relationship distinguishing sub-unit 3031 and a labeling sub-unit 3032.
A relationship judgment subunit 3031, configured to perform time relationship judgment on the structured data by using a deep learning classification model to obtain an event type; and a labeling subunit 3032, configured to perform sequence labeling on the corresponding structured data by using a BiLSTM + CRF event extraction model according to a specific time type, so as to obtain an event element.
In an embodiment, as shown in fig. 15, the model building unit 304 includes a training subunit 3041, a retrieving subunit 3042, and a sorting subunit 3043.
A training subunit 3041, configured to construct a language model, and train the language model according to the event type and the event element to obtain an intelligent question-answering model; a retrieval subunit 3042, configured to perform coarse-and-fine retrieval on the content of the question by using the intelligent question-and-answer model to obtain a potential content paragraph; a sorting subunit 3043, configured to perform fine-grained sorting on the potential content paragraphs to obtain fine-grained paragraph answers.
It should be noted that, as can be clearly understood by those skilled in the art, the specific implementation processes of the intelligent question-answering model constructing apparatus 300 and each unit may refer to the corresponding descriptions in the foregoing method embodiments, and for convenience and brevity of description, no further description is provided herein.
The above-described intelligent question-answering model construction apparatus 300 can be implemented in the form of a computer program that can be run on a computer device as shown in fig. 16.
Referring to fig. 16, fig. 16 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a terminal or a server, where the terminal may be an electronic device with a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 16, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer programs 5032 include program instructions that, when executed, cause the processor 502 to perform an intelligent question-answering model construction method.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute an intelligent question-answering model construction method.
The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 16 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 500 to which the present application is applied, and that a particular computer device 500 may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps:
acquiring a document; analyzing the document to obtain structured data; extracting knowledge from the structured data to obtain event types and event elements; and constructing an intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by utilizing the intelligent question-answering model.
In an embodiment, when the processor 502 implements the step of parsing the document to obtain the structured data, the following steps are specifically implemented:
carrying out structured analysis on the data of the text type in the document to obtain text structured data; carrying out structured analysis on the table objects in the document to obtain table structured data; carrying out structural analysis on the picture objects in the document to obtain image structural data;
wherein the structured data includes text structured data, table structured data, and image structured data.
In an embodiment, when implementing the step of performing the structured parsing on the data of the text type in the document to obtain the text structured data, the processor 502 specifically implements the following steps:
identifying text body content within the document; eliminating text character patterns in the text content; performing word segmentation on the text content after the text content is removed by adopting an NLP word segmentation tool to obtain a word segmentation result; and identifying the general entity for the word segmentation result to obtain text structured data.
In an embodiment, when implementing the step of performing the structural analysis on the table object in the document to obtain the table structured data, the processor 502 specifically implements the following steps:
performing element extraction on the table objects in the document to obtain table elements; and carrying out structural transformation on the table elements to obtain table structural data.
In an embodiment, when implementing the step of extracting knowledge from the structured data to obtain an event type and an event element, the processor 502 specifically implements the following steps:
adopting a deep learning classification model to judge the time relation of the structured data to obtain an event type; and according to a specific time type, performing sequence labeling on corresponding structured data by adopting a BilSTM + CRF event extraction model to obtain event elements.
In an embodiment, when implementing the step of constructing an intelligent question-answering model according to the event type and the event elements to provide answers to questions by using the intelligent question-answering model, the processor 502 specifically implements the following steps:
constructing a language model, and training the language model according to the event type and the event elements to obtain an intelligent question-answering model; performing coarse-fine retrieval on the content of the question by using an intelligent question-answering model to obtain a potential content paragraph; and carrying out fine-grained sequencing on the potential content paragraphs to obtain fine-grained paragraph answers.
It should be understood that in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. The computer program includes program instructions, and the computer program may be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present invention also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, wherein the computer program, when executed by a processor, causes the processor to perform the steps of:
acquiring a document; analyzing the document to obtain structured data; extracting knowledge from the structured data to obtain event types and event elements; and constructing an intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by utilizing the intelligent question-answering model.
In an embodiment, when the processor executes the computer program to realize the step of parsing the document to obtain the structured data, the following steps are specifically realized:
carrying out structured analysis on the data of the text type in the document to obtain text structured data; carrying out structured analysis on the table objects in the document to obtain table structured data; carrying out structural analysis on the picture objects in the document to obtain image structural data;
wherein the structured data includes text structured data, table structured data, and image structured data.
In an embodiment, when the processor executes the computer program to implement the step of performing the structured parsing on the data of the text type in the document to obtain the text structured data, the following steps are specifically implemented:
identifying text body content within the document; eliminating text character patterns in the text content; performing word segmentation on the text content after the text content is removed by adopting an NLP word segmentation tool to obtain a word segmentation result; and identifying the general entity for the word segmentation result to obtain text structured data.
In an embodiment, when the processor executes the computer program to implement the step of performing the structured parsing on the table object in the document to obtain the table structured data, the following steps are specifically implemented:
performing element extraction on the table objects in the document to obtain table elements; and carrying out structural transformation on the table elements to obtain table structural data.
In an embodiment, when the processor executes the computer program to perform the step of extracting knowledge from the structured data to obtain an event type and an event element, the following steps are specifically performed:
adopting a deep learning classification model to judge the time relation of the structured data to obtain an event type; and according to a specific time type, performing sequence labeling on corresponding structured data by adopting a BilSTM + CRF event extraction model to obtain event elements.
In an embodiment, when the processor executes the computer program to implement the step of constructing a smart question-and-answer model according to the event type and the event element so as to provide answers to questions by using the smart question-and-answer model, the following steps are specifically implemented:
constructing a language model, and training the language model according to the event type and the event elements to obtain an intelligent question-answering model; performing coarse-fine retrieval on the content of the question by using an intelligent question-answering model to obtain a potential content paragraph; and carrying out fine-grained sequencing on the potential content paragraphs to obtain fine-grained paragraph answers.
The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be merged, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. The intelligent question-answering model construction method is characterized by comprising the following steps:
acquiring a document;
analyzing the document to obtain structured data;
extracting knowledge from the structured data to obtain event types and event elements;
and constructing an intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by utilizing the intelligent question-answering model.
2. The method according to claim 1, wherein the parsing the document to obtain structured data comprises:
carrying out structured analysis on the data of the text type in the document to obtain text structured data;
carrying out structured analysis on the table objects in the document to obtain table structured data;
carrying out structural analysis on the picture objects in the document to obtain image structural data;
wherein the structured data includes text structured data, table structured data, and image structured data.
3. The method for constructing an intelligent question-answering model according to claim 2, wherein the structured parsing of the data of text types in the document to obtain text structured data comprises:
identifying text body content within the document;
eliminating text character patterns in the text content;
performing word segmentation on the text content after the text content is removed by adopting an NLP word segmentation tool to obtain a word segmentation result;
and identifying the general entity for the word segmentation result to obtain text structured data.
4. The method according to claim 2, wherein the performing structural analysis on the table object in the document to obtain table structural data includes:
performing element extraction on the table objects in the document to obtain table elements;
and carrying out structural transformation on the table elements to obtain table structural data.
5. The method according to claim 1, wherein the extracting knowledge of the structured data to obtain event types and event elements comprises:
adopting a deep learning classification model to judge the time relation of the structured data to obtain an event type;
and according to a specific time type, performing sequence labeling on corresponding structured data by adopting a BilSTM + CRF event extraction model to obtain event elements.
6. The method for constructing an intelligent question-answering model according to claim 1, wherein the constructing an intelligent question-answering model according to the event types and the event elements so as to provide answers to questions by using the intelligent question-answering model comprises the following steps:
constructing a language model, and training the language model according to the event type and the event elements to obtain an intelligent question-answering model;
performing coarse-fine retrieval on the content of the question by using an intelligent question-answering model to obtain a potential content paragraph;
and carrying out fine-grained sequencing on the potential content paragraphs to obtain fine-grained paragraph answers.
7. Intelligent question-answering model construction device, its characterized in that includes:
a document acquisition unit configured to acquire a document;
the analysis unit is used for analyzing the document to obtain structured data;
the knowledge extraction unit is used for extracting knowledge from the structured data to obtain an event type and an event element;
and the model construction unit is used for constructing an intelligent question-answering model according to the event type and the event elements so as to provide answers to questions by utilizing the intelligent question-answering model.
8. The intelligent question-answering model building device according to claim 7, wherein the parsing unit includes:
the text analysis subunit is used for performing structured analysis on the data of the text type in the document to obtain text structured data;
the table analysis subunit is used for carrying out structured analysis on the table objects in the document to obtain table structured data;
and the picture analysis subunit is used for carrying out structural analysis on the picture objects in the document to obtain image structural data.
9. A computer device, characterized in that the computer device comprises a memory, on which a computer program is stored, and a processor, which when executing the computer program implements the method according to any of claims 1 to 6.
10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.
CN202111003318.1A 2021-08-30 2021-08-30 Intelligent question-answering model construction method and device, computer equipment and storage medium Pending CN113918686A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111003318.1A CN113918686A (en) 2021-08-30 2021-08-30 Intelligent question-answering model construction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111003318.1A CN113918686A (en) 2021-08-30 2021-08-30 Intelligent question-answering model construction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113918686A true CN113918686A (en) 2022-01-11

Family

ID=79233394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111003318.1A Pending CN113918686A (en) 2021-08-30 2021-08-30 Intelligent question-answering model construction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113918686A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743204A (en) * 2022-04-11 2022-07-12 平安科技(深圳)有限公司 Automatic question answering method, system, equipment and storage medium for table
JP2023010805A (en) * 2022-05-20 2023-01-20 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method for training document information extraction model and extracting document information, device, electronic apparatus, storage medium and computer program
CN117408631A (en) * 2023-10-18 2024-01-16 江苏泰坦智慧科技有限公司 Operation ticket generation method, device and storage medium
CN117520549A (en) * 2023-11-20 2024-02-06 北京中关村科金技术有限公司 Document segmentation method, device, equipment and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN108153729A (en) * 2017-12-22 2018-06-12 武汉数博科技有限责任公司 A kind of Knowledge Extraction Method towards financial field
CN110147436A (en) * 2019-03-18 2019-08-20 清华大学 A kind of mixing automatic question-answering method based on padagogical knowledge map and text
CN111292731A (en) * 2018-11-21 2020-06-16 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium
CN111460831A (en) * 2020-03-27 2020-07-28 科大讯飞股份有限公司 Event determination method, related device and readable storage medium
CN112183030A (en) * 2020-10-10 2021-01-05 深圳壹账通智能科技有限公司 Event extraction method and device based on preset neural network, computer equipment and storage medium
CN112507700A (en) * 2020-11-26 2021-03-16 北京百度网讯科技有限公司 Event extraction method and device, electronic equipment and storage medium
CN112579666A (en) * 2020-12-15 2021-03-30 深港产学研基地(北京大学香港科技大学深圳研修院) Intelligent question-answering system and method and related equipment
CN112949303A (en) * 2021-03-01 2021-06-11 山东健康医疗大数据有限公司 Text word segmentation analysis method and system for medical history text data structuralization

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN108153729A (en) * 2017-12-22 2018-06-12 武汉数博科技有限责任公司 A kind of Knowledge Extraction Method towards financial field
CN111292731A (en) * 2018-11-21 2020-06-16 深圳绿米联创科技有限公司 Voice information processing method and device, electronic equipment and storage medium
CN110147436A (en) * 2019-03-18 2019-08-20 清华大学 A kind of mixing automatic question-answering method based on padagogical knowledge map and text
CN111460831A (en) * 2020-03-27 2020-07-28 科大讯飞股份有限公司 Event determination method, related device and readable storage medium
CN112183030A (en) * 2020-10-10 2021-01-05 深圳壹账通智能科技有限公司 Event extraction method and device based on preset neural network, computer equipment and storage medium
CN112507700A (en) * 2020-11-26 2021-03-16 北京百度网讯科技有限公司 Event extraction method and device, electronic equipment and storage medium
CN112579666A (en) * 2020-12-15 2021-03-30 深港产学研基地(北京大学香港科技大学深圳研修院) Intelligent question-answering system and method and related equipment
CN112949303A (en) * 2021-03-01 2021-06-11 山东健康医疗大数据有限公司 Text word segmentation analysis method and system for medical history text data structuralization

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743204A (en) * 2022-04-11 2022-07-12 平安科技(深圳)有限公司 Automatic question answering method, system, equipment and storage medium for table
JP2023010805A (en) * 2022-05-20 2023-01-20 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method for training document information extraction model and extracting document information, device, electronic apparatus, storage medium and computer program
CN117408631A (en) * 2023-10-18 2024-01-16 江苏泰坦智慧科技有限公司 Operation ticket generation method, device and storage medium
CN117520549A (en) * 2023-11-20 2024-02-06 北京中关村科金技术有限公司 Document segmentation method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN112417096B (en) Question-answer pair matching method, device, electronic equipment and storage medium
CN113918686A (en) Intelligent question-answering model construction method and device, computer equipment and storage medium
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN113495900A (en) Method and device for acquiring structured query language sentences based on natural language
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN111274239A (en) Test paper structuralization processing method, device and equipment
CN115470338B (en) Multi-scenario intelligent question answering method and system based on multi-path recall
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
Braz et al. Document classification using a Bi-LSTM to unclog Brazil's supreme court
CN111666766A (en) Data processing method, device and equipment
CN110968664A (en) Document retrieval method, device, equipment and medium
CN110889275A (en) Information extraction method based on deep semantic understanding
CN116150367A (en) Emotion analysis method and system based on aspects
CN118296120A (en) Large-scale language model retrieval enhancement generation method for multi-mode multi-scale multi-channel recall
CN114792246A (en) Method and system for mining typical product characteristics based on topic integration clustering
CN109684473A (en) A kind of automatic bulletin generation method and system
CN117454217A (en) Deep ensemble learning-based depression emotion recognition method, device and system
US20210182549A1 (en) Natural Language Processing (NLP) Pipeline for Automated Attribute Extraction
CN116225956A (en) Automated testing method, apparatus, computer device and storage medium
CN116012855A (en) Text content examination method, apparatus, computer device and storage medium
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN114021004A (en) Method, device and equipment for recommending science similar questions and readable storage medium
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN114067343A (en) Data set construction method, model training method and corresponding device
Jasmonts et al. New Information Extracting and Analysis Methodology for the Terminology Research Purposes: The Field of Biology.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination