CN116226349A - Question and answer method and system based on table semantic fasttet question analysis - Google Patents

Question and answer method and system based on table semantic fasttet question analysis Download PDF

Info

Publication number
CN116226349A
CN116226349A CN202310196713.9A CN202310196713A CN116226349A CN 116226349 A CN116226349 A CN 116226349A CN 202310196713 A CN202310196713 A CN 202310196713A CN 116226349 A CN116226349 A CN 116226349A
Authority
CN
China
Prior art keywords
question
dictionary
analysis
answer
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310196713.9A
Other languages
Chinese (zh)
Inventor
谭培波
李建康
刘弦弦
付林虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhitong Yunlian Technology Co ltd
Original Assignee
Beijing Zhitong Yunlian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhitong Yunlian Technology Co ltd filed Critical Beijing Zhitong Yunlian Technology Co ltd
Priority to CN202310196713.9A priority Critical patent/CN116226349A/en
Publication of CN116226349A publication Critical patent/CN116226349A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a question-answering method and a question-answering system based on table semantic fasttet question analysis, wherein the method comprises the following steps: data preparation stage: the method comprises the steps of rewriting a table into a chapter semantic text with a hierarchical structure, obtaining an analysis dictionary for question analysis, establishing a question and answer labeling corpus, and establishing a field-related fastext similarity calculation model to establish data required by a table semantic fastext question analysis question and answer method; question analysis stage: analyzing the question through an analysis dictionary, and then carrying out semantic analysis with the question-answer labeling corpus according to a fasttex similarity calculation model; the sentence obtaining stage: and carrying out operation according to the semantically parsed question sentence to construct a complete answer sentence. The invention rewrites the complex form into the natural language text with chapter hierarchical structure, and overcomes the difficulty that the complex form structure is difficult to standardize into database tasks; the general question multi-attribute solution is obtained by comparing an approximate solution through the fasttet question analysis model, so that engineering requirements are greatly met.

Description

Question and answer method and system based on table semantic fasttet question analysis
Technical Field
The document relates to the technical field of question analysis, in particular to a question answering method and system based on table semantic fasttext question analysis.
Background
The question-answering technology is an application of a typical natural language processing technology in artificial intelligence, the main problems in reality are contradiction between randomness of the question-answering and uniqueness in engineering, the defects of the prior art are that the accuracy requirement of the answer is difficult to ensure, the question-answering technology is unacceptable in engineering, and the question-answering technology in the prior art has the following defects:
(1) Question semantic understanding ambiguity. The frame semantic-based method defines the question and answer frame of the question and answer sentence, and has randomness and ambiguity because the frame is manually defined. For example, "which of the wells with the pitch of Tian Jingshen being larger than 6000 meters" can be defined as various semantic names such as "pitch + attribute", "pitch + well + attribute constraint", each person can define different semantics, and finally, no authoritative semantic definition exists, so that corpus labeling is difficult, and the requirements of answer uniqueness and accuracy on engineering cannot be met. If question analysis is carried out according to daily 5W1H, a plurality of non-question sentences and multi-limiting condition question sentences in the engineering are difficult to analyze, such as 'yesterday daily output', which is not a question sentence but a numerical query browse; "What is greater than 6000 meters of wells" for prophos Tian Jingshen includes the question of What is greater than 6000 meters of wells for one layer "and What is greater than 6000 meters of wells for the second layer" which makes question interpretation with a single layer of 5W1H sequences difficult.
(2) It is difficult to handle complex hierarchies in questions. The question-answering system in the engineering is built in the space-time range of certain business of a certain company, and the engineering forms adopted in the engineering generally have complex space structures and time structures from the principle of readability and disambiguation, and the structures are difficult to be decomposed into simpler diagrams or small tables through downward decomposition. Generally, a table contains complex business, structure and time relation, and can be understood as a summary table with various small tables or small pictures organically spliced together. For example, an oilfield production daily report contains all businesses such as production wells, drilling wells, purification plants, production service centers and the like, the spatial relationship of facilities such as the combination of all levels of main bodies, lines, wells and purification devices and the time hierarchy relationship of year, month and day, and objects in the structure are in the form of a single table in the transverse direction and the longitudinal direction, and are not vertically arranged according to the conventional object and attribute. If the complex form in the project is required to be decomposed into a plurality of small forms or a complex knowledge graph is required to be reconstructed for the form according to the traditional simple question analysis thought, the reverse project method complicates the simple problem, and a plurality of invalid workloads are additionally increased, so that the rationality of the engineering logic is difficult to ensure, and the implementation progress is difficult to ensure.
(3) Too many dictionaries are required and are not engineering. For example, a question-answering system of an oil field needs 39 dictionaries to realize the conversion from words in question to entities (time and various levels of equipment) to concepts (attributes) to sentence semantics (question forms), so that the task is huge in engineering quantity, and the core problem is that after a form with a complex structure is decomposed into smaller granularity, the contents described by the dictionaries are additionally added, namely the contents of various levels of structures in the engineering form, only the explicit description is performed once again by using a plurality of small forms or small pictures, and invisible knowledge in engineering is not used most effectively and directly.
(4) Resulting in too long computation time across multiple hidden nodes. In the knowledge graph-based method, when a question is answered across nodes, the extension time is too long. The problem of dead halt also occurs when the cross-node calculation is performed at a large time granularity, for example, in a daily production report, if the annual value is calculated, the problem that the time is too long to answer in time, for example, the problem that the accumulated yield of the XX line is what, an implicit inclusion relation exists between the line and the well, the time of the year, the month and the day also has a time hierarchical inclusion relation exists, the daily production report is one day, the database table is frequently called as a result, and then the accumulated operation is read, so that the dead halt of a computer is directly caused.
Therefore, a question-answering method is needed to solve the above-mentioned drawbacks.
Disclosure of Invention
The invention provides a question answering method and system based on table semantic fasttet question analysis, and aims to solve the problems.
The invention provides a question-answering method based on table semantic fasttet question analysis, which comprises the following steps:
s1, a data preparation stage: the method comprises the steps of rewriting a table into a chapter semantic text with a hierarchical structure, obtaining an analysis dictionary for question analysis, establishing a question and answer labeling corpus, and establishing a field-related fastext similarity calculation model to establish data required by a table semantic fastext question analysis question and answer method;
s2, a question analysis stage: analyzing the question through an analysis dictionary, and carrying out semantic analysis on the analyzed question and the question-answer labeling corpus according to a fasttet similarity calculation model;
s3, a sentence answering stage is obtained: and carrying out operation according to the semantically parsed question sentence to construct a complete answer sentence.
The invention provides a question-answering system based on table semantic fasttet question analysis, which comprises:
the data module is used for rewriting the table into a chapter semantic text with a hierarchical structure, acquiring an analysis dictionary for question analysis, establishing a question and answer labeling corpus, and establishing a field-related fastext similarity calculation model to establish data required by a question and answer method for the table semantic fastext question analysis;
the processing module is used for carrying out semantic analysis on the question according to the fasttext similarity calculation model after analyzing the question through the analysis dictionary;
and the application module is used for acquiring the answer sentence through the question-answer labeling corpus according to the semantically parsed question sentence, and acquiring the complete answer sentence after assembling the answer sentence.
According to the embodiment of the invention, the question and answer construction is directly carried out on the complex engineering form, and the problems of understanding ambiguity in question sentences and multi-level cross-point node calculation are overcome by utilizing the complete meaning and complex structure of the engineering form; the complex form is rewritten into the natural language text with chapter hierarchical structure, thus overcoming the difficulty that the complex form structure is difficult to standardize into database task; meanwhile, a fasttet question analysis model is used for obtaining a general question multi-attribute solution by comparing an approximate solution, so that the requirement of engineering on data accuracy is greatly met.
Drawings
For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description that follow are only some of the embodiments described in the description, from which, for a person skilled in the art, other drawings can be obtained without inventive faculty.
FIG. 1 is a flowchart of a question-answering method based on table semantic fasttet question parsing according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a question-answering system based on table semantic fasttet question parsing according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a specific form-based query system according to an embodiment of the present system;
FIG. 4 is a schematic diagram of a daily report structure according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a chapter structure of a daily report according to an embodiment of the present invention;
FIG. 6 is a diagram of a daily knowledge base in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of a table name dictionary in accordance with an embodiment of the present invention;
FIG. 8 is a schematic diagram of a spoken dictionary according to an embodiment of the present invention;
FIG. 9 is a diagram of an object dictionary in accordance with an embodiment of the present invention;
FIG. 10 is a diagram of an object property dictionary in accordance with an embodiment of the present invention;
FIG. 11 is a diagram of an attribute value dictionary according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of a time dictionary according to an embodiment of the present invention;
FIG. 13 is a schematic diagram of a query dictionary in accordance with an embodiment of the present invention;
FIG. 14 is a schematic diagram of a corpus of question marks according to an embodiment of the present invention;
FIG. 15 is a flowchart of a question and answer technique based on a table semantic fasttet question parsing according to an embodiment of the present invention;
FIG. 16 is a question of an embodiment of the present invention;
fig. 17 is a schematic diagram of a question parsing result according to an embodiment of the present invention.
Detailed Description
In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.
Method embodiment
The embodiment of the invention provides a question and answer method based on table semantic fasttet question analysis, and fig. 1 is a flow chart of the question and answer method based on table semantic fasttet question analysis, according to the flow chart shown in fig. 1, the question and answer method based on table semantic fasttet question analysis of the embodiment of the invention comprises the following steps:
step S101, data preparation phase: the method comprises the steps of rewriting a table into a chapter semantic text with a hierarchical structure, obtaining an analysis dictionary for question analysis, establishing a question and answer labeling corpus, and establishing a field-related fastext similarity calculation model to establish data required by a table semantic fastext question analysis question and answer method; the step S101 specifically includes:
and establishing a query knowledge base and preparing data related to the questions and the answers. The method comprises the steps of rewriting a table from whole to part into a chapter semantic text with a hierarchical structure as a knowledge base (inverted library) for question searching, wherein a large table is converted into a document with a chapter structure, and the chapter structure is a template of the same large table; the small table is a chapter, the head structure of the small table is a chapter name, and the record of the small table is a paragraph text below the chapter; when the structured small table is transformed into a paragraph text sentence, semantic information of the small table needs to be added. The text description of the whole form is built from whole to part by converting the engineering record large form, small form, form row, form record into the required text, chapter, sentence and word for searching, which is a 1-1 process; the query is a process of 1-to-n from bottom to top and from part to whole, namely, multiple search combination results can be generated from bottom to top, and more dictionary form exclusion conditions with different levels are needed to disambiguate, so that an engineering unique answer is obtained. In order to realize searching, all words in the text are required to be put into a word segmentation dictionary to be used as an inverted word stock for searching.
In step S101, the rewriting of the table into the chapter semantic text having the hierarchical structure specifically includes:
and converting the table of the daily report into a text knowledge base with a hierarchical structure according to the daily report template, wherein the daily report template realizes the splitting of the table structure, and the text knowledge base comprises four layers of words, constructed sentences, small table names and table names which respectively correspond to the words, the sentences, the chapters and the four layers in natural language.
The step S101 of obtaining an analysis dictionary for question analysis specifically includes:
acquiring a table name dictionary for analyzing the levels of the pieces and chapters of the question analysis, manually intervening the table name dictionary according to the service requirement to specify the table name, determining the table name by a specific question, and updating the table name dictionary according to the specific question;
acquiring a spoken dictionary, an object attribute dictionary, an attribute value dictionary, a time dictionary and a query word dictionary which are used for word hierarchy analysis of question sentence analysis, wherein the spoken dictionary realizes conversion from spoken words in question sentences to standard names; the object dictionary corresponds to the index in the table and is an object in the semantic meaning or an object with a hierarchy; the object attribute dictionary collects the content of all the transverse attributes in the table, and the attribute value dictionary collects the concrete values of all the transverse attributes in the table; the time dictionary is used for describing a general time structure; the query word dictionary is used for describing query words appearing in question sentences and a method adopted for obtaining answers.
The step S101 of establishing a question and answer labeling corpus specifically comprises the following steps:
collecting all known questions and correct analysis results of corresponding questions which are manually checked to form a question-answer labeling corpus, wherein the question-answer labeling corpus is continuously formed in an increment form, automatically analyzing each new question as a reference, and then manually checking to obtain a final correct analysis result; and all associated parsing dictionaries will be added at the same time.
The establishing a domain-related fastatex similarity calculation model in step S101 specifically includes:
and (3) carrying out data conversion and tensation on the input question and the output table names, objects and attributes in a v2w mode, and then establishing a multi-level neural network mapping relation between the input tensor and the output attribute tensor so as to obtain analysis parameters of the question.
Step S102, question analysis stage: and analyzing the question through an analysis dictionary, and carrying out semantic analysis on the analyzed question and the question-answer labeling corpus according to the fasttet similarity calculation model. The step S102 specifically includes:
analyzing the question y=f (x) by adopting a fasttext model, and searching a nearest reference analysis result y0=f0 (x 0) by calculating the similarity of the question and a known question in a question and answer corpus according to the best matching principle of the characters of the question; question analysis is carried out in a character level and a semantic level 2 level, question and reference question are compared, character level decomposition is carried out according to a mode that sentence linear sequences are not overlapped and marked, and semantic phenomena such as fusion among words and borrow are not considered (fusion is carried out, for example, "how much gas is produced = daily gas production and how much gas is produced", and borrow is carried out, for example, "how deep = well depth and how much"); the decomposed characters are subjected to contrast mapping into semantic elements through a dictionary to realize analysis from the characters to the semantics, namely, the characters are corresponding to the attribute and attribute value in the table, and then semantic phenomena such as fusion, borrowing and the like are directly subjected to contrast analysis through the dictionary; if y0, f0 and x0 appear in the question, y=y0, f=f0 and x=x0, otherwise dictionary matching is needed for y0, f0 and x0 which do not appear in the question, and finally, y=f (x) analysis is completed.
Step S103, obtaining a sentence answering stage: and carrying out operation according to the semantically parsed question sentence to construct a complete answer sentence. The step S103 specifically includes:
and (3) calculating according to the analysis result question y=f (x), and constructing an answer. Acquiring effective knowledge items containing complete semantics and data from a query- & gt knowledge base through y and x, which is equivalent to loading a table header and data, and obtaining a final result through f for calculating the knowledge base- & gt answer, which is equivalent to processing the data; f is most typically a query operation on the table data, complex operations including qualifying queries, statistical analysis, etc.; and constructing an answer template according to the habit of people answering the questions, constructing an answer according to the acquired data according to the template requirement, and outputting a complete answer.
Referring to fig. 15, a specific embodiment of the method embodiment includes 3 large flows, knowledge base construction, question parsing and answer generation, which are described in detail as follows:
step 151: knowledge base construction
And (3) rewriting more than 1000 generated correlation tables used in engineering into sentences for question and answer, and converting the structured data query into natural language search.
Step 151-1: reading report corpus
The report is a complete engineering table, as shown in fig. 4, the date format is combined according to the requirement of engineering understanding, and the tables with different specifications are combined together, especially some characters which are not tables are included, so that when the tables are read by pandas, the head of the table is not required to be identified.
Step 151-2: report form reading template
The report mode is shown in fig. 5, and the report mode realizes the text structure positioning of the report, is universal and is the same as the report gas template of the same type, so the report mode is the basis for automatically identifying and converting the report.
Step 151-3: multi-line processing in report form
The rows in the excel table have business implications, generally indicated by quotation marks, such as "oil temperature \r\n. C "represents the relationship between the oil temperature and its units. The r/n is replaced, so that each table only has one character string without r/n, otherwise, r/n can bring about a plurality of visualization problems during text processing.
Step 151-4 report form line splicing
The report is separated according to columns, which does not conform to the habit of processing natural language according to sentences, so that each row is firstly removed of all digital units in the row, and only non-array units are reserved; then all units are sequenced by adopting a sort () method, so that tables with the same head but different column positions can be ensured to have the same semantic construction dog; finally, all the recorded sentence texts are spliced together by using the @ and form a 'station name @ @ well number @ @ production time h @ @ wellhead temperature @ @ wellhead pressure (MPa) @ @ production allocation @ @ daily @ @ monthly @ annual cumulative @ @ cumulative @ represents a gauge header'.
Step 151-5: report text structure identification
And (3) comparing the line text of the report with the template to mark the position of the head of the report, and identifying the small head of the table as a chapter structure of the table if the table is regarded as a document.
Step 151-6: multi-line header processing
The multi-row header represents the hierarchical structure of the attribute, and is filled in the 1 st table only once for the upper layer, so that for the rows marked as small tables, all rows are subjected to table filling, and a ffill () method is adopted; merging the multi-layer headers, and connecting different layers by using @ @; then delete redundant header, only keep one single row header after merging, such as wellhead pressure (MPa) @ @ oil pressure, wellhead pressure (MPa) @ @ casing pressure, production allocation @104m3, daily @ well head 104m3, daily @ liquid m3, monthly @ well head 104m3, monthly @ liquid m3, annual cumulative @ well head 108m3, annual cumulative @ liquid 104m3, cumulative @ well head 108m3, cumulative @ liquid 104m3, cumulative @ well head 104m3, cumulative @ liquid 104m3, etc. represent a set of 2 levels of indicators.
Step 151-7: single line text, dictionary generation
Single line text refers to those genusThe case where the sex-value pairs are arranged in a row. Firstly, a single line text is identified, then all blank columns are removed by a pd.dropana (how= 'all', axis=1, and plane=true) method, and a multi-line structure with attribute-value staggered arrangement is constructed; then sequentially taking out attribute-value pairs, writing into a key value dictionary form, and directly taking the dictionary as a sentence, wherein the yield of the measure is as s= {' 104m 3 ) 0.0',' New well production (104 m) 3 ) 0.0',' old well production (104 m) 3 ) 2433.0' is a single line of recorded text.
At the same time, all the table records without numbers are output as word dictionary, such as c= {' measure output (104 m 3 ) ' old well production (104 m) 3 ) ' New well production (104 m) 3 ) ' is a word dictionary, after the input question word is processed, if the input question word is intersected with the dictionary c, the corresponding sentence is s, and s is a table with complete semantic records.
Step 151-8: small form text, dictionary generation
Because the small table has a clear table structure, when the small table is reformed into a sentence, the header information needs to be carried. Firstly, all small table entries are taken out, and all empty columns are removed by using pd.dropana (how= 'all', axis=1, place=true); then the line number of the small table is taken out; then constructing a gauge head and 2 tables without the gauge head among the 2 small table row numbers; then, removing the empty column without the list of the row number, and selecting a list with a list head by using the column, wherein the list head is stored in the 1 st row of the list; then, modifying each row of records into sentences with header fields and formed in a dictionary form by adopting a pd.to_direct ('records') method; wherein the first row is the header row and needs to be removed.
The attribute dictionary shown in fig. 10 and the attribute value dictionary shown in fig. 11 are simultaneously output, and the positions of the attributes and the values in the text structure are also indicated.
Step 152: question parsing
And obtaining a reference analysis result y0=f0 (x 0) in a known question-answer corpus by performing similarity calculation based on a fasttet model on the question sentence, and obtaining a final analysis result y=f (x) by comparing the reference results.
Step 152-1, reading the corpus of question and answer labels and various dictionaries
The dictionaries of fig. 9-13 are read, as well as the corpus shown in fig. 4.
Step 152-2: reading questions
The question shown in fig. 16 is read, some of the questions are the same as the corpus, the representative is repeated questions, some of the questions are new questions, and the key points of the question analysis are the questions.
Step 152-3: calculating similarity between question and corpus
Similarity calculation based on fasttext model is carried out on the question and corpus question sequences, and the following scipy.space cosine method is adopted:
diss=1-scipy.spatial.distance.cosine(model.get_word_vector(word),model.get_word_vector(word1))
then, the diss are ordered according to the principle from big to small, and the first question is taken as the reference question which is most similar to the question.
Step 152-4 obtaining the reference y0=f0 (x 0)
And (3) extracting the corresponding character analysis result and semantic analysis result from the question mark prediction as y0=f0 (x 0), namely, the analysis result to be referred to.
Step 152-5: calculating the part y0\f0\x0 in the question
If the character y0\f0\x0 is in the question, taking the part of the analysis result to be directly a part of the analysis final result y\f\x.
The original question is replaced with known data: q1=q.replay (j, '#' 1), the original question is narrowed down continuously to reduce the interference of finding new elements.
Step 152-6: calculating the rest y0\f0\x0
For y0\f0\x0 which is not in the question, firstly analyzing the dictionary type of the semantic element from the reference semantic, then inquiring in q1 by using the corresponding dictionary, and taking the longest result as the analyzed semantic element. The spoken dictionary query need not be used in q 1.
After confirming the semantic attribute of the newly added character string, directly replacing y0 with y as a semantic analysis result:
step 152-7: obtaining complete y=f (x)
The complete analysis results of the questions y=f (x) are shown in fig. 17 after two steps 152-5 and 152-6, wherein the questions with similarity of 1 are repeated questions, and the analysis results can directly copy corpus labeling results.
For the "how many production wells the general subject 2011 is in the year", referring to the known question "what the general subject 2012 is in the year", the analysis result is as follows:
character analysis: plain body @ @ @2011@ @ @ annual @ @ @ number of wells (ports) put into production
Semantic parsing: plain body block/object Obj @ @2011 number/attribute value @ @ annual/attribute t @ number of wells (ports)/attribute y
It can be seen that "2011" is a time object converted from the aligned character string
Step 153: sentence generation
Finding out data in the knowledge base according to the character analysis result of the question analysis, calculating the value of y according to the result of the semantic analysis, and finally splicing and outputting the answer sentence.
Step 153-1 search data according to x, y
The parsed semantic elements are formed into a set, the semantic elements are matched with words in a knowledge base, the knowledge item with the most matching items is required data, see fig. 17, for example, sentences which can be matched by the set of { 'number of production wells (mouths)', 'block', 'general light body', '2011', 'year' are, and obviously, all the information required by the general light body 2011 is provided.
Step 153-2 data recovery
As the searched data is a complete record, the question and answer generally do not need much data, and when there is statistical calculation, the data needs to be intercepted according to the header given by the semantics to obtain a small table for calculation.
Step 153-3: data calculation from f
The method f is a calculation method for obtaining y data, such as query, statistical operation, etc., and each f corresponds to a semantic frame or a template.
Step 153-4: obtaining sentence pattern plate
Directly taking out the sentence template from the reference label.
Step 153-5: sentence answering assembly
And replacing the corresponding words in the answer sentence template by the parsed semantic words and the calculation results to obtain a final answer sentence.
By adopting the embodiment of the invention, the following beneficial effects are reported:
the method and the device have the advantages that the question and answer construction is directly carried out on the complex engineering form, and the problems of understanding ambiguity in question sentences and multi-level cross-point node calculation are solved by utilizing the complete meaning and the complex structure of the engineering form; the complex form is rewritten into the natural language text with chapter hierarchical structure, thus overcoming the difficulty that the complex form structure is difficult to standardize into database task; meanwhile, a fasttet question analysis model is used for obtaining a general question multi-attribute solution by a method for comparing an approximate solution, so that engineering requirements are greatly met; through the practice of question-answering 13 engineering forms on an oilfield production site, the question-answering accuracy rate reaches 98%, and the requirement of engineering on data accuracy is met.
System embodiment
The embodiment of the invention provides a question and answer system based on table semantic fasttet question analysis, fig. 2 is a schematic diagram of the question and answer system based on table semantic fasttet question analysis according to the embodiment of the invention, and according to the embodiment of fig. 2, the question and answer system based on table semantic fasttet question analysis specifically comprises:
the data module 20 is used for rewriting the table into a chapter semantic text with a hierarchical structure, obtaining an analysis dictionary for question analysis, establishing a question-answer labeling corpus, establishing a field-related fastext similarity calculation model, and establishing data required by a question-answer method for the table semantic fastext question analysis;
the processing module 22 is configured to parse the question through the parsing dictionary and then parse the question semantically according to the fasttext similarity calculation model;
and the application module 24 is used for obtaining the answer sentence through the question and answer labeling corpus according to the semantically parsed question sentence, and obtaining the complete answer sentence after assembling the answer sentence.
Fig. 3 is a schematic diagram of a specific question-answering system based on a table query mode in this system embodiment, which is composed of a data layer 1, a processing layer 2 and an application layer 3, wherein the data layer 1 is equivalent to a data module 20 of the question-answering system based on table semantic fasttet question parsing in this embodiment, the processing layer 2 is equivalent to a processing module 22 of the question-answering system based on table semantic fasttet question parsing in this embodiment, and the application layer 3 is equivalent to an application module of the question-answering system based on table semantic fasttet question parsing in this embodiment. In a specific embodiment of the invention, the data layer has the functions of storing, reading, writing, modifying and the like of files, and comprises engineering materials required by questions and answers, wherein the materials are prepared various reports in advance, and dictionary, model and labeled questions and answers corpus used for natural language analysis of questions and sentences; the processing layer realizes analysis, table lookup and sentence generation of questions and equivalent the whole question-answering system into a query system; the application layer realizes the interactive operation with the user and comprises a question input part and a answer output part.
The data layer 1 consists of 4 parts of a report knowledge base 1-1, a dictionary 1-2, a v2w vector model 1-3 in the petrochemical field, a question-answer labeling corpus 1-4 and the like, and the whole data layer provides various data required by data output. The report knowledge base 1-1 consists of daily report 1-1-1, daily report templates 1-1-2 and daily report knowledge base 1-1-3, and the daily report in the engineering is converted into a natural language knowledge base required by query. The structure of daily report 1-1-1 is typically shown in fig. 4 (without limitation to this structure), fig. 4 is a summary of 7 different services, and in each structure, except for the conventional table of object vertical-attribute horizontal, the 6 th is a form of arranging only object attributes left and right, which is used for processing the situation of less total index or data; the forms are views generated by a plurality of databases, and the composition modes of the forms accord with the engineering reading habit, so that the forms are directly subjected to question-answer processing, and the engineering habit is met. As shown in FIG. 5, the daily report template 1-1-2 is characterized in that a table is regarded as an article, each small table is regarded as a section, the head of the small table is used as the name of the section, and the table head fields are spliced. The daily report knowledge base 1-1-3 is shown in fig. 6, and one engineering daily report is rewritten into a text knowledge base with a hierarchical structure, wherein the knowledge base comprises 4 layers of words, constructed sentences, small table names and table names, and the 4 layers of words, sentences, chapters and pieces corresponding to the text, the words are used for searching, and candidate materials are selected through the query; the sentence is a complete form record, contains the object attribute and the object attribute value of each field, and is the instantiation of the small form semantic; the small table names represent a part of semantic frames, and sentences belonging to the same small table name have the same table structure or frame semantics; the table name is the name of the entire table and represents a complete engineering scenario.
The dictionary 1-2 also comprises a dictionary chapter level analysis dictionary table name dictionary 1-2-1 for question analysis, so that the conversion between the layers of the spoken dictionary 1-2-2, the object dictionary 1-2-3, the object attribute dictionary 1-2-4, the attribute value dictionary 1-2-5, the time dictionary 1-2-6 and the query word dictionary 1-2-7 for sentence word level analysis, which are nonsensical characters, standard terms, table attributes and operation methods, is realized; if a search text is formed for all possible questions, the word level dictionary is a word stock that can search out the text. The table name dictionary 1-2-1 is composed of table names, as shown in fig. 7, and is generally all data tables and views involved in engineering, the number of which is more than 1000, but is generally not more than 10 related to the project, so that the range of the project can be focused; the list names are mainly specified according to the service requirement, because the same question corresponds to a plurality of services, namely a plurality of lists, but only one list is needed to be selected for a specific scene, and then the list names are needed to be specified by manual intervention; the table names are determined by specific questions and are mainly identified according to the objects and attributes appearing in the questions, for example, sulfur only belongs to daily reports, so that the dictionary of the table names is updated with the specific questions; the table names realize automatic identification of the table names by classifying the questions. Regarding the small table, the small table is defined as the basic structure of the query, and the basic semantic structure of the small table is defined as 3-tuple of object+attribute+value, wherein the objects are longitudinally arranged, the attributes are transversely arranged, and the intersection point of the longitudinal object and the transverse line attribute is used for determining the position of the value, but the position of the value is not necessarily at the intersection point position, and the exception exists. 1-2-2 are spoken dictionaries that collect daily parlance of objects and properties that appear in a spoken language. Because spoken language has strong territory and is not a widely accepted consensus, a dictionary processing is independently made for increasing and decreasing at any time so as to adapt to the requirements of different application scenes, as shown in fig. 8. The object dictionary 1-2-3 corresponds to an index in a table, is an object in semantics or an object with a hierarchy, as shown in fig. 9; the object property dictionary 1-2-4 is the content of all the lateral properties in the table, as shown in FIG. 10, in which hierarchical relationships are also listed for multi-level properties that occur in many tables, denoted by @ @. The attribute value dictionary is shown in FIG. 11, where 1-2-3, 1-2-4, and 1-1-5 collectively implement a description of elements of the table. The time dictionary 1-2-6 is shown in FIG. 12 and contains many regular expressions to describe a generic time structure. The query word dictionary 1-2-7 is shown in fig. 13, and describes the query words appearing in the question sentence and the methods adopted to obtain the answer, which directly correspond to the methods in the program.
The question-answer labeling corpus 1-3 is a corpus on which all known questions and correct analysis results obtained through manual correction depend, and the format of the corpus is shown in fig. 14; in the label, meaningless characters, standard terms, table attributes and question semantics are expressed, for example, the label of ' number of production wells (mouth)/attribute/y ' indicates that the label term of the character string ' number of production wells ' is ' number of production wells (mouth) ", which is an attribute of a table, namely a table header, and the question indicates a required result y; this corpus tends to converge on a certain data in a limited scenario as questions go deep. The corpus is formed continuously in an increment form, the automatically resolved result of each new sentence is used as a reference, and then the final correct resolved result is obtained through manual correction; while these labels will add all associated dictionaries 1-2 at the same time; corpus labeling is a sentence-level analysis corpus, the increase of which can directly influence the content of a word-level analysis dictionary at the lower level, and the upper level and the lower level have an automatic linkage relationship.
1-4 is a w2v vector data model of fastatex in petrochemical field, 266 ten thousand articles (questions, abstract and contents) of professional literature in petrochemical field are adopted, 2200 ten thousand sentences in total are taken as corpus, vector dimension is 120 dimensions, and training time is 5.5 hours. Because the table names, objects and attributes corresponding to the question-answer corpus are quite large and are in dynamic change, the corpus cannot be classified by adopting a classification model with a fixed classification number, the input question sentence and the output table names, objects and attributes are dataized and tensed by adopting a v2w mode, and then a multi-level neural network mapping relation is established between the input tensor and the output attribute tensor, so that the analysis parameters of various question sentences can be calculated through the input question sentences.
The processing layer 2 comprises 3 modules, such as knowledge base construction 2-1, question analysis 2-2, question generation 2-3 and the like, and realizes the processes of analysis, disambiguation and calculation of the questions and finally generating the questions. The knowledge base construction 2-1 comprises a daily knowledge base construction 2-1-1 module and a dictionary construction 2-1-2 module, and the functions of constructing the knowledge base from the table in the database to the top and outputting corresponding various dictionaries are respectively completed; the question analysis 2-2 comprises the calculation of the question fasttext similarity 2-2-1, the comparison of the question characters 2-2-2 and the query of the question semantic elements 2-2-3 so as to realize the analysis of the question from the characters to the semantics; the answer generation 2-3 comprises two modules, namely question result calculation 2-3-1 and answer assembly 2-3-2, so that data processing and answer assembly according to templates are realized, and finally complete answer is realized.
The application layer 3 comprises two modules, namely question input 3-1 and answer output 3-2, and is used for realizing the input of a question and the presentation of a final answer on an interface.
By adopting the embodiment of the invention, the following beneficial effects are reported:
the method and the device have the advantages that the question and answer construction is directly carried out on the complex engineering form, and the problems of understanding ambiguity in question sentences and multi-level cross-point node calculation are solved by utilizing the complete meaning and the complex structure of the engineering form; the complex form is rewritten into the natural language text with chapter hierarchical structure, thus overcoming the difficulty that the complex form structure is difficult to standardize into database task; meanwhile, a fasttet question analysis model is used for obtaining a general question multi-attribute solution by a method for comparing an approximate solution, so that engineering requirements are greatly met; through the practice of question-answering 13 engineering forms on an oilfield production site, the question-answering accuracy rate reaches 98%, and the requirement of engineering on data accuracy is met.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (10)

1. A question-answering method based on table semantic fasttet question analysis is characterized by comprising the following steps:
s1, a data preparation stage: the method comprises the steps of rewriting a table into a chapter semantic text with a hierarchical structure, obtaining an analysis dictionary for question analysis, establishing a question and answer labeling corpus, and establishing a field-related fastext similarity calculation model to establish data required by a table semantic fastext question analysis question and answer method;
s2, a question analysis stage: analyzing the question through the analysis dictionary, and carrying out semantic analysis on the analyzed question and the question-answer labeling corpus according to a fasttet similarity calculation model;
s3, a sentence answering stage is obtained: and carrying out operation according to the semantically parsed question sentence to construct a complete answer sentence.
2. The method according to claim 1, wherein the step S1 of rewriting the table into the chapter semantic text having the hierarchical structure specifically includes:
and converting the table of the daily report into a text knowledge base with a hierarchical structure according to a daily report template, wherein the daily report template realizes the splitting of the table structure, and the text knowledge base comprises four layers of words, constructed sentences, small table names and table names which respectively correspond to the words, the sentences, the chapters and the four layers in natural language.
3. The method according to claim 1, wherein the step S1 of obtaining the parsing dictionary for question parsing specifically includes:
acquiring a table name dictionary for analyzing the levels of the pieces and chapters of the question analysis, wherein the table name dictionary designates a table name according to manual intervention required by a service, the table name is determined by a specific question, and the table name dictionary is updated according to the specific question;
acquiring a spoken dictionary, an object attribute dictionary, an attribute value dictionary, a time dictionary and a query word dictionary which are used for word hierarchy analysis of question analysis, wherein the spoken dictionary realizes conversion from spoken words in question to standard names; the object dictionary corresponds to an index in a table and is an object in semantics or an object with a hierarchy; the object attribute dictionary collects the content of all the transverse attributes in the table, and the attribute value dictionary collects the specific values of all the transverse attributes in the table; the time dictionary is used for describing a general time structure; the query word dictionary is used for describing query words appearing in the question sentences and a method adopted for obtaining answers.
4. The method according to claim 1, wherein the creating a question-answer labeling corpus in step S1 specifically includes:
collecting all known questions and correct analysis results of corresponding questions which are manually checked to form a question-answer labeling corpus, wherein the question-answer labeling corpus is continuously formed in an increment form, automatically analyzing each new question as a reference, and then manually checking to obtain a final correct analysis result; and all associated parsing dictionaries will be added at the same time.
5. The method according to claim 1, wherein the establishing a domain-related fasttext similarity calculation model in step S1 specifically includes:
and (3) carrying out data conversion and tensation on the input question and the output table names, objects and attributes in a v2w mode, and then establishing a multi-level neural network mapping relation between the input tensor and the output attribute tensor so as to obtain analysis parameters of the question.
6. The method according to claim 1, wherein the step S2 specifically comprises:
and obtaining a reference analysis result by calculating the similarity of the question and the known question in the question-answer labeling corpus according to the best matching principle of the question characters.
7. The question-answering system based on the form semantic fasttet question analysis is characterized by comprising the following components:
the data module is used for rewriting the table into a chapter semantic text with a hierarchical structure, acquiring an analysis dictionary for question analysis, establishing a question and answer labeling corpus, and establishing a field-related fastext similarity calculation model to establish data required by a question and answer method for the table semantic fastext question analysis;
the processing module is used for carrying out semantic analysis on the question according to the fasttext similarity calculation model after analyzing the question through the analysis dictionary;
and the application module is used for acquiring the answer sentence through the question-answer labeling corpus according to the semantically parsed question sentence, and acquiring the complete answer sentence after assembling the answer sentence.
8. The system according to claim 7, wherein the data module specifically comprises:
the report knowledge base module is used for converting the table of the daily report into a text knowledge base with a hierarchical structure according to a daily report template, wherein the daily report template realizes the splitting of the table structure, and the text knowledge base comprises four layers of words, constructed sentences, small table names and table names which respectively correspond to the words, the sentences, the chapters and the four layers in natural language;
the dictionary module is used for acquiring a dictionary for word, piece and chapter hierarchical analysis for question analysis;
the corpus module is used for collecting all known questions and correct analysis results of corresponding questions which are manually checked to form a question-answer labeling corpus, the question-answer labeling corpus is continuously formed in an increment form, the automatic analysis result of each new question is used as a reference, and then the final correct analysis result is obtained through manual check; and all associated parsing dictionaries are added at the same time;
the fasttext module is used for carrying out data conversion and tensation on input question and output table names, objects and attributes in a v2w mode, and then establishing a multi-level neural network mapping relation between the input tensor and the output attribute tensor so as to obtain analysis parameters of the question.
9. The system of claim 8, wherein the dictionary module specifically comprises:
the table name dictionary is used for analyzing the levels of the fragments and chapters of the question analysis, the table name dictionary manually intervenes and designates the table name according to the service requirement, the table name is determined by a specific question, and the table name dictionary is updated according to the specific question;
obtaining a spoken dictionary, an object attribute dictionary, an attribute value dictionary, a time dictionary and a query word dictionary for word hierarchy analysis of question sentence analysis,
the spoken language dictionary is used for realizing conversion from spoken words in question sentences to standard names;
the object dictionary corresponds to an index in a table and is an object in semantics or an object with a hierarchy;
the object property dictionary gathers the contents of all lateral properties within the table,
the attribute value dictionary collects specific values of all transverse attributes in the table;
the time dictionary is used for describing a general time structure;
the query word dictionary is used for describing query words appearing in the question sentences and a method adopted for obtaining answers.
10. The system of claim 9, wherein the processing module is specifically configured to
And obtaining a reference analysis result by calculating the similarity of the question and the known question in the question-answer labeling corpus according to the best matching principle of the question characters.
CN202310196713.9A 2023-03-02 2023-03-02 Question and answer method and system based on table semantic fasttet question analysis Pending CN116226349A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310196713.9A CN116226349A (en) 2023-03-02 2023-03-02 Question and answer method and system based on table semantic fasttet question analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310196713.9A CN116226349A (en) 2023-03-02 2023-03-02 Question and answer method and system based on table semantic fasttet question analysis

Publications (1)

Publication Number Publication Date
CN116226349A true CN116226349A (en) 2023-06-06

Family

ID=86578334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310196713.9A Pending CN116226349A (en) 2023-03-02 2023-03-02 Question and answer method and system based on table semantic fasttet question analysis

Country Status (1)

Country Link
CN (1) CN116226349A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130791A (en) * 2023-10-26 2023-11-28 南通话时代信息科技有限公司 Computing power resource allocation method and system of cloud customer service platform
CN117633170A (en) * 2023-11-07 2024-03-01 中译语通科技股份有限公司 Thinking chain data construction method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130791A (en) * 2023-10-26 2023-11-28 南通话时代信息科技有限公司 Computing power resource allocation method and system of cloud customer service platform
CN117130791B (en) * 2023-10-26 2023-12-26 南通话时代信息科技有限公司 Computing power resource allocation method and system of cloud customer service platform
CN117633170A (en) * 2023-11-07 2024-03-01 中译语通科技股份有限公司 Thinking chain data construction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112199511B (en) Cross-language multi-source vertical domain knowledge graph construction method
US9361358B2 (en) Syntactic loci and fields in a functional information system
Davies et al. Semantic Web technologies: trends and research in ontology-based systems
US7739257B2 (en) Search engine
CN110990590A (en) Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN116226349A (en) Question and answer method and system based on table semantic fasttet question analysis
CN112434024B (en) Relational database-oriented data dictionary generation method, device, equipment and medium
Meroño-Peñuela et al. Linked humanities data: The next frontier? A case-study in historical census data
CN114911951A (en) Knowledge graph construction method for man-machine cooperation assembly task
CN112925901B (en) Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN114610898A (en) Method and system for constructing supply chain operation knowledge graph
CN114896423A (en) Construction method and system of enterprise basic information knowledge graph
CN117094390A (en) Knowledge graph construction and intelligent search method oriented to ocean engineering field
CN114880483A (en) Metadata knowledge graph construction method, storage medium and system
Sonje et al. draw2code: AI based Auto Web Page Generation from Hand-drawn Page Mock-up
Cook Learning context-aware representations of subtrees
Namvar et al. CivilOnto: An Ontology Based on Persian Articles Published in Civil Engineering Domain.
Assaf et al. RUBIX: a framework for improving data integration with linked data
Paulus et al. The PLASMA Framework: Laying the Path to Domain-Specific Semantics in Dataspaces
Assaf et al. Improving schema matching with linked data
Masmoudi et al. Diserto: Semantics-based tool for automatic and virtual data integration
Leshcheva et al. Towards a method of ontology population from heterogeneous sources of structured data
Toussaint et al. Building and interpreting term dependencies using association rules extracted from Galois lattices.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination