CN116628172B - Dialogue method for multi-strategy fusion in government service field based on knowledge graph - Google Patents
Dialogue method for multi-strategy fusion in government service field based on knowledge graph Download PDFInfo
- Publication number
- CN116628172B CN116628172B CN202310909706.9A CN202310909706A CN116628172B CN 116628172 B CN116628172 B CN 116628172B CN 202310909706 A CN202310909706 A CN 202310909706A CN 116628172 B CN116628172 B CN 116628172B
- Authority
- CN
- China
- Prior art keywords
- government
- entity
- model
- knowledge
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 135
- 230000004927 fusion Effects 0.000 title claims abstract description 31
- 238000000605 extraction Methods 0.000 claims abstract description 115
- 230000007246 mechanism Effects 0.000 claims abstract description 45
- 238000010276 construction Methods 0.000 claims abstract description 24
- 238000012163 sequencing technique Methods 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 86
- 238000012549 training Methods 0.000 claims description 65
- 238000004422 calculation algorithm Methods 0.000 claims description 58
- 230000008569 process Effects 0.000 claims description 38
- 230000000694 effects Effects 0.000 claims description 33
- 238000013507 mapping Methods 0.000 claims description 29
- 238000004458 analytical method Methods 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 26
- 238000003860 storage Methods 0.000 claims description 22
- 238000012015 optical character recognition Methods 0.000 claims description 20
- 239000000463 material Substances 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 15
- 230000008901 benefit Effects 0.000 claims description 14
- 238000005516 engineering process Methods 0.000 claims description 14
- 238000006243 chemical reaction Methods 0.000 claims description 12
- 239000012634 fragment Substances 0.000 claims description 12
- 238000013461 design Methods 0.000 claims description 11
- 238000007726 management method Methods 0.000 claims description 11
- 230000037213 diet Effects 0.000 claims description 9
- 235000005911 diet Nutrition 0.000 claims description 9
- 238000013210 evaluation model Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 9
- 238000013441 quality evaluation Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000012550 audit Methods 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 8
- 230000009193 crawling Effects 0.000 claims description 7
- 238000013508 migration Methods 0.000 claims description 7
- 230000005012 migration Effects 0.000 claims description 7
- 230000008520 organization Effects 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000005192 partition Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 5
- 230000000903 blocking effect Effects 0.000 claims description 4
- 238000005352 clarification Methods 0.000 claims description 4
- 238000005065 mining Methods 0.000 claims description 4
- 238000012805 post-processing Methods 0.000 claims description 4
- 102100032202 Cornulin Human genes 0.000 claims description 3
- 101100481876 Danio rerio pbk gene Proteins 0.000 claims description 3
- 101000920981 Homo sapiens Cornulin Proteins 0.000 claims description 3
- 101100481878 Mus musculus Pbk gene Proteins 0.000 claims description 3
- 238000009825 accumulation Methods 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000013075 data extraction Methods 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 230000004992 fission Effects 0.000 claims description 3
- 238000005206 flow analysis Methods 0.000 claims description 3
- 238000011835 investigation Methods 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 claims description 3
- 238000013439 planning Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000002360 preparation method Methods 0.000 claims description 3
- 238000007670 refining Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 2
- 230000003213 activating effect Effects 0.000 claims 1
- 238000012790 confirmation Methods 0.000 claims 1
- 238000012937 correction Methods 0.000 claims 1
- 230000001502 supplementing effect Effects 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 description 6
- 238000013502 data validation Methods 0.000 description 5
- 241001178520 Stomatepia mongo Species 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000013523 data management Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2468—Fuzzy queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Fuzzy Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Automation & Control Theory (AREA)
- Probability & Statistics with Applications (AREA)
- Animal Behavior & Ethology (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of government service and discloses a multi-strategy fusion dialogue method in the government service field based on a knowledge graph, which adopts a multi-strategy fusion dialogue method in the government service field, wherein the dialogue method is a government service field multi-strategy fusion dialogue method of a government service graph construction strategy, a government service graph multi-round question-answering strategy, a word and semantic multi-stage recall and sequencing-based FAQ knowledge base question-answering strategy, a multi-document retrieval extraction reading understanding question-answering strategy and a LLM large language model local knowledge base strategy based on a trusted knowledge mechanism. Based on the government technical knowledge and a government service flow system, a government service map is constructed, the government service flow is recreated, and the government service is optimized. A dialogue related model structure is built according to the characteristics of government affair field data, knowledge data of government affair patterns are integrated when the model is trained, and accuracy of the model in the government affair field is further improved. The invention provides the government affair service question-answer inquiry more accurately, thereby greatly improving the use experience of the user and optimizing the government affair service.
Description
Technical Field
The invention relates to the technical field of government affair service, in particular to a dialogue method for multi-strategy fusion in the field of government affair service based on a knowledge graph.
Background
In the practice process of pushing the Internet and government service, the Internet is the most important scene of public service providing mode innovation, and the local exploration innovation is moved to the new kinetic energy of pushing the improvement. Under the situation of 'one-net office', the existing disordered and fragmented knowledge association of the service matters of the government service website is complicated, the administrative approval matters are complicated, the masses have problems and can find related solutions only after a long time, the masses are unfamiliar with the office flow, and the masses are increasingly outstanding in reflecting the problem of 'questionable matters'. Meanwhile, the problems that ownership of data is unclear, and each department lacks association overall arrangement and the like exist in government service supply are solved. By establishing a large database of ubiquitous information, physical concentration of information resources is realized, and only scattered small information islands are essentially changed into a large information island which is more disordered and chaotic. Therefore, the technical application of the novel idea support is not changed fundamentally, and various problems existing in 'one-net office' are difficult to be effectively solved.
The existing government affair question-answering method comprises a word-based knowledge base question-answering method, a semantic vector retrieval-based knowledge base question-answering method and a general knowledge map-based question-answering method. The existing question-answering related algorithm can not fully utilize the characteristics of government field texts and government field knowledge to construct a model, lacks necessary government service knowledge fusion and government service knowledge reasoning capability, can not construct a strategy based on the spectrum of government service, can not construct a multi-round question-answering strategy based on the spectrum of government service, can not be used for solving the problem of a FAQ knowledge base based on word and semantic multi-stage recall and sequencing, can not be used for solving the problem of extraction type reading understanding of multi-document retrieval, and can not be used for integrating a LLM large language model local knowledge base strategy based on a trusted knowledge mechanism. Each question-answering method has limited application scene generalization capability, poor systematicness and can not well meet the government service requirements, so as to make up for the defects of the existing scheme.
The invention provides a dialogue method for multi-strategy fusion in the government service field based on a knowledge graph based on the problems.
Disclosure of Invention
The invention aims to provide a dialogue method for multi-strategy fusion in the field of government affairs service based on a knowledge graph, and provides a multi-strategy fusion question-answering method in the field of government affairs service based on a graph construction strategy of government affairs service, a multi-round question-answering strategy of government affairs service graph, a FAQ knowledge base question-answering strategy based on word and semantic multi-stage recall and sequencing, a multi-document retrieval extraction reading understanding question-answering strategy and a LLM large language model local knowledge base strategy based on a trusted knowledge mechanism. Based on the government domain knowledge and the government service flow, a government service map is constructed, the government service flow is recreated, and the government service is optimized. The AI model structure is constructed according to the characteristics of the government field data, and knowledge data of government maps are integrated when the model is trained, so that the accuracy of the model in the government field is further improved. Aiming at the constructed government map, accurate question and answer of government service is carried out. And (3) the accurate query fails, fuzzy query is performed based on the FAQ knowledge base strategy, and a knowledge base strategy based on word and semantic multi-stage recall and sequencing is provided. Aiming at the FAQ scene migration problem, a multi-document retrieval extraction type reading understanding question-answering strategy is provided. Aiming at the problem of insufficient question-answer corpus in the field of FAQ knowledge base government affairs, a LLM large language model local knowledge base strategy based on a trusted knowledge mechanism is provided.
Therefore, the dialogue method based on the multi-strategy fusion in the government service field of the knowledge graph radically changes the supply mode of government questions and answers, effectively solves various problems existing in 'one-net office' so as to greatly improve the use experience of users and optimize the government service, is more in line with the industry characteristics in the government field, has better systematicness and high accuracy in government service inquiry, and greatly improves the use experience of users.
The invention is realized in the following way: the invention provides a dialogue method for multi-strategy fusion in the government service field based on a knowledge graph;
s1, constructing a map based on government affairs service, constructing a government affair field knowledge system and a government affair service flow knowledge system, and standardizing the government affair field knowledge and the government affair service flow;
s1.1, semi-structured and unstructured data acquisition is carried out, specifically, a high-quality site evaluation model is adopted to evaluate the data from two dimensions of quality grade and liveness, comprehensive scoring is carried out, an intelligent content extraction technology GNE algorithm and an XPATH analysis extraction technology are connected to government affair content analysis, and a government affair file hierarchical storage scheme is provided for guaranteeing consistency and integrity of downloaded data;
S1.2, carrying out knowledge extraction and structured data extraction, extracting triple data from structured data sources in the administrative domain according to a constructed administrative domain body, storing entity-attribute value and entity-relation-entity into a map, supporting intelligent matching of the attribute of an entity object and a data set field through ontology mapping aiming at a government service database, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; supporting the body relation to select an object data set or a relation data set to carry out relevant configuration, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, and simultaneously reducing a storage space;
the knowledge extraction is carried out firstly on the structured data knowledge extraction specifically according to the following steps:
extracting triple data, entity-attribute value, entity-relation-entity and map from structured data sources in the government affair field according to the constructed government affair field body and government affair service flow knowledge system, supporting intelligent matching of attributes of an ontology object and data set fields by ontology mapping aiming at a government affair service database, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; supporting the body relation to select an object data set or a relation data set to carry out relevant configuration, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, and simultaneously reducing a storage space;
Aiming at a traditional government service database, through ontology mapping, supporting intelligent matching of attributes of an ontology object and data set fields, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; the method has the advantages that the body relation is supported to select an object data set or a relation data set to carry out relevant configuration, a data file is structured, automatic access of structured data can be realized through an excel uploading plug-in, a csv uploading plug-in and other tools, a user-defined logic view is used as a function of the data set, a plurality of pieces of table information can be integrated into one table according to service requirements to carry out extraction customization, relevant operation provides an intuitive and convenient operation interface, and the information of the plurality of tables is extracted and customized to facilitate query operation;
aiming at the problem of updating and storing structured data in a log bin, a zipper table solution is provided, a zipper table design is adopted to check a history state, meanwhile, the occupation of a storage space is reduced, and the history state can not be queried due to direct coverage of the data updating;
analyzing semi-structured and unstructured data, and performing intelligent form filling, wherein the intelligent form filling is realized by an OCR (optical character recognition) technology and an extraction technology, so that the daily document processing work is automated, the document processing efficiency of business personnel is improved, doc, docx, pdf, chm format files uploaded or downloaded by a user are firstly analyzed to obtain data such as pictures, tables, texts, titles and the like in the files, doc format files are firstly converted into docx files, the docx files are directly analyzed through a python-docx package of a third party tool of python, chm format files are firstly converted into pdf format files, then pdf analysis is performed, the pdf scanning files are subjected to OCR recognition, the whole OCR recognition process is divided into layout analysis, text detection and text recognition, the layout analysis technology is used for analyzing the text structures of documents, and automatically extracting the structured information of the document documents, the text detection process is to use rectangular frames with four points to select text area frames in an image, the text recognition process is to recognize the text selected by the previous step into characters, a one-stage target detection algorithm of a YOLO series is adopted, a text detection algorithm based on regression CTPN, EAST and segmentation PSENT and DBNet, a CRNN+ CTC, attention and SVRT text recognition algorithm are adopted to meet the recognition requirement of a document, component extraction is to adopt a method of combining an extraction technology and a rule to realize automatic extraction of elements, intelligent form filling is realized, component extraction tasks are divided into entity recognition and relation extraction, entity linkage is completed for the extracted entities, entity alignment and entity disambiguation are carried out, model audit and manual audit are carried out for the entities before warehouse entry, tracing the knowledge, positioning the sources of the government staff and the user searched knowledge or returned results, and improving the confidence and the interpretability of the user on the knowledge; the relation extraction is based on a relation extraction model of multi-feature fusion, the entity type, the entity itself and the context information are input at the input end of the model, different pooling strategies are set for different dynamics of entity lengths through BERT pre-training model feature extraction, and the relation extraction model is used for extracting the relation;
Entity linking is the task of linking the text mentioned in the text to an entity in the knowledge base. Entity linking difficulties, entity alignment has different mention texts through the same entity and entity disambiguation of the same mention text into different entities.
Aiming at the solution scheme that different mention texts are provided through the same entity, the method comprises the steps of utilizing established entity names and aliases to train texts to match with a SimCSE model, storing training entity word vectors on FAISS vector library lines, obtaining vectors through the ENCODER of the SimCSE model according to the entity which is checked for the first time, removing vector retrieval in the FAISS library, and obtaining entity words with highest scores;
and then entity linking is carried out, and for the entity solutions corresponding to different texts in the same reference, a plurality of entities possibly correspond to the entities in the government map through the entity corrected by the first round of dictionary, a plurality of corresponding entity ID candidates are found according to the entity names, and entity disambiguation processing is carried out. The entity disambiguation is realized based on the idea of two classifications, wherein the entity connected to the entity is selected as positive examples during training, two negative examples are selected from candidate entities, an input text and a description text of the entity to be disambiguated are connected together and input into a BERT model, CLS position vectors are taken for output, and feature vectors corresponding to the starting position and the ending position of the candidate entity are connected, the three vectors are connected, the probability scores of the candidate entities are obtained through full connection layer and the most sigmoid activation, the probability scores of all the candidate entities are ordered, and the entity with the highest probability is selected as the correct entity;
And then entity model auditing and manual auditing are carried out, entity auditing is carried out through the model, the entity is ensured to meet the service requirement, the inspection process does not use context information, the judgment is carried out by focusing on the combination mode of characters per se, the construction of a short text two-classification problem model is essential, the long-distance relation is not required to be captured for a short text task, the traditional RNN model is used for solving the problem, the performance and the effect are well balanced, the data set construction is carried out, a positive sample is a standard named entity based on manual auditing, 1 is a positive label, the representation of the following characters is a named entity, and the government affair entity names are sourced from a database. And 0 is a negative label, and represents that the following text is not a named entity. The unnamed entity is the character string inversion, and the ratio of positive and negative samples is 1:1;
the human-computer collaboration mechanism is adopted, and the identified triples, entity-attribute-entity and entity-relation-entity are realized based on the algorithm, and meanwhile, manual audit is supported by intervention of government staff in the knowledge extraction process, so that the human-computer collaboration mechanism integrating knowledge extraction is realized. When the extraction is carried out, if the content of the data conflicts, the manually modified data is high in priority.
And tracing knowledge. The knowledge tracing technology realizes the positioning of the provenance of the knowledge of the questions and answers of the government workers and the users or the returned results, and improves the confidence and the interpretability of the knowledge of the government workers and the users. The method has the advantages of quickly and accurately realizing knowledge tracing, grasping knowledge source information and determining related information at the first time, reducing time cost and manpower and material resources to the greatest extent, and having important significance for assisting decision making and solving problems of government staff.
In the specific implementation process of knowledge tracing, unique identification information is created on data through a knowledge graph, and the function of rapid positioning is performed, so that nodes in a database have sources and bases. When a knowledge graph is constructed, the dimensions of time, place and the like are marked on input data, for example, when a knowledge is derived from a book, the dimensions of book names, book publishers, book publishing time, book authors and the like of the knowledge are marked, and the knowledge is traced back to the source of the knowledge based on the marks. Uploading the document to a system for inputting, and entering the related information of the identified entity into a warehouse through OCR recognition on the page, and recording the related document information corresponding to the entity into a mongo database. Information of one entity attribute may exist in a plurality of documents, so one entity may correspond to a plurality of documents, and thus entity tracing may correspond to a plurality of documents. The user searches the content, and the document tracing function is used to inquire the document information of all sources of the entity from the mongo database, and the document information is ranked by combining with the user scoring. And supporting the star rating of the user for the documents without rating, and storing the documents into the user portrait.
The method comprises the steps of resolving entity identification of semi-structured and unstructured data, firstly extracting unstructured data in a combined mode, aiming at the problems that entity identification subtask BERT-CRF is wrong in entity boundary in the government field, incomplete in entity identification and useless in existing government knowledge base information, providing a solution for enhancing accuracy of entity identification based on knowledge base description text information, firstly constructing an entity name dictionary by utilizing entity names of the knowledge base and alias information of the entities, obtaining vector embedding of the entity names by mining the description text of the entities in the knowledge base, obtaining candidate entities in the text by a name dictionary matching technology, and finally screening results by utilizing the entity identification model to complete the task of entity identification; the method specifically comprises a data preparation flow, an alias dictionary is constructed by using entity names of a knowledge base and alias information of the entities, entity description texts are constructed, and a mapping dictionary is constructed, wherein the specific flow is as follows:
constructing an entity alias dictionary, and constructing the alias dictionary by utilizing entity names of government knowledge base and alias information of the entities, wherein errors which cannot be matched in the entity base of government data entity names specifically comprise: and (3) a special character error exists in the middle of the error one text, an entity name error exists in the error two input text, the error three alias is not in the knowledge base, the special symbol is normalized for the error one, and the processed name is added into the alias of the corresponding entity. If all Chinese punctuation marks are completely replaced by English punctuation marks. For error three, the entity recognition model can solve this problem. And counting the total number of times that the entity in the knowledge base cannot be matched with the second error and the third error, setting the total number of times that the entity in the training set cannot be matched with to be more than 4 and the number of times that the entity in the training set can be matched with the corresponding number of times of occurrence of the character string to be more than 3, and adding the character string into the alias of the entity.
And constructing an entity description text and a mapping dictionary, and splicing by using entity-attribute values, entity-relation-entities and triplet data in the constructed government map to obtain the entity description text. A mapping dictionary is built, and a dictionary is used for the later model, wherein a common dictionary comprises entity names and entity id lists, entity ids and entity names, entity ids and entity description texts, entity ids and entity types, entity types and entity ids;
the method comprises the steps of constructing an entity name dictionary by using entity names and alias information of the entities of a knowledge base, constructing a entity description text by using the entity description text of the knowledge base, selecting vector output of a model CLS position as vector embedding of the entity names by using a BERT pre-training model, obtaining candidate entities in a short text by using a dictionary matching mode, and finally screening matched results by using the constructed named entity recognition model. The construction flow is as follows:
the dictionary tree is added with the forward maximum matching of the entities, and meanwhile, the concept of the forward maximum matching of the entities is adopted to match the entities in the text. According to matching, entity names are inserted into a dictionary tree, a plurality of single-word entities exist in an entity library, the entity matching can cause a plurality of matching results, for the single-word entities not being inserted, the problem that some entities are repeated when the single-word entities are matched at maximum occurs, the occurrence times of the single-word entities are counted, and how to process the single-word entities is determined according to the occurrence times. The maximum entity is reserved, the minimum entity is reserved, or all the entities are reserved, the matching is carried out according to the maximum matching, and only the entities to be separated are separated after the matching is finished.
In order to perform two-class on the matched entity, entity names are required to be represented by a vector, because BERT is used for a subsequent model, the embedding of the entity names is obtained by using BERT, entity description text of a knowledge base is obtained, a BERT pre-training model is utilized, vector output of a model CLS position is selected to serve as vector embedding of the entity names, training data is constructed, candidate entities in the text are obtained through a maximum matching algorithm in a dictionary matching mode, and corresponding labels are marked.
And screening matched results by constructing an entity identification model, enabling government affair texts to pass through a BERT layer, splicing embedding corresponding to entity names through a bidirectional LSTM, and performing convolution and full-connection prediction. Because the model is realized by a dictionary matching mode, the result can find candidate entities in a knowledge base without boundary errors. The model removes the word entities during dictionary matching, and the BERT-CRF model predicts the word entities.
S1.3, performing semi-structured and unstructured data knowledge extraction and intelligent form filling through an OCR (optical character recognition) method and an extraction method, extracting a model based on a multi-feature fusion relation, inputting entity types, entities and context information at an input end of the model, performing feature extraction through a BERT pre-training model, solving the problem that the same mentioned text corresponds to different entities in the entity linking process through a disambiguation model, fusing the disambiguation model into map entity information, improving the accuracy of the disambiguation model, performing entity auditing through the model and manual work, and ensuring that entity-attribute values and entity-relation-entity triples meet service requirements;
S1.4, performing government ontology construction, constructing a government domain knowledge system and a government service flow knowledge system, wherein the government domain knowledge system and the government service flow knowledge system specifically comprise the steps of defining the domain and the scope of the government ontology, collecting government concepts and data resources, constructing the reusability of the existing ontology, analyzing and expressing the ontology, constructing the ontology, integrating and instantiating the ontology and evaluating and verifying the ontology;
the method comprises the following steps:
s2.1, defining the field and the scope of the government affair body, defining the business function field, the application, the described information content and the government affair object of the use and maintenance body corresponding to the government affair field;
s2.2, collecting government concepts and data resources, and collecting and processing consistency of data which does not meet the standard according to structured data such as government service databases and the like, and semi-structured data and unstructured data of documents in various government fields and government service networks, including e.g. government manuals and web page data of government service websites;
the reusability of the existing ontology is built, analysis and perfection are carried out on the existing government affair ontology, and reusability is improved; analyzing and expressing the ontology, extracting text information such as government manual, government document, government service website webpage data and the like, and extracting core concepts, concept attributes and relations among the concepts from the existing ontology which cannot be reused;
Constructing a body, firstly, defining the inheritance relationship of classes by adopting a top-down method, namely, starting from the most basic concept in the administrative field, and refining the inheritance relationship layer by layer; integrating and instantiating the ontology, integrating the government ontology, redefining and semantically processing the ontology of the government by adopting a consistency regulation method so as to avoid influencing data sharing and fusion, and extracting data for instantiation after confirming the ontology;
through the steps, a preliminary government field knowledge ontology is established, and then is evaluated and verified in terms of correctness, consistency, expandability, effectiveness, scale and descriptive capacity of the ontology through ontology evaluation and verification, multi-party investigation and invitation of field expert participation.
And S2, performing accurate question and answer of the government affair service based on the constructed government affair service map.
Further, the evaluation and verification are specifically performed as follows:
s3.1, evaluating and verifying a government field basic ontology, which is used for representing a general knowledge concept and does not contain service field characteristics, and performing characteristic modeling on structured and unstructured data such as texts, databases and the like, wherein the text ontology mainly describes attributes such as file formats, file sizes, keywords and the like;
S3.2, evaluating and verifying a knowledge system of a government domain knowledge body and a government service flow, aiming at analysis of related government corpora such as government documents, government news corpora and the like, planning five categories including personnel role categories, policy documents, news categories, comprehensive government role categories, administrative reply categories and corresponding triples under each category, entity-attribute values, entity-relationship-entities, and defining relationships from improving the efficiency of searching questions and answers of subsequent government staff by taking the content of the official documents as a core, and defining the relationships among the entities contained in the five categories of documents such as personnel role categories, policy notification categories, news categories, comprehensive government role categories, administrative reply categories, the specific relationships include personnel role authority relationships, personnel role authority relationships, mechanism-to-personnel role name relationships and mechanism-to-front employee role name relationships;
s3.2, aiming at government service flow analysis, constructing a government service flow knowledge system, wherein the main concepts of the service flow comprise government matters certificate materials, laws and regulations, administrative departments, service objects, administrative regions, matters subjects and authority levels, wherein the government matters are core concepts of government matters, each type of entity has respective attribute characteristics, semantic relations exist among the entities, the attribute characteristics and the semantic relations among the entities basically cover all information of the government matters, and the government matters consultation question-answer requirements of the public are met;
And S3.3, the core attribute characteristics of the government matters entity are rights sources, authority types, driving levels, handling types, service objects and fields, legal time limits, promise time limits, setting basis, organization properties, application conditions and application materials, semantic relations are defined from the aspect of improving the government service efficiency, the relation between the government matters and the authority levels is a driving level, the relation between the government matters and certificate materials is handling materials, the relation between the government matters and the administrative departments is a supervision body, the relation between the government matters and the administrative departments is a joint office, the relation between the government matters and the legal regulations is a setting basis, the relation between the government matters and the legal regulations is an implementation basis, and the relation between the government matters and the service objects is a service body.
Further, in step S1.1, a high-quality site evaluation model is provided for semi-structured and unstructured data acquisition, evaluation is performed from two dimensions of quality level and liveness, comprehensive scoring is performed, an intelligent content extraction technology GNE algorithm and an XPATH analysis extraction technology are accessed on government content analysis, government file downloading and storage are performed, in order to ensure consistency and integrity of data, a hierarchical storage scheme is provided, and the steps are specifically performed as follows:
Firstly, selecting an initial high-quality government website site, performing intelligent sniffing by using a network link, and performing fission type discovery of a new high-quality site and automatic site crawling and warehousing through quality evaluation and content link relation without performing secondary acquisition of content and landing and warehousing; the quality site evaluation model specifically comprises a quality grade scoring algorithm and an activity grade scoring algorithm, wherein the quality grade scoring algorithm is used for determining the relevance of a government website in the government field through a two-class classification prediction and normalization algorithm of information released by the government website, assisting in filtering websites with poor relevance, the activity grade scoring algorithm is used for carrying out three-dimensional normalization weighted summation on the number of push messages, the number of the push messages and the push release time of the push messages through the website number of push messages, obtaining quantifiable activity grades, assisting in configuring an optimized crawling cycle strategy, the websites discovered by the initial quality government website are subjected to a quality evaluation model based on the quality evaluation scoring algorithm, finally obtaining the quality grade and the activity grade of the new government website, solving the problem of difficult management of a large number of crawlers through the functions of site configuration, scheduling task monitoring and the like, crawling multi-modal data in the government field, and mining value from massive data by combining with the capability of government knowledge map products.
The method comprises the steps of providing a hierarchical storage scheme, acquiring related information of files through crawlers, storing acquired file information in a MongoDB, managing file downloading service through a zookeeper, uniformly distributing downloaded tasks to the upper surface of each service according to a partition distribution algorithm, managing threads through a thread pool by each service task, detecting whether a blocking queue in a current thread pool is less than half of a set value or not at regular time, triggering a method for executing file downloading if the blocking queue is not less than half of the set value, firstly reading data of a corresponding partition in a MongoDB during file downloading, judging whether an address is downloadable or not, judging whether the file address exists or not until a redisis available, requesting an address, acquiring a file stream, calculating an md5 value, splicing oss addresses, storing files to oss, recording and storing hbase and MongoDB, generating pdf for storage if the webpage is required, storing the later file information, keeping the file information consistent, recording the hbase and the MongoDB for processing the successful file, and calling the file information if the file is not available, and processing the successful file.
Further, the intelligent discovery method for the high-quality sites aiming at the semi-structured and unstructured data comprises the following specific steps:
s5.1: GNE is an extraction algorithm based on webpage text density and symbol density, after html webpage text is obtained, a Dom tree is generated through Jso analysis, then the Dom tree is subjected to preprocessing operations such as js script removal and css style removal, the text density and the symbol density are respectively extracted according to the text density algorithm and the symbol density algorithm, the score of the webpage text is calculated through the two dimensional values, and the score is regarded as the text of the webpage.
Further, in step S2, accurate question-answering of government service is performed based on the constructed government service map, fuzzy query is performed based on the FAQ knowledge base strategy, knowledge base strategies based on word and semantic multi-stage recall and sequencing are provided, an extraction type reading understanding question-answering strategy for multi-document retrieval is provided for the scene migration problem of the FAQ knowledge base, and a local knowledge base strategy of LLM large language model based on a trusted knowledge mechanism is provided for the problem of insufficient dialogue corpus in the government field of the FAQ knowledge base; the method comprises the following steps:
and S9.1, constructing a government service dialogue service by using a government service map multi-round question-answer strategy, and transmitting user identification and message content to a main logic service, wherein the top intention recognition module is used for classifying the top intention into a chat intention and a government intention because the corpus has good distinction degree by adopting a GBDT-LR traditional machine learning model, and the accuracy is guaranteed due to few model parameters. The chat intents are divided into calling, receiving intents, rejecting intents and clarifying intents, the government intents are divided based on the demands of government affairs, the whole government affair service subdivision intents are divided based on a DIET joint extraction model for government affair subdivision intents and entity extraction, government affair service dialogue management is completed based on intention inheritance and slot inheritance, the concrete steps are as follows,
S9.2, establishing a slot semantic template, designing an intention strategy according to entity, relation and attribute constructed by an ontology, ensuring that the intention is bound with slot list information, taking government matters as core entities, surrounding an associated entity triplet, an entity-attribute value, constructing the intention by entity-relation-entity, wherein the relation between the government matters and the authority level is a driving level, the relation between the government matters and the authority is an office material, the relation between the government matters and the authority is a supervision body, the relation between the government matters and the authority is a joint agency, the relation between the government matters and the law and regulation is a setting basis, the relation between the government matters and the law and regulation is an implementation basis, the relation between the government matters and a service object is a service body, setting a reply template corresponding to each intention, rejecting, clarifying the template and the cypher template, dynamically filling the cypher template according to data extracted by the intention and the entity, and possibly simultaneously returning a plurality of combined records according to the configuration of the business requirement cypher model, setting global template data, setting an intention unidentified template, rejecting the rest template comprises rejecting, rejecting the intention threshold, and the intention threshold is set up and the intention threshold is confirmed according to the intention threshold. Within a certain threshold, the intention is clarified. Entering a FAQ knowledge base question-answering strategy below a certain threshold;
And S9.3, performing intention and entity joint extraction based on a DIET joint extraction model, aiming at the problems that the conventional pre-training model BERT-TEXTCNN intention recognition and BERT-CRF model training and model reasoning are slow, and the entity recognition and intention extraction are strongly related under government affair subdivision scenes, and the two-stage tasks have error accumulation, the intention and entity joint extraction under government affair service subdivision scenes based on the DIET joint extraction model is proposed, under the condition of acquiring the intention and the slot positions, the intention and the slot positions are bound together, aiming at a semantic slot position template, judging whether the recognized entity is the same as the type of the slot positions under the response intention under the slot position template, completing slot position filling work, if the type is not the same, returning a non-response result, ensuring that the recognized entity can be linked to the entity energy link in a graph database through an entity linking technology, and the slot positions of an upper round are needed to be inherited, completing the slot positions, the intention is definitely inherited on the upper round, the intention is low in the intention, and the intention of the upper round is inherited, the related slot positions are extracted under the condition, the intention is not inherited, the problem of the fact that the dictionary is not needed is filtered, the problem of the fact that the dictionary is not merged, the dictionary is extracted, and the problem is not merged, and the problem of the dictionary is solved, and the problem is solved, and the result is extracted by the fact that the dictionary is not combined, and the dictionary is not matched, and the dictionary model is extracted. Finally, filtering out entities irrelevant to the service through rule filtering. Finally, de-duplicating and returning to the related entity format;
And S9.4, government service dialogue management, namely completing government service multi-round dialogue based on intention inheritance and slot inheritance. And the slot inheritance judges whether the dialogue user input exists an entity according to the user input and the user unique identification, if the dialogue user input does not exist, inherits the entity of one round, inherits the intention, has clear intention in the upper round, has the intention strength lower than a certain threshold value, and carries out intention inheritance in the slot related to the intention of the upper round. For intent below a certain threshold, a clarification validation process is required. Support is provided for multiple rounds of QA, multiple rounds of state machines are designed and maintained, dst results of each round are stored, a multiple round time window strategy is set, the round number window is limited to 5 rounds, the time window is limited to 5 minutes, expiration time is set in redis, history information is stored in redis, and context information can be conveniently and rapidly searched in a conversation process. For the same dialog state management intention, only the one with the latest time is taken;
and S9.5, processing the returned result. And (3) inquiring statement conversion and library checking, returning results, executing different cytoer templates according to different intention strategies, inquiring a graph database, filling an answer template with an entity, returning an answer, and entering a FAQ dialogue strategy if no data is inquired.
FAQ knowledge base question-answering strategy based on word and semantic multi-stage recall and sequencing; based on the constructed government service map, performing accurate question-answering of government service; the accurate query fails, fuzzy query is carried out based on the FAQ knowledge base strategy, and a FAQ knowledge base question-answering strategy based on word and semantic multi-stage recall and sequencing is provided; the method comprises the following steps:
and S9.6, a data preprocessing module is used for completing data cleaning and filtering invalid samples, completing the construction of a training set, a verification machine and a test set, and increasing a dictionary in the custom domain by new word discovery so as to improve the word segmentation accuracy.
S9.7, a recall module, wherein the recall model is divided into Word strength recall and semantic recall based, a solution of Word collineation distance is provided based on Word2Vec aiming at the problem of TFIDF short text Word frequency failure in the Word strength recall process, and the model accuracy is further improved by applying a SimCSE model based on contrast learning aiming at the problem of ebedding collapse of the BERT pre-training module when matching is performed;
and S9.8, based on Word strength recall, aiming at the problem of TFIDF short text Word frequency failure in the Word strength recall process, a solution of providing Word co-occurrence distance based on Word2Vec is provided, the weight of the Word is very close to the semantic distance between the Word vector and other words, non-core words can be subjected to outlier calculation, word2Vec is trained according to corpus total quantity to obtain Word vectors of each Word, word2Vec is used for training Word vector benefits, corpus of question-answer pairs is used, homonym corpus is also used for expansion, word vectors of one Word and other words are calculated to be close, calculated score is relatively high, the occurrence frequency of the Word is very high for one Word which is not a core Word, but the distance from other words is very far, calculated score is very low, the core Word which is not a key Word can be filtered out, word distance is calculated, the inner product is calculated by using cos similarity, the two Word vector space distance is close to each Word vector, the calculated with the aid of the cose similarity is high, the calculated Word total distance is calculated, the Word distance is calculated to be greater than the face distance is calculated by using the cose, the Word total distance is calculated, the index is calculated to be greater than the first, the index Word is calculated by the distance is calculated by d, the index Word is calculated to be the distance between the Word is not equal to the Word total distance, and the Word is calculated to be the distance is calculated, and the distance is calculated to be equal to the distance is calculated to the distance between the Word is calculated to have the distance is the Word has the distance value is equal to the distance. Finding out hundreds of possible answer pairs meeting the requirements of questions from million-level corpus, and recalling 50 pairs of corpus through a recall model, so that the subsequent sequencing task uses a more complex model;
S9.9, based on semantic recall and problem-problem semantic similarity calculation, aiming at the problem that the BERT pre-training module generates an empadd collapse when matching, a simCSE model based on comparison learning is applied to further improve the accuracy of the model, a simCSE comparison learning model based on the positive example construction is adopted, the random Dropout of the Bert is utilized, different sentence vectors are obtained through the BertEncoder twice for the same text to form a similar text, the negative example construction is adopted, other samples in the same Batch are randomly sampled as negative examples, on-line service is performed on the scene of the problem-problem, two sides of a double-tower are in a semantic space, the model of a tower is stored offline, the trained word vectors are stored offline and stored into Faiss, on-line service obtains the problem vectors through a network, the Faiss is removed for vector retrieval, the most index vectors are retrieved, on-line large-scale real-time calculation is avoided, and meanwhile, the answer pairs are centered from the millions and approximate to the answer pairs are called 50 pairs;
s9.10, according to the problem-problem similarity sorting, aiming at the characteristics of government corpora in the problem-problem similarity sorting process, taking into consideration the similarity sorting score of the problem-problem and considering whether sentences have government entity names or not, carrying out weighted voting summation, further improving the accuracy of sorting results, de-merging the word recall and semantic recall results, identifying government related entities in the query, calculating the intersection ratio of the entities in the query and each query in recall, calculating the semantic matching similarity of query and recall sentence levels based on comparison learning SimCSE model, weighting sorting, wherein the weight of semantic matching is 0.8, the weight of entity intersection ratio is 0.2, and finally selecting top50 for return;
S9.11, calculating the similarity of the questions and the answers, aiming at the problems of low quality of the questions and the answers, and avoiding answering questions, further improving the final matching effect, selecting the best answer from 50 pairs or adding diversity to randomly return topk to the questions and the answers to return to a user, constructing a model, wherein the question-question similarity calculation module recalls 50 corpus pairs finally, uses a more complex model, is not limited to a double tower, has the defects that two sides are not fully interacted, only has interaction in the last layer, has an upper limit on the effect, constructs the model, constructs a question and answer pair, inputs a BERT network, outputs a number between 0 and 1, represents the matching degree of the questions and the answers, and the BERT model enables the questions and the answers to cross for many times to learn more things, thereby improving the accuracy of the model;
s9.12, providing a multi-document retrieval extraction type reading understanding question-answer strategy aiming at FAQ multi-scene migration, wherein the strategy comprises multi-document coarse recall based on a bm25 algorithm, multi-task fine recall based on a BERT-MRC model and multi-document sequencing and answer return;
S9.13, based on the bm25 algorithm multi-document coarse recall, for 50 documents most relevant to each problem recall, on-line large-scale on-line calculation is avoided, and a dynamic negative sampling method is provided for solving the problem of unbalanced positive and negative samples of reading and understanding tasks, and in order to avoid the problem that paragraphs are truncated during training, the maximum length of the method is set to 400, so that the maximum input length 512 of the combination of the problems and the paragraphs does not exceed the maximum input length 512 of bert, the benefits of sentence segmentation can be fully utilized in model training, and the influence of sentence truncation is avoided. When training data is generated, a dynamic negative sampling method is used for solving the problem of imbalance of positive and negative samples, each problem corresponds to a positive sample set and a negative sample set, positive samples are defined as fragments containing answers of corresponding messages of the problem, and the negative samples are selected from 5 fragments with highest bm25 score and 5 fragments which do not contain answers as alternatives after all messages are segmented. When the batch is generated, each positive sample is taken, and one negative sample is randomly taken from candidate negative samples, so that the method has the advantages of ensuring that the positive and negative samples of the reordering, namely the sorting task, are balanced, and ensuring that the negative samples generated by each epoch are different in probability. In addition, a hardnegotiable sample method is used, negative samples are detected by using a trained model, and negative samples with high interference degree are added into a training set for retraining, so that the model can better distinguish the positive samples from the negative samples.
S9.14, based on BERT-MRC model multi-task recall, adopting a multi-task method to train an extraction model, taking the content of the question and the paragraph as the input of the model, judging whether an answer exists or not based on a detection task, extracting the answer of the question based on the extraction task, and multi-task training to promote the overall effect of the model, wherein the first task is to predict whether the table contains the answer, the relevance of the paragraph and the question, and the second task is to predict the start and end positions of the answer, which is the target of answer extraction per se. When the multi-task extraction module is used for training, two tasks are required to be converged simultaneously by adjusting the loss weights of the two tasks due to inconsistent convergence rates of the two tasks in the training process. The loss function is 0.01 x MRC_loss+0.99 x CLS_loss, firstly, an answer is ensured, and the answer is extracted, so that the model can achieve a good convergence effect;
s9.15, sorting the multiple documents and returning answers, firstly judging whether the two classification probability values of the label, the start and the end are lower than a certain threshold value, entering a local government knowledge base strategy of a LLM large model based on a trusted knowledge mechanism when the two classification probability values are lower than the threshold value, secondly, calculating a document score for the documents with the answers based on a sorting rule, and selecting the paragraph with the highest sorting score as a reading answer to return;
S9.16, LLM large language model local knowledge base strategy based on trusted knowledge mechanism; aiming at the problem of insufficient dialogue corpus in the field of FAQ knowledge base government affairs, a local government affair knowledge base strategy of a LLM big language model based on a trusted knowledge mechanism is provided, trusted knowledge is needed to be provided when the LLM big language model is applied inside an enterprise, the enterprise can better play a role, the trusted knowledge is a key for enabling generated artificial intelligence to be truly available in an enterprise environment, a stable enterprise knowledge base is not used for supporting the generated artificial intelligence, the generated artificial intelligence always returns trusted wrong answers, and because of consistency of result texts, the answers are always difficult to verify, so that knowledge workers with strict requirements on the correct answers trust in a system of the enterprise can be rooted in the trusted knowledge mechanism of the enterprise, and users are allowed to know where the answers come from and observe the data management strategy of the enterprise; firstly, constructing a government field knowledge base offline, wherein each knowledge has two character string lists, one character string is a question title, one character string is an answer text, vectorizing the government field knowledge through a vectorization model BERT, storing the vectorized result in a vector database Faiss, vectorizing the user questions through the vectorization model BERT according to a user question, inquiring the government vector database to obtain a topn strip matching result, wherein the search result comprises vectors and payload, the payload comprises title and text, and aiming at the problem of high redundancy of the topn direct matching result, the matching strategy based on MMR is provided to obtain topn data so as to further improve the accuracy and diversity of the matching result;
S10.1: the task performed by the Prompt guided LLM model is better understood by constructing the Prompt guided LLM model, and the task comprises prefix_Prompt, problems of users and playload. The method comprises the steps that a prefix_prompt is adopted, a trusted knowledge mechanism obtains information related to content and titles through government affair patterns, role information and activity information of login users are obtained, generated content meets the requirements of roles in specific government affair fields better, the direction of generating government affair content by using a LLM large language model is better specified, semantic networks among elements in the three posts are established, context and relations among staff in an organization, users, work done by the staff and work achievements are known, and accordingly returned results interacted with the LLM have higher credibility. For the government affair field, pre-fix_sample each piece of data, the following prompt fragments are added, and the paragraph content returns the non-found related information if not in the government affair field so as to inform the LLM model that the non-government affair field is not answered. And establishing an index number for each question, distinguishing the questions, wherein content comprises a title corresponding to a user question and a payload answer and a text corresponding to the answer, wherein the length of the prompting word is limited, each matched related abstract takes the first 300 characters, more related abstracts are wanted, 300 is changed into a larger value, the length of the prompting word is limited, only the first three of the searching results are taken, more searching results are wanted, and limit is set into the larger value.
S10.2: after the Prompt is built, the LLM big language model index building tool calls LLMAPI to access the LLM big language model, and the LLM big language model returns a result to a return trusted knowledge mechanism for post-processing. And aiming at the problem that the LLM large language model does not have an interpretation, the returned result is quoted and processed, and the source information is supplemented based on the traceability system and the search engine. And verifying and reasoning the returned result based on the government service map aiming at the actual errors and the reasoning errors of the LLM large language model. And executing necessary external query and call aiming at the query script returned by the LLM large language model. And finally, assembling and returning the result according to the required format. The trusted knowledge mechanism returns the post-processed results to the LLM big language model index building tool and returns the results to the user answers.
To achieve this goal, enterprises need to build knowledge graphs, which are the basis of enterprise trusted knowledge mechanisms, to achieve semantic relationships between "content, people, and activities. Content including personal assets, files, messages, related business objects, and the like; people, including identities and roles, teams, departments, groups, etc.; activities including content creation, editing history, comment searching, clicking, etc. After mapping data, the enterprise knowledge graph establishes semantic networks among all elements in the three supports to know the context and relation between staff in an organization and work results made by the staff, so that returned results interacted with LLM have higher credibility, a credible knowledge mechanism uniformly performs pretreatment and post-treatment, the prompt is refined, context data and field knowledge are supplemented, the generated text can be accurately docked to an application command after auditing and processing, the credible knowledge mechanism does not enable a large language model LLM to completely learn personal/enterprise data and field knowledge, and the staff and the work results are placed in different expert environments to cooperate together, and the expert environments are credible knowledge mechanisms specific to enterprises.
Compared with the prior art, the invention has the beneficial effects that:
1. the dialogue method based on the multi-strategy fusion in the government service field of the knowledge graph radically changes the supply mode of government questions and answers, effectively solves various problems existing in 'one-network general handling', greatly improves the use experience of users and optimizes the government service, is more in line with the business characteristics in the government service field, has better systematicness and high accuracy in government service inquiry, and greatly improves the use experience of users.
2. The method has the advantages that a knowledge system in the government affair field and a knowledge system in the government affair service process are built, the knowledge and the government affair service process in the government affair field are standardized, the work retrieval efficiency of government affair staff is improved, and meanwhile, the public government affair consultation question-answering requirement is met.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of faq of the present invention;
FIG. 2 is a question-answering flow chart of the present invention;
FIG. 3 is a diagram of a government map construction overall framework of the present invention;
FIG. 4 is a diagram of a government domain knowledge ontology according to the present invention;
FIG. 5 is a diagram of a model of the present invention for government service ontology construction;
FIG. 6 is a graph of an evaluation algorithm model of the present invention;
FIG. 7 is a flow chart of the GNE algorithm of the present invention;
FIG. 8 is a flow chart of file downloading according to the present invention;
FIG. 9 is a knowledge extraction flow chart of the present invention;
FIG. 10 is a flow chart of the entity identification dataset construction of the present invention;
FIG. 11 is a relationship extraction flow chart of the present invention;
FIG. 12 is a semantic matching flow chart of the present invention;
FIG. 13 is a flow chart of enterprise trusted knowledge mechanism establishment of the present invention;
FIG. 14 is a flow chart of semi-structured and unstructured data parsing of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
Referring to fig. 1-14, a dialogue method for multi-strategy fusion in the government service field based on a knowledge graph;
2-3, S1, constructing a government affair field knowledge system and a government affair service flow knowledge system based on a map construction of government affair service, and normalizing the government affair field knowledge and the government affair service flow;
s1.1, semi-structured and unstructured data acquisition is carried out, specifically, a high-quality site evaluation model is adopted to evaluate the data from two dimensions of quality grade and liveness, comprehensive scoring is carried out, an intelligent content extraction technology GNE algorithm and an XPATH analysis extraction technology are connected to government affair content analysis, and a government affair file hierarchical storage scheme is provided for guaranteeing consistency and integrity of downloaded data;
s1.2, carrying out knowledge extraction and structured data extraction, extracting triple data from structured data sources in the administrative domain according to a constructed administrative domain body, storing entity-attribute value and entity-relation-entity into a map, supporting intelligent matching of the attribute of an entity object and a data set field through ontology mapping aiming at a government service database, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; supporting the body relation to select an object data set or a relation data set to carry out relevant configuration, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, and simultaneously reducing a storage space;
S1.3, performing semi-structured and unstructured data knowledge extraction and intelligent form filling through an OCR (optical character recognition) method and an extraction method, extracting a model based on a multi-feature fusion relation, inputting entity types, entities and context information at an input end of the model, performing feature extraction through a BERT pre-training model, solving the problem that the same mentioned text corresponds to different entities in the entity linking process through a disambiguation model, fusing the disambiguation model into map entity information, improving the accuracy of the disambiguation model, performing entity auditing through the model and manual work, and ensuring that entity-attribute values and entity-relation-entity triples meet service requirements;
s1.4, performing government ontology construction, constructing a government domain knowledge system and a government service flow knowledge system, wherein the government domain knowledge system and the government service flow knowledge system specifically comprise the steps of defining the domain and the scope of the government ontology, collecting government concepts and data resources, constructing the reusability of the existing ontology, analyzing and expressing the ontology, constructing the ontology, integrating and instantiating the ontology and evaluating and verifying the ontology;
and S2, performing accurate question and answer of the government affair service based on the constructed government affair service map.
In this embodiment, in step S1.4, the following steps are specifically performed:
s2.1, defining the field and the scope of the government affair body, defining the business function field, the application, the described information content and the government affair object of the use and maintenance body corresponding to the government affair field;
S2.2, collecting government concepts and data resources, and collecting and processing consistency of data which does not meet the standard according to structured data such as government service databases and the like, and semi-structured data and unstructured data of documents in various government fields and government service networks, including e.g. government manuals and web page data of government service websites;
s2.3, constructing reusability of the existing ontology, analyzing and perfecting the existing government affair ontology, and improving reusability;
s2.4, analyzing and expressing the ontology, and extracting text information such as government manuals, government official documents, government service website webpage data and the like from the existing ontology which cannot be reused, so as to extract core concepts, concept attributes and relations among the concepts;
s2.5, constructing a body, firstly, defining the class and the inheritance relationship of the class by adopting a top-down method, namely, starting from the most basic concept in the administrative field, and refining the class layer by layer;
s2.6, integrating and instantiating the ontology, integrating the government affairs ontology, redefining and semantically processing the government affairs ontology by adopting a consistency protocol method so as to avoid influencing data sharing and fusion, and extracting data for instantiation after confirming the ontology;
S2.7, establishing a preliminary government domain knowledge ontology through the steps, and evaluating and verifying the ontology through the aspects of correctness, consistency, expandability, effectiveness, scale and descriptive capacity of the ontology through multi-party investigation and invitation of domain expert participation.
In this embodiment, in step S2.7, the following steps are specifically performed:
s3.1, firstly, evaluating and verifying a government field basic ontology, as shown in fig. 5-6, which is used for representing a general knowledge concept without business field characteristics, and performing characteristic modeling on structured and unstructured data such as texts, databases and the like, wherein the text ontology mainly describes attributes such as file formats, file sizes, keywords and the like;
s3.2, evaluating and verifying a knowledge system of a government domain knowledge body and a government service flow, aiming at analysis of related government corpora such as government documents, government news corpora and the like, planning five categories including personnel role categories, policy documents, news categories, comprehensive government role categories, administrative reply categories and corresponding triples under each category, entity-attribute values, entity-relationship-entities, and defining relationships from improving the efficiency of searching questions and answers of subsequent government staff by taking the content of the official documents as a core, and defining the relationships among the entities contained in the five categories of documents such as personnel role categories, policy notification categories, news categories, comprehensive government role categories, administrative reply categories, the specific relationships include personnel role authority relationships, personnel role authority relationships, mechanism-to-personnel role name relationships and mechanism-to-front employee role name relationships;
S3.2, aiming at government service flow analysis, constructing a government service flow knowledge system, wherein the main concepts of the service flow comprise government matters certificate materials, laws and regulations, administrative departments, service objects, administrative regions, matters subjects and authority levels, wherein the government matters are core concepts of government matters, each type of entity has respective attribute characteristics, semantic relations exist among the entities, the attribute characteristics and the semantic relations among the entities basically cover all information of the government matters, and the government matters consultation question-answer requirements of the public are met;
and S3.3, the core attribute characteristics of the government matters entity are rights sources, authority types, driving levels, handling types, service objects and fields, legal time limits, promise time limits, setting basis, organization properties, application conditions and application materials, semantic relations are defined from the aspect of improving the government service efficiency, the relation between the government matters and the authority levels is a driving level, the relation between the government matters and certificate materials is handling materials, the relation between the government matters and the administrative departments is a supervision body, the relation between the government matters and the administrative departments is a joint office, the relation between the government matters and the legal regulations is a setting basis, the relation between the government matters and the legal regulations is an implementation basis, and the relation between the government matters and the service objects is a service body.
In this embodiment, in step S1.1, a high-quality site evaluation model is provided for semi-structured and unstructured data acquisition, evaluation is performed from two dimensions of quality level and liveness, comprehensive scoring is performed, an intelligent content extraction technology GNE algorithm and an XPATH analysis extraction technology are accessed on government affair content analysis, government affair files are downloaded and stored, in order to ensure consistency and integrity of data, a hierarchical storage scheme is provided, and the steps are specifically performed as follows:
s4.1, first selecting an initial high-quality government website site, performing intelligent sniffing by using a network link, and performing fission type discovery of a new high-quality website and automatic website crawling and warehousing through quality evaluation and content link relation without performing secondary acquisition of content and landing and warehousing;
s4.2, guaranteeing consistency and integrity of downloaded data, providing a hierarchical storage scheme, acquiring related information of files through crawlers, storing acquired file information in a MongoDB, managing file downloading service through a zookeeper, uniformly distributing downloaded tasks to each service according to a partition distribution algorithm, managing threads through a thread pool by each service task, detecting whether a blocking queue in a current thread pool is less than half of a set value at regular time, if not, triggering a method for executing file downloading, firstly reading data of a corresponding partition in a mongob, judging whether an address is downloadable, judging whether a file address exists to redisis not, carrying out address request if not, acquiring a file stream, calculating an md5 value, splicing oss addresses, storing files to oss, recording and storing hbase and mongobb, if so, generating a pdf to store, keeping the later file information consistent, if the downloading fails, recording and sending the same to the mongobb, and calling the data to a service, if the data is successfully processed, and if the data is appointed to the service is reserved, 8, and if the method is successful, carrying out the file downloading is successfully called.
In this embodiment, in step S4.1, the quality website evaluation model specifically includes a quality grade scoring algorithm and an activity grade scoring algorithm, as shown in fig. 6, the quality grade scoring algorithm determines the relevance of the government website in the government field by using a two-class classification prediction and normalization algorithm for publishing information of the government website, assists in filtering websites with poor relevance, the activity grade scoring algorithm performs weighted summation for normalization by using the website number of the documents, the number of interest, the document publishing time three-dimensionally, obtains quantifiable activity grades, assists in configuring an optimized crawling cycle strategy, and finally obtains the quality grade and the activity grade of a new government website through a quality evaluation model based on the quality evaluation scoring algorithm, and solves the problem of difficulty in managing a large number of crawlers by using functions of configuring and scheduling the government website, and the like, crawls the multi-mode data in the government field by combining with the product capability of government knowledge map, and extracts value from mass data.
The intelligent discovery method for the high-quality site aiming at the semi-structured and unstructured data comprises the following specific steps:
s5.1: referring to fig. 7, gne is an extraction algorithm based on the text density and symbol density of a web page, after obtaining an html web page text, a Dom tree is generated by jso analysis, then a pre-processing operation such as js script removal and css style removal is performed on the Dom tree, the text density and the symbol density are respectively extracted according to the text density algorithm and the symbol density algorithm, the score of the web page text is calculated through the two dimension values, and the score is regarded as the text of the web page.
In this embodiment, in step S1.2: the knowledge extraction is carried out firstly on the structured data knowledge extraction specifically according to the following steps:
s6.1, extracting triple data from a structured data source in the administrative domain according to a constructed administrative domain ontology and an administrative service flow knowledge system, storing entity-attribute value and entity-relation-entity into a map, supporting intelligent matching of the attribute of an ontology object and a data set field through ontology mapping aiming at a government service database, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; supporting the body relation to select an object data set or a relation data set to carry out relevant configuration, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, and simultaneously reducing a storage space;
aiming at a traditional government service database, through ontology mapping, supporting intelligent matching of attributes of an ontology object and data set fields, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; the method supports the related configuration of the object data set or the relation data set selected by the body relation, constructs the data file, can realize the automatic access of the structured data through an excel uploading plug-in, a csv uploading plug-in and other tools, has the function of customizing a logic view as the data set, integrates a plurality of pieces of table information into one table according to service requirements to extract and customize, provides a visual and convenient operation interface for related operation, extracts and customizes the information of the plurality of tables, facilitates query operation, aims at the problem of updating and storing the structured data in a plurality of bins, provides a zipper table solution, adopts a zipper table design to check the historical state, and reduces the occupation of the storage space. The history state is directly covered by the data update, so that the history state cannot be queried. Storing all data in separate slices can lead to the problem of storing large amounts of non-updated data. The design of the pull chain table is to record the state of the updated data, the data which is not updated is not stored, the life cycle of each state is marked by time, the data of the state in the designated time range is obtained according to the requirement during the inquiry, and the latest state is represented by the maximum value of 9999-12-31 by default. And when inquiring, acquiring the data of the state of the designated time range according to the requirement. The data validation date dw_begin_date and the data expiration date dw_end_date2 fields are added, the data validation date records when the record is validated, and the data expiration date is the expiration time (9999-12-31 indicates that the record is valid until the present time). Recording of the new addition: the date of data validation is the same day and the date of expiration is 9999-12-31. Record of no change: the date of data validation is unchanged from the date of expiration before use. There is a record of the changes: for the old record: reserving and changing the expiration date to be the same day; for the new record: the new date is the same day, and the expiration date is 9999-12-31. Deleted records: closed loop is required and the expiration date becomes the current day. To improve query performance, an index is added to the 2 fields of the data validation date dw_begin_date and the data expiration date dw_end_date, as shown in fig. 9.
S6.2, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, reducing the occupation of a storage space, and directly covering the history state by data updating to cause that the history state cannot be queried;
s6.3: analyzing semi-structured and unstructured data, and performing intelligent form filling, wherein the intelligent form filling is realized by an OCR (optical character recognition) technology and an extraction technology, so that the daily document processing work is automated, the document processing efficiency of business personnel is improved, doc, docx, pdf, chm format files uploaded or downloaded by a user are firstly analyzed to obtain data such as pictures, tables, texts, titles and the like in the files, doc format files are firstly converted into docx files, the docx files are directly analyzed through a python-docx package of a third party tool of python, chm format files are firstly converted into pdf format files, then pdf analysis is performed, the pdf scanning files are subjected to OCR recognition, the whole OCR recognition process is divided into layout analysis, text detection and text recognition, the layout analysis technology is used for analyzing the text structures of documents, and automatically extracting the structured information of the document documents, the text detection process is to use rectangular frames with four points to select text area frames in an image, the text recognition process is to recognize the text selected by the previous step into characters, a one-stage target detection algorithm of a YOLO series is adopted, a text detection algorithm based on regression CTPN, EAST and segmentation PSENT and DBNet, a CRNN+ CTC, attention and SVRT text recognition algorithm are adopted to meet the recognition requirement of a document, component extraction is to adopt a method of combining an extraction technology and a rule to realize automatic extraction of elements, intelligent form filling is realized, component extraction tasks are divided into entity recognition and relation extraction, entity linkage is completed for the extracted entities, entity alignment and entity disambiguation are carried out, model audit and manual audit are carried out for the entities before warehouse entry, tracing the knowledge, positioning the sources of the government staff and the user searched knowledge or returned results, and improving the confidence and the interpretability of the user on the knowledge;
S6.4, resolving entity identification for semi-structured and unstructured data, firstly extracting unstructured data in a combined way, aiming at the problems that an entity identification subtask BERT-CRF is wrong in entity boundary in the government field, incomplete in entity identification and is not used for information of the existing government knowledge base, providing a solution for enhancing the accuracy of entity identification based on information of knowledge base description text, firstly constructing an entity name dictionary by utilizing entity names and entity alias information of the knowledge base, obtaining vector embedding of the entity names by mining the entity description text in the knowledge base, then obtaining candidate entities in the text by a name dictionary matching technology, and finally screening results by utilizing an entity identification model to finish the task of entity identification;
s6.5, in the step S6.4, the method specifically comprises a data preparation flow, an alias dictionary is constructed by using entity names of a knowledge base and alias information of the entities, entity description texts are constructed, and a mapping dictionary is constructed, wherein the specific flow is as follows:
constructing an entity alias dictionary, and constructing the alias dictionary by utilizing entity names of government knowledge base and alias information of the entities, wherein errors which cannot be matched in the entity base of government data entity names specifically comprise: and (3) a special character error exists in the middle of the error one text, an entity name error exists in the error two input text, the error three alias is not in the knowledge base, the special symbol is normalized for the error one, and the processed name is added into the alias of the corresponding entity. If all Chinese punctuation marks are completely replaced by English punctuation marks. For error three, the entity recognition model can solve this problem. And counting the total number of times that the entity in the knowledge base cannot be matched with the second error and the third error, setting the total number of times that the entity in the training set cannot be matched with to be more than 4 and the number of times that the entity in the training set can be matched with the corresponding number of times of occurrence of the character string to be more than 3, and adding the character string into the alias of the entity.
And constructing an entity description text and a mapping dictionary, and splicing by using entity-attribute values, entity-relation-entities and triplet data in the constructed government map to obtain the entity description text. A mapping dictionary is built, and a dictionary is used for the later model, wherein a common dictionary comprises entity names and entity id lists, entity ids and entity names, entity ids and entity description texts, entity ids and entity types, entity types and entity ids;
the method comprises the steps of constructing an entity name dictionary by using entity names and alias information of the entities of a knowledge base, constructing a entity description text by using the entity description text of the knowledge base, selecting vector output of a model CLS position as vector embedding of the entity names by using a BERT pre-training model, obtaining candidate entities in a short text by using a dictionary matching mode, and finally screening matched results by using the constructed named entity recognition model. The construction flow is as follows:
the dictionary tree is added with the forward maximum matching of the entities, and meanwhile, the concept of the forward maximum matching of the entities is adopted to match the entities in the text. According to matching, entity names are inserted into a dictionary tree, a plurality of single-word entities exist in an entity library, the entity matching can cause a plurality of matching results, for the single-word entities not being inserted, the problem that some entities are repeated when the single-word entities are matched at maximum occurs, the occurrence times of the single-word entities are counted, and how to process the single-word entities is determined according to the occurrence times. The maximum entity is reserved, the minimum entity is reserved, or all the entities are reserved, the matching is carried out according to the maximum matching, and only the entities to be separated are separated after the matching is finished.
In order to perform two-class on the matched entity, entity names are required to be represented by a vector, because BERT is used for a subsequent model, the embedding of the entity names is obtained by using BERT, entity description text of a knowledge base is obtained, a BERT pre-training model is utilized, vector output of a model CLS position is selected to serve as vector embedding of the entity names, training data is constructed, candidate entities in the text are obtained through a maximum matching algorithm in a dictionary matching mode, and corresponding labels are marked.
The matched result is screened by constructing an entity identification model, the government affair text passes through the BERT layer, as shown in figure 10, and the embedding corresponding to the entity name is spliced through the bidirectional LSTM, and the rolling and full-connection prediction is carried out. Because the model is realized by a dictionary matching mode, the result can find candidate entities in a knowledge base without boundary errors. The model removes the word entities during dictionary matching, and the BERT-CRF model predicts the word entities.
In this embodiment, in step S6.3, a relationship extraction model based on multi-feature fusion is extracted, the input end of the model inputs entity type, entity itself and context information, and different pooling strategies are dynamically set for different entity lengths through feature extraction of the BERT pre-training model, so that the model is extracted; as in fig. 11;
Entity linking is the task of linking the text mentioned in the text to an entity in the knowledge base. Entity linking difficulties, entity alignment has different mention texts through the same entity and entity disambiguation of the same mention text into different entities.
Aiming at the solution scheme that different mention texts are provided through the same entity, the method comprises the steps of utilizing established entity names and aliases to train texts to match with a SimCSE model, storing training entity word vectors on FAISS vector library lines, obtaining vectors through the ENCODER of the SimCSE model according to the entity which is checked for the first time, removing vector retrieval in the FAISS library, and obtaining entity words with highest scores;
and then entity linking is carried out, and for the entity solutions corresponding to different texts in the same reference, a plurality of entities possibly correspond to the entities in the government map through the entity corrected by the first round of dictionary, a plurality of corresponding entity ID candidates are found according to the entity names, and entity disambiguation processing is carried out. The entity disambiguation is realized based on the idea of two classifications, wherein the entity connected to the entity is selected as positive examples during training, two negative examples are selected from candidate entities, an input text and a description text of the entity to be disambiguated are connected together and input into a BERT model, CLS position vectors are taken for output, and feature vectors corresponding to the starting position and the ending position of the candidate entity are connected, the three vectors are connected, the probability scores of the candidate entities are obtained through full connection layer and the most sigmoid activation, the probability scores of all the candidate entities are ordered, and the entity with the highest probability is selected as the correct entity;
And then entity model auditing and manual auditing are carried out, entity auditing is carried out through the model, the entity is ensured to meet the service requirement, the inspection process does not use context information, the judgment is carried out by focusing on the combination mode of characters per se, the construction of a short text two-classification problem model is essential, the long-distance relation is not required to be captured for a short text task, the traditional RNN model is used for solving the problem, the performance and the effect are well balanced, the data set construction is carried out, a positive sample is a standard named entity based on manual auditing, 1 is a positive label, the representation of the following characters is a named entity, and the government affair entity names are sourced from a database. And 0 is a negative label, and represents that the following text is not a named entity. The unnamed entity is the character string inversion, and the ratio of positive and negative samples is 1:1;
the human-computer collaboration mechanism is adopted, and the identified triples, entity-attribute-entity and entity-relation-entity are realized based on the algorithm, and meanwhile, manual audit is supported by intervention of government staff in the knowledge extraction process, so that the human-computer collaboration mechanism integrating knowledge extraction is realized. When the extraction is carried out, if the content of the data conflicts, the manually modified data is high in priority.
And tracing knowledge. The knowledge tracing technology realizes the positioning of the provenance of the knowledge of the questions and answers of the government workers and the users or the returned results, and improves the confidence and the interpretability of the knowledge of the government workers and the users. The method has the advantages of quickly and accurately realizing knowledge tracing, grasping knowledge source information and determining related information at the first time, reducing time cost and manpower and material resources to the greatest extent, and having important significance for assisting decision making and solving problems of government staff.
In the specific implementation process of knowledge tracing, unique identification information is created on data through a knowledge graph, and the function of rapid positioning is performed, so that nodes in a database have sources and bases. When a knowledge graph is constructed, the dimensions of time, place and the like are marked on input data, for example, when a knowledge is derived from a book, the dimensions of book names, book publishers, book publishing time, book authors and the like of the knowledge are marked, and the knowledge is traced back to the source of the knowledge based on the marks. Uploading the document to a system for inputting, and entering the related information of the identified entity into a warehouse through OCR recognition on the page, and recording the related document information corresponding to the entity into a mongo database. Information of one entity attribute may exist in a plurality of documents, so one entity may correspond to a plurality of documents, and thus entity tracing may correspond to a plurality of documents. The user searches the content, and the document tracing function is used to inquire the document information of all sources of the entity from the mongo database, and the document information is ranked by combining with the user scoring. And supporting the star rating of the user for the documents without rating, and storing the documents into the user portrait.
In this embodiment, in step S2, as shown in fig. 1, based on a constructed government service map, accurate question-answering is performed on government service, fuzzy query is performed based on FAQ knowledge base policy, knowledge base policy based on word and semantic multi-stage recall and sequencing is provided, extraction type reading understanding question-answering policy for multi-document retrieval is provided for FAQ knowledge base scene migration problem, and LLM large language model local knowledge base policy based on trusted knowledge mechanism is provided for FAQ knowledge base government field dialogue corpus shortage problem; the method comprises the following steps:
and S9.1, constructing a government service dialogue service by using a government service map multi-round question-answer strategy, and transmitting user identification and message content to a main logic service, wherein the top intention recognition module is used for classifying the top intention into a chat intention and a government intention because the corpus has good distinction degree by adopting a GBDT-LR traditional machine learning model, and the accuracy is guaranteed due to few model parameters. The chat intents are divided into calling, receiving intents, rejecting intents and clarifying intents, the government intents are divided based on the demands of government affairs, the whole government affair service subdivision intents are divided based on a DIET joint extraction model for government affair subdivision intents and entity extraction, government affair service dialogue management is completed based on intention inheritance and slot inheritance, the concrete steps are as follows,
S9.2, establishing a slot semantic template, designing an intention strategy according to entity, relation and attribute constructed by an ontology, ensuring that the intention is bound with slot list information, taking government matters as core entities, surrounding an associated entity triplet, an entity-attribute value, constructing the intention by entity-relation-entity, wherein the relation between the government matters and the authority level is a driving level, the relation between the government matters and the authority is an office material, the relation between the government matters and the authority is a supervision body, the relation between the government matters and the authority is a joint agency, the relation between the government matters and the law and regulation is a setting basis, the relation between the government matters and the law and regulation is an implementation basis, the relation between the government matters and a service object is a service body, setting a reply template corresponding to each intention, rejecting, clarifying the template and the cypher template, dynamically filling the cypher template according to data extracted by the intention and the entity, and possibly simultaneously returning a plurality of combined records according to the configuration of the business requirement cypher model, setting global template data, setting an intention unidentified template, rejecting the rest template comprises rejecting, rejecting the intention threshold, and the intention threshold is set up and the intention threshold is confirmed according to the intention threshold. Within a certain threshold, the intention is clarified. Entering a FAQ knowledge base question-answering strategy below a certain threshold;
And S9.3, performing intention and entity joint extraction based on a DIET joint extraction model, aiming at the problems that the conventional pre-training model BERT-TEXTCNN intention recognition and BERT-CRF model training and model reasoning are slow, and the entity recognition and intention extraction are strongly related under government affair subdivision scenes, and the two-stage tasks have error accumulation, the intention and entity joint extraction under government affair service subdivision scenes based on the DIET joint extraction model is proposed, under the condition of acquiring the intention and the slot positions, the intention and the slot positions are bound together, aiming at a semantic slot position template, judging whether the recognized entity is the same as the type of the slot positions under the response intention under the slot position template, completing slot position filling work, if the type is not the same, returning a non-response result, ensuring that the recognized entity can be linked to the entity energy link in a graph database through an entity linking technology, and the slot positions of an upper round are needed to be inherited, completing the slot positions, the intention is definitely inherited on the upper round, the intention is low in the intention, and the intention of the upper round is inherited, the related slot positions are extracted under the condition, the intention is not inherited, the problem of the fact that the dictionary is not needed is filtered, the problem of the fact that the dictionary is not merged, the dictionary is extracted, and the problem is not merged, and the problem of the dictionary is solved, and the problem is solved, and the result is extracted by the fact that the dictionary is not combined, and the dictionary is not matched, and the dictionary model is extracted. Finally, filtering out entities irrelevant to the service through rule filtering. Finally, de-duplicating and returning to the related entity format;
And S9.4, government service dialogue management, namely completing government service multi-round dialogue based on intention inheritance and slot inheritance. And the slot inheritance judges whether the dialogue user input exists an entity according to the user input and the user unique identification, if the dialogue user input does not exist, inherits the entity of one round, inherits the intention, has clear intention in the upper round, has the intention strength lower than a certain threshold value, and carries out intention inheritance in the slot related to the intention of the upper round. For intent below a certain threshold, a clarification validation process is required. Support is provided for multiple rounds of QA, multiple rounds of state machines are designed and maintained, dst results of each round are stored, a multiple round time window strategy is set, the round number window is limited to 5 rounds, the time window is limited to 5 minutes, expiration time is set in redis, history information is stored in redis, and context information can be conveniently and rapidly searched in a conversation process. For the same dialog state management intention, only the one with the latest time is taken;
and S9.5, processing the returned result. And (3) inquiring statement conversion and library checking, returning results, executing different cytoer templates according to different intention strategies, inquiring a graph database, filling an answer template with an entity, returning an answer, and entering a FAQ dialogue strategy if no data is inquired.
FAQ knowledge base question-answering strategy based on word and semantic multi-stage recall and sequencing; based on the constructed government service map, performing accurate question-answering of government service; the accurate query fails, fuzzy query is carried out based on the FAQ knowledge base strategy, and a FAQ knowledge base question-answering strategy based on word and semantic multi-stage recall and sequencing is provided; the method comprises the following steps:
and S9.6, a data preprocessing module is used for completing data cleaning and filtering invalid samples, completing the construction of a training set, a verification machine and a test set, and increasing a dictionary in the custom domain by new word discovery so as to improve the word segmentation accuracy.
S9.7, a recall module, wherein the recall model is divided into Word strength recall and semantic recall based, a solution of Word collineation distance is provided based on Word2Vec aiming at the problem of TFIDF short text Word frequency failure in the Word strength recall process, and the model accuracy is further improved by applying a SimCSE model based on contrast learning aiming at the problem of ebedding collapse of the BERT pre-training module when matching is performed;
s9.8: based on Word strength recall, aiming at the problem of TFIDF short text Word frequency failure in the Word strength recall process, a solution for providing Word co-occurrence distance based on Word2Vec is provided, one Word vector is very close to the semantic distance between other words, then the weight of the Word vector is very large, non-core words can be subjected to outlier, word2Vec is trained according to the corpus volume to obtain Word vectors of each Word, word vectors of each Word are trained by Word2Vec, the benefit of Word vector training is realized by using the corpus of question-answer pairs, word vectors of one Word and other words can be expanded by using corpora of the same industry, calculated score is relatively high, the frequency of occurrence of other words which are not core words is very high, but the distance from other words is very far, calculated score is very low, the core Word which is not a keyword can be filtered out, word distance is calculated, the inner product is calculated by using cos similarity, the Euclidean distance can be also used, the closer the cos similarity is calculated the space distance between two Word vectors, the smaller the included angle is, the calculated cos is divided high, the Euclidean distance is calculated, the calculated distance is brought into 1-d for d/2 calculation similarity, the length of the Word is divided by the whole sum, the Word of a document is divided by the Word length, word vectors are trained, the score from the Word of each document to other Word vectors is calculated to calculate Word and other Word calculation weights in the document to replace TF failure, then the information quantity IDF of the Word is used to beat high-frequency but unimportant words, and finally the inverted index is established by the product of the Word co-occurrence distance and IDF. Finding out hundreds of possible answer pairs meeting the requirements of questions from million-level corpus, and recalling 50 pairs of corpus through a recall model, so that the subsequent sequencing task uses a more complex model;
S9.9, as shown in FIG. 12, based on semantic recall and calculation of problem-problem semantic similarity, aiming at the problem of ebedding collapse of the BERT pre-training module when matching, applying a comparison learning SimCSE model to further improve the model accuracy, based on the SimCSE comparison learning model, constructing a positive example, using the random Dropout of Bert, obtaining different sentence vectors from the same text through a Bertencoder twice to form a similar text, constructing a negative example, randomly sampling other samples in the same Batch as the negative example, serving on line, carrying out a problem-problem scene, storing a tower model on two sides in a semantic space offline, storing the trained word vectors offline and storing the problem ebedding in Faiss, carrying out vector retrieval of Faiss by the online service, retrieving the nearest index vectors, avoiding online large-scale real-time calculation, and simultaneously carrying out a query pair from a million-question pair to a recall 50 pairs;
s9.10, according to the problem-problem similarity sorting, aiming at the characteristics of government corpora in the problem-problem similarity sorting process, taking into consideration the similarity sorting score of the problem-problem and considering whether sentences have government entity names or not, carrying out weighted voting summation, further improving the accuracy of sorting results, de-merging the word recall and semantic recall results, identifying government related entities in the query, calculating the intersection ratio of the entities in the query and each query in recall, calculating the semantic matching similarity of query and recall sentence levels based on comparison learning SimCSE model, weighting sorting, wherein the weight of semantic matching is 0.8, the weight of entity intersection ratio is 0.2, and finally selecting top50 for return;
S9.11, calculating the similarity of the questions and the answers, aiming at the problems of low quality of the questions and the answers, and avoiding answering questions, further improving the final matching effect, selecting the best answer from 50 pairs or adding diversity to randomly return topk to the questions and the answers to return to a user, constructing a model, wherein the question-question similarity calculation module recalls 50 corpus pairs finally, uses a more complex model, is not limited to a double tower, has the defects that two sides are not fully interacted, only has interaction in the last layer, has an upper limit on the effect, constructs the model, constructs a question and answer pair, inputs a BERT network, outputs a number between 0 and 1, represents the matching degree of the questions and the answers, and the BERT model enables the questions and the answers to cross for many times to learn more things, thereby improving the accuracy of the model;
s9.12, providing a multi-document retrieval extraction type reading understanding question-answer strategy aiming at FAQ multi-scene migration, wherein the strategy comprises multi-document coarse recall based on a bm25 algorithm, multi-task fine recall based on a BERT-MRC model and multi-document sequencing and answer return;
S9.13, based on the bm25 algorithm multi-document coarse recall, for 50 documents most relevant to each problem recall, on-line large-scale on-line calculation is avoided, and a dynamic negative sampling method is provided for solving the problem of unbalanced positive and negative samples of reading and understanding tasks, and in order to avoid the problem that paragraphs are truncated during training, the maximum length of the method is set to 400, so that the maximum input length 512 of the combination of the problems and the paragraphs does not exceed the maximum input length 512 of bert, the benefits of sentence segmentation can be fully utilized in model training, and the influence of sentence truncation is avoided. When training data is generated, a dynamic negative sampling method is used for solving the problem of imbalance of positive and negative samples, each problem corresponds to a positive sample set and a negative sample set, positive samples are defined as fragments containing answers of corresponding messages of the problem, and the negative samples are selected from 5 fragments with highest bm25 score and 5 fragments which do not contain answers as alternatives after all messages are segmented. When the batch is generated, each positive sample is taken, and one negative sample is randomly taken from candidate negative samples, so that the method has the advantages of ensuring that the positive and negative samples of the reordering, namely the sorting task, are balanced, and ensuring that the negative samples generated by each epoch are different in probability. In addition, a hardnegotiable sample method is used, negative samples are detected by using a trained model, and negative samples with high interference degree are added into a training set for retraining, so that the model can better distinguish the positive samples from the negative samples.
S9.14, based on BERT-MRC model multi-task recall, adopting a multi-task method to train an extraction model, taking the content of the question and the paragraph as the input of the model, judging whether an answer exists or not based on a detection task, extracting the answer of the question based on the extraction task, and multi-task training to promote the overall effect of the model, wherein the first task is to predict whether the table contains the answer, the relevance of the paragraph and the question, and the second task is to predict the start and end positions of the answer, which is the target of answer extraction per se. When the multi-task extraction module is used for training, two tasks are required to be converged simultaneously by adjusting the loss weights of the two tasks due to inconsistent convergence rates of the two tasks in the training process. The loss function is 0.01 x MRC_loss+0.99 x CLS_loss, firstly, an answer is ensured, and the answer is extracted, so that the model can achieve a good convergence effect;
s9.15, sorting the multiple documents and returning answers, firstly judging whether the two classification probability values of the label, the start and the end are lower than a certain threshold value, entering a local government knowledge base strategy of a LLM large model based on a trusted knowledge mechanism when the two classification probability values are lower than the threshold value, secondly, calculating a document score for the documents with the answers based on a sorting rule, and selecting the paragraph with the highest sorting score as a reading answer to return;
S9.16, LLM large language model local knowledge base strategy based on trusted knowledge mechanism; aiming at the problem of insufficient dialogue corpus in the field of FAQ knowledge base government affairs, a local government affair knowledge base strategy of a LLM big language model based on a trusted knowledge mechanism is provided, trusted knowledge is needed to be provided when the LLM big language model is applied inside an enterprise, the enterprise can better play a role, the trusted knowledge is a key for enabling generated artificial intelligence to be truly available in an enterprise environment, a stable enterprise knowledge base is not used for supporting the generated artificial intelligence, the generated artificial intelligence always returns trusted wrong answers, and because of consistency of result texts, the answers are always difficult to verify, so that knowledge workers with strict requirements on the correct answers trust in a system of the enterprise can be rooted in the trusted knowledge mechanism of the enterprise, and users are allowed to know where the answers come from and observe the data management strategy of the enterprise;
to achieve this goal, as in FIG. 13, the enterprise needs to build a knowledge graph, which is the basis for the enterprise's trusted knowledge mechanism, implementing the semantic relationship between "content, people, and activities". Content including personal assets, files, messages, related business objects, and the like; people, including identities and roles, teams, departments, groups, etc.; activities including content creation, editing history, comment searching, clicking, etc. After mapping data, the enterprise knowledge graph establishes semantic networks among all elements in the three supports to know the context and relation between staff in an organization and work results made by the staff, so that returned results interacted with LLM have higher credibility, a credible knowledge mechanism uniformly performs pretreatment and post-treatment, the prompt is refined, context data and field knowledge are supplemented, the generated text can be accurately docked to an application command after auditing and processing, the credible knowledge mechanism does not enable a large language model LLM to completely learn personal/enterprise data and field knowledge, and the staff and the work results are placed in different expert environments to cooperate together, and the expert environments are credible knowledge mechanisms specific to enterprises.
In this embodiment, in step S9.16, a government field knowledge base is firstly built offline, each knowledge has two character string lists, one character string is a question title, one character string is an answer text, the government field knowledge is vectorized through a vectorization model BERT and stored in a vector database Faiss, according to a user question, the user question is vectorized through the vectorization model BERT, the government vector database is queried to obtain a topn strip matching result, the search result contains a vector and a payload, the payload contains a title and a text, and aiming at the topn direct matching result, the problem of high redundancy, a matching strategy based on an MMR is provided to obtain topn data, so that the accuracy and the diversity of the matching result are further improved;
s10.1: the task performed by the Prompt guided LLM model is better understood by constructing the Prompt guided LLM model, and the task comprises prefix_Prompt, problems of users and playload. The method comprises the steps that a prefix_prompt is adopted, a trusted knowledge mechanism obtains information related to content and titles through government affair patterns, role information and activity information of login users are obtained, generated content meets the requirements of roles in specific government affair fields better, the direction of generating government affair content by using a LLM large language model is better specified, semantic networks among elements in the three posts are established, context and relations among staff in an organization, users, work done by the staff and work achievements are known, and accordingly returned results interacted with the LLM have higher credibility. For the government affair field, pre-fix_sample each piece of data, the following prompt fragments are added, and the paragraph content returns the non-found related information if not in the government affair field so as to inform the LLM model that the non-government affair field is not answered. And establishing an index number for each question, distinguishing the questions, wherein content comprises a title corresponding to a user question and a payload answer and a text corresponding to the answer, wherein the length of the prompting word is limited, each matched related abstract takes the first 300 characters, more related abstracts are wanted, 300 is changed into a larger value, the length of the prompting word is limited, only the first three of the searching results are taken, more searching results are wanted, and limit is set into the larger value.
S10.2: after the Prompt is built, the LLM big language model index building tool calls LLMAPI to access the LLM big language model, and the LLM big language model returns a result to a return trusted knowledge mechanism for post-processing. And aiming at the problem that the LLM large language model does not have an interpretation, the returned result is quoted and processed, and the source information is supplemented based on the traceability system and the search engine. And verifying and reasoning the returned result based on the government service map aiming at the actual errors and the reasoning errors of the LLM large language model. And executing necessary external query and call aiming at the query script returned by the LLM large language model. And finally, assembling and returning the result according to the required format. The trusted knowledge mechanism returns the post-processed results to the LLM big language model index building tool and returns the results to the user answers.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (9)
1. A dialogue method for multi-strategy fusion in the government service field based on a knowledge graph is characterized in that: the method is specifically carried out according to the following steps of;
S1, constructing a map based on government affairs service, constructing a government affair field knowledge system and a government affair service flow knowledge system, and standardizing the government affair field knowledge and the government affair service flow;
s1.1, semi-structured and unstructured data acquisition is carried out, specifically, a high-quality site evaluation model is adopted to evaluate the data from two dimensions of quality grade and liveness, comprehensive scoring is carried out, an intelligent content extraction technology GNE algorithm and an XPATH analysis extraction technology are connected to government affair content analysis, and a government affair file hierarchical storage scheme is provided for guaranteeing consistency and integrity of downloaded data;
s1.2, carrying out knowledge extraction and structured data extraction, extracting triple data from structured data sources in the administrative domain according to a constructed administrative domain body, storing entity-attribute value and entity-relation-entity into a map, supporting intelligent matching of the attribute of an entity object and a data set field through ontology mapping aiming at a government service database, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; supporting the body relation to select an object data set or a relation data set to carry out relevant configuration, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, and simultaneously reducing a storage space;
S1.3, performing semi-structured and unstructured data knowledge extraction and intelligent form filling through an OCR (optical character recognition) method and an extraction method, extracting a model based on a multi-feature fusion relation, inputting entity types, entities and context information at an input end of the model, performing feature extraction through a BERT pre-training model, solving the problem that the same mentioned text corresponds to different entities in the entity linking process through a disambiguation model, fusing the disambiguation model into map entity information, improving the accuracy of the disambiguation model, performing entity auditing through the model and manual work, and ensuring that entity-attribute values and entity-relation-entity triples meet service requirements;
s1.4, performing government ontology construction, constructing a government domain knowledge system and a government service flow knowledge system, wherein the government domain knowledge system and the government service flow knowledge system specifically comprise the steps of defining the domain and the scope of the government ontology, collecting government concepts and data resources, constructing the reusability of the existing ontology, analyzing and expressing the ontology, constructing the ontology, integrating and instantiating the ontology and evaluating and verifying the ontology;
s2, performing accurate question and answer of government affair service based on the constructed government affair service map;
based on the constructed government service map, performing accurate question-answering of government service, performing fuzzy query based on FAQ knowledge base strategies, providing knowledge base strategies based on word and semantic multi-stage recall and sequencing, providing an extraction type reading understanding question-answering strategy for multi-document retrieval aiming at the FAQ knowledge base scene migration problem, and providing a LLM (language model local knowledge base strategy based on a trusted knowledge mechanism aiming at the problem of insufficient dialogue corpus in the field of the government of the FAQ knowledge base; the method comprises the following steps:
S9.1, a government service map multi-round question-answering strategy is constructed, a government service dialogue service is constructed, user identification and message content are sent to a main logic service, a top-layer intention recognition module is adopted, because corpus has good distinction, a GBDT-LR traditional machine learning model is adopted, model parameters are few, accuracy is guaranteed, the top-layer intention is divided into a free chat intention and a government intention, the free chat intention is divided into an calling call, an acceptance intention, a rejection intention and a clarification intention, the government intention is divided based on the requirements of government service, the whole government service subdivision intention is divided into government service subdivision intention and entity extraction based on a DIET joint extraction model, the government service dialogue management is finished based on intention inheritance and slot inheritance, and the concrete steps are as follows,
s9.2: establishing a slot semantic template, designing an intention strategy according to entity, relation and attribute constructed by an ontology, ensuring that the intention is bound with slot list information, taking government matters as a core entity, surrounding an associated entity triplet, an entity-attribute value, constructing the intention by the entity-relation-entity, wherein the relation between the government matters and the authority level is a driving hierarchy, the relation between the government matters and the certificate materials is an transacted material, the relation between the government matters and the administrative departments is a supervision body, the relation between the government matters and the administrative departments is a joint office, the relation between the government matters and the law and regulation is a setting basis, the relation between the government matters and the law and regulation is an implementation basis, the relation between the government matters and a service object is a service body, setting a reply template corresponding to each intention, rejecting, clarifying templates and a cypher template, simultaneously returning a plurality of combined records according to the configuration of the cypher model in a dynamic filling of the cypher template, setting a global template, the rest template comprises a supervision template, rejecting the call template, etc., and the intention is set to be a threshold value, and a threshold value is set to be a threshold value is confirmed within a certain range, and a threshold value is set to be a threshold value;
S9.3, performing intention and entity joint extraction based on a DIET joint extraction model, aiming at the problems that the conventional pre-training model BERT-TEXTCNN intention recognition and BERT-CRF model training and model reasoning are slow, and the entity recognition and intention extraction are strongly related under government affair subdivision scenes, and the two-stage tasks have error accumulation, the intention and entity joint extraction under government affair service subdivision scenes based on the DIET joint extraction model is proposed, under the condition of acquiring the intention and the slot positions, the intention and the slot positions are bound together, aiming at a semantic slot position template, judging whether the recognized entity is the same as the slot position type under the response intention under the slot position template, completing slot position filling work, if the type is not the same, returning a response result, ensuring that the recognized entity can be linked to the entity in a graph database through an entity linking technology, and the recognized slot position is empty, and the slot position assignment of an upper round is needed, thereby completing the slot position, the intention is clearly inherited on the upper round, the intention is low, the intention is inherited with the related slot position under the upper round of service subdivision scene, the intention is not needed, the related to be filtered, the problem is solved, the problem that the related service is not needed is filtered, the dictionary is not filtered, the relevant rule is not needed, and the problem is not is filtered, and finally, the problem is solved, the relation is filtered, and the dictionary is not is filtered, and the relevant is filtered;
S9.4, government service dialogue management, namely completing government service multi-round dialogue based on intention inheritance and slot inheritance, judging whether a dialogue user input exists an entity according to user input and a user unique identifier, inheriting the entity of one round if the dialogue user input does not exist, carrying out intention inheritance on the upper round, wherein the strength of the intention of the round is lower than a certain threshold value, carrying out intention inheritance with the slot related to the intention of the upper round, carrying out clarification confirmation processing on the intention is lower than a certain threshold value, providing support for multiple rounds QA, designing and maintaining a multi-round state machine, saving dst results of each round, setting a multi-round time window strategy, limiting the round number window to 5 rounds, limiting the time window to 5 minutes, setting expiration time in rediss, storing history information in rediss, facilitating quick searching of context information in the dialogue process, and managing the intention of the same dialogue state, and taking only one nearest time;
s9.5, processing the returned result, inquiring statement conversion and checking the returned result, executing different cytoer templates according to different intention strategies, inquiring the graph database, filling the entity with the answer template, returning an answer, and entering a FAQ dialogue strategy if no data is inquired;
FAQ knowledge base question-answering strategy based on word and semantic multi-stage recall and sequencing; based on the constructed government service map, performing accurate question-answering of government service; the accurate query fails, fuzzy query is carried out based on the FAQ knowledge base strategy, and a FAQ knowledge base question-answering strategy based on word and semantic multi-stage recall and sequencing is provided; the method comprises the following steps:
s9.6, a data preprocessing module is used for completing data cleaning and filtering invalid samples, completing the construction of training sets, verification machines and test sets, and increasing a dictionary in the user-defined field by new word discovery so as to improve the word segmentation accuracy;
s9.7, a recall module, wherein the recall model is divided into Word strength recall and semantic recall based, a solution of Word collineation distance is provided based on Word2Vec aiming at the problem of TFIDF short text Word frequency failure in the Word strength recall process, and the model accuracy is further improved by applying a SimCSE model based on contrast learning aiming at the problem of ebedding collapse of the BERT pre-training module when matching is performed;
s9.8: based on Word strength recall, a solution for providing Word co-occurrence distance based on Word2Vec is provided aiming at the problem of TFIDF short text Word frequency failure in the Word strength recall process, one Word vector is very close to the semantic distance of other words, then the weight of the Word vector is very large, non-core words can be outliers, word2Vec is trained according to the corpus total quantity to obtain Word vectors of each Word, word vectors of each Word are calculated by using the benefits of Word2Vec training Word vectors, namely the corpus of question-answer pairs or expansion by using corpora of the same industry, calculated score is relatively high, the frequency of occurrence of other words is very high, but the distance from other words is very far, calculated score is very low, the core words which are not keywords can be filtered, word distance is calculated, and inner product is calculated by using cos similarity, the method comprises the steps of calculating the spatial distance of two Word vectors by cos similarity, describing that the included angle is small, calculating the cos score and the Euclidean distance, bringing the calculated distance into 1-d/2 calculation similarity, dividing the overall sum by the length of the words, using the method to divide the words of one document into words, training Word vectors, calculating the score from the words of each document to other Word vectors to calculate the Word and other Word calculation weights in one document, replacing TF failure, then using the information quantity IDF of the words to beat high-frequency but unimportant words, finally, using the product of the co-occurrence distance of the words and the IDF to establish an inverted index, finding out hundred possible million possible answer pairs, requiring fast, having high recall rate, recalling 50 pairs of corpora by recalling a model, and using a relatively complex model for the subsequent sequencing task;
S9.9, based on semantic recall and problem-problem semantic similarity calculation, aiming at the problem that the BERT pre-training module generates an empadd collapse when matching, a simCSE model based on comparison learning is applied to further improve the accuracy of the model, a simCSE comparison learning model based on the positive example construction is adopted, the random Dropout of the Bert is utilized, different sentence vectors are obtained through the BertEncoder twice for the same text to form a similar text, the negative example construction is adopted, other samples in the same Batch are randomly sampled as negative examples, on-line service is performed on the scene of the problem-problem, two sides of a double-tower are in a semantic space, the model of a tower is stored offline, the trained word vectors are stored offline and stored into Faiss, on-line service obtains the problem vectors through a network, the Faiss is removed for vector retrieval, the most index vectors are retrieved, on-line large-scale real-time calculation is avoided, and meanwhile, the answer pairs are centered from the millions and approximate to the answer pairs are called 50 pairs;
s9.10, according to the problem-problem similarity sorting, aiming at the characteristics of government corpora in the problem-problem similarity sorting process, taking into consideration the similarity sorting score of the problem-problem and considering whether sentences have government entity names or not, carrying out weighted voting summation, further improving the accuracy of sorting results, de-merging the word recall and semantic recall results, identifying government related entities in the query, calculating the intersection ratio of the entities in the query and each query in recall, calculating the semantic matching similarity of query and recall sentence levels based on comparison learning SimCSE model, weighting sorting, wherein the weight of semantic matching is 0.8, the weight of entity intersection ratio is 0.2, and finally selecting top50 for return;
S9.11, calculating the similarity of questions and answers, namely aiming at questions and answers with low quality and dirty data questions, avoiding answering questions by the similarity calculation of the questions and the answers, further improving the final matching effect, selecting the best answer from 50 pairs or adding diversity to randomly return topk to the questions and answers, and returning to the user;
s9.12, providing a multi-document retrieval extraction type reading understanding question-answer strategy aiming at FAQ multi-scene migration, wherein the strategy comprises multi-document coarse recall based on a bm25 algorithm, multi-task fine recall based on a BERT-MRC model and multi-document sequencing and answer return;
s9.13, based on the multi-document coarse recall of the bm25 algorithm, for 50 documents which are most relevant in each question recall, on-line large-scale on-line calculation is avoided, the problem that positive and negative samples of reading and understanding tasks are unbalanced is solved, a dynamic negative sampling method is provided, in order to avoid the problem that paragraphs are truncated in training, the maximum length of the method is set to 400, the combination of the questions and the paragraphs does not exceed the maximum input length 512 of bert, the benefits of the clauses can be fully utilized in model training, the effects of sentence truncation are avoided, the problem that positive and negative samples are unbalanced is solved by using a dynamic negative sampling method when training data are generated, each question corresponds to a positive sample set and a negative sample set, positive samples are defined as fragments containing answers corresponding to the question, negative samples are taken as 5 fragments with highest parts of all passages after segmentation, and 5 fragments which do not contain answers are taken as alternatives, each positive sample is taken in candidate negative samples, a negative sample is taken randomly, the negative sample is taken, the negative sample is better in the candidate negative sample is used, the negative sample is better in the training set, and the negative sample is better in the training method, and the negative sample is better in the negative sample is used;
S9.14, based on BERT-MRC model multitasking recall, adopting a multitasking method to train an extraction model, taking the content of a question and a paragraph as the input of the model, judging whether an answer exists or not based on a detection task, extracting a question answer based on the extraction task, multitasking to improve the overall effect of the model, wherein the first task is to predict whether a table contains the answer, the relevance of the paragraph and the question, the second task is to predict start and end, the starting and ending positions of the answer are targets of answer extraction per se, and the multitasking extraction module enables the two tasks to achieve convergence simultaneously by adjusting the loss weights of the two tasks in the training process due to inconsistent convergence speeds of the two tasks, and the loss function is 0.01 MRC_loss+0.99 CLS_loss, so that the model can achieve a good convergence effect when the answer is extracted;
s9.15, sorting the multiple documents and returning answers, firstly judging whether the two classification probability values of the label, the start and the end are lower than a certain threshold value, entering a local government knowledge base strategy of a LLM large model based on a trusted knowledge mechanism when the two classification probability values are lower than the threshold value, secondly, calculating a document score for the documents with the answers based on a sorting rule, and selecting the paragraph with the highest sorting score as a reading answer to return;
S9.16, LLM large language model local knowledge base strategy based on trusted knowledge mechanism; aiming at the problem of insufficient dialogue corpus in the field of FAQ knowledge base government affairs, a local government affair knowledge base strategy of a large language model of LLM based on a trusted knowledge mechanism is provided, the trusted knowledge mechanism uniformly performs preprocessing and post-processing, the contextual data and the field knowledge are refined and supplemented by the prompt, the generated text can be subjected to auditing and processing before the command can be correctly docked, and the trusted knowledge mechanism does not enable the large language model LLM to completely learn personal/enterprise data and field knowledge, so that a specific trusted knowledge mechanism of an enterprise is formed.
2. The dialogue method for multi-strategy fusion in the government service field based on the knowledge graph according to claim 1, wherein in step S1.4, the method is specifically implemented as follows:
s2.1, defining the field and the scope of the government affair body, defining the business function field, the application, the described information content and the government affair object of the use and maintenance body corresponding to the government affair field;
s2.2, collecting government concepts and data resources, and collecting and processing consistency of data which does not meet the standard according to structured data such as government service databases and the like, and semi-structured data and unstructured data of documents in various government fields and government service networks, including e.g. government manuals and web page data of government service websites;
S2.3, constructing reusability of the existing ontology, analyzing and perfecting the existing government affair ontology, and improving reusability;
s2.4, analyzing and expressing the ontology, and extracting text information such as government manuals, government official documents, government service website webpage data and the like from the existing ontology which cannot be reused, so as to extract core concepts, concept attributes and relations among the concepts;
s2.5, constructing a body, firstly, defining the class and the inheritance relationship of the class by adopting a top-down method, namely, starting from the most basic concept in the administrative field, and refining the class layer by layer;
s2.6, integrating and instantiating the ontology, integrating the government affairs ontology, redefining and semantically processing the government affairs ontology by adopting a consistency protocol method so as to avoid influencing data sharing and fusion, and extracting data for instantiation after confirming the ontology;
s2.7, establishing a preliminary government domain knowledge ontology through the steps, and evaluating and verifying the ontology through the aspects of correctness, consistency, expandability, effectiveness, scale and descriptive capacity of the ontology through multi-party investigation and invitation of domain expert participation.
3. The dialog method for multi-strategy fusion in the government service domain based on the knowledge graph according to claim 2, wherein in step S2.7, the method is specifically implemented as follows:
S3.1, evaluating and verifying a government field basic ontology, which is used for representing a general knowledge concept and does not contain service field characteristics, and performing characteristic modeling on structured and unstructured data such as texts, databases and the like, wherein the text ontology mainly describes attributes such as file formats, file sizes, keywords and the like;
s3.2, evaluating and verifying a knowledge system of a government domain knowledge body and a government service flow, aiming at analysis of related government corpora such as government documents, government news corpora and the like, planning five categories including personnel role categories, policy documents, news categories, comprehensive government role categories, administrative reply categories and corresponding triples under each category, entity-attribute values, entity-relationship-entities, and defining relationships from improving the efficiency of searching questions and answers of subsequent government staff by taking the content of the official documents as a core, and defining the relationships among the entities contained in the five categories of documents such as personnel role categories, policy notification categories, news categories, comprehensive government role categories, administrative reply categories, the specific relationships include personnel role authority relationships, personnel role authority relationships, mechanism-to-personnel role name relationships and mechanism-to-front employee role name relationships;
s3.2, aiming at government service flow analysis, constructing a government service flow knowledge system, wherein the main concepts of the service flow comprise government matters certificate materials, laws and regulations, administrative departments, service objects, administrative regions, matters subjects and authority levels, wherein the government matters are core concepts of government matters, each type of entity has respective attribute characteristics, semantic relations exist among the entities, the attribute characteristics and the semantic relations among the entities basically cover all information of the government matters, and the government matters consultation question-answer requirements of the public are met;
And S3.3, the core attribute characteristics of the government matters entity are rights sources, authority types, driving levels, handling types, service objects and fields, legal time limits, promise time limits, setting basis, organization properties, application conditions and application materials, semantic relations are defined from the aspect of improving the government service efficiency, the relation between the government matters and the authority levels is a driving level, the relation between the government matters and certificate materials is handling materials, the relation between the government matters and the administrative departments is a supervision body, the relation between the government matters and the administrative departments is a joint office, the relation between the government matters and the legal regulations is a setting basis, the relation between the government matters and the legal regulations is an implementation basis, and the relation between the government matters and the service objects is a service body.
4. The dialogue method based on the multi-strategy fusion of the government service field of the knowledge graph according to claim 1, wherein in the step S1.1, a high-quality site evaluation model is provided for semi-structured and unstructured data acquisition, evaluation is carried out from two dimensions of quality grade and liveness, comprehensive scoring is carried out, an intelligent content extraction technology GNE algorithm and an XPATH analysis extraction technology are accessed on government content analysis, government file downloading and storage are carried out, a hierarchical storage scheme is provided for guaranteeing the consistency and the integrity of data, and the method specifically comprises the following steps:
S4.1, first selecting an initial high-quality government website site, performing intelligent sniffing by using a network link, and performing fission type discovery of a new high-quality website and automatic website crawling and warehousing through quality evaluation and content link relation without performing secondary acquisition of content and landing and warehousing;
s4.2, guaranteeing consistency and integrity of downloaded data, providing a hierarchical storage scheme, acquiring related information of files through crawlers, storing acquired file information in a MongoDB, managing file downloading service through a zookeeper, uniformly distributing downloaded tasks to each service according to a partition distribution algorithm, managing threads through a thread pool by each service task, detecting whether a blocking queue in a current thread pool is less than half of a set value at regular time, if not, triggering a method for executing file downloading, firstly reading data of a corresponding partition in a mongob, judging whether an address is downloadable, judging whether a file address exists to redis, if not, carrying out an address request, acquiring a file stream, calculating an md5 value, splicing oss addresses, storing files to oss, recording and storing hbase and mongobb, if so, generating pdf to store, keeping the later file information consistent, if the downloading fails, recording and sending the same to the mongobb, and calling the data to the service to be used for processing the file if the data is successful.
5. The method is characterized in that in step S4.1, a high-quality site evaluation model specifically comprises a quality grade scoring algorithm and an activity grade scoring algorithm, the quality grade scoring algorithm is used for solving the problem of difficult management of a large number of crawlers by means of functions of site configuration, task scheduling and the like, the correlation degree of the government websites in the government field is clarified, websites with poor correlation degree are assisted to be filtered, the activity grade scoring algorithm is used for carrying out normalized weighted summation on the number of the website pushups, the number of the pushups and the pushups time three-dimensionally to obtain a quantifiable activity grade, the optimal crawling cycle strategy is assisted to be configured, the websites found by the initial high-quality government websites are subjected to a quality evaluation model based on the quality evaluation scoring algorithm, the quality grade and the activity grade of the new government websites are finally obtained, the problem of difficult management of a large number of crawlers is solved by means of site configuration, task scheduling and monitoring is combined with mass data of the government map, and value is mined from the government knowledge product capacity;
the intelligent discovery method for the high-quality site aiming at the semi-structured and unstructured data comprises the following specific steps:
S5.1: after obtaining an html webpage text, generating a Dom tree through Jso analysis, removing js script, css style and other preprocessing operations on the Dom tree, and extracting text density and symbol density according to a text density algorithm and a symbol density algorithm respectively;
s5.2: and calculating a score of the webpage text through the two dimension values, wherein a score higher person is considered as the text of the webpage.
6. The knowledge-graph-based dialog method for multi-policy fusion in the field of government services according to claim 1, wherein in step S1.2: the knowledge extraction is carried out firstly on the structured data knowledge extraction specifically according to the following steps:
s6.1, extracting triple data from a structured data source in the administrative domain according to a constructed administrative domain ontology and an administrative service flow knowledge system, storing entity-attribute value and entity-relation-entity into a map, supporting intelligent matching of the attribute of an ontology object and a data set field through ontology mapping aiming at a government service database, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; supporting the body relation to select an object data set or a relation data set to carry out relevant configuration, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, and simultaneously reducing a storage space;
Aiming at a traditional government service database, through ontology mapping, supporting intelligent matching of attributes of an ontology object and data set fields, and determining value conversion logic from the data set to map data in a form of configuration mapping rules; the method has the advantages that the body relation is supported to select an object data set or a relation data set to carry out relevant configuration, a data file is structured, automatic access of structured data can be realized through an excel uploading plug-in, a csv uploading plug-in and other tools, a user-defined logic view is used as a function of the data set, a plurality of pieces of table information can be integrated into one table according to service requirements to carry out extraction customization, relevant operation provides an intuitive and convenient operation interface, and the information of the plurality of tables is extracted and customized to facilitate query operation;
s6.2, aiming at the problem of updating and storing structured data in a log bin, providing a zipper table solution, adopting a zipper table design to check a history state, reducing the occupation of a storage space, and directly covering the history state by data updating to cause that the history state cannot be queried;
s6.3: analyzing semi-structured and unstructured data, and performing intelligent form filling, wherein the intelligent form filling is realized by an OCR (optical character recognition) technology and an extraction technology, so that the daily document processing work is automated, the document processing efficiency of business personnel is improved, doc, docx, pdf, chm format files uploaded or downloaded by a user are firstly analyzed to obtain data such as pictures, tables, texts, titles and the like in the files, doc format files are firstly converted into docx files, the docx files are directly analyzed through a python-docx package of a third party tool of python, chm format files are firstly converted into pdf format files, then pdf analysis is performed, the pdf scanning files are subjected to OCR recognition, the whole OCR recognition process is divided into layout analysis, text detection and text recognition, the layout analysis technology is used for analyzing the text structures of documents, and automatically extracting the structured information of the document documents, the text detection process is to use rectangular frames with four points to select text area frames in an image, the text recognition process is to recognize the text selected by the previous step into characters, a one-stage target detection algorithm of a YOLO series is adopted, a text detection algorithm based on regression CTPN, EAST and segmentation PSENT and DBNet, a CRNN+ CTC, attention and SVRT text recognition algorithm are adopted to meet the recognition requirement of a document, component extraction is to adopt a method of combining an extraction technology and a rule to realize automatic extraction of elements, intelligent form filling is realized, component extraction tasks are divided into entity recognition and relation extraction, entity linkage is completed for the extracted entities, entity alignment and entity disambiguation are carried out, model audit and manual audit are carried out for the entities before warehouse entry, tracing the knowledge, positioning the sources of the government staff and the user searched knowledge or returned results, and improving the confidence and the interpretability of the user on the knowledge;
S6.4, resolving entity identification for semi-structured and unstructured data, firstly extracting unstructured data in a combined way, aiming at the problems that an entity identification subtask BERT-CRF is wrong in entity boundary in the government field, incomplete in entity identification and is not used for information of the existing government knowledge base, providing a solution for enhancing the accuracy of entity identification based on information of knowledge base description text, firstly constructing an entity name dictionary by utilizing entity names and entity alias information of the knowledge base, obtaining vector embedding of the entity names by mining the entity description text in the knowledge base, then obtaining candidate entities in the text by a name dictionary matching technology, and finally screening results by utilizing an entity identification model to finish the task of entity identification;
s6.5, in the step S6.4, the method specifically comprises a data preparation flow, an alias dictionary is constructed by using entity names of a knowledge base and alias information of the entities, entity description texts are constructed, and a mapping dictionary is constructed, wherein the specific flow is as follows:
constructing an entity alias dictionary, constructing an alias dictionary by utilizing entity names of government knowledge bases and alias information of the entities, setting all character strings which cannot be matched by the entities in a training set and the occurrence times corresponding to the character strings to be more than 4 in total, and adding the character strings into the aliases of the entities if the occurrence times corresponding to the character strings are more than 3;
Constructing an entity description text and a mapping dictionary, and splicing by utilizing entity-attribute values, entity-relation-entities and triplet data in the constructed government map to obtain the entity description text, and constructing the mapping dictionary, wherein the common dictionary comprises entity names and entity id lists, entity ids and entity names, entity ids and entity description texts, entity ids and entity types, entity types and entity ids for the later models;
the construction process of the entity identification data set comprises the steps of firstly constructing an entity name dictionary by using entity names and alias information of the entities of a knowledge base, constructing a text by using entity description of the knowledge base, using a BERT pre-training model, selecting vector output of a model CLS position as vector embedding of the entity names, obtaining candidate entities in a short text by a dictionary matching mode, and finally screening matched results by using a constructed named entity identification model, wherein the construction process is as follows:
and adding the entity into the dictionary tree to carry out forward maximum matching, and adopting the thought of the forward maximum matching entity to match the entity in the text, and inserting the entity name into the dictionary tree according to the matching.
7. The knowledge-graph-based dialog method for multi-strategy fusion in the field of government service according to claim 6, wherein the dialog method comprises the steps of; in order to perform two-class on the matched entity, entity names are required to be represented by a vector, because BERT is used for a subsequent model, the embedding of the entity names is obtained by using BERT, entity description text of a knowledge base is obtained, a BERT pre-training model is utilized, vector output of a model CLS position is selected to serve as vector embedding of the entity names, training data is constructed, candidate entities in the text are obtained through a maximum matching algorithm in a dictionary matching mode, and corresponding labels are marked;
And screening matched results by constructing an entity identification model, enabling government affair texts to pass through a BERT layer, splicing embedding corresponding to entity names through a bidirectional LSTM, and performing convolution and full-connection prediction.
8. The knowledge-graph-based dialog method for multi-strategy fusion in the field of government service according to claim 6, wherein the dialog method comprises the steps of; in step S6.3, a relation extraction model based on multi-feature fusion is extracted, the entity type, the entity and the context information are input at the input end of the model, different pooling strategies are dynamically set for different entity lengths through BERT pre-training model feature extraction, and the relation extraction model is extracted;
entity link, namely the task of linking the mentioned text in the text to the entity in the knowledge base, the difficulty of entity link, entity alignment has different mentioned texts through the same entity and entity disambiguation the same mentioned text corresponds to different entities;
aiming at the solution scheme that different mention texts are provided through the same entity, the method comprises the steps of utilizing established entity names and aliases to train texts to match with a SimCSE model, storing training entity word vectors on FAISS vector library lines, obtaining vectors through the ENCODER of the SimCSE model according to the entity which is checked for the first time, removing vector retrieval in the FAISS library, and obtaining entity words with highest scores;
Performing entity link, namely aiming at the entities corresponding to different entity solutions of the same mentioned text, namely, performing entity disambiguation processing by using a first round of dictionary correction, namely, possibly corresponding a plurality of entities in a government map, finding out a plurality of corresponding entity ID candidate items according to entity names, performing entity disambiguation processing, wherein the entity disambiguation is realized based on the idea of two classifications, selecting the connected entity as a positive example in the candidate entities during training, selecting two negative examples, connecting the input text and the description text of the entity to be disambiguated together, inputting the description text into a BERT model, taking CLS position vectors for outputting, and the feature vectors corresponding to the starting position and the ending position of the candidate entity, connecting the three vectors, performing full connection layer, activating the most sigmoid to obtain probability scores of the candidate entities, and sorting the probability scores of all the candidate entities, wherein the probability is the correct entity with the highest probability;
performing entity model auditing and manual auditing, and performing entity auditing through the model to ensure that the entity meets the service requirement, wherein the problem is solved by using the RNN model, a positive sample is a standard named entity based on manual auditing, 1 is a positive label, representing the following text as the named entity, the name of the government entity from a database, 0 is a negative label, representing the following text as the non-named entity, the character string of the non-named entity is reversed, and the ratio of the positive sample to the negative sample is 1:1;
The human-computer collaboration mechanism is adopted, the identified triples, entity-attribute-entity and entity-relation-entity are realized based on the algorithm, and manual auditing is supported in the intervention of government workers in the knowledge extraction process, so that the human-computer collaboration mechanism integrating the knowledge extraction is realized, and when the knowledge extraction is performed, data content conflicts exist, and the manually modified data is high in priority.
9. The dialogue method based on the multi-strategy fusion of the government service field of the knowledge graph as claimed in claim 1, wherein in step S9.16, a government field knowledge base is firstly built offline, each knowledge has two character string lists, one character string is a question title, one character string is an answer text, the government field knowledge is vectorized through a vectorization model BERT and stored in a vector database Faiss, the user questions are vectorized through the vectorization model BERT according to the user questions, the government vector database is queried to obtain a topn matching result, the search result comprises vectors and payload, the payload comprises titles and text, and the matching strategy based on MMR is provided for obtaining topn data aiming at the problem of high redundancy of the topn direct matching result so as to further improve the accuracy and diversity of the matching result;
S10.1: constructing a Prompt-guided LLM model to better understand the executed tasks, specifically comprising a prefix_prompt, a question of a user and a plajoadprefix_prompt, wherein a trusted knowledge mechanism obtains information related to content and titles through government affair patterns, obtains role information and activity information of login users, generates content which better accords with the requirements of roles in specific government affair fields, better designates the direction of the LLM large language model to generate government affair content, establishes a semantic network among all elements in the three posts to know context backgrounds and relations between staff and users and work results made by the staff and the users, so that returned results interacted with LLM have higher credibility, and aiming at the government affair fields, the prefix_prompt is added with a Prompt fragment which is used, and the paragraph content is not detected in the government affair fields, so as to inform the LLM model that the LLM is not to need not to be answered, and the direction of the government affair content is better, and the direction of the problem is established, the question itself, the answer comprises a user question and the answer is set to be more than 300, and the corresponding text is more than 300, because the relevant results are more than the text is searched for the relevant text, and the text is more than 300 is more than the relevant text is more than the answer;
S10.2: after the Prompt is built, the LLM large language model index building tool calls LLMAPI to access the LLM large language model, the LLM large language model returns results to a return trusted knowledge mechanism for post-processing, the returned results are quoted and processed aiming at the LLM large language model without an explanatory problem, the returned results are checked and inferred aiming at the LLM large language model facts errors and reasoning errors based on the tracing system and the search engine supplementing source information, the returned results are checked and inferred based on the government service map, the necessary external query and call are executed aiming at the query script returned by the LLM large language model, finally the results are assembled and returned according to the required format, and the trusted knowledge mechanism returns the post-processed results to the LLM large language model index building tool to be returned to the user for answer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310909706.9A CN116628172B (en) | 2023-07-24 | 2023-07-24 | Dialogue method for multi-strategy fusion in government service field based on knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310909706.9A CN116628172B (en) | 2023-07-24 | 2023-07-24 | Dialogue method for multi-strategy fusion in government service field based on knowledge graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116628172A CN116628172A (en) | 2023-08-22 |
CN116628172B true CN116628172B (en) | 2023-09-19 |
Family
ID=87621644
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310909706.9A Active CN116628172B (en) | 2023-07-24 | 2023-07-24 | Dialogue method for multi-strategy fusion in government service field based on knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116628172B (en) |
Families Citing this family (73)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116821712B (en) * | 2023-08-25 | 2023-12-19 | 中电科大数据研究院有限公司 | Semantic matching method and device for unstructured text and knowledge graph |
CN116821103B (en) * | 2023-08-29 | 2023-12-19 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and computer readable storage medium |
CN116842126B (en) * | 2023-08-29 | 2023-12-19 | 青岛网信信息科技有限公司 | Method, medium and system for realizing accurate output of knowledge base by using LLM |
CN116821376B (en) * | 2023-08-30 | 2024-03-08 | 北京华琦远航国际咨询有限公司 | Knowledge graph construction method and system in coal mine safety production field |
CN116842128B (en) * | 2023-09-01 | 2023-11-21 | 合肥机数量子科技有限公司 | Text relation extraction method and device, computer equipment and storage medium |
CN116860951B (en) * | 2023-09-04 | 2023-11-14 | 贵州中昂科技有限公司 | Information consultation service management method and management system based on artificial intelligence |
CN116861014B (en) * | 2023-09-05 | 2024-01-26 | 联通在线信息科技有限公司 | Image information extraction method and device based on pre-training language model |
CN117056493B (en) * | 2023-09-07 | 2024-07-16 | 四川大学 | Large language model medical question-answering system based on medical record knowledge graph |
CN117312493A (en) * | 2023-09-08 | 2023-12-29 | 中国中医科学院中医药信息研究所 | Multi-strategy knowledge extraction system |
CN117194637B (en) * | 2023-09-18 | 2024-04-30 | 深圳市大数据研究院 | Multi-level visual evaluation report generation method and device based on large language model |
CN116955579B (en) * | 2023-09-21 | 2023-12-29 | 武汉轻度科技有限公司 | Chat reply generation method and device based on keyword knowledge retrieval |
CN117235281B (en) * | 2023-09-22 | 2024-07-05 | 武汉贝塔世纪科技有限公司 | Multi-element data management method and system based on knowledge graph technology |
CN117056495B (en) * | 2023-10-08 | 2024-01-12 | 吉奥时空信息技术股份有限公司 | Automatic question-answering method and system for government affair consultation |
CN117271740A (en) * | 2023-10-11 | 2023-12-22 | 中国电子科技集团公司第十五研究所 | Large language model time sequence knowledge question-answering method based on sentence granularity prompt |
CN117076719B (en) * | 2023-10-12 | 2024-04-19 | 北京枫清科技有限公司 | Database joint query method, device and equipment based on large language model |
CN117057683B (en) * | 2023-10-13 | 2023-12-22 | 四川中电启明星信息技术有限公司 | Staff portrait management system based on knowledge graph and multi-source application data |
CN117111917B (en) * | 2023-10-23 | 2024-02-27 | 福建自贸试验区厦门片区Manteia数据科技有限公司 | Interaction method and device of medical auxiliary system, electronic equipment and storage medium |
CN117131181B (en) * | 2023-10-24 | 2024-04-05 | 国家电网有限公司 | Construction method of heterogeneous knowledge question-answer model, information extraction method and system |
CN117151429B (en) * | 2023-10-27 | 2024-01-26 | 中电科大数据研究院有限公司 | Government service flow arranging method and device based on knowledge graph |
CN117151122B (en) * | 2023-10-30 | 2024-03-22 | 湖南三湘银行股份有限公司 | Bank customer service session question-answering processing method and system based on natural language processing |
CN117194616A (en) * | 2023-11-06 | 2023-12-08 | 湖南四方天箭信息科技有限公司 | Knowledge query method and device for vertical domain knowledge graph, computer equipment and storage medium |
CN117194730B (en) * | 2023-11-06 | 2024-02-20 | 北京枫清科技有限公司 | Intention recognition and question answering method and device, electronic equipment and storage medium |
CN117786103B (en) * | 2023-11-07 | 2024-10-18 | 任拓数据科技(上海)有限公司 | Method for establishing content labels based on electronic commerce data and social media marketing content data |
CN117708277B (en) * | 2023-11-10 | 2024-10-01 | 广州宝露软件开发有限公司 | AIGC-based question and answer system and application method |
CN117235238B (en) * | 2023-11-13 | 2024-03-08 | 广东蘑菇物联科技有限公司 | Question answering method, question answering device, storage medium and computer equipment |
CN117556054B (en) * | 2023-11-14 | 2024-07-30 | 哈尔滨工业大学 | Knowledge graph construction method and management system based on large language model |
CN117235243A (en) * | 2023-11-16 | 2023-12-15 | 青岛民航凯亚系统集成有限公司 | Training optimization method for large language model of civil airport and comprehensive service platform |
CN117725222B (en) * | 2023-11-20 | 2024-07-02 | 中国科学院成都文献情报中心 | Method for extracting document complex knowledge object by integrating knowledge graph and large language model |
CN117290563B (en) * | 2023-11-22 | 2024-07-30 | 北京小米移动软件有限公司 | Vertical type searching method and device, searching system and storage medium |
CN117313748B (en) * | 2023-11-24 | 2024-03-12 | 中电科大数据研究院有限公司 | Multi-feature fusion semantic understanding method and device for government affair question and answer |
CN117290489B (en) * | 2023-11-24 | 2024-02-23 | 烟台云朵软件有限公司 | Method and system for quickly constructing industry question-answer knowledge base |
CN117407514B (en) * | 2023-11-28 | 2024-07-09 | 星环信息科技(上海)股份有限公司 | Solution plan generation method, device, equipment and storage medium |
CN117312534B (en) * | 2023-11-28 | 2024-02-23 | 南京中孚信息技术有限公司 | Intelligent question-answering implementation method, device and medium based on secret knowledge base |
CN117312535B (en) * | 2023-11-28 | 2024-06-28 | 中国平安财产保险股份有限公司 | Method, device, equipment and medium for processing problem data based on artificial intelligence |
CN117573834B (en) * | 2023-11-30 | 2024-04-16 | 北京快牛智营科技有限公司 | Multi-robot dialogue method and system for software-oriented instant service platform |
CN117371973A (en) * | 2023-12-06 | 2024-01-09 | 武汉科技大学 | Knowledge-graph-retrieval-based enhanced language model graduation service system |
CN117371534B (en) * | 2023-12-07 | 2024-02-27 | 同方赛威讯信息技术有限公司 | Knowledge graph construction method and system based on BERT |
CN117763160A (en) * | 2023-12-11 | 2024-03-26 | 江苏思行达信息技术有限公司 | Knowledge graph-based power marketing policy file analysis method |
CN117633252B (en) * | 2023-12-14 | 2024-06-18 | 广州华微明天软件技术有限公司 | Auxiliary retrieval method integrating knowledge graph and large language model |
CN117494717B (en) * | 2023-12-27 | 2024-03-19 | 卓世科技(海南)有限公司 | Context construction method and system based on AI large language model |
CN117828050B (en) * | 2023-12-29 | 2024-07-09 | 北京智谱华章科技有限公司 | Traditional Chinese medicine question-answering method, equipment and medium based on long-document retrieval enhancement generation |
CN117520524B (en) * | 2024-01-04 | 2024-03-29 | 北京环球医疗救援有限责任公司 | Intelligent question-answering method and system for industry |
CN117608545B (en) * | 2024-01-17 | 2024-05-10 | 之江实验室 | Standard operation program generation method based on knowledge graph |
CN117851573B (en) * | 2024-01-17 | 2024-06-25 | 广州大麦信息科技有限公司 | Virtual anchor intelligent chatting system based on dynamic knowledge graph |
CN118035461A (en) * | 2024-01-18 | 2024-05-14 | 广州市城市规划勘测设计研究院有限公司 | Knowledge-graph question-answering method, system, equipment and medium for field batch report |
CN117609477B (en) * | 2024-01-22 | 2024-05-07 | 亚信科技(中国)有限公司 | Large model question-answering method and device based on domain knowledge |
CN117910571A (en) * | 2024-01-23 | 2024-04-19 | 成都成电金盘健康数据技术有限公司 | Cervical cancer knowledge graph construction method based on big data |
CN117611254A (en) * | 2024-01-23 | 2024-02-27 | 口碑(上海)信息技术有限公司 | Large language model-based text generation method, device, equipment and storage medium |
CN117633518B (en) * | 2024-01-25 | 2024-04-26 | 北京大学 | Industrial chain construction method and system |
CN117743564B (en) * | 2024-01-30 | 2024-05-10 | 广东省华南技术转移中心有限公司 | Automatic extraction and recommendation method and system for technological policy information |
CN117689963B (en) * | 2024-02-02 | 2024-04-09 | 南京邮电大学 | Visual entity linking method based on multi-mode pre-training model |
CN117688165B (en) * | 2024-02-04 | 2024-04-30 | 湘江实验室 | Multi-edge collaborative customer service method, device, equipment and readable storage medium |
CN117743390B (en) * | 2024-02-20 | 2024-05-28 | 证通股份有限公司 | Query method and system for financial information and storage medium |
CN117743315B (en) * | 2024-02-20 | 2024-05-14 | 浪潮软件科技有限公司 | Method for providing high-quality data for multi-mode large model system |
CN117827847B (en) * | 2024-03-04 | 2024-05-28 | 国网山东省电力公司信息通信公司 | Training sample construction method, system, equipment and medium combined with large language model |
CN117851577B (en) * | 2024-03-06 | 2024-05-14 | 海乂知信息科技(南京)有限公司 | Government service question-answering method based on knowledge graph enhanced large language model |
CN117875725B (en) * | 2024-03-13 | 2024-08-02 | 湖南三湘银行股份有限公司 | Information processing system based on knowledge graph |
CN117931898B (en) * | 2024-03-25 | 2024-06-07 | 成都同步新创科技股份有限公司 | Multidimensional database statistical analysis method based on large model |
CN118093844A (en) * | 2024-04-26 | 2024-05-28 | 山东鼎高信息技术有限公司 | Government intelligent customer service implementation method based on artificial intelligent large model |
CN118154055B (en) * | 2024-05-08 | 2024-08-09 | 京东科技信息技术有限公司 | Public data element construction method and device |
CN118135592B (en) * | 2024-05-09 | 2024-09-13 | 支付宝(杭州)信息技术有限公司 | User service method and device based on medical LLM model |
CN118193714B (en) * | 2024-05-17 | 2024-07-30 | 山东浪潮科学研究院有限公司 | Dynamic adaptation question-answering system and method based on hierarchical structure and retrieval enhancement |
CN118227656B (en) * | 2024-05-24 | 2024-08-13 | 浙江大学 | Query method and device based on data lake |
CN118279113B (en) * | 2024-05-28 | 2024-08-16 | 湖南百姓田园数字科技有限公司 | Digital intelligence social public service management method and system based on large model |
CN118245590B (en) * | 2024-05-29 | 2024-07-26 | 福建拓尔通软件有限公司 | Answer selection method and system based on multi-view image contrast learning and meta-learning feature purification network |
CN118332092A (en) * | 2024-06-07 | 2024-07-12 | 清华大学 | Construction industry safety question-answering method and equipment based on large language model technology |
CN118312167B (en) * | 2024-06-11 | 2024-09-10 | 冠骋信息技术(苏州)有限公司 | Method and system for realizing suite mechanism based on low-code platform |
CN118364091B (en) * | 2024-06-19 | 2024-08-16 | 杭州艾草信息服务有限公司 | Question-answering processing method and system based on large model |
CN118377723B (en) * | 2024-06-20 | 2024-10-18 | 北京电科智芯科技有限公司 | Test case generation method, device and equipment |
CN118445419B (en) * | 2024-07-05 | 2024-09-17 | 北方健康医疗大数据科技有限公司 | Method, device, equipment and medium for constructing prompt automatic labeling medical text based on large model |
CN118503404A (en) * | 2024-07-17 | 2024-08-16 | 阿里云飞天(杭州)云计算技术有限公司 | Information processing method and device, electronic equipment and computer program product |
CN118552358B (en) * | 2024-07-25 | 2024-09-24 | 贵州中汇科技发展有限公司 | Legal document oriented graph knowledge enhanced paraphrasing generation method and system |
CN118643366B (en) * | 2024-08-14 | 2024-10-18 | 广州平云信息科技有限公司 | Platform user portrait generation method and system applying deep learning model |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW501046B (en) * | 1999-06-11 | 2002-09-01 | Ind Tech Res Inst | A portable dialogue manager |
WO2019050968A1 (en) * | 2017-09-05 | 2019-03-14 | Forgeai, Inc. | Methods, apparatus, and systems for transforming unstructured natural language information into structured computer- processable data |
CN111078897A (en) * | 2019-12-26 | 2020-04-28 | 国衡智慧城市科技研究院(北京)有限公司 | System for generating six-dimensional knowledge map |
CN111324691A (en) * | 2020-01-06 | 2020-06-23 | 大连民族大学 | Intelligent question-answering method for minority nationality field based on knowledge graph |
CN111460125A (en) * | 2020-05-09 | 2020-07-28 | 山东舜网传媒股份有限公司 | Intelligent question and answer method and system for government affair service |
CN112579796A (en) * | 2020-12-30 | 2021-03-30 | 南京云起网络科技有限公司 | Knowledge graph construction method for teaching resources of online education classroom |
WO2021138163A1 (en) * | 2019-12-30 | 2021-07-08 | Kpmg Llp | System and method for analysis and determination of relationships from a variety of data sources |
EP3855320A1 (en) * | 2020-01-27 | 2021-07-28 | Cuddle Artificial Intelligence Private Limited | Systems and methods for adaptive question answering related applications |
CN113449114A (en) * | 2020-12-31 | 2021-09-28 | 中国科学技术大学智慧城市研究院(芜湖) | Method for constructing natural human life cycle holographic image based on knowledge graph |
CN113569050A (en) * | 2021-09-24 | 2021-10-29 | 湖南大学 | Method and device for automatically constructing government affair field knowledge map based on deep learning |
CN113672599A (en) * | 2020-09-30 | 2021-11-19 | 华斌 | Visual aid decision-making method for realizing government affair informatization project construction management by creating domain knowledge graph |
CN114119317A (en) * | 2021-11-22 | 2022-03-01 | 浪潮软件股份有限公司 | Knowledge graph construction method based on government affair service scene |
KR20220118680A (en) * | 2021-02-19 | 2022-08-26 | (주)아와소프트 | Chatbot service providing system for considering user personaand method thereof |
CN115438199A (en) * | 2022-11-08 | 2022-12-06 | 眉山环天智慧科技有限公司 | Knowledge platform system based on smart city scene data middling platform technology |
CN115510025A (en) * | 2022-09-14 | 2022-12-23 | 上海市大数据中心 | Construction method of government affair industry knowledge base based on natural language and user behavior analysis |
CN116450834A (en) * | 2022-12-31 | 2023-07-18 | 云南电网有限责任公司信息中心 | Archive knowledge graph construction method based on multi-mode semantic features |
-
2023
- 2023-07-24 CN CN202310909706.9A patent/CN116628172B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW501046B (en) * | 1999-06-11 | 2002-09-01 | Ind Tech Res Inst | A portable dialogue manager |
WO2019050968A1 (en) * | 2017-09-05 | 2019-03-14 | Forgeai, Inc. | Methods, apparatus, and systems for transforming unstructured natural language information into structured computer- processable data |
CN111078897A (en) * | 2019-12-26 | 2020-04-28 | 国衡智慧城市科技研究院(北京)有限公司 | System for generating six-dimensional knowledge map |
WO2021138163A1 (en) * | 2019-12-30 | 2021-07-08 | Kpmg Llp | System and method for analysis and determination of relationships from a variety of data sources |
CN111324691A (en) * | 2020-01-06 | 2020-06-23 | 大连民族大学 | Intelligent question-answering method for minority nationality field based on knowledge graph |
EP3855320A1 (en) * | 2020-01-27 | 2021-07-28 | Cuddle Artificial Intelligence Private Limited | Systems and methods for adaptive question answering related applications |
CN111460125A (en) * | 2020-05-09 | 2020-07-28 | 山东舜网传媒股份有限公司 | Intelligent question and answer method and system for government affair service |
CN113672599A (en) * | 2020-09-30 | 2021-11-19 | 华斌 | Visual aid decision-making method for realizing government affair informatization project construction management by creating domain knowledge graph |
CN112579796A (en) * | 2020-12-30 | 2021-03-30 | 南京云起网络科技有限公司 | Knowledge graph construction method for teaching resources of online education classroom |
CN113449114A (en) * | 2020-12-31 | 2021-09-28 | 中国科学技术大学智慧城市研究院(芜湖) | Method for constructing natural human life cycle holographic image based on knowledge graph |
KR20220118680A (en) * | 2021-02-19 | 2022-08-26 | (주)아와소프트 | Chatbot service providing system for considering user personaand method thereof |
CN113569050A (en) * | 2021-09-24 | 2021-10-29 | 湖南大学 | Method and device for automatically constructing government affair field knowledge map based on deep learning |
CN114119317A (en) * | 2021-11-22 | 2022-03-01 | 浪潮软件股份有限公司 | Knowledge graph construction method based on government affair service scene |
CN115510025A (en) * | 2022-09-14 | 2022-12-23 | 上海市大数据中心 | Construction method of government affair industry knowledge base based on natural language and user behavior analysis |
CN115438199A (en) * | 2022-11-08 | 2022-12-06 | 眉山环天智慧科技有限公司 | Knowledge platform system based on smart city scene data middling platform technology |
CN116450834A (en) * | 2022-12-31 | 2023-07-18 | 云南电网有限责任公司信息中心 | Archive knowledge graph construction method based on multi-mode semantic features |
Non-Patent Citations (2)
Title |
---|
基于人工智能的智能服务机器人在政务服务领域的设计与应用;张鹤等;网络安全技术与应用;全文 * |
知识图谱驱动的科研档案大数据管理系统构建研究;雷洁等;数字图书馆论坛;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN116628172A (en) | 2023-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116628172B (en) | Dialogue method for multi-strategy fusion in government service field based on knowledge graph | |
CN110298033B (en) | Keyword corpus labeling training extraction system | |
CN111026842B (en) | Natural language processing method, natural language processing device and intelligent question-answering system | |
CN110765257B (en) | Intelligent consulting system of law of knowledge map driving type | |
CN110727779A (en) | Question-answering method and system based on multi-model fusion | |
CN111967761B (en) | Knowledge graph-based monitoring and early warning method and device and electronic equipment | |
CN110377715A (en) | Reasoning type accurate intelligent answering method based on legal knowledge map | |
CN113806563B (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
CN109271506A (en) | A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning | |
US20190392035A1 (en) | Information object extraction using combination of classifiers analyzing local and non-local features | |
RU2679988C1 (en) | Extracting information objects with the help of a classifier combination | |
US20040163043A1 (en) | System method and computer program product for obtaining structured data from text | |
CN108922633A (en) | A kind of disease name standard convention method and canonical system | |
CN110097278B (en) | Intelligent sharing and fusion training system and application system for scientific and technological resources | |
RU2646380C1 (en) | Using verified by user data for training models of confidence | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN112417170B (en) | Relationship linking method for incomplete knowledge graph | |
CN113157859A (en) | Event detection method based on upper concept information | |
CN114238653A (en) | Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education | |
CN115390806A (en) | Software design mode recommendation method based on bimodal joint modeling | |
CN114661872A (en) | Beginner-oriented API self-adaptive recommendation method and system | |
CN113742498B (en) | Knowledge graph construction and updating method | |
Tallapragada et al. | Improved Resume Parsing based on Contextual Meaning Extraction using BERT | |
CN114911893A (en) | Method and system for automatically constructing knowledge base based on knowledge graph | |
Mustafa et al. | Optimizing document classification: Unleashing the power of genetic algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |