CN117407505A - Question-answer retrieval method and system for integrating document knowledge and question-answer data - Google Patents

Question-answer retrieval method and system for integrating document knowledge and question-answer data Download PDF

Info

Publication number
CN117407505A
CN117407505A CN202311397843.5A CN202311397843A CN117407505A CN 117407505 A CN117407505 A CN 117407505A CN 202311397843 A CN202311397843 A CN 202311397843A CN 117407505 A CN117407505 A CN 117407505A
Authority
CN
China
Prior art keywords
question
document
answer
type data
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311397843.5A
Other languages
Chinese (zh)
Inventor
李纪波
丁一凡
郑伟航
王延东
周祥国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur General Software Co Ltd
Original Assignee
Inspur General Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur General Software Co Ltd filed Critical Inspur General Software Co Ltd
Priority to CN202311397843.5A priority Critical patent/CN117407505A/en
Publication of CN117407505A publication Critical patent/CN117407505A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a question-answer retrieval method and a system for fusing document knowledge and question-answer data, wherein the method receives question-answer pair type data and document type data input by a front-end interactive interface, fuses the question-answer pair type data and the document type data and respectively numbers the question-answer pair type data and the document type data to distinguish a vector library; acquiring a user input question and searching question-answer pair type data; when the similarity of the search is higher than a threshold value, directly returning the content of the question-answer pair, and not carrying out recall ordering on a vector library storing the documents; when the similarity of the search is lower than a threshold value, searching a document type storage vector library, and returning document contents after the search is successful; when the question and answer pair type data and the document type data are not hit, the question recommendation is carried out, and a plurality of questions with high similarity are returned to the front-end interactive interface for the user to select the most suitable questions. The invention realizes the efficient knowledge retrieval and question-answering functions, returns corresponding answers or document paragraphs according to the similarity of the questions, and improves the question-answering accuracy.

Description

Question-answer retrieval method and system for integrating document knowledge and question-answer data
Technical Field
The invention relates to the technical field of natural language processing, in particular to a question-answer retrieval method and a question-answer retrieval system for data by fusing document knowledge and question-answer.
Background
The intelligent question-answering system is an important application of artificial intelligence technology, can answer questions presented by users rapidly, accurately and succinctly, and improves working efficiency and saves human resources by integrating scattered knowledge.
For question-answer pair type data, manual question-answer pair carding is needed, a large amount of manpower and material resources are consumed in the process, but the accuracy of the question-answer data of the type is higher and more stable, so that balance points are needed to be found to achieve both cost and accuracy, a large amount of unstructured TEXT data such as DOC, TEXT, PDF and other types of documents exist in enterprises, and if the segment answers are directly extracted from the documents, the preprocessing cost of the question-answer data can be greatly increased.
The question-answering system integrating question-answering pairs and document knowledge can better give consideration to data preprocessing cost and prediction precision, so that enterprises can construct a question-answering knowledge base more efficiently, the question-answering system can fall to the ground more efficiently, and the use cost of the system is further reduced.
The current question-answering system generally only supports one of question-answering pairs or document question-answering pairs, recall accuracy in question-answering pair type data is greatly improved, a large amount of manpower is needed to be input for cleaning data in the early stage, the current document question-answering based on a large model extracts knowledge from a single document and combines a knowledge base of the model to summarize and generate, but the use cost of the large model is too high, and question answering occurs.
Most of the current researches are for single situation, question and answer pair or question and answer type recall rate improvement based on document understanding, but the research of fusion of the question and answer pair is not very much, and the situations of low precision and inaccurate recall content exist, and most typical answer content is confusion, because based on document question and answer is no fixed problem, the situation that the input questions are similar to the document content can occur, and question and answer pair type answer recall precision is affected.
Disclosure of Invention
In view of the above, the invention provides a question-answer retrieval method and a question-answer retrieval system for fusing document knowledge and question-answer data, which are used for constructing different vector libraries for storage through unified algorithm scales, so that efficient knowledge retrieval and question-answer functions are realized. The user can upload question and answer pair data and document data, and uses a unified algorithm to carry out vectorization and index construction on the question and answer pair data and document data, then the user carries out question inquiry through an interactive interface, and the system returns corresponding answers or document paragraphs according to the similarity of the questions, so that the question and answer accuracy is improved.
Based on the above object, in a first aspect, the present invention provides a question-answer search method for fusing document knowledge and question-answer data, comprising the steps of:
receiving question-answer pair type data and document type data input by a front-end interactive interface, fusing the question-answer pair type data and the document type data, and numbering the question-answer pair type data and the document type data respectively to distinguish a vector library;
the questions and answers are stored as index vectors, and the answers are stored in a knowledge base; firstly slicing the uploaded document in the document type data, and uniformly storing the sliced document in a vector database;
acquiring a user input question and searching question-answer pair type data; when the similarity of the search is higher than a threshold value, directly returning the content of the question-answer pair, and not carrying out recall ordering on a vector library storing the documents; when the similarity of the search is lower than a threshold value, searching a document type storage vector library, and returning document contents after the search is successful;
when the question and answer pair type data and the document type data are not hit, the question recommendation is carried out, and a plurality of questions with high similarity are returned to the front-end interactive interface for the user to select the most suitable questions.
As a further scheme of the invention, after the user input problem is obtained, the system automatically performs problem processing, including word segmentation and vectorization, and selects a proper vector library for problem retrieval according to vectorization representation of the problem.
As a further scheme of the invention, two types of question-answer pair type data and document type data are fused, different vector libraries are constructed and stored through unified algorithm scales, and meanwhile, the question-answer pair type data and the document type data are numbered to distinguish different vector libraries.
As a further scheme of the invention, the type data of the input question-answer pair is subjected to word segmentation and vectorization, and a question vector library is constructed; slicing the input document type data, and carrying out word segmentation and vectorization on the sliced document by taking paragraphs as units to construct a document vector library; the question-answer vector library for the question-answer pair type data and the document type data is constructed based on fass and is used for ensuring consistency of two types of question-answer recall standards.
As a further aspect of the present invention, in the question searching stage, the system tries to search in the vector library of question-answer pair type data, and if the similarity is higher than a predetermined threshold, returns the answer of the similar question directly.
As a further aspect of the present invention, if a question misses in a vector library of question-answer pair type data, the system attempts to retrieve in a vector library of document type data; if the questions are missed in the vector libraries of the question-answer pair type data and the document type data, the system recommends the questions and returns a list of similar questions.
As a further scheme of the invention, when receiving question-answer pair type data and document type data input by a front-end interactive interface, a user inputs corresponding knowledge from the provided front-end interactive interface, wherein the knowledge comprises document type data and question-answer pair type data, different interfaces are respectively provided for inputting, the question-answer pair type data is recorded as qa, the document question-answer type data is marked as dqa, and after the question-answer pair type data is uploaded, a mapping file is constructed by a question and a corresponding answer, and the mapping file is structured as two groups of dictionary type data.
As a further aspect of the present invention, after the data storage is completed, constructing the vector library includes:
performing word segmentation on all the problem texts, wherein a word segmentation tool is jieba or HanLP;
performing word embedding expression at sentence level, and based on embedding by Bert, respectively representing the problems as a vector with a size of 512, and completing text word segmentation and vectorization;
after converting the problem into a vector, constructing a vector index library based on fass;
question and answer pairs can be entered one by one through an interface or are collated into document batches for importing.
As a further scheme of the invention, the system defaults to generate ID after uploading the document, names the dqa _ID, cuts the document, and cuts the document at the character level; setting the number of characters to be segmented based on a configuration file provided by a system, and setting by a user according to the size of a document and the type of the document; after the document is segmented, a knowledge base is constructed, key-value values, namely { paragraph ID, paragraph }, are stored in a dictionary form, and the key-value values are stored in the knowledge base; after the document is segmented, a word segmentation tool is used for segmenting the word, stop words are removed, vectorization is carried out on the basis of Bert, segmented paragraphs are expressed as vectors, and a word vector library is built and stored on the basis of fass.
In a second aspect, the present invention provides a question-answer retrieval system for fusing document knowledge and question-answer data, the system comprising:
the data uploading module is used for receiving question-answer pair type data and document type data input by the front-end interactive interface;
the vector library construction module is used for fusing the type data and the document type data and numbering the type data and the document type data respectively so as to distinguish a vector library; the questions and answers of the type data are stored as index vectors, and the answers are stored in a knowledge base; firstly slicing the uploaded document in the document type data, and uniformly storing the sliced document in a vector database;
the knowledge query module is used for acquiring the input questions of the user and retrieving the type data by question and answer; when the similarity of the search is higher than a threshold value, directly returning the content of the question-answer pair, and not carrying out recall ordering on a vector library storing the documents; when the similarity of the search is lower than a threshold value, searching a document type storage vector library, and returning document contents after the search is successful;
and the recall module is used for recommending the questions when the question-answer pair type data and the document type data are not hit, and returning a plurality of questions with high similarity to the front-end interactive interface for the user to select the most suitable questions.
In yet another aspect of the present invention, there is provided a computer device including a memory and a processor, the memory storing a computer program which, when executed by the processor, performs any one of the above-described question-answer retrieval methods according to the present invention that fuses document knowledge and question-answer data.
In yet another aspect of the present invention, there is provided a computer readable storage medium storing computer program instructions that when executed implement any of the above methods for retrieving questions and answers to data according to the present invention.
Compared with the prior art, the question-answer retrieval method and system for fusing document knowledge and question-answer data has the following beneficial effects:
1. the retrieval accuracy is improved: the method of the invention allows the fusion of question-answer pair type data and document type data and recalls according to the similarity of questions, thereby improving the accuracy of knowledge retrieval. The user can more easily find an answer or document related to his question.
2. Resource utilization rate: by adopting Faiss to construct a data structure in the form of a vector index library and a queue, the system can efficiently manage hardware resources, improve the utilization rate of the resources and reduce the maintenance cost of the system.
3. Flexibility and customizability: the invention provides configurable parameters such as the score threshold value of recall results and the number of returned results to meet the customization of different application scenes and user requirements. This makes the system excellent in different situations while reducing the workload of the user.
4. Data fusion: the question and answer pair data and the document data are integrated and stored in the same system, so that the problem of data dispersion is reduced, the integrity of a knowledge base is improved, and the knowledge management is more efficient.
5. Quick problem solving: by selecting a proper vector library for question retrieval, the system can rapidly provide answers or related documents of questions for users, thereby accelerating the speed of solving the questions.
6. Automatic problem recommendation: when the questions are missed in the question-answer pairs and the vector library of the document type data, the system provides a question recommending function, selects the most suitable questions from similar questions, and helps the user find the answers.
7. Multi-format support: the system supports multiple document formats including doc, docx, txt and pdf, etc., making it suitable for different types of document data.
In summary, the invention improves the accuracy and efficiency of knowledge retrieval, provides better user experience, and reduces the complexity of knowledge management.
These and other aspects of the present application will be more readily apparent from the following description of the embodiments. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.
In the figure:
FIG. 1 is a flow chart of a question-answer search method for fusing document knowledge and question-answer data according to an embodiment of the present invention.
FIG. 2 is a block diagram showing the structure of a question-answer search system that fuses document knowledge and question-answer data according to an embodiment of the present invention.
Detailed Description
The present application will be further described with reference to the drawings and detailed description, which should be understood that, on the premise of no conflict, the following embodiments or technical features may be arbitrarily combined to form new embodiments.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two non-identical entities with the same name or non-identical parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such as a process, method, system, article, or other step or unit that comprises a list of steps or units.
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
The current question-answering system generally only supports one of question-answering pairs or document question-answering, recall accuracy in question-answering pair type data is greatly improved, a large amount of manpower is needed to be input for cleaning data in the early stage, the current document question-answering based on a large model extracts knowledge from a single document and combines a knowledge base of the model to summarize and generate, but the use cost of the large model is too high, and question answering can occur.
Most of the current researches are for single situation, question and answer pair or question and answer type recall rate improvement based on document understanding, but the research of fusion of the question and answer pair is not very much, and the situations of low precision and inaccurate recall content exist, and most typical answer content is confusion, because based on document question and answer is no fixed problem, the situation that the input questions are similar to the document content can occur, and question and answer pair type answer recall precision is affected.
The invention provides a question and answer retrieval method and a question and answer retrieval system for document knowledge and question and answer data, which are used for constructing different vector libraries for storage through unified algorithm scales, so that efficient knowledge retrieval and question and answer functions are realized. The user can upload question and answer pair data and document data, and uses a unified algorithm to carry out vectorization and index construction on the question and answer pair data and document data, then the user carries out question inquiry through an interactive interface, and the system returns corresponding answers or document paragraphs according to the similarity of the questions, so that the question and answer accuracy is improved.
Referring to fig. 1, an embodiment of the present invention provides a question-answer search method for fusing document knowledge and question-answer data, the method comprising the steps of:
step S10, question-answer pair type data and document type data input by a front-end interactive interface are received, and the question-answer pair type data and the document type data are fused and respectively numbered to distinguish a vector library;
step S20, storing questions of the type data as index vectors, wherein answers are stored in a knowledge base; firstly slicing the uploaded document in the document type data, and uniformly storing the sliced document in a vector database;
step S30, acquiring a user input question and searching type data by question and answer; when the similarity of the search is higher than a threshold value, directly returning the content of the question-answer pair, and not carrying out recall ordering on a vector library storing the documents; when the similarity of the search is lower than a threshold value, searching a document type storage vector library, and returning document contents after the search is successful;
And S40, when the question-answer pair type data and the document type data are not hit, recommending the questions, and returning a plurality of questions with high similarity to the front-end interactive interface for the user to select the most suitable questions.
In this embodiment, after the user input problem is obtained, the system automatically performs problem processing, including word segmentation and vectorization, and selects an appropriate vector library for problem retrieval according to the vectorized representation of the problem.
In step S10, two types of question-answer pair type data and document type data are fused, different vector libraries are built and stored through a unified algorithm scale, and the question-answer pair type data and the document type data are numbered to distinguish the different vector libraries.
The method comprises the steps of carrying out word segmentation and vectorization on type data of an input question-answer pair, and constructing a question vector library; slicing the input document type data, and carrying out word segmentation and vectorization on the sliced document by taking paragraphs as units to construct a document vector library; the question-answer vector library for the question-answer pair type data and the document type data is constructed based on fass and is used for ensuring consistency of two types of question-answer recall standards.
In this embodiment, in the question searching stage, the system tries to search in the vector library of question-answer pair type data, and if the similarity is higher than a predetermined threshold, returns an answer to the similar question directly.
If the problem is not hit in the vector library of the question-answer pair type data, the system tries to search in the vector library of the document type data; if the questions are missed in the vector libraries of the question-answer pair type data and the document type data, the system recommends the questions and returns a list of similar questions.
In this embodiment, when receiving question-answer pair type data and document type data entered by a front-end interactive interface, a user enters corresponding knowledge from the provided front-end interactive interface, including document type data and question-answer pair type data, and respectively provides different interfaces for entering, the question-answer pair type data is recorded as qa, the document question-answer type data is marked as dqa, and after the question-answer pair type data is uploaded, a mapping file is constructed between a question and a corresponding answer, and the mapping file is structured as two groups of dictionary type data.
After the data storage is completed, constructing a vector library comprises the following steps:
performing word segmentation on all the problem texts, wherein a word segmentation tool is jieba or HanLP;
performing word embedding expression at sentence level, and based on embedding by Bert, respectively representing the problems as a vector with a size of 512, and completing text word segmentation and vectorization;
after converting the problem into a vector, constructing a vector index library based on fass;
Question and answer pairs can be entered one by one through an interface or are collated into document batches for importing.
The system defaults to generate an ID after uploading the document, names the ID by dqa _ID, cuts the document, and cuts the document at a character level; setting the number of characters to be segmented based on a configuration file provided by a system, and setting by a user according to the size of a document and the type of the document; after the document is segmented, a knowledge base is constructed, key-value values, namely { paragraph ID, paragraph }, are stored in a dictionary form, and the key-value values are stored in the knowledge base; after the document is segmented, a word segmentation tool is used for segmenting the word, stop words are removed, vectorization is carried out on the basis of Bert, segmented paragraphs are expressed as vectors, and a word vector library is built and stored on the basis of fass.
Therefore, the question-answer retrieval method for fusing document knowledge and question-answer data is characterized in that the provided front-end interface is used for respectively inputting and uploading the question-answer pair and the document, two types of data are mainly fused, different vector libraries are built through unified algorithm scales for storage, meanwhile, question-answer pair type data and document types are numbered, the numbers are mainly used for distinguishing different vector libraries, questions of the question-answer pair type data are stored as index vectors, answers are stored in a knowledge base, the document type data are required to be sliced firstly, the sliced documents are uniformly stored in a vector database, and the question-answer pair and document type question-answer vector libraries are built based on fass, so that the consistency of two types of question-answer recall standards can be ensured, and the question-answer accuracy is improved.
After the user inputs the questions, firstly, searching the question-answer pair type data, if the similarity is higher, directly returning the content of the question-answer pair, and not carrying out recall sequencing on a vector library storing the documents, when the similarity of the question-answer pair type data is lower, searching the document type storage vector library, returning the document content after the search is successful, if the question-answer pair and the document type data are not hit, carrying out the recommendation of the questions, returning a plurality of questions with highest similarity to a front-end interaction interface by the system, and selecting the most suitable questions by the user.
In this embodiment, when vector library is constructed and data is uploaded, a user firstly inputs corresponding knowledge including document type data, question-answer pair type data and the like from a provided front-end interface.
After the question-answer pair type data is uploaded, a mapping file is constructed from the questions and the corresponding answers, and the structure is two groups of dictionary type data, namely:
q_id_map= { question ID, question } and q_a_map= { question, answer }, where the question ID is numbered from 0, increasing gradually, where the number coincides with the question ordering at the time of vector library construction, and is in the form of tuples, i.e.:
The (q_id_map, q_a_map) is stored in a knowledge base, the corresponding file name is qa_ID, the ID is a number automatically generated by a system when uploading data, the system has uniqueness, and the ID is an important mark for distinguishing answers to a vector library and a vector library constructed by a document.
Regarding similar questions of a certain question, the similar questions are not processed separately, but are integrated with other questions together, and stored in a q_id_map dictionary, and answers corresponding to the questions are many-to-one, namely, the questions and the similar questions are stored in q_a_map dictionary data corresponding to the same answer as knowledge index content. Taking a question and answer pair data as an example:
data={“question”:”sentence”,”answer”:”answer_text”,
"sizete" [ "sizete_1 '," sizete_2 ', …, "sizete_n ' ], dictionary data composed for this type of data is as follows, q_id_map= {" problem id_1":" sense "," problem id_2":" sizete_1 "," problem id_3":" sizete_2 ": …," problem id_n+1":" sizete_n "}, and q_a_map= {" sense ":" answer_text "," sizete_1 ":" answer_text "," …, "sizete_n": answer_text ", and the like.
After the data storage is completed, a vector library is required to be constructed, firstly, word segmentation is carried out on all the problem texts, word segmentation tools can be jieba, hanLP and the like, secondly, word embedding expression at sentence level is carried out, the patent is based on embedding by Bert, the problems are respectively expressed as a vector with the size of 512, the word segmentation and vectorization of the text are completed, and the process is carried out by means of the current mainstream method so as to achieve the best Chinese practice.
After converting the problem into a vector, constructing a vector index library based on fass.
The input of question and answer pairs can be input by the interface one by one, and can also be arranged into documents (Excel forms) for batch import.
In this embodiment, in the document type data, after uploading the document, the system defaults to generate an ID, names the dqa _id, then segments the document, the segmentation of the document is performed on a character level, the system provides a configuration file to set the number of characters to be segmented, and the user can set the configuration file according to the size of the document and the type of the document. After the document segmentation is completed, a knowledge base needs to be constructed, and key-value values, namely { paragraph ID, paragraph }, are stored in a dictionary form and are stored in the knowledge base.
After the document is segmented, a word segmentation tool is used for segmenting the word, meanwhile, stop words are removed, vectorization is carried out on the basis of Bert, segmented paragraphs are expressed as vectors, and then a word vector library is built and stored on the basis of fass.
The document can be segmented according to paragraphs, and the document can be stored in the form of paragraphs. Text vectorization is not limited to Bert, but may be based on vectorization techniques such as Word2Vec, glove, etc.
The uploaded document type supports doc, docx, txt and pdf etc. types of documents.
The method comprises the steps that when the data is uploaded, the types of the questions and the answers need to be checked, a question and answer type selection box is arranged in a front-end interface provided by the system, and a user selects to upload different data types, so that the back-end service of the system can perform data preprocessing based on different modules, and then a corresponding vector library is generated.
In this embodiment, during knowledge query and recall, the system provides an interface to assemble question and answer types, after uploading data and constructing a vector library, a corresponding question and answer type is generated, each question and answer type has a unique ID, different questions and answers can be assembled on a robot level, the robot meaning is a chat interface for user interaction, the chat interface is an entry for user interaction, the user can configure based on a configuration interface provided by the system, the configured content is an ID of a dialogue type, the ID is automatically generated after the user newly builds the question and answer type, and is synchronously and permanently stored in a database, the system can automatically load the corresponding dialogue ID after opening the robot configuration interface, and the user can complete the addition by checking a box behind the dialogue.
Before the configuration of the robot layer is carried out, the question and answer in the dialogue type or the model based on the document question and answer type is ensured to be started, meanwhile, the system provides a front-end interface operation, after the data is uploaded, a user can click a construction button in the interface, a corresponding vector library can be automatically constructed, the vector model is not loaded into a memory and is only subjected to persistence, the user can load the corresponding question and answer library into the memory by clicking a starting button after the data of the corresponding question and answer type in the interface, for the loading of the question and answer vector model, the system adopts a data structure in a queue form, after the system is clicked, the corresponding dialogue ID is put into the queue, the user can delete the data according to the interface provided by the front end, after clicking the deletion button, the system defaults to take out the corresponding dialogue type from the queue, and release the memory, so that resources can be better released, and the hardware utilization rate is improved.
After a user inputs a problem from a robot chat frame, firstly, a system automatically performs word segmentation, a word segmentation tool is not limited to barking (jieba) word segmentation, phrase removal is stopped, and the process is the same as that of vector library construction based on the word vector conversion of Bert, so that the consistency of word vectors and quantization scales can be ensured. After the questions are converted into sentence vectors, corresponding dialogue models, namely vector libraries, are removed from the queues based on parameters such as robot id, dialogue id and dialogue type, and the model defaults to find out the 5 nearest questions, namely recalls based on a similar vector search library faiss, then the questions are stored in a list, and the output questions are stored in the list in the form of a dictionary, and the specific structure is as follows: { "a": sentece, "score": score }, where score is the similarity (0-100) calculated based on the fasss library, then ranked according to score, where a lower score value indicates a more similar problem to the input, then ranked according to score value, and the system ranks all similar problems in ascending order.
For question-answer pair type data, recall of an answer requires two steps, the first: the nearest question ID list is obtained based on fasss, then the corresponding question is searched for from q_id_map, and the second part, the answer is searched for in the dictionary data q_a_map based on the question, and returned in the data structure { "a": answer ": score }, wherein answer is the answer.
For the document question-answer type, the answer recall only needs one step, a nearest sendeid list is obtained based on fass, then the corresponding content is searched from q_id_map, the content is returned in a data structure of { "a": content, "score": score }, the content is the nearest paragraph searched from the document, and the content with the lowest score value after sequencing is taken out and returned to the interactive interface.
The score value to be returned can also be set through the system configuration file, if the score value larger than the score value cannot be found, the nearest questions or similar paragraphs can be returned directly, and the default selection returns 5 nearest questions or paragraphs, so that the user can select according to the actual situation.
When the problem inquiry is carried out, parameters such as a robot id, a dialogue type and the like are contained simultaneously, the parameters are important basis for assembling the dialogue type at the back end, when a user chat through a robot chat interface, the parameters are added by default, the parameters are also important basis for ensuring the uniqueness of the dialogue, and document class questions and answers and question and answer pairs are identified based on id numbers.
It is noted that the above-described figures are only schematic illustrations of processes involved in a method according to an exemplary embodiment of the invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
It should be understood that although described in a certain order, the steps are not necessarily performed sequentially in the order described. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, some steps of the present embodiment may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the steps or stages in other steps or other steps.
Referring to fig. 1 to 2, the invention further provides a question-answer search system for fusing document knowledge and question-answer data, which comprises:
the data uploading module is used for receiving question-answer pair type data and document type data input by the front-end interactive interface;
the vector library construction module is used for fusing the type data and the document type data and numbering the type data and the document type data respectively so as to distinguish a vector library; the questions and answers of the type data are stored as index vectors, and the answers are stored in a knowledge base; firstly slicing the uploaded document in the document type data, and uniformly storing the sliced document in a vector database;
The knowledge query module is used for acquiring the input questions of the user and retrieving the type data by question and answer; when the similarity of the search is higher than a threshold value, directly returning the content of the question-answer pair, and not carrying out recall ordering on a vector library storing the documents; when the similarity of the search is lower than a threshold value, searching a document type storage vector library, and returning document contents after the search is successful;
and the recall module is used for recommending the questions when the question-answer pair type data and the document type data are not hit, and returning a plurality of questions with high similarity to the front-end interactive interface for the user to select the most suitable questions.
In this embodiment, after a user inputs a question, firstly, a question-answer pair is searched, if the similarity is higher, the content of the question-answer pair is directly returned, the recall ordering is not performed on a vector library storing documents, when the similarity of the question-answer pair type data is lower, the document type storage vector library is searched, the document content is returned after the search is successful, if the question-answer pair and the document type data are both missed, the question recommendation can be performed, the system returns a plurality of questions with highest similarity to the front-end interaction interface, and the user selects the most suitable question.
In this embodiment, when vector library is constructed and data is uploaded, a user firstly inputs corresponding knowledge including document type data, question-answer pair type data and the like from a provided front-end interface.
After the question-answer pair type data is uploaded, a mapping file is constructed from the questions and the corresponding answers, and the structure is two groups of dictionary type data, namely:
q_id_map= { question ID, question } and q_a_map= { question, answer }, where the question ID is numbered from 0, increasing gradually, where the number coincides with the question ordering at the time of vector library construction, and is in the form of tuples, i.e.:
the (q_id_map, q_a_map) is stored in a knowledge base, the corresponding file name is qa_ID, the ID is a number automatically generated by a system when uploading data, the system has uniqueness, and the ID is an important mark for distinguishing answers to a vector library and a vector library constructed by a document.
Regarding similar questions of a certain question, the similar questions are not processed separately, but are integrated with other questions together, and stored in a q_id_map dictionary, and answers corresponding to the questions are many-to-one, namely, the questions and the similar questions are stored in q_a_map dictionary data corresponding to the same answer as knowledge index content. Taking a question and answer pair data as an example:
data={“question”:”sentence”,”answer”:”answer_text”,
"sizete" [ "sizete_1 '," sizete_2 ', …, "sizete_n ' ], dictionary data composed for this type of data is as follows, q_id_map= {" problem id_1":" sense "," problem id_2":" sizete_1 "," problem id_3":" sizete_2 ": …," problem id_n+1":" sizete_n "}, and q_a_map= {" sense ":" answer_text "," sizete_1 ":" answer_text "," …, "sizete_n": answer_text ", and the like.
After the data storage is completed, a vector library is required to be constructed, firstly, word segmentation is carried out on all the problem texts, word segmentation tools can be jieba, hanLP and the like, secondly, word embedding expression at sentence level is carried out, the patent is based on embedding by Bert, the problems are respectively expressed as a vector with the size of 512, the word segmentation and vectorization of the text are completed, and the process is carried out by means of the current mainstream method so as to achieve the best Chinese practice.
After converting the problem into a vector, constructing a vector index library based on fass.
The input of question and answer pairs can be input by the interface one by one, and can also be arranged into documents (Excel forms) for batch import.
In this embodiment, in the document type data, after uploading the document, the system defaults to generate an ID, names the dqa _id, then segments the document, the segmentation of the document is performed on a character level, the system provides a configuration file to set the number of characters to be segmented, and the user can set the configuration file according to the size of the document and the type of the document. After the document segmentation is completed, a knowledge base needs to be constructed, and key-value values, namely { paragraph ID, paragraph }, are stored in a dictionary form and are stored in the knowledge base.
After the document is segmented, a word segmentation tool is used for segmenting the word, meanwhile, stop words are removed, vectorization is carried out on the basis of Bert, segmented paragraphs are expressed as vectors, and then a word vector library is built and stored on the basis of fass.
The document can be segmented according to paragraphs, and the document can be stored in the form of paragraphs. Text vectorization is not limited to Bert, but may be based on vectorization techniques such as Word2Vec, glove, etc.
The uploaded document type supports doc, docx, txt and pdf etc. types of documents.
The method comprises the steps that when the data is uploaded, the types of the questions and the answers need to be checked, a question and answer type selection box is arranged in a front-end interface provided by the system, and a user selects to upload different data types, so that the back-end service of the system can perform data preprocessing based on different modules, and then a corresponding vector library is generated.
In this embodiment, during knowledge query and recall, the system provides an interface to assemble question and answer types, after uploading data and constructing a vector library, a corresponding question and answer type is generated, each question and answer type has a unique ID, different questions and answers can be assembled on a robot level, the robot meaning is a chat interface for user interaction, the chat interface is an entry for user interaction, the user can configure based on a configuration interface provided by the system, the configured content is an ID of a dialogue type, the ID is automatically generated after the user newly builds the question and answer type, and is synchronously and permanently stored in a database, the system can automatically load the corresponding dialogue ID after opening the robot configuration interface, and the user can complete the addition by checking a box behind the dialogue.
Before the configuration of the robot layer is carried out, the question and answer in the dialogue type or the model based on the document question and answer type is ensured to be started, meanwhile, the system provides a front-end interface operation, after the data is uploaded, a user can click a construction button in the interface, a corresponding vector library can be automatically constructed, the vector model is not loaded into a memory and is only subjected to persistence, the user can load the corresponding question and answer library into the memory by clicking a starting button after the data of the corresponding question and answer type in the interface, for the loading of the question and answer vector model, the system adopts a data structure in a queue form, after the system is clicked, the corresponding dialogue ID is put into the queue, the user can delete the data according to the interface provided by the front end, after clicking the deletion button, the system defaults to take out the corresponding dialogue type from the queue, and release the memory, so that resources can be better released, and the hardware utilization rate is improved.
After a user inputs a problem from a robot chat frame, firstly, a system automatically performs word segmentation, a word segmentation tool is not limited to barking (jieba) word segmentation, phrase removal is stopped, and the process is the same as that of vector library construction based on the word vector conversion of Bert, so that the consistency of word vectors and quantization scales can be ensured. After the questions are converted into sentence vectors, corresponding dialogue models, namely vector libraries, are removed from the queues based on parameters such as robot id, dialogue id and dialogue type, and the model defaults to find out the 5 nearest questions, namely recalls based on a similar vector search library faiss, then the questions are stored in a list, and the output questions are stored in the list in the form of a dictionary, and the specific structure is as follows: { "a": sentece, "score": score }, where score is the similarity (0-100) calculated based on the fasss library, then ranked according to score, where a lower score value indicates a more similar problem to the input, then ranked according to score value, and the system ranks all similar problems in ascending order.
For question-answer pair type data, recall of an answer requires two steps, the first: the nearest question ID list is obtained based on fasss, then the corresponding question is searched for from q_id_map, and the second part, the answer is searched for in the dictionary data q_a_map based on the question, and returned in the data structure { "a": answer ": score }, wherein answer is the answer.
For the document question-answer type, the answer recall only needs one step, a nearest sendeid list is obtained based on fass, then the corresponding content is searched from q_id_map, the content is returned in a data structure of { "a": content, "score": score }, the content is the nearest paragraph searched from the document, and the content with the lowest score value after sequencing is taken out and returned to the interactive interface.
The score value to be returned can also be set through the system configuration file, if the score value larger than the score value cannot be found, the nearest questions or similar paragraphs can be returned directly, and the default selection returns 5 nearest questions or paragraphs, so that the user can select according to the actual situation.
When the problem inquiry is carried out, parameters such as a robot id, a dialogue type and the like are contained simultaneously, the parameters are important basis for assembling the dialogue type at the back end, when a user chat through a robot chat interface, the parameters are added by default, the parameters are also important basis for ensuring the uniqueness of the dialogue, and document class questions and answers and question and answer pairs are identified based on id numbers.
In conclusion, the question-answer retrieval method and system for integrating document knowledge and question-answer pair data have wide application prospects. The method combines knowledge data of different types, and improves accuracy and efficiency of problem solving through vectorization and similarity comparison. The flexibility, configurability and resource utilization of the system increase the effectiveness of knowledge management while providing automatic problem recommendation functionality so that users can more easily obtain the required information. The invention brings new solutions to the field of knowledge retrieval, and is expected to have beneficial effects in various fields including online help documents, customer support, education, research and the like.
In a third aspect of the embodiments of the present invention, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, implements the method of any of the embodiments described above.
A processor and a memory are included in the computer device, and may further include: an input system and an output system. The processor, memory, input system, and output system may be connected by a bus or other means, and the input system may receive input numeric or character information and generate signal inputs related to the migration of question-answer searches that fuse document knowledge and question-answer data. The output system may include a display device such as a display screen.
The memory, as a non-volatile computer readable storage medium, may be used to store a non-volatile software program, a non-volatile computer executable program, and a module, such as program instructions/modules corresponding to a question-answer search method for fusing document knowledge and question-answer data in the embodiments of the present application. The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data created by fusing document knowledge and the use of a question-answer search method for the question-answer data, and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the local module through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process the data. The processors of the multiple computer devices of the computer device of the present embodiment execute various functional applications and data processing of the server by running nonvolatile software programs, instructions and modules stored in the memory, that is, the steps of the question-answer search method for fusing document knowledge and question-answer data implementing the above method embodiment.
It should be appreciated that all of the embodiments, features and advantages set forth above for the method of question-answer retrieval of data in accordance with the fused document knowledge and questions of the present invention apply equally to the medium of question-answer retrieval and storage of data in accordance with the fused document knowledge and questions of the present invention without conflicting with each other.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
Finally, it should be noted that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, RAM may be available in a variety of forms such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP and/or any other such configuration.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims (10)

1. A question-answer retrieval method for fusing document knowledge and question-answer data is characterized by comprising the following steps:
Receiving question-answer pair type data and document type data input by a front-end interactive interface, fusing the question-answer pair type data and the document type data, and numbering the question-answer pair type data and the document type data respectively to distinguish a vector library;
the questions and answers are stored as index vectors, and the answers are stored in a knowledge base; firstly slicing the uploaded document in the document type data, and uniformly storing the sliced document in a vector database;
acquiring a user input question and searching question-answer pair type data; when the similarity of the search is higher than a threshold value, directly returning the content of the question-answer pair, and not carrying out recall ordering on a vector library storing the documents; when the similarity of the search is lower than a threshold value, searching a document type storage vector library, and returning document contents after the search is successful;
when the question and answer pair type data and the document type data are not hit, the question recommendation is carried out, and a plurality of questions with high similarity are returned to the front-end interactive interface for the user to select the most suitable questions.
2. The question-answering retrieval method for fusing document knowledge and question-answering data according to claim 1, wherein after obtaining the user input question, the system automatically performs question processing including word segmentation and vectorization, and selects a proper vector library for question retrieval according to the vectorized representation of the question.
3. The method for retrieving the question and answer data by fusing document knowledge and question and answer according to claim 2, wherein the two types of question and answer pair type data and document type data are fused, different vector libraries are constructed and stored through a unified algorithm scale, and the question and answer pair type data and the document type data are numbered at the same time so as to distinguish different vector libraries.
4. The question-answer retrieval method for fusing document knowledge and question-answer pair data according to claim 3, wherein the entered question-answer pair type data is subjected to word segmentation and vectorization to construct a question vector library; slicing the input document type data, and carrying out word segmentation and vectorization on the sliced document by taking paragraphs as units to construct a document vector library; the question-answer vector library for the question-answer pair type data and the document type data is constructed based on fass and is used for ensuring consistency of two types of question-answer recall standards.
5. The question-answering retrieval method for fusing document knowledge and question-answering data according to claim 4, wherein in the question retrieval phase, the system attempts to retrieve in a vector library of question-answering pair type data, and if the similarity is higher than a predetermined threshold, returns answers to similar questions directly.
6. The question-answering retrieval method for fusing document knowledge and question-answering data according to claim 5, wherein if a question is not hit in a vector base of question-answering pair type data, the system tries to retrieve in the vector base of document type data; if the questions are missed in the vector libraries of the question-answer pair type data and the document type data, the system recommends the questions and returns a list of similar questions.
7. The method for searching the question and answer data by fusing document knowledge and question and answer according to claim 6, wherein when receiving the question and answer pair type data and the document type data input by the front-end interactive interface, a user inputs corresponding knowledge from the provided front-end interactive interface, wherein the knowledge comprises document type data and question and answer pair type data, different interfaces are respectively provided for inputting, the question and answer pair type data is marked as qa, the document question and answer type data is marked as dqa, and after the question and answer pair type data are uploaded, a mapping file is constructed by the questions and the corresponding answers, and the structure is two groups of dictionary type data.
8. The question-answer retrieval method of fusing document knowledge and question-answer data according to claim 7, wherein after the data storage is completed, constructing a vector library comprises:
Performing word segmentation on all the problem texts, wherein a word segmentation tool is jieba or HanLP;
performing word embedding expression at sentence level, and based on embedding by Bert, respectively representing the problems as a vector with a size of 512, and completing text word segmentation and vectorization;
after converting the problem into a vector, constructing a vector index library based on fass;
question and answer pairs can be entered one by one through an interface or are collated into document batches for importing.
9. The question-answering retrieval method for fusing document knowledge and question-answering data according to claim 8, wherein the system generates an ID by default after uploading the document, and names the document with dqa _id, and the document is segmented, and the segmentation of the document is performed at a character level; setting the number of characters to be segmented based on a configuration file provided by a system, and setting by a user according to the size of a document and the type of the document; after the document is segmented, a knowledge base is constructed, key-value values, namely { paragraph ID, paragraph }, are stored in a dictionary form, and the key-value values are stored in the knowledge base; after the document is segmented, a word segmentation tool is used for segmenting the word, stop words are removed, vectorization is carried out on the basis of Bert, segmented paragraphs are expressed as vectors, and a word vector library is built and stored on the basis of fass.
10. A question-answer retrieval system for fusing document knowledge and question-answer data, for executing the question-answer retrieval method for fusing document knowledge and question-answer data of any one of claims 1 to 9, the question-answer retrieval system for fusing document knowledge and question-answer data comprising:
the data uploading module is used for receiving question-answer pair type data and document type data input by the front-end interactive interface;
the vector library construction module is used for fusing the type data and the document type data and numbering the type data and the document type data respectively so as to distinguish a vector library; the questions and answers of the type data are stored as index vectors, and the answers are stored in a knowledge base; firstly slicing the uploaded document in the document type data, and uniformly storing the sliced document in a vector database;
the knowledge query module is used for acquiring the input questions of the user and retrieving the type data by question and answer; when the similarity of the search is higher than a threshold value, directly returning the content of the question-answer pair, and not carrying out recall ordering on a vector library storing the documents; when the similarity of the search is lower than a threshold value, searching a document type storage vector library, and returning document contents after the search is successful;
And the recall module is used for recommending the questions when the question-answer pair type data and the document type data are not hit, and returning a plurality of questions with high similarity to the front-end interactive interface for the user to select the most suitable questions.
CN202311397843.5A 2023-10-26 2023-10-26 Question-answer retrieval method and system for integrating document knowledge and question-answer data Pending CN117407505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311397843.5A CN117407505A (en) 2023-10-26 2023-10-26 Question-answer retrieval method and system for integrating document knowledge and question-answer data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311397843.5A CN117407505A (en) 2023-10-26 2023-10-26 Question-answer retrieval method and system for integrating document knowledge and question-answer data

Publications (1)

Publication Number Publication Date
CN117407505A true CN117407505A (en) 2024-01-16

Family

ID=89499604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311397843.5A Pending CN117407505A (en) 2023-10-26 2023-10-26 Question-answer retrieval method and system for integrating document knowledge and question-answer data

Country Status (1)

Country Link
CN (1) CN117407505A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875433A (en) * 2024-03-12 2024-04-12 科沃斯家用机器人有限公司 Question answering method, device, equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875433A (en) * 2024-03-12 2024-04-12 科沃斯家用机器人有限公司 Question answering method, device, equipment and readable storage medium
CN117875433B (en) * 2024-03-12 2024-06-07 科沃斯家用机器人有限公司 Question answering method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111241241B (en) Case retrieval method, device, equipment and storage medium based on knowledge graph
CN109240901B (en) Performance analysis method, performance analysis device, storage medium, and electronic apparatus
CN112307762B (en) Search result sorting method and device, storage medium and electronic device
US20190129942A1 (en) Methods and systems for automatically generating reports from search results
CN108108426B (en) Understanding method and device for natural language question and electronic equipment
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
CN110059085B (en) Web 2.0-oriented JSON data analysis and modeling method
CN116521893A (en) Control method and control device of intelligent dialogue system and electronic equipment
CN112883030A (en) Data collection method and device, computer equipment and storage medium
CN117407505A (en) Question-answer retrieval method and system for integrating document knowledge and question-answer data
CN110895586A (en) Method and device for generating news page, computer equipment and storage medium
CN110442702A (en) Searching method, device, readable storage medium storing program for executing and electronic equipment
CN110737432A (en) script aided design method and device based on root list
CN116226494B (en) Crawler system and method for information search
CN112966076A (en) Intelligent question and answer generating method and device, computer equipment and storage medium
CN112202889A (en) Information pushing method and device and storage medium
CN117150107A (en) Recommendation method and device based on knowledge graph, computer equipment and storage medium
CN111859042A (en) Retrieval method and device and electronic equipment
CN107577690B (en) Recommendation method and recommendation device for mass information data
CN110362694A (en) Data in literature search method, equipment and readable storage medium storing program for executing based on artificial intelligence
CN114281983B (en) Hierarchical text classification method, hierarchical text classification system, electronic device and storage medium
CN114417010A (en) Knowledge graph construction method and device for real-time workflow and storage medium
CN114461813A (en) Data pushing method, system and storage medium based on knowledge graph
CN113609166A (en) Search method, search device, computer equipment and computer-readable storage medium
CN110941765A (en) Search intention identification method, information search method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination