CN116595122A - Method, device and equipment for searching computer field document in question-answering system - Google Patents

Method, device and equipment for searching computer field document in question-answering system Download PDF

Info

Publication number
CN116595122A
CN116595122A CN202310338689.8A CN202310338689A CN116595122A CN 116595122 A CN116595122 A CN 116595122A CN 202310338689 A CN202310338689 A CN 202310338689A CN 116595122 A CN116595122 A CN 116595122A
Authority
CN
China
Prior art keywords
candidate
document
documents
question
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310338689.8A
Other languages
Chinese (zh)
Inventor
王越
张治国
赵逢波
周成标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baolande Software Co ltd
Original Assignee
Beijing Baolande Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baolande Software Co ltd filed Critical Beijing Baolande Software Co ltd
Priority to CN202310338689.8A priority Critical patent/CN116595122A/en
Publication of CN116595122A publication Critical patent/CN116595122A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a method, a device and equipment for searching documents in the computer field in a question-answering system, and relates to the technical field of document searching, wherein the method comprises the following steps: based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document; for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sequencing the candidate documents based on the similarity between each candidate document in the candidate document set and the query sentence to obtain a sequencing result of the candidate documents. The invention improves the accuracy of technical document retrieval in the field of computers.

Description

Method, device and equipment for searching computer field document in question-answering system
Technical Field
The present invention relates to the field of document retrieval technology, and in particular, to a method, an apparatus, and a device for retrieving documents in a computer field in a question-answering system.
Background
The prior question-answering system in the intelligent operation and maintenance field generally integrates various question-answering functions, such as task-type question-answering, boring-type question-answering and document-type question-answering. The document type question-answering function can help a user to search a required document from a knowledge base matched with the question-answering system through query keywords or key sentences input by the user, and the results are ranked from high to low according to the similarity.
Common retrieval techniques may employ a boolean model based on keywords or a vector space model based on machine learning. The vector space model adopts a machine learning algorithm to convert the document content and the query statement into feature vectors, and the similarity between the document and the query statement is calculated to obtain a query result.
The Boolean model can not sort the documents in a fine granularity, and further time-consuming refinement ranking is required; the vector space model depends on the quality and quantity of manual annotation data at the upper limit of the algorithm effect, and is easily influenced by the professional domain vocabulary in the document content, so that the vector conversion result is inaccurate. Therefore, the existing question-answering system has the problems of low accuracy and low efficiency in document retrieval in the computer field.
Disclosure of Invention
The invention provides a method, a device and equipment for searching a computer field document in a question-answering system, which are used for solving the problems of low efficiency and low accuracy in the search of the computer field document in the question-answering system in the prior art and improving the search efficiency and accuracy of the computer field document in the question-answering system.
The invention provides a retrieval method of a computer field document in a question-answering system, which comprises the following steps:
based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document;
for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document;
determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;
and sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.
According to the method for searching the computer field document in the question-answering system, provided by the invention, the matching characteristics comprise title characteristics and text characteristics;
the determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentence in the candidate document comprises:
and determining candidate matching vectors corresponding to the candidate documents based on the title features and the text features of the keywords corresponding to the query sentences in the candidate documents and the title weight coefficients and the text weight coefficients.
According to the method for searching the computer field document in the question-answering system, which is provided by the invention, the method further comprises the following steps:
adding a pre-acquired keyword dictionary in the computer field into a word segmentation device of an elastic search;
automatically constructing a mixed inverted index for a plurality of documents in the knowledge base through an elastic search based on the query statement;
wherein the mixed inverted index comprises: conventional inverted indexes and inverted indexes based on the computer domain keyword dictionary.
According to the method for searching the computer field document in the question-answering system, which is provided by the invention, the method further comprises the following steps:
collecting keywords in the history document by using a TF-IDF algorithm;
collecting key phrases in the history document by using an SIFRank algorithm;
and constructing the preset computer domain keyword dictionary based on the keywords and the key phrases.
According to the method for searching the computer field document in the question-answering system, which is provided by the invention, the method further comprises the following steps:
preprocessing the history document; the preprocessing includes deleting code fragments and structured query language SQL statements in the history document.
According to the method for searching the computer field document in the question-answering system, which is provided by the invention, the method further comprises the following steps:
and updating the sorting result based on the updating time and/or the clicking times of each candidate document in the candidate document set.
According to the method for searching the documents in the computer field in the question-answering system provided by the invention, the mixed inverted index corresponding to the documents in the query knowledge base comprises the following steps:
and based on the mixed inverted index, adopting an elastic search to perform distributed cluster search on a plurality of documents in the knowledge base.
The invention also provides a retrieval device of the computer field document in the question-answering system, which comprises:
the document set acquisition module is used for inquiring mixed inverted indexes corresponding to a plurality of documents in the knowledge base based on inquiry sentences to obtain a candidate document set comprising at least one candidate document;
the matching vector determining module is used for determining candidate matching vectors corresponding to the candidate documents according to the matching characteristics of the keywords corresponding to the query sentences in the candidate documents aiming at the candidate documents in the candidate document set;
the similarity determining module is used for determining the similarity between the candidate document and the query statement based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;
and the document ordering module is used for ordering the candidate documents based on the similarity between the candidate documents in the candidate document set and the query statement to obtain an ordering result of the candidate documents.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the searching method of the computer field document in the question-answering system when executing the program.
The invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for searching documents in the computer domain in the question-answering system.
According to the method, the device and the equipment for searching the documents in the computer field in the question-answering system, the mixed inverted indexes corresponding to the documents in the knowledge base are inquired through inquiry sentences to obtain a candidate document set; for each candidate document in the candidate document set, determining a candidate matching vector based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity and outputting a sorting result. By designing the document matching vector method, the coarse screening results of the mixed inverted index are reordered in finer granularity, so that the accuracy of document retrieval is improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for searching documents in the computer domain in the question-answering system provided by the invention;
FIG. 2 is a schematic diagram of the results of a method for retrieving documents in the computer domain in the question-answering system provided by the invention;
FIG. 3 is a second schematic diagram of the results of the method for retrieving documents in the computer domain in the question-answering system provided by the present invention;
FIG. 4 is a schematic diagram of a search device for documents in the computer domain in the question-answering system provided by the invention;
fig. 5 is a schematic diagram of the physical structure of the electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The method, apparatus and device for searching documents in computer domain in question-answering system of the present invention are described below with reference to fig. 1 to 5.
FIG. 1 is a schematic flow chart of a method for searching documents in computer domain in question-answering system, as shown in FIG. 1, the method comprises:
s1, inquiring mixed inverted indexes corresponding to a plurality of documents in a knowledge base based on inquiry sentences to obtain a candidate document set comprising at least one candidate document;
in a specific implementation, a query sentence, typically a number of keywords or a query sentence, input by a user in an input box of the question-answering system is obtained. Unlike conventional document retrieval systems, the AND, OR, NOT logic of Boolean queries only supports AND, OR logic in question-answering systems. The query statement is exemplified by "H5 System architecture diagram" or "H5, BOSS, interface".
S2, determining candidate matching vectors corresponding to the candidate documents according to the matching characteristics of the keywords corresponding to the query sentences in the candidate documents aiming at the candidate documents in the candidate document set;
s3, determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;
in a specific implementation, the similarity may be calculated by using cosine similarity, a jaccard similarity coefficient, a pearson correlation coefficient, and the like.
S4, sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.
In the implementation, the candidate documents are returned to the front end of the question-answering system according to the sorting result, and the front end interface marks the words matched with the query sentences in the display documents.
According to the method for searching the documents in the computer field in the question-answering system, provided by the embodiment, the mixed inverted indexes corresponding to the documents in the knowledge base are inquired through inquiry sentences to obtain a candidate document set; for each candidate document in the candidate document set, determining a candidate matching vector based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity and outputting a sorting result. By designing the document matching vector method, the re-ordering of the coarse screening result of the mixed inverted index with finer granularity is realized, and the accuracy of document retrieval is improved.
In an optional embodiment, in the method for searching a computer domain document in the question-answering system, the matching features include a title feature and a text feature;
in specific implementation, according to the matching situation of the principle of boolean query and the keywords in the query statement, 6-dimensional matching features designed for candidate documents are shown in table 1:
TABLE 1
The title of the candidate document is treated differently from the document text, and the title comprises the title of the document and the title of each chapter. Three features 1, 2, 3 belong to the "title" feature, and three features 4, 5, 6 belong to the "body" feature.
The meaning of feature 1 is whether the keywords in the query statement appear in the heading, here the boolean model, with the appearance being 1 and the absence being 0.
The meaning of the feature 2 matching ratio (title) is: the number of words of the title in the matching is the ratio of the total number of query words. The feature represents how many terms in the query statement appear in the heading, and the higher the feature value, the higher the rank should be, alone.
The meaning of the number of feature 3 matching words (title) is: the total number of words matching the title. This value is used to further fine-grained discrimination ranking, and when feature 1 and feature 2 of the query statement are the same value, we further employ feature 3 to ascertain. The more matches, the higher the rank. In order to ensure the normalization principle of each feature, in a specific implementation, the matching times are set according to the following rules, as shown in table 2:
TABLE 2
The meaning of the features 4, 5 and 6 is similar to that of the first 3 corresponding features, but in the features 6, the matching number is set slightly different from that of the features 3, and the number can be used as a super parameter and can be freely adjusted according to the actual use situation.
Based on the matching characteristics of the keywords corresponding to the query sentences in the candidate documents, determining candidate matching vectors corresponding to the candidate documents comprises the following steps:
and determining candidate matching vectors corresponding to the candidate documents based on the title features and the text features of the keywords corresponding to the query sentences in the candidate documents and the title weight coefficients and the text weight coefficients.
In specific implementation, through analyzing the query habits of the user, the user frequently inputs keywords with summary properties in the query to perform the query, and the term occupation ratio of the title in the matching is very large. We therefore divide the document into two fields, title and body, and have different weights. For the first three features we multiply the weight coefficient of 0.55 after taking the feature value, the last three features multiply the weight coefficient of 0.45.
In a specific implementation, based on a preset standard matching vector and a candidate matching vector corresponding to a candidate document, determining the similarity between the candidate document and the query sentence, forming a matching vector by the query sentence input by a user and the candidate document, and then carrying out similarity calculation with the standard vector, wherein table 3 is a schematic diagram of the construction standard matching vector:
TABLE 3 Table 3
The original characteristic values can be considered that the query sentences are highly matched in each vector, each value is 1, namely the query sentences are completely existing in the title and the text, the times in matching are more, and the set highest matching numerical value is met. The original eigenvalues are not weighted, and the weight coefficient of 0.55,0.45 is multiplied to finally obtain a standard matching vector: [0.55,0.55,0.55,0.45,0.45,0.45].
In a specific implementation, assuming that 6 candidate documents exist in the candidate document set, 6 matching vectors are formed by query sentences input by a user and the 6 candidate documents in sequence according to the characteristic construction mode. And then the 6 matching vectors are respectively calculated to be cosine similar with the standard matching vectors, and are sequenced from high to low. To verify feature structure rationality, we take table 4 as an example, the first column describes the matching degree of the query sentence and the candidate document, the second column is the generated candidate matching vector, and the third column is the cosine similarity calculated with the standard matching vector.
TABLE 4 Table 4
The settings in table 2 correspond, for example, to the following:
the first 3 features are 0.55 weight coefficients: (1,0.5,0.6) 0.55= 0.55,0.275,0.33;
the last 3 features are 0.45 weight coefficients: (1,0.5,0.6) 0.45= 0.45,0.225,0.27;
the combination is [0.55,0.275,0.33,0.45,0.225,0.27];
the remaining 5 original vectors are each as follows, with the first three vectors multiplied by 0.55 and the last three multiplied by 0.45: [1,0.3,0.2,1,0.3,0.2], [1,1,1,0,0,0], [1,0.5,0.6,0,0,0], [0,0,0,1,1,1], [0,0,0,1,0.5,0.6].
The cosine similarity calculated by the candidate matching vector and the standard matching vector is calculated, in this embodiment, a cosine similarity calculation method is used, as shown in table 4, the similarity value is between 0 and 1, and the closer to 1, the higher the matching degree between the query vector and the document vector is.
In the embodiment, the cosine similarity is calculated through the matching features in the designed documents, so that the candidate documents can be subjected to fine granularity sorting, and the accuracy of the document retrieval output result is improved.
In an optional embodiment, the method for searching the computer domain document in the question-answering system further includes:
adding a pre-acquired keyword dictionary in the computer domain into a word segmentation device of an elastomer search of a distributed search and analysis engine;
automatically constructing a mixed inverted index for a plurality of documents in a knowledge base through a distributed search and analysis engine elastic search based on query sentences;
wherein the mixing inverted index comprises: conventional inverted indexes and inverted indexes based on a keyword dictionary in the computer field.
In the embodiment, the mixed inverted index comprising the conventional inverted index and the inverted index of the keyword dictionary in the preset computer field is automatically constructed through the distributed search and analysis engine elastic search, so that the searching efficiency is improved.
In an optional embodiment, the method for searching the computer domain document in the question-answering system further includes:
collecting keywords in the history document by using a TF-IDF algorithm;
collecting key phrases in the history document by using an SIFRank algorithm;
and constructing the preset computer domain keyword dictionary based on the keywords and the key phrases.
In specific implementation, TF in the TF-IDF algorithm represents word frequency, and the value is used for coordinating the influence of word times and sentence lengths on IDF values; IDF is the inverse document frequency, and is understood to be the amount of information contained in each word, with the number of occurrences of the word being smaller and the IDF value being greater. The final value of TF-IDF is TF IDF. The specific formula is as follows:
fig. 2 is a schematic diagram of the result of the method for searching documents in the computer domain in the question-answering system, and the extraction result of extracting keywords in the knowledge base by using the TF-IDF algorithm is shown in fig. 2.
The key phrase extraction uses the SIFRank algorithm for collection. The algorithm firstly carries out word segmentation and part-of-speech tagging on sentences, and then determines noun phrases by using regular expressions.
Fig. 3 is a second schematic diagram of the result of the method for searching documents in the computer domain in the question-answering system, and the extraction result of extracting key phrases in the knowledge base by using the sip lank algorithm is shown in fig. 3.
In an implementation, a plurality of phrases ranked first are filtered out and supplemented into a dictionary.
In the embodiment, through the preset processing, the keyword dictionary of the computer field document is established more scientifically, so that the subsequent retrieval processing is facilitated, and the accuracy of the computer field document retrieval is improved.
In an optional embodiment, the method for searching the computer domain document in the question-answering system further includes:
preprocessing a history document; preprocessing includes deleting code fragments and structured query language SQL statements in the history document.
In the embodiment, the influence of the code fragments and SQL sentences in the document on subsequent vectorization is reduced, and the accuracy of document retrieval in the subsequent computer field is improved by preprocessing the historical document.
In an optional embodiment, the method for searching the computer domain document in the question-answering system further includes:
and updating the sorting result based on the updating time and/or the clicking times of each candidate document in the candidate document set.
In the embodiment, based on two dimensions of the update time and the click times of the candidate documents, the candidate documents are further optimized and ordered, and the document retrieval result in the computer field is optimized.
In an optional embodiment, in the method for searching a computer domain document in the question-answering system, the mixed inverted index corresponding to a plurality of documents in the query knowledge base includes:
based on the mixed inverted index, a distributed search and analysis engine elastic search is adopted to perform distributed cluster search on a plurality of documents in a knowledge base.
In the embodiment, the distributed search and analysis engine is used for fast searching of the distributed clusters by means of the distributed, high-expansion and high-real-time performance of the distributed search engine, so that the document searching efficiency in the field of computers is improved, and the application scene is expanded.
Fig. 4 is a schematic structural diagram of a search device for documents in computer domain in question-answering system provided by the present invention, as shown in fig. 4, including:
a document set obtaining module 41, configured to query, based on the query statement, a mixed inverted index corresponding to a plurality of documents in the knowledge base, to obtain a candidate document set including at least one candidate document;
a matching vector determining module 42, configured to determine, for each candidate document in the candidate document set, a candidate matching vector corresponding to the candidate document based on a matching feature of the keyword corresponding to the query term in the candidate document;
a similarity determining module 43, configured to determine a similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;
the document ordering module 44 is configured to order the candidate documents based on the similarity between each candidate document in the candidate document set and the query sentence, so as to obtain an ordering result of each candidate document.
According to the retrieval device of the documents in the computer field in the question-answering system, based on mutual coordination among the modules, the document collection acquisition module is used for inquiring mixed inverted indexes corresponding to a plurality of documents in a knowledge base based on inquiry sentences to obtain a candidate document collection; determining candidate matching vectors by a matching vector determining module according to the matching characteristics of keywords corresponding to query sentences in the candidate documents aiming at each candidate document in a candidate document set; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document through a similarity determining module; and sequencing the candidate documents based on the similarity through a document sequencing module and outputting sequencing results. The method has the advantages that the coarse screening results of the mixed inverted index are reordered in finer granularity, and the accuracy of document retrieval is improved.
Fig. 5 is a schematic physical structure of an electronic device according to the present invention, as shown in fig. 5, the electronic device may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method for retrieving a computer domain document in a question-answering system, the method comprising: based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document; for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform a method for searching a computer domain document in a question-answering system provided by the above methods, the method comprising: based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document; for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for searching a computer field document in a question-answering system is characterized by comprising the following steps:
based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document;
for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document;
determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;
and sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.
2. The method for retrieving documents in a computer domain in a question-answering system according to claim 1, wherein the matching features include a title feature and a text feature;
the determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentence in the candidate document comprises:
and determining candidate matching vectors corresponding to the candidate documents based on the title features and the text features of the keywords corresponding to the query sentences in the candidate documents and the title weight coefficients and the text weight coefficients.
3. The method for retrieving documents in a computer domain in a question-answering system according to claim 1, further comprising:
adding a pre-acquired keyword dictionary in the computer field into a word segmentation device of an elastic search;
automatically constructing a mixed inverted index for a plurality of documents in the knowledge base through an elastic search based on the query statement;
wherein the mixed inverted index comprises: conventional inverted indexes and inverted indexes based on the computer domain keyword dictionary.
4. A method for retrieving computer domain documents in a question-answering system according to claim 3, wherein the method further comprises:
collecting keywords in the history document by using a TF-IDF algorithm;
collecting key phrases in the history document by using an SIFRank algorithm;
and constructing the preset computer domain keyword dictionary based on the keywords and the key phrases.
5. The method for retrieving documents in a computer domain in a question-answering system according to claim 4, further comprising:
preprocessing the history document; the preprocessing includes deleting code fragments and structured query language SQL statements in the history document.
6. The method for retrieving documents in a computer domain in a question-answering system according to claim 1, further comprising:
and updating the sorting result based on the updating time and/or the clicking times of each candidate document in the candidate document set.
7. The method for retrieving documents in computer domain in question-answering system according to claim 1, wherein the querying the mixed inverted index corresponding to the plurality of documents in the knowledge base includes:
and based on the mixed inverted index, adopting an elastic search to perform distributed cluster search on a plurality of documents in the knowledge base.
8. A search device for a computer domain document in a question-answering system, comprising:
the document set acquisition module is used for inquiring mixed inverted indexes corresponding to a plurality of documents in the knowledge base based on inquiry sentences to obtain a candidate document set comprising at least one candidate document;
the matching vector determining module is used for determining candidate matching vectors corresponding to the candidate documents according to the matching characteristics of the keywords corresponding to the query sentences in the candidate documents aiming at the candidate documents in the candidate document set;
the similarity determining module is used for determining the similarity between the candidate document and the query statement based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;
and the document ordering module is used for ordering the candidate documents based on the similarity between the candidate documents in the candidate document set and the query statement to obtain an ordering result of the candidate documents.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method for retrieving a computer domain document in a question-answering system according to any one of claims 1 to 7 when the program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a method of retrieving a computer domain document in a question-answering system according to any one of claims 1 to 7.
CN202310338689.8A 2023-03-31 2023-03-31 Method, device and equipment for searching computer field document in question-answering system Pending CN116595122A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310338689.8A CN116595122A (en) 2023-03-31 2023-03-31 Method, device and equipment for searching computer field document in question-answering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310338689.8A CN116595122A (en) 2023-03-31 2023-03-31 Method, device and equipment for searching computer field document in question-answering system

Publications (1)

Publication Number Publication Date
CN116595122A true CN116595122A (en) 2023-08-15

Family

ID=87597926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310338689.8A Pending CN116595122A (en) 2023-03-31 2023-03-31 Method, device and equipment for searching computer field document in question-answering system

Country Status (1)

Country Link
CN (1) CN116595122A (en)

Similar Documents

Publication Publication Date Title
US8332434B2 (en) Method and system for finding appropriate semantic web ontology terms from words
US8024331B2 (en) Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
CN109960756B (en) News event information induction method
CN111159359B (en) Document retrieval method, device and computer readable storage medium
CN111125334A (en) Search question-answering system based on pre-training
CN110750704B (en) Method and device for automatically completing query
WO2009154570A1 (en) System and method for aligning and indexing multilingual documents
CN108875065B (en) Indonesia news webpage recommendation method based on content
KR20130036863A (en) Document classifying system and method using semantic feature
CN111611356A (en) Information searching method and device, electronic equipment and readable storage medium
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN112000783B (en) Patent recommendation method, device and equipment based on text similarity analysis and storage medium
CN115309872B (en) Multi-model entropy weighted retrieval method and system based on Kmeans recall
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
Ghanem et al. Stemming effectiveness in clustering of Arabic documents
CN116501875A (en) Document processing method and system based on natural language and knowledge graph
CN108595413B (en) Answer extraction method based on semantic dependency tree
Juan An effective similarity measurement for FAQ question answering system
CN116595122A (en) Method, device and equipment for searching computer field document in question-answering system
CN114580557A (en) Document similarity determination method and device based on semantic analysis
CN111930880A (en) Text code retrieval method, device and medium
Brumer et al. Predicting relevance scores for triples from type-like relations using neural embedding-the cabbage triple scorer at wsdm cup 2017
Chen et al. A similarity-based method for retrieving documents from the SCI/SSCI database
KR100952077B1 (en) Apparatus and method for choosing entry using keywords
JP2019211884A (en) Information search system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination