CN116595122A

CN116595122A - Method, device and equipment for searching computer field document in question-answering system

Info

Publication number: CN116595122A
Application number: CN202310338689.8A
Authority: CN
Inventors: 王越; 张治国; 赵逢波; 周成标
Original assignee: Beijing Baolande Software Co ltd
Current assignee: Beijing Baolande Software Co ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-08-15

Abstract

The invention provides a method, a device and equipment for searching documents in the computer field in a question-answering system, and relates to the technical field of document searching, wherein the method comprises the following steps: based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document; for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sequencing the candidate documents based on the similarity between each candidate document in the candidate document set and the query sentence to obtain a sequencing result of the candidate documents. The invention improves the accuracy of technical document retrieval in the field of computers.

Description

Method, device and equipment for searching computer field document in question-answering system

Technical Field

The present invention relates to the field of document retrieval technology, and in particular, to a method, an apparatus, and a device for retrieving documents in a computer field in a question-answering system.

Background

The prior question-answering system in the intelligent operation and maintenance field generally integrates various question-answering functions, such as task-type question-answering, boring-type question-answering and document-type question-answering. The document type question-answering function can help a user to search a required document from a knowledge base matched with the question-answering system through query keywords or key sentences input by the user, and the results are ranked from high to low according to the similarity.

Common retrieval techniques may employ a boolean model based on keywords or a vector space model based on machine learning. The vector space model adopts a machine learning algorithm to convert the document content and the query statement into feature vectors, and the similarity between the document and the query statement is calculated to obtain a query result.

The Boolean model can not sort the documents in a fine granularity, and further time-consuming refinement ranking is required; the vector space model depends on the quality and quantity of manual annotation data at the upper limit of the algorithm effect, and is easily influenced by the professional domain vocabulary in the document content, so that the vector conversion result is inaccurate. Therefore, the existing question-answering system has the problems of low accuracy and low efficiency in document retrieval in the computer field.

Disclosure of Invention

The invention provides a method, a device and equipment for searching a computer field document in a question-answering system, which are used for solving the problems of low efficiency and low accuracy in the search of the computer field document in the question-answering system in the prior art and improving the search efficiency and accuracy of the computer field document in the question-answering system.

The invention provides a retrieval method of a computer field document in a question-answering system, which comprises the following steps:

based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document;

for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document;

determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;

and sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.

According to the method for searching the computer field document in the question-answering system, provided by the invention, the matching characteristics comprise title characteristics and text characteristics;

the determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentence in the candidate document comprises:

and determining candidate matching vectors corresponding to the candidate documents based on the title features and the text features of the keywords corresponding to the query sentences in the candidate documents and the title weight coefficients and the text weight coefficients.

According to the method for searching the computer field document in the question-answering system, which is provided by the invention, the method further comprises the following steps:

adding a pre-acquired keyword dictionary in the computer field into a word segmentation device of an elastic search;

automatically constructing a mixed inverted index for a plurality of documents in the knowledge base through an elastic search based on the query statement;

wherein the mixed inverted index comprises: conventional inverted indexes and inverted indexes based on the computer domain keyword dictionary.

collecting keywords in the history document by using a TF-IDF algorithm;

collecting key phrases in the history document by using an SIFRank algorithm;

and constructing the preset computer domain keyword dictionary based on the keywords and the key phrases.

preprocessing the history document; the preprocessing includes deleting code fragments and structured query language SQL statements in the history document.

and updating the sorting result based on the updating time and/or the clicking times of each candidate document in the candidate document set.

According to the method for searching the documents in the computer field in the question-answering system provided by the invention, the mixed inverted index corresponding to the documents in the query knowledge base comprises the following steps:

and based on the mixed inverted index, adopting an elastic search to perform distributed cluster search on a plurality of documents in the knowledge base.

The invention also provides a retrieval device of the computer field document in the question-answering system, which comprises:

the document set acquisition module is used for inquiring mixed inverted indexes corresponding to a plurality of documents in the knowledge base based on inquiry sentences to obtain a candidate document set comprising at least one candidate document;

the matching vector determining module is used for determining candidate matching vectors corresponding to the candidate documents according to the matching characteristics of the keywords corresponding to the query sentences in the candidate documents aiming at the candidate documents in the candidate document set;

the similarity determining module is used for determining the similarity between the candidate document and the query statement based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;

and the document ordering module is used for ordering the candidate documents based on the similarity between the candidate documents in the candidate document set and the query statement to obtain an ordering result of the candidate documents.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the searching method of the computer field document in the question-answering system when executing the program.

The invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for searching documents in the computer domain in the question-answering system.

According to the method, the device and the equipment for searching the documents in the computer field in the question-answering system, the mixed inverted indexes corresponding to the documents in the knowledge base are inquired through inquiry sentences to obtain a candidate document set; for each candidate document in the candidate document set, determining a candidate matching vector based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity and outputting a sorting result. By designing the document matching vector method, the coarse screening results of the mixed inverted index are reordered in finer granularity, so that the accuracy of document retrieval is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for searching documents in the computer domain in the question-answering system provided by the invention;

FIG. 2 is a schematic diagram of the results of a method for retrieving documents in the computer domain in the question-answering system provided by the invention;

FIG. 3 is a second schematic diagram of the results of the method for retrieving documents in the computer domain in the question-answering system provided by the present invention;

FIG. 4 is a schematic diagram of a search device for documents in the computer domain in the question-answering system provided by the invention;

fig. 5 is a schematic diagram of the physical structure of the electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The method, apparatus and device for searching documents in computer domain in question-answering system of the present invention are described below with reference to fig. 1 to 5.

FIG. 1 is a schematic flow chart of a method for searching documents in computer domain in question-answering system, as shown in FIG. 1, the method comprises:

s1, inquiring mixed inverted indexes corresponding to a plurality of documents in a knowledge base based on inquiry sentences to obtain a candidate document set comprising at least one candidate document;

in a specific implementation, a query sentence, typically a number of keywords or a query sentence, input by a user in an input box of the question-answering system is obtained. Unlike conventional document retrieval systems, the AND, OR, NOT logic of Boolean queries only supports AND, OR logic in question-answering systems. The query statement is exemplified by "H5 System architecture diagram" or "H5, BOSS, interface".

S2, determining candidate matching vectors corresponding to the candidate documents according to the matching characteristics of the keywords corresponding to the query sentences in the candidate documents aiming at the candidate documents in the candidate document set;

s3, determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;

in a specific implementation, the similarity may be calculated by using cosine similarity, a jaccard similarity coefficient, a pearson correlation coefficient, and the like.

S4, sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.

In the implementation, the candidate documents are returned to the front end of the question-answering system according to the sorting result, and the front end interface marks the words matched with the query sentences in the display documents.

According to the method for searching the documents in the computer field in the question-answering system, provided by the embodiment, the mixed inverted indexes corresponding to the documents in the knowledge base are inquired through inquiry sentences to obtain a candidate document set; for each candidate document in the candidate document set, determining a candidate matching vector based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity and outputting a sorting result. By designing the document matching vector method, the re-ordering of the coarse screening result of the mixed inverted index with finer granularity is realized, and the accuracy of document retrieval is improved.

In an optional embodiment, in the method for searching a computer domain document in the question-answering system, the matching features include a title feature and a text feature;

in specific implementation, according to the matching situation of the principle of boolean query and the keywords in the query statement, 6-dimensional matching features designed for candidate documents are shown in table 1:

TABLE 1

The title of the candidate document is treated differently from the document text, and the title comprises the title of the document and the title of each chapter. Three features 1, 2, 3 belong to the "title" feature, and three features 4, 5, 6 belong to the "body" feature.

The meaning of feature 1 is whether the keywords in the query statement appear in the heading, here the boolean model, with the appearance being 1 and the absence being 0.

The meaning of the feature 2 matching ratio (title) is: the number of words of the title in the matching is the ratio of the total number of query words. The feature represents how many terms in the query statement appear in the heading, and the higher the feature value, the higher the rank should be, alone.

The meaning of the number of feature 3 matching words (title) is: the total number of words matching the title. This value is used to further fine-grained discrimination ranking, and when feature 1 and feature 2 of the query statement are the same value, we further employ feature 3 to ascertain. The more matches, the higher the rank. In order to ensure the normalization principle of each feature, in a specific implementation, the matching times are set according to the following rules, as shown in table 2:

TABLE 2

The meaning of the features 4, 5 and 6 is similar to that of the first 3 corresponding features, but in the features 6, the matching number is set slightly different from that of the features 3, and the number can be used as a super parameter and can be freely adjusted according to the actual use situation.

Based on the matching characteristics of the keywords corresponding to the query sentences in the candidate documents, determining candidate matching vectors corresponding to the candidate documents comprises the following steps:

In specific implementation, through analyzing the query habits of the user, the user frequently inputs keywords with summary properties in the query to perform the query, and the term occupation ratio of the title in the matching is very large. We therefore divide the document into two fields, title and body, and have different weights. For the first three features we multiply the weight coefficient of 0.55 after taking the feature value, the last three features multiply the weight coefficient of 0.45.

In a specific implementation, based on a preset standard matching vector and a candidate matching vector corresponding to a candidate document, determining the similarity between the candidate document and the query sentence, forming a matching vector by the query sentence input by a user and the candidate document, and then carrying out similarity calculation with the standard vector, wherein table 3 is a schematic diagram of the construction standard matching vector:

TABLE 3 Table 3

The original characteristic values can be considered that the query sentences are highly matched in each vector, each value is 1, namely the query sentences are completely existing in the title and the text, the times in matching are more, and the set highest matching numerical value is met. The original eigenvalues are not weighted, and the weight coefficient of 0.55,0.45 is multiplied to finally obtain a standard matching vector: [0.55,0.55,0.55,0.45,0.45,0.45].

In a specific implementation, assuming that 6 candidate documents exist in the candidate document set, 6 matching vectors are formed by query sentences input by a user and the 6 candidate documents in sequence according to the characteristic construction mode. And then the 6 matching vectors are respectively calculated to be cosine similar with the standard matching vectors, and are sequenced from high to low. To verify feature structure rationality, we take table 4 as an example, the first column describes the matching degree of the query sentence and the candidate document, the second column is the generated candidate matching vector, and the third column is the cosine similarity calculated with the standard matching vector.

TABLE 4 Table 4

The settings in table 2 correspond, for example, to the following:

the first 3 features are 0.55 weight coefficients: (1,0.5,0.6) 0.55= 0.55,0.275,0.33;

the last 3 features are 0.45 weight coefficients: (1,0.5,0.6) 0.45= 0.45,0.225,0.27;

the combination is [0.55,0.275,0.33,0.45,0.225,0.27];

the remaining 5 original vectors are each as follows, with the first three vectors multiplied by 0.55 and the last three multiplied by 0.45: [1,0.3,0.2,1,0.3,0.2], [1,1,1,0,0,0], [1,0.5,0.6,0,0,0], [0,0,0,1,1,1], [0,0,0,1,0.5,0.6].

The cosine similarity calculated by the candidate matching vector and the standard matching vector is calculated, in this embodiment, a cosine similarity calculation method is used, as shown in table 4, the similarity value is between 0 and 1, and the closer to 1, the higher the matching degree between the query vector and the document vector is.

In the embodiment, the cosine similarity is calculated through the matching features in the designed documents, so that the candidate documents can be subjected to fine granularity sorting, and the accuracy of the document retrieval output result is improved.

In an optional embodiment, the method for searching the computer domain document in the question-answering system further includes:

adding a pre-acquired keyword dictionary in the computer domain into a word segmentation device of an elastomer search of a distributed search and analysis engine;

automatically constructing a mixed inverted index for a plurality of documents in a knowledge base through a distributed search and analysis engine elastic search based on query sentences;

wherein the mixing inverted index comprises: conventional inverted indexes and inverted indexes based on a keyword dictionary in the computer field.

In the embodiment, the mixed inverted index comprising the conventional inverted index and the inverted index of the keyword dictionary in the preset computer field is automatically constructed through the distributed search and analysis engine elastic search, so that the searching efficiency is improved.

collecting keywords in the history document by using a TF-IDF algorithm;

collecting key phrases in the history document by using an SIFRank algorithm;

In specific implementation, TF in the TF-IDF algorithm represents word frequency, and the value is used for coordinating the influence of word times and sentence lengths on IDF values; IDF is the inverse document frequency, and is understood to be the amount of information contained in each word, with the number of occurrences of the word being smaller and the IDF value being greater. The final value of TF-IDF is TF IDF. The specific formula is as follows:

fig. 2 is a schematic diagram of the result of the method for searching documents in the computer domain in the question-answering system, and the extraction result of extracting keywords in the knowledge base by using the TF-IDF algorithm is shown in fig. 2.

The key phrase extraction uses the SIFRank algorithm for collection. The algorithm firstly carries out word segmentation and part-of-speech tagging on sentences, and then determines noun phrases by using regular expressions.

Fig. 3 is a second schematic diagram of the result of the method for searching documents in the computer domain in the question-answering system, and the extraction result of extracting key phrases in the knowledge base by using the sip lank algorithm is shown in fig. 3.

In an implementation, a plurality of phrases ranked first are filtered out and supplemented into a dictionary.

In the embodiment, through the preset processing, the keyword dictionary of the computer field document is established more scientifically, so that the subsequent retrieval processing is facilitated, and the accuracy of the computer field document retrieval is improved.

preprocessing a history document; preprocessing includes deleting code fragments and structured query language SQL statements in the history document.

In the embodiment, the influence of the code fragments and SQL sentences in the document on subsequent vectorization is reduced, and the accuracy of document retrieval in the subsequent computer field is improved by preprocessing the historical document.

In the embodiment, based on two dimensions of the update time and the click times of the candidate documents, the candidate documents are further optimized and ordered, and the document retrieval result in the computer field is optimized.

In an optional embodiment, in the method for searching a computer domain document in the question-answering system, the mixed inverted index corresponding to a plurality of documents in the query knowledge base includes:

based on the mixed inverted index, a distributed search and analysis engine elastic search is adopted to perform distributed cluster search on a plurality of documents in a knowledge base.

In the embodiment, the distributed search and analysis engine is used for fast searching of the distributed clusters by means of the distributed, high-expansion and high-real-time performance of the distributed search engine, so that the document searching efficiency in the field of computers is improved, and the application scene is expanded.

Fig. 4 is a schematic structural diagram of a search device for documents in computer domain in question-answering system provided by the present invention, as shown in fig. 4, including:

a document set obtaining module 41, configured to query, based on the query statement, a mixed inverted index corresponding to a plurality of documents in the knowledge base, to obtain a candidate document set including at least one candidate document;

a matching vector determining module 42, configured to determine, for each candidate document in the candidate document set, a candidate matching vector corresponding to the candidate document based on a matching feature of the keyword corresponding to the query term in the candidate document;

a similarity determining module 43, configured to determine a similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document;

the document ordering module 44 is configured to order the candidate documents based on the similarity between each candidate document in the candidate document set and the query sentence, so as to obtain an ordering result of each candidate document.

According to the retrieval device of the documents in the computer field in the question-answering system, based on mutual coordination among the modules, the document collection acquisition module is used for inquiring mixed inverted indexes corresponding to a plurality of documents in a knowledge base based on inquiry sentences to obtain a candidate document collection; determining candidate matching vectors by a matching vector determining module according to the matching characteristics of keywords corresponding to query sentences in the candidate documents aiming at each candidate document in a candidate document set; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document through a similarity determining module; and sequencing the candidate documents based on the similarity through a document sequencing module and outputting sequencing results. The method has the advantages that the coarse screening results of the mixed inverted index are reordered in finer granularity, and the accuracy of document retrieval is improved.

Fig. 5 is a schematic physical structure of an electronic device according to the present invention, as shown in fig. 5, the electronic device may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method for retrieving a computer domain document in a question-answering system, the method comprising: based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document; for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform a method for searching a computer domain document in a question-answering system provided by the above methods, the method comprising: based on the query statement, querying a mixed inverted index corresponding to a plurality of documents in a knowledge base to obtain a candidate document set comprising at least one candidate document; for each candidate document in the candidate document set, determining a candidate matching vector corresponding to the candidate document based on the matching characteristics of the keywords corresponding to the query sentences in the candidate document; determining the similarity between the candidate document and the query sentence based on a preset standard matching vector and a candidate matching vector corresponding to the candidate document; and sorting the candidate documents based on the similarity between the candidate documents and the query statement in the candidate document set to obtain a sorting result of the candidate documents.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for searching a computer field document in a question-answering system is characterized by comprising the following steps:

2. The method for retrieving documents in a computer domain in a question-answering system according to claim 1, wherein the matching features include a title feature and a text feature;

3. The method for retrieving documents in a computer domain in a question-answering system according to claim 1, further comprising:

4. A method for retrieving computer domain documents in a question-answering system according to claim 3, wherein the method further comprises:

collecting keywords in the history document by using a TF-IDF algorithm;

collecting key phrases in the history document by using an SIFRank algorithm;

5. The method for retrieving documents in a computer domain in a question-answering system according to claim 4, further comprising:

6. The method for retrieving documents in a computer domain in a question-answering system according to claim 1, further comprising:

7. The method for retrieving documents in computer domain in question-answering system according to claim 1, wherein the querying the mixed inverted index corresponding to the plurality of documents in the knowledge base includes:

8. A search device for a computer domain document in a question-answering system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method for retrieving a computer domain document in a question-answering system according to any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a method of retrieving a computer domain document in a question-answering system according to any one of claims 1 to 7.