CN115630144B

CN115630144B - Document searching method and device and related equipment

Info

Publication number: CN115630144B
Application number: CN202211646790.1A
Authority: CN
Inventors: 王哲; 刘殊玥; 余怡然; 舒光斌; 岳丰; 杨思喆; 史勇; 罗俊; 贾智杰; 方兴; 宋群力
Original assignee: Citic Securities Co ltd
Current assignee: Citic Securities Co ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-04-28
Anticipated expiration: 2042-12-21
Also published as: CN115630144A

Abstract

A document searching method, comprising: the search engine acquires inquiry content input by a user; searching among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document, among the plurality of documents, of which the correlation with the query content meets a preset condition; and outputting the target document and the interpretability information. Therefore, the user can determine the basis of the feedback target document of the search engine according to the interpretability information output by the search engine, so that the reliability of the user on the feedback target document of the search engine can be improved, and the search experience of the user is improved. In addition, the application also provides a corresponding document searching device and related equipment.

Description

Document searching method and device and related equipment

Technical Field

The present disclosure relates to the field of data retrieval technologies, and in particular, to a document searching method, apparatus, and related devices.

Background

At present, in organizations such as enterprises, a large amount of document data is usually stored, such as dealer researches in the financial field, and since the document data is stored in unstructured form, retrieving valid documents meeting the expectations of users from the document data becomes a key issue for the organizations.

Although the current search engine technology is widely applied to searching of document information, in a practical application scene, document information fed back by a search engine often does not meet the expectations of users, for example, when the relevance between a searched part of documents and search query content input by users is low, the users may consider that the search engine omits to feed back a part of documents with high relevance to the search query content, and the search experience of the users is affected. Therefore, how to improve the reliability of the feedback search results of the search engine is an important problem to be solved.

Disclosure of Invention

The document searching method is used for improving the credibility of the search results fed back by the search engine and further improving the searching experience of the user. Furthermore, the application provides a corresponding document searching apparatus, a computing device, a computer readable storage medium and a computer program product.

In a first aspect, the present application provides a document searching method, the method being applied to a search engine, the method comprising:

acquiring query content input by a user;

searching among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document, among the plurality of documents, of which the correlation with the query content meets a preset condition;

and outputting the target document and the interpretability information.

In one possible implementation manner, the searching among the plurality of documents according to the query content to obtain the target document and the interpretability information corresponding to the target document includes:

searching the target document and the relevance scores of the target document from the plurality of documents according to the query content, wherein the relevance scores of the target document are higher than those of the rest documents in the plurality of documents;

determining weights corresponding to a plurality of candidate matching elements respectively according to the target document and the relevance scores of the target document, wherein the deviation between the scores calculated based on the weights corresponding to the candidate matching elements and the relevance scores of the target document is smaller than a preset range;

And determining the target matching element from the plurality of candidate matching elements, and determining the weight corresponding to the target matching element, wherein the target matching element meets a preset element determination condition.

In one possible implementation, the plurality of candidate matching elements includes any of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multimodal information matching, metadata attribute matching, document full-text length, non-text modality data included in the document, timeliness data of the document, historical access data of the document.

In one possible embodiment, the method further comprises:

determining target fragments in the target document according to the interpretability information, wherein the matching degree between the target fragments and the query content is higher than that between the rest fragments in the target document and the query content;

outputting the target fragment.

In one possible embodiment, the plurality of documents includes a predicted document in which predicted data of an object to be evaluated is recorded, and the method further includes:

determining a predicted document related to the object to be evaluated from the plurality of documents;

Acquiring actual data matched with the predicted content in the predicted document;

determining a prediction error rate corresponding to the object to be evaluated according to the prediction content and the actual data;

and generating evaluation information aiming at the object to be evaluated according to the prediction error rate and the number of the prediction documents.

In one possible implementation, the target document is a multi-modal document, and the multi-modal document refers to a document including any of a plurality of types of information in text, figures, and tables.

In a second aspect, an embodiment of the present application provides a document searching apparatus, which is applied to a search engine, including:

the acquisition module is used for acquiring query content input by a user;

the searching module is used for searching among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document which is among the plurality of documents and has correlation with the query content and meets a preset condition;

And the output module is used for outputting the target document and the interpretability information.

In a possible implementation manner, the searching module is specifically configured to:

In a possible implementation manner, the searching module is further configured to determine, according to the interpretability information, a target segment in the target document, where a matching degree between the target segment and the query content is higher than a matching degree between the rest of segments in the target document and the query content;

the output module is further used for outputting the target segment.

In one possible embodiment, the plurality of documents includes a prediction document in which prediction data of an object to be evaluated is recorded;

the search module is further configured to:

In a third aspect, the present application provides a computing device comprising a processor, a memory. The processor is configured to execute instructions stored in the memory to cause the computing device to perform a document searching method as in the first aspect or any implementation of the first aspect. It should be noted that the memory may be integrated into the processor or may be independent of the processor. The computing device may also include a bus. The processor is connected with the memory through a bus. The memory may include a readable memory and a random access memory, among others.

In a fourth aspect, the present application provides a computer readable storage medium having instructions stored therein which, when run on a computing device, cause the computing device to perform the steps of the document searching method of the first aspect or any implementation of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a computing device, cause the computing device to perform the operational steps of the document searching method of the first aspect or any implementation of the first aspect.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a schematic diagram of an exemplary application scenario provided herein;

FIG. 2 is a schematic view of another exemplary application scenario provided herein;

FIG. 3 is a schematic flow chart of a document searching method provided in the present application;

FIG. 4 is a schematic diagram of an exemplary interaction interface provided herein;

FIG. 5 is a schematic view of an exemplary sliding window provided herein;

FIG. 6 is a schematic diagram of a document searching apparatus according to the present application;

fig. 7 is a schematic hardware structure of a computing device provided in the present application.

Detailed Description

Currently, in the process of information retrieval by a user, a search engine generally searches one or more documents with high relevance to the query content according to the query content input by the user for feedback. For the user, the user can only know the document information, but cannot know the basis of the document fed back by the search engine, so that the user has low credibility on the document fed back by the search engine, and particularly when the document fed back by the search engine does not accord with the expectation of the user, the user can consider that the search engine omits important documents of the feedback part, namely the documents with higher relevance to the query content, so that the search experience of the user is influenced.

Based on the above, the embodiment of the application provides a document searching method for improving the searching experience of a user. When the user performs information retrieval, the search engine firstly acquires query content input by the user, searches among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and weight corresponding to the target matching element, the target document is at least one document in the plurality of documents, and the correlation between the target document and the query content meets the preset condition, so that the search engine outputs the target document and the interpretability information. In the process of searching the document according to the query content, not only the target document related to the query content, but also the interpretability information corresponding to the target document is output, so that the user can determine the basis of the feedback target document of the search engine according to the interpretability information, thereby improving the credibility of the user on the feedback target document of the search engine and improving the search experience of the user.

As an implementation example, the embodiment of the present application may be applied to an application scenario as shown in fig. 1. In this application scenario, the user 101 may conduct an information search on the terminal device 102. Specifically, the terminal device 102 may present an interactive interface to the user 101, so that the user 101 may input query content on the interactive interface and trigger the terminal device 102 to perform information retrieval according to the query content. The terminal device 102 is configured with a search engine 1021, and the search engine 1021 can retrieve a document with higher relevance to the query content from a database according to the query content, determine the interpretability information corresponding to the document, and present the document and the interpretability information together on an interactive interface of the terminal device 102, so that when the user 101 views the retrieved document, the user 101 can also view the interpretability information fed back by the terminal device 102, thereby improving the credibility of the user 101 to the document.

It should be noted that the application scenario shown in fig. 1 is only an implementation example, and is not used to limit the application scenario of implementation of the solution. For example, in other possible application scenarios, as shown in fig. 2, after the user 101 inputs the query content on the terminal device 102, the terminal device 102 may generate a search request including the query content, and send the search request to the cloud server 103 in the cloud, where the cloud server 103 runs the computing engine 1031, so that the cloud server 103 may respond to the search engine, retrieve, by using the search engine 1031, a document with a higher relevance to the query content according to the query content, determine the interpretability information corresponding to the document, and then feed back the document and the interpretability information to the terminal device 102. Thus, the terminal device 102 may present the document and the interpretability information to the user 101 on an interactive interface.

For ease of understanding, embodiments of the document searching method provided in the present application are described below with reference to the accompanying drawings.

Referring to fig. 3, fig. 3 is a flowchart of a document searching method according to an embodiment of the present application, where the method may be applied to the application scenario shown in fig. 1 or fig. 2, or may be applied to other applicable application scenarios. For convenience of explanation, the present embodiment is exemplified by application to the application scenario shown in fig. 1.

The document searching method shown in fig. 3 may be performed by the search engine 1021 in fig. 1, and the method may specifically include:

s301: search engine 1021 obtains the query content entered by the user.

In an actual application scenario, the user 101 may input query content through an interactive interface of the terminal device 102, so as to trigger the search engine 1021 to execute a corresponding data search process according to the query content.

Illustratively, the query content entered by the user may be text content in the form of words, sentences, and the like. Alternatively, the query content input by the user may be non-text content such as a picture or voice. At this time, after the search engine 1021 obtains the query content input by the user, the non-text query content may be converted into a corresponding text, for example, the image may be converted into a text capable of expressing the image content by means of image recognition, analysis, or the like, or the voice input by the user may be converted into a text expressing the semantics of the voice by means of voice recognition, or the like. In this embodiment, the specific implementation manner of the query content input by the user is not limited.

S302: the search engine 1021 searches a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document in the plurality of documents, and the correlation between the target document and the query content meets a preset condition.

Typically, the search engine 1021 has access to a database that stores a plurality of documents, such as, for example, research report documents within an enterprise, so that the search engine 1021 can retrieve one or more documents from the plurality of documents that have a higher relevance to the query content based on the query content. For convenience of distinction and description, the retrieved document will be referred to as a target document hereinafter.

In a specific implementation, the search engine 1021 may determine, for example, by means of keyword matching or the like, the relevance between the documents and the query content, and determine a target document with higher relevance from the documents. The relevance between the target document and the query content can be determined through the score, and the higher the score is, the higher the characteristic relevance is, the lower the score is, and the lower the characteristic relevance is. Because the search engine 1021 searches the specific implementation process of the related document according to the query content such as the keyword, the related application exists in the actual application scene, and the search process is not repeated here.

Note that the document in this embodiment may be a single-mode document, such as a document including only text information. Alternatively, the document in the present embodiment may be a multi-modal document, such as a document including any of a plurality of types of information such as text, a drawing, and a table. At this time, the non-text type information (such as information of a graph, a table, etc.) in the multi-mode document can be converted into text information with the same semantics through semantic recognition, etc., and the text information is positioned in the document to the position where the original non-text type information is located (such as corresponding paragraphs or chapters in the document, etc.).

It will be appreciated that if the search engine 1021 feeds back only the target document, the lack of an explanatory basis for the search engine 1021 to determine the target document may result in a lower confidence in the user 101 that the target document is fed back by the search engine 1021. Therefore, in this embodiment, the search engine 1021 also determines the interpretability information corresponding to the target document, where the interpretability information can explain the basis of the search engine 1021 for screening the target document from the plurality of documents. Because the search engine 1021 is mature in implementation technology for searching the target document according to the query content, in this embodiment, the logic of determining the target document by the search engine 1021 is not required to be changed, and after determining the target document, the interpretability information is generated for the target document.

In one possible implementation, the search engine 1021 may first perform a search among a plurality of documents according to the query content, determine a relevance score between the query content and each document, and search the first M documents (M is a positive integer) with the highest relevance from the plurality of documents according to the relevance score corresponding to each document, that is, the target document, where the relevance scores of the remaining documents are lower than the relevance score of the target document. Then, the search engine 1021 determines weights corresponding to the plurality of candidate matching elements according to the target document and the relevance scores of the target document, wherein the deviation between the scores calculated based on the weights corresponding to the plurality of candidate matching elements and the relevance scores of the target document is smaller than a preset range, and further determines a target matching element and the weight corresponding to the target matching element from the plurality of candidate matching elements, and the determined target matching element meets a preset element determination condition. And the determined target matching element and the weight corresponding to the target matching element can be used as the interpretability information of the target document.

Specifically, for any query content Q, it is assumed that the query content includes N keywords, such as q= [ word ] ₁ ,…,word _n ]The search engine 1021 can retrieve a list of documents D (Q) = [ D) related to Q from the query content Q ₁ ,…, D _k ,…,D _M ]I.e. the target document is retrieved. Documents in the list are not duplicated and may be arranged in descending order according to some similarity algorithm. Wherein the more forward a document is in the list, the higher the correlation between the document and the query content Q is characterized. For document D in the document list _k (rank position in list is kth position) with a relevance Score of Score (Q, D) _k ). Then, an interpretable approximation function (Q, D _k ) The interpretable approximation function may be constructed based on a plurality of candidate matching elements and weights corresponding to each candidate matching element. Illustratively, an interpretable approximation function may be constructed as shown in equation (1) below.

Wherein Q is query content; d (D) _k Is a document in the document list; f (f) _n （Q，D _k ) Is a candidate matching element; w (w) _n The weight corresponding to the candidate matching element is given; n is the number of candidate matching elements and is a positive integer. That is, the constructed interpretable approximation function may be a function that sums all of the candidate matching elements with a weight value product.

In the present embodiment, the following implementation examples of candidate matching elements are provided.

In a first implementation example, the candidate matching elements may be features composed based on the document and the query content. For example, the candidate matching elements may specifically include one or more features of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multimodal information matching, metadata attribute matching.

Wherein word matching refers to the matching of query content Q and a single document D _k The word included in common may be characterized by a word frequency-inverse text frequency index (term frequency-inverse document frequency, TF-IDF) of the word as a candidate matching element for word matching. Word frequency TF means that the word is in document D _k IDF is the inverse of the number of documents in the database that include the word.

n-gram matching, which means that the query content Q is matched with the document D _k The n-gram items, namely n continuous words, form a phrase according to a fixed sequence, and the TF-IDF of the n-gram items is used as the characteristic of the candidate matching element of n-gram matching. Illustratively, the value of n may be, for example, 2 or 3, etc.

Synonym matching means that synonyms which can be queried by words or n-gram items in query content Q are determined according to a synonym dictionary and are found in a document D _k If the matching element appears, the TF-IDF of the synonym can be used as the feature of the candidate matching element of the synonym matching. The synonym dictionary may be a universal domain synonym dictionary based on HowNet, or may be a domain synonym dictionary mined from text in a single domain (e.g., a financial domain).

Semantic vector matching refers to determining other words or word phrases with higher semantic similarity with words or n-gram items in query content Q according to word-level semantic vector representation ^sim (e.g., semantic similarity greater than 0.9, etc.), and the word ^sim In document D _k If so, the word can be saved ^sim The product of TF-IDF and semantic similarity sim is used as a feature of the candidate matching element for semantic vector matching. Wherein, the voice similarity sim refers to words or n-gram items and words in the query content Q ^sim Semantic similarity between the two. Illustratively, word-level semantic vectors may be inferred based on, for example, a deep learning model such as word2vec, glove, BERT, GPT.

Topic keyword matching refers to the query content Q and the document D _k The matching condition can be used as the feature of the candidate matching element of the matching of the topic keywords. Wherein document D _k Can be determined based on topic models (which can also be combined with manual labeling results) such as latent dirichlet allocation (latent dirichlet allocation, LDA) and the like.

Multimodal information matching, which means that the query content Q and the document D _k And (3) matching the multi-modal attribute, and taking the probability of the multi-modal attribute as the characteristic of the candidate matching element of the multi-modal information matching. Wherein when document D _k In the case of a multimodal document, by matching document D _k The analysis understanding of the content in (a) can determine the document D _k Such as a table/picture subdivision category (an asset liability table, a market trend chart, etc.), an associated subject object (corresponding to stocks, corresponding industries, etc.), a description keyword of a table related to a specific index caliber, various picture content information, etc., and a probability such as an occurrence probability, etc., corresponding to each of the multi-modal attributes may be determined.

Metadata attribute matching, referred to as queryPolling content Q and document D _k And (3) matching the metadata attribute, and taking the intention probability corresponding to the metadata attribute as the characteristic of the candidate matching element of the metadata attribute matching. Wherein document D _k The metadata attribute in (a) may be, for example, a type to which the document content belongs (industry analysis, individual stock analysis, morning presentation, etc.), author information, release time, home industry, etc. When the query content Q matches the intent of a certain metadata attribute dimension or dimensions, the intent probability is used as a feature of the candidate matching element for metadata attribute matching. For example, assuming that query content Q is "individual strand analysis of the real estate industry," it is understood by intent recognition that it contains document D _k In which the "home industry" matches the intent of the two metadata attributes of "type of report", e.g<Home industry: the homeowner industry, probability = 0.9>、<Type of report: individual strand analysis, probability = 0.85>The intention probabilities corresponding to the two metadata attributes are used as the characteristics of the candidate matching elements.

In a second implementation example, the candidate matching element may be document D _k Is a feature of (a). For example, the candidate matching element may be one or more characteristics of a full text length of the document, non-text modal data included in the document, timeliness data of the document, and historical access data of the document.

The full-text length of the document can be measured by including one or more dimension characteristics, such as the number of words, the number of pages of the document, and the like.

The non-text modal data included in the document is document D _k The number of non-text mode information such as the number of tables (the total number or the number of specified categories), the number of pictures, and the like.

Timeliness data of a document, referred to as a measure of document D _k Can be, for example, the current search date t _s With document release date t _p Negative exponential function e of interval ^-（ts-tp） As a feature, and the larger the time interval is, the smaller the feature value is.

Historical access data for documents, referred to as for balanceQuantity document D _k Is a feature of the accessed case of (a), for example, may be document D _k Historical click rate of (c), etc.

In an actual application scenario, the candidate matching element may be any of the above various implementation examples, or may be other applicable features, which is not limited in this embodiment.

Then, for the query content Q and the document list D (Q) = [ D ₁ ,…, D _k ,…,D _M ]The relevance Score corresponding to each document in the document list D (Q) is calculated based on the plurality of candidate matching elements, and the weight values of the plurality of candidate matching elements are calculated such that the deviation between the relevance Score of the interpretable approximation function and the document list D (Q) is as minimum as possible.

Specifically, a loss function as shown in the following formula (2) can be constructed:

Wherein, loss (W) ^Q ) As a loss function; w (W) ^Q A vector formed by weight values corresponding to a plurality of candidate matching factors; score (Q, D) _k ) To query content Q and document D _k A correlation score between; (Q, D) _k ) A score calculated based on the interpretable approximation function;

representing a weight vector W ^Q A kind of electronic deviceL ₁ Norm, selectionL ₁ The significance of the norms is that sparse characteristic coefficients are induced and search explanation is given based on the characteristic that a few coefficients are not 0; mu (mu)>And 0 is a regularization coefficient, and the weight vector obtained by the larger value is sparse. In practical application, the more sparse the weight vector, the higher the relevance of the characterization determining document based on the candidate matching factors corresponding to the weight, namely the more worth being used as the explanatory information of the document.

In other possible embodiments, the field is due to practical useIn the scene, the search engine 1021 may also implement "fuzzy search" by using natural language processing (natural language processing, NLP) techniques such as Query rewrite (Query Write), semantic matching model (Semantic Matching), etc., so that the search engine 1021 may further determine the neighborhood of the Query content Q, e.g., by determining the neighborhood of the Query content Q through the search logs and result list in the search engine 1021, etc., to obtain a set including the Query content Q and other Query contents that are fuzzy-matched with the Query content

For example, the query content Q is Q in the set N (Q) ₀ Etc. For example, assuming that the query content entered is "capital", the search engine 1021 may search for documents that include not only "capital" keywords, but also "Beijing" (but not "capital") keywords. Thus, the search engine 1021 may generate the inclusion set N (Q) = { "capital", "beijing" } by fuzzy matching or the like.

The search engine 1021 may then calculate scores between the plurality of query contents in the set N (Q) and the respective documents in the document list D (Q) based on the interpretable approximation function, and calculate weight values for the plurality of candidate matching elements that minimize the deviation between the scores and the relevance scores for the documents. At this time, a loss function shown in the following formula (3) can be constructed:

wherein, loss (W) ^Q ) As a loss function; w (W) ^Q A vector formed by weight values corresponding to a plurality of candidate matching factors; q (Q) _j The j-th query content in the set N (Q);

for the kth and query content Q in the document list D (Q) _j A document for performing the calculation; score (Q) _j ，

) To query content Q _j And documents

A correlation score between; (Q) _j ，

) A score calculated based on the interpretable approximation function;

representing a weight vector W ^Q A kind of electronic deviceL ₁ A norm; mu (mu)>0 is the regularization coefficient. Wherein pi _k （Q，Q _j ) 0 or more represents Q _j Weight between Q and Q, if Q _j The higher the similarity to Q text, the greater the weight. Illustratively Q _j The weight between Q and Q may be calculated using a gaussian kernel function as shown in equation (4) below:

wherein distance (Q, Q _j ) Is Q _j The distance between the Q and the Q can be selected, for example, the approach point of the Q can be quickly searched by using vector index tools such as Faiss, HNSW and the like to obtain the distance between the Q and the Q, and a barrel where the Q is and the approach barrel thereof can be searched by using a local sensitive Hash algorithm such as SimHash, minHash which is commonly used in the text de-duplication problem to obtain the distance between the Q and the approach barrel; sigma is a coefficient for controlling the range of action of the gaussian kernel, the larger the value of which, the larger the local influence range of the gaussian kernel.

The search engine 1021 may then solve for the weights of the candidate matching factors corresponding to the loss function reaching a minimum, i.e., obtain information about

Is an optimization problem of (a). To solve the corresponding W when the loss function shown in the formula (3) reaches the minimum value ^Q For example, the optimization problem is as shown in the following equation (5):

wherein R is ^N Is an N-dimensional euclidean space.

For a given query Q, pi (Q, Q _j ）、Score（Q _j ，

）、f _n （Q _j ，

) Are all non-negative constants in the optimization problem, and therefore will pi (Q, Q _j ) Abbreviated as pi _j ；Score（Q _j ，

) Abbreviated as S _j，k ；f _n （Q _j ，

) Abbreviated as

The optimization problem is abbreviated as the following formula (6)

Wherein,,

is a quadratic function, and the method is characterized in that,

is a non-minuscule function.

Then, the unconstrained optimization problem shown in the above formula (6) can be converted into a constrained optimization problem with equality through variable decoupling techniques, as shown in the following formula (7):

where "s.t." means a symbolic representation satisfying the constraint, i.e. satisfying the constraint V-w=0.

Thus, an augmented lagrangian function as shown in equation (8) below can be defined:

wherein the N-dimensional variables

Is Lagrangian multiplier, ρ>And 0 is a penalty term coefficient.

The search engine 1021 may then solve the optimization problem alternately using an alternate direction multiplier (alternating direction method of multipliers, ADMM) algorithm framework, as shown in equation (9) below:

from the underlying mathematical derivation, the above optimization problem has an explicit solution, as shown in equation (10) below:

Search engine 1021 performs alternate optimization based on ADMM algorithm when V ^k+1 ,W ^k+1 ,Y ^k+1 Respectively with V ^k ,W ^k ,Y ^k Stopping when the number of iterations is sufficiently close or large enough (e.g. 50 iterations may be limited), returning the feature weights W ^k+1 As a means ofW ^Q Is a predicted result of (a). In this manner, the search engine 1021 may calculate weights for a plurality of candidate matching elements.

The search engine 1021 may then determine a target matching element that is the interpretability information of the target document and a weight corresponding to the target matching element from the plurality of candidate matching elements. The following two implementation examples are provided in this embodiment:

example one: the search engine 1021 may calculate a contribution contribution (n) =w corresponding to each candidate matching element according to the plurality of candidate matching elements and their corresponding weights _n f _n （Q，D _k ) And screening K larger contribution degrees from contribution degrees respectively corresponding to the candidate matching elements, so that the K candidate matching elements respectively corresponding to the K contribution degrees are determined to be target matching elements serving as explanatory information, and the weight corresponding to the target matching elements is determined.

Example two: the search engine 1021 may calculate a contribution contribution (n) =w corresponding to each candidate matching element according to the plurality of candidate matching elements and their corresponding weights _n f _n （Q，D _k ). The search engine 1021 may then rank in descending order of absolute value of contribution to obtain feature vectors

Is (are) reordered

. Then, the search engine 1021 may collect the following formula (11), and calculate the minimum K value of the similarity difference approaching 1+epsilon, so that the search engine 1021 may determine candidate matching factors corresponding to the first K feature vectors in the ranking as target matching elements as the interpretability information, and further determine weights corresponding to the target matching elements.

Where ε is a constant and ε >0, for example, 0.1 may be.

S303: after determining the target document and the interpretability information corresponding to the target document, the search engine 1021 outputs the target document and the interpretability information.

In this way, the terminal device 102 may present the target document and the interpretability information output by the search engine 1021 to the user 101, so that when the user 101 views the target document, the user 101 can also view the corresponding interpretability information, which enables the user 101 to determine the basis for the search engine 1021 to feed back the target document according to the interpretability information, thereby improving the reliability of the user 101 to the search engine 1021 to feed back the target document, and improving the search experience of the user.

For example, the terminal device 102 may present an interactive interface as shown in fig. 4, where relevant information of a plurality of documents may be included in the interactive interface, such as names of the documents or content summaries, etc. And, for each document, the preview button for the document can be presented in the interactive interface, so that after the user 101 clicks the preview button, the interpretability information corresponding to the document can be viewed in the new pop-up window. Of course, the presentation manner shown in fig. 4 is merely an exemplary illustration, and the terminal device 102 may present the target document and the corresponding interpretability information in other manners during actual application.

In a further possible embodiment, the terminal device 102 may also support the user 101 to reorder or rescreen the target document based on the presented interpretability information. In particular, when the user 101 views the interpretability information, it may consider that the importance of a part of the target matching elements for document retrieval is higher than that of the other target matching elements, and then the user 101 may screen from the presented multiple target matching elements. In this way, the search engine 1021 may reorder the target documents based on the user screened target matching elements and present the reordered target documents to the user 101 on the terminal device 102. Or, the search engine 1021 can rescreen the target document based on the target matching element screened by the user, and present the rescreened target document to the user 101 on the terminal device 102 according to the ranking before rescreening or the ranking after rescreening, so that the user 101 can customize document retrieval, and the retrieval experience of the user 101 is further improved.

In practical applications, when the user is a developer or a technician, the user 101 may also analyze and determine whether there is an incorrect matching element in the interpretability information or whether there is a missing matching element based on the interpretability information presented by the terminal device 102. When there is an error or missing matching element, the user 101 corrects the matching element on the terminal device 102, or configures a new matching element for the search engine 1021, and triggers the search engine 1021 to re-interpret the retrieved target document. In this way, the search engine 1021 may re-generate corresponding interpretability information for the target document based on the revised matching element of the user 101 and the newly added matching element, and re-present the updated interpretability information to the user 101 on the terminal device 102, so that the user 101 determines whether the interpretability of the search engine 1021 for retrieving and feeding back the target document meets the expectations based on the newly presented interpretability information, so as to provide a more reasonable and correct retrieval interpretation for the user 101 or other users in the following.

In this embodiment, a document content segment having a high degree of correlation with the query input provided by the user 101 may also be presented. For this purpose, the present embodiment may further include the steps of:

S304: the search engine 1021 determines a target segment in the target document based on the interpretability information, wherein the target segment matches the query content more than the remaining segments in the target document.

S305: the search engine 1021 outputs the target segment.

As an example of implementation, for document D _k The search engine 1021 may use a sliding window of preset length and width from the document D _k The first page in (a) starts a backward swipe scan, each swipe generates a new swipe window,as shown in fig. 5. During each swipe, the search engine 1021 calculates an interpretability score for the window area. Wherein the higher the interpretability score, the higher the association between the document content (which may include text, graphs, tables, or the like) and the query input within the sliding window area is characterized.

For example, for each window region Windows (p), the search engine 1021 may count the set of K target matching elements (or interpretable features) that match within the sliding window coordinate range, i.e.

Then, the search engine 1021 calculates an interpretability score of each sliding window region Windows (p) according to the following formula (12):

Wherein, rate (f _n P) represents an interpretable feature f _n At D _k Sliding window of (2)pThe number of occurrences of the feature is in document D _k The proportions appear throughout.

In this way, the search engine 1021 may screen out a preset number of sliding window regions with a greater interpretability score according to the interpretability score of each sliding window region. In this way, the search engine 1021 can output the document content in the screened sliding window area as a target segment.

Further, the terminal device 102 may present the target segment output by the search engine 1021 to the user 101. For example, after the user 101 clicks the preview button corresponding to a document on the display interface of the terminal device 102, the target segment content in the document can be displayed on the display interface, so that the user 101 can conveniently and quickly locate the content with high correlation with the query content in the document, and the search experience of the user can be further improved.

In the present embodiment, the document searching method is exemplified by the search engine 1021, and in other embodiments, the document searching method may be executed by another entity independent of the search engine 1021, such as a processor or an interpretive engine, which is separately disposed in the terminal device 102, without limitation.

In a further possible embodiment, among the plurality of documents stored in the database, there may be a part of the content in the document as predicted content of the researcher, such as predicting a change in stock price over a period of time, predicting a yield of a certain product, or the like. At this time, the search engine 1021 (or other entity independent of the search engine 1021) may perform evaluation of the predictive ability of the researcher (or other object) in addition to determining feedback of the target document and the interpretability information to the user, so as to evaluate the predictive accuracy of the researcher, and further may perform evaluation of the business ability of the researcher.

In particular, the search engine 1021 may determine a predicted document associated with an object to be evaluated from among a plurality of documents stored in a database, in which predicted data of the object to be evaluated is recorded. The object to be evaluated may be, for example, the above-mentioned researcher, or may be a research and development department, or may be an AI model, etc., which is not limited in this embodiment. Then, the search engine 1021 may obtain actual data matching with the predicted content in the predicted document, for example, the predicted content is a change of the stock price of XX in one week, and the actual data is a real change of XX in one week, so that the search engine 1021 may determine a prediction error rate corresponding to the object to be evaluated according to the predicted content and the actual data, thereby generating evaluation information for the object to be evaluated according to the prediction error rate and the number of predicted documents.

Taking the object to be evaluated as a researcher as an example, assuming that the predicted document is a research report for each strand, a real value list of all numerical indexes can be recorded as

. For the researchers P ₁ Giving a predicted value of the numerical index of

For the numerical index for which the prediction is not covered, its corresponding position is recorded and counted in the set ∅. Then, the error rate of the numerical index can be defined using the following equation (13):

wherein, the prediction _i (P ₁ ) For the researcher P ₁ A predicted value for the i-th numerical indicator; value _i The actual value of the i-th numerical index; count ({ i|prediction) _i (P ₁ ) E ∅) refers to counting the number-type indicators that the researcher did not predict; parameter θ>0 is a coverage penalty term, the larger its value, characterizing P for the researcher ₁ The lower the tolerance to numerical index imperfections is predicted.

In addition, if there may be enumerated indexes in the predicted document, the values (assuming K possibilities) need to be mapped into a positive integer set, such as [1,2, …, K]Etc., and maintains monotonicity of enumerated indicators at the time of mapping. For example, for an enumerated index [ buy, hold, sell ]]It can be mapped to [1,2,3 ]]Rather than other random combinations of {1,2,3 }. The value list of all enumeration type indexes is recorded as

. For the researchers P ₁ The predicted value is given as

The values of the real index and the prediction index are mapped to the positive integer set. Then, an enumeration class error rate may be defined using the following equation (14):

wherein K is _j Is a positive integer for representing the jth enumeration classThe number of possible values of the index.

Thus, the search engine 1021 can obtain the researcher P ₁ The overall error rate for this document is:

then for and the researcher P ₁ The search engine 1021 can calculate the error rate corresponding to each document respectively, and then obtain the total error rate by means of weighted summation or average value calculation, etc. to realize the research on the research staff P ₁ A lower overall error rate indicates a stronger lapping level (in terms of predictive accuracy) for a quantitative evaluation of lapping accuracy over a period of time.

Since the predicted error rate is mainly that of expressing the researcher P ₁ In terms of the accuracy of the ease of prediction, search engine 1021 is on the way of researcher P ₁ In the evaluation, the researcher P can be considered ₁ The influence factors on the number of the produced predicted documents are avoided, the evaluation dimension adopted in the evaluation is less, and the fairness and fairness of the evaluation are influenced.

Specifically, the output and error rate of the researcher can be subjected to monotonous mapping and standardized to an index between values (0, 1), wherein the monotonous standardized function can be, for example, a hyperbolic tangent function shown in the following formula (16), and in other embodiments, the monotonous standardized function can also be a hyperbolic tangent function according to the researcher P ₁ And segment mapping after sequencing.

Thus, based on formulas (14) to (16), the following formula (17) can be used to calculate the target of the researcher P ₁ Is a composite score of (2):

wherein StdProduction(P ₁ ) For the researcher P ₁ The output index of the (2) is 0-1; stdErrorrate (P) ₁ ) For the researcher P ₁ The value of the comprehensive error rate index is 0-1; score (P) ₁ ) For the researcher P ₁ Is a composite score of (2).

It should be noted that, the implementation manner of generating the evaluation information for the object to be evaluated by the search engine 1021 is only an implementation example, for example, in other possible implementations, the search engine 1021 may generate the evaluation information according to the prediction error rate or generate the evaluation information according to other applicable reference information, etc.; alternatively, the error rate may be multiplied by a parameter less than 1 for researchers who are primarily long-term predictive, taking into account the effect of age, to balance the effect of different ages for different researchers. Alternatively, when a plurality of researchers predict the same content in the same prediction document, the error rate of the researchers may be adjusted by using a shape value method or other methods, and the present embodiment is not limited thereto.

In addition, the embodiment of the application also provides a document searching device. Referring to fig. 6, fig. 6 shows a schematic structure of a document searching apparatus in an embodiment of the present application, and the document searching apparatus 600 shown in fig. 6 is applied to a search engine, such as the search engine 1021 in the foregoing embodiment, and the document searching apparatus 600 includes:

an obtaining module 601, configured to obtain query content input by a user;

the searching module 602 is configured to search among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, where the interpretability information includes a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document in the plurality of documents, where a correlation between the target document and the query content meets a preset condition;

and an output module 603, configured to output the target document and the interpretability information.

In one possible implementation, the search module 602 is specifically configured to:

In a possible implementation manner, the search module 602 is further configured to determine, according to the interpretability information, a target segment in the target document, where a matching degree between the target segment and the query content is higher than a matching degree between the rest of segments in the target document and the query content;

The output module 603 is further configured to output the target segment.

the search module 602 is further configured to:

It should be noted that, because the content of information interaction and execution process between each module and unit of the above-mentioned apparatus is based on the same concept as the method embodiment in the embodiment of the present application, the technical effects brought by the content are the same as the method embodiment in the embodiment of the present application, and specific content can be referred to the description in the foregoing method embodiment shown in the embodiment of the present application, which is not repeated here.

In addition, the embodiment of the application also provides a computing device. Referring to fig. 7, fig. 7 illustrates a schematic hardware architecture of a computing device 700 in an embodiment of the present application, where the computing device 700 may include a processor 701 and a memory 702.

Wherein the memory 702 is configured to store a computer program;

the processor 701 is configured to execute the following steps according to the computer program:

acquiring query content input by a user;

and outputting the target document and the interpretability information.

The processor 701 may be a CPU, and the processor 701 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete device components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.

The memory 702 may be, for example, volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In a possible implementation manner, the processor 701 is specifically configured to perform the following steps according to the computer program:

In a possible implementation manner, the processor 701 is further configured to perform the following steps according to the computer program:

outputting the target fragment.

In a possible implementation manner, the plurality of documents include a predicted document, in which predicted data of the object to be evaluated is recorded, and the processor 701 is further configured to execute the following steps according to the computer program:

In addition, the embodiment of the application also provides a computer readable storage medium for storing a computer program for executing the document searching method described in the above embodiment of the method.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus general hardware platforms. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the objective of the embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application.

Claims

1. A document searching method, the method being applied to a search engine, the method comprising:

acquiring query content input by a user;

outputting the target document and the interpretability information;

searching among a plurality of documents according to the query content to obtain a target document and explanatory information corresponding to the target document, wherein the method comprises the following steps:

Determining weights respectively corresponding to a plurality of candidate matching elements by using an interpretable approximation function according to the target document and the relevance scores of the target document, wherein the interpretable approximation function is constructed based on the plurality of candidate matching elements and the weights corresponding to each candidate matching element, and the deviation between the scores calculated by using the interpretable approximation function based on the weights respectively corresponding to the plurality of candidate matching elements and the relevance scores of the target document is smaller than a preset range;

calculating the contribution degree corresponding to each candidate matching element according to the weights corresponding to the candidate matching elements and the candidate matching elements respectively;

and determining the target matching element from the plurality of candidate matching elements according to the contribution degree corresponding to each candidate matching element, and determining the weight corresponding to the target matching element, wherein the target matching element meets a preset element determination condition.

2. The method of claim 1, wherein the plurality of candidate matching elements comprises any of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multimodal information matching, metadata attribute matching, document full-text length, non-text modality data included in a document, timeliness data of a document, historical access data of a document.

3. The method according to claim 1, wherein the method further comprises:

outputting the target fragment.

4. The method according to claim 1, wherein the plurality of documents includes a predicted document in which predicted data of the object to be evaluated is recorded, the method further comprising:

5. The method according to any one of claims 1 to 4, wherein the target document is a multi-modal document, and the multi-modal document refers to a document including any of a plurality of types of information in text, figures, and tables.

6. A document searching apparatus, the document searching apparatus being applied to a search engine, the document searching apparatus comprising:

the acquisition module is used for acquiring query content input by a user;

the output module is used for outputting the target document and the interpretability information;

the search module is specifically configured to:

7. The apparatus of claim 6, wherein the plurality of candidate matching elements comprises any of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multimodal information matching, metadata attribute matching, document full-text length, non-text modality data included in a document, timeliness data of a document, historical access data of a document.

8. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

the searching module is further used for determining target fragments in the target document according to the interpretability information, and the matching degree between the target fragments and the query content is higher than that between the rest fragments in the target document and the query content;

The output module is further used for outputting the target segment.

9. The apparatus according to claim 6, wherein the plurality of documents includes a predictive document in which predictive data of an object to be evaluated is recorded;

the search module is further configured to:

10. The apparatus according to any one of claims 6 to 9, wherein the target document is a multi-modal document, the multi-modal document being a document including any of a plurality of types of information in text, figures, tables.

11. A computing device comprising a processor, a memory;

the processor is configured to execute instructions stored in the memory to cause the computing device to perform the steps of the method of any one of claims 1 to 5.

12. A computer readable storage medium comprising instructions which, when run on a computing device, cause the computing device to perform the steps of the method of any of claims 1 to 5.