CN115630144A

CN115630144A - Document searching method and device and related equipment

Info

Publication number: CN115630144A
Application number: CN202211646790.1A
Authority: CN
Inventors: 王哲; 刘殊玥; 余怡然; 舒光斌; 岳丰; 杨思喆; 史勇; 罗俊; 贾智杰; 方兴; 宋群力
Original assignee: Citic Securities Co ltd
Current assignee: Citic Securities Co ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-01-20
Anticipated expiration: 2042-12-21
Also published as: CN115630144B

Abstract

A document searching method, comprising: a search engine acquires query content input by a user; searching in a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document of the plurality of documents, the relevance between the target document and the query content of which meets a preset condition; and outputting the target document and the interpretability information. Therefore, the user can determine the basis of the target document fed back by the search engine according to the interpretable information output by the search engine, so that the reliability of the user on the target document fed back by the search engine can be improved, and the search experience of the user is improved. In addition, the application also provides a corresponding document searching device and related equipment.

Description

Document searching method and device and related equipment

Technical Field

The present application relates to the field of data retrieval technologies, and in particular, to a document search method, an apparatus, and a related device.

Background

At present, in organization organizations such as enterprises, a large amount of document data, such as dealer studies in the financial field, are usually stored, and since the document data are stored in an unstructured form, retrieving valid documents that meet the expectations of users from the document data becomes a key issue of attention of the organization organizations.

Although the current search engine technology is widely applied to searching for document information, in a practical application scenario, document information fed back by a search engine often does not meet the expectation of a user, for example, when the relevance between a searched partial document and the content of a search query input by the user is low, the user may think that the search engine misses and feeds back a partial document which is highly relevant to the content of the search query, which affects the search experience of the user. Therefore, how to improve the credibility of the search engine for feeding back the search results becomes an important problem to be solved urgently.

Disclosure of Invention

The application provides a document searching method, which aims to improve the credibility of a search engine for feeding back a search result and further improve the search experience of a user. In addition, the application also provides a corresponding document searching device, a computing device, a computer readable storage medium and a computer program product.

In a first aspect, the present application provides a document searching method, which is applied to a search engine, and includes:

acquiring query content input by a user;

searching in a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document of the plurality of documents, the relevance between the target document and the query content of which meets a preset condition;

and outputting the target document and the interpretability information.

In a possible implementation manner, the searching among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document includes:

searching the target document and the relevance score of the target document from the plurality of documents according to the query content, wherein the relevance score of the target document is higher than the relevance scores of the rest documents in the plurality of documents;

determining weights corresponding to a plurality of candidate matching elements according to the target document and the relevance scores of the target document, wherein the deviation between the scores calculated based on the weights corresponding to the candidate matching elements and the relevance scores of the target document is smaller than a preset range;

and determining the target matching element from the candidate matching elements, and determining the weight corresponding to the target matching element, wherein the target matching element meets the preset element determination condition.

In one possible implementation, the plurality of candidate matching elements includes any of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multi-modal information matching, metadata attribute matching, document full-text length, non-textual modality data included with the document, time-sensitive data of the document, and historical access data of the document.

In one possible embodiment, the method further comprises:

determining a target segment in the target document according to the interpretability information, wherein the matching degree between the target segment and the query content is higher than that between the rest segments in the target document and the query content;

and outputting the target segment.

In a possible embodiment, the plurality of documents includes a prediction document in which prediction data of an object to be evaluated is recorded, and the method further includes:

determining a prediction document related to the object to be evaluated from the plurality of documents;

acquiring actual data matched with the predicted content in the predicted document;

determining a prediction error rate corresponding to the object to be evaluated according to the prediction content and the actual data;

and generating evaluation information aiming at the object to be evaluated according to the prediction error rate and the number of the prediction documents.

In one possible implementation, the target document is a multi-modal document, and the multi-modal document refers to a document including any of various types of information in words, figures, and tables.

In a second aspect, an embodiment of the present application provides a document searching apparatus, which is applied to a search engine, and includes:

the acquisition module is used for acquiring query contents input by a user;

the search module is used for searching in a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document, of the plurality of documents, of which the correlation with the query content meets a preset condition;

and the output module is used for outputting the target document and the interpretability information.

In a possible implementation manner, the search module is specifically configured to:

In a possible implementation manner, the search module is further configured to determine a target segment in the target document according to the interpretability information, wherein the matching degree between the target segment and the query content is higher than the matching degree between the rest segments in the target document and the query content;

the output module is further configured to output the target segment.

In a possible implementation manner, the plurality of documents include a prediction document in which prediction data of an object to be evaluated is recorded;

the search module is further configured to:

In one possible implementation, the target document is a multi-modal document, which refers to a document including any of various types of information in words, diagrams, and tables.

In a third aspect, the present application provides a computing device comprising a processor, a memory. The processor is configured to execute instructions stored in the memory to cause the computing device to perform a document search method as in the first aspect or any implementation manner of the first aspect. It should be noted that the memory may be integrated into the processor or may be independent from the processor. The computing device may also include a bus. Wherein, the processor is connected with the memory through a bus. The memory may include a readable memory and a random access memory, among others.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein instructions, which, when executed on a computing device, cause the computing device to perform the operational steps of the document search method according to the first aspect or any one of the implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a computing device, cause the computing device to perform the operational steps of the document search method according to the first aspect or any one of the implementations of the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to these drawings.

FIG. 1 is a schematic diagram of an exemplary application scenario provided herein;

FIG. 2 is a schematic diagram of another exemplary application scenario provided herein;

FIG. 3 is a schematic flow chart of a document searching method provided by the present application;

FIG. 4 is a schematic diagram of an exemplary interaction interface provided herein;

FIG. 5 is a schematic view of an exemplary sliding window provided herein;

FIG. 6 is a schematic structural diagram of a document searching apparatus provided in the present application;

fig. 7 is a schematic diagram of a hardware structure of a computing device provided in the present application.

Detailed Description

Currently, in the process of information retrieval by a user, a search engine usually searches one or more documents with higher relevance to query content according to the query content input by the user for feedback. For a user, the user can only know document information, but cannot know the basis of the document fed back by the search engine, so that the credibility of the document fed back by the user to the search engine is low, and especially when the document fed back by the search engine does not meet the expectation of the user, the user may think that the search engine misses the important document fed back, that is, the document with higher relevance to the query content, thereby affecting the search experience of the user.

Based on this, the embodiment of the application provides a document searching method, which is used for improving the searching experience of a user. In specific implementation, when a user carries out information retrieval, a search engine firstly acquires query content input by the user, searches in a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information comprises a target matching element corresponding to the target document and a weight corresponding to the target matching element, the target document is at least one document in the plurality of documents, and the correlation between the target document and the query content meets a preset condition, so that the search engine outputs the target document and the interpretability information. In the process of retrieving the document according to the query content, not only the target document relevant to the query content is output, but also the interpretability information corresponding to the target document is output, so that the user can determine the basis of the search engine for feeding back the target document according to the interpretability information, the reliability of the user for feeding back the target document by the search engine can be improved, and the search experience of the user can be improved.

As an implementation example, the embodiment of the application can be applied to an application scenario as shown in FIG. 1. In this application scenario, the user 101 may perform an information search on the terminal device 102. Specifically, the terminal device 102 may present an interactive interface to the user 101, so that the user 101 may input query content on the interactive interface and trigger the terminal device 102 to perform information retrieval according to the query content. The terminal device 102 is provided with a search engine 1021, and the search engine 1021 can retrieve a document with higher relevance with the query content from the database according to the query content, determine interpretability information corresponding to the document, and present the document and the interpretability information together on the interactive interface of the terminal device 102, so that the user 101 can also view the interpretability information fed back by the terminal device 102 when viewing the retrieved document, thereby improving the reliability of the user 101 for the document.

It is noted that the application scenario shown in fig. 1 is only an implementation example, and is not used to limit the application scenario of implementation of the embodiment. For example, in other possible application scenarios, as shown in fig. 2, after the user 101 inputs query content on the terminal device 102, the terminal device 102 may generate a search request including the query content, and send the search request to the cloud server 103 in the cloud, and a computing engine 1031 is run in the cloud server 103, so that the cloud server 103 may respond to the search engine, retrieve a document with high relevance to the query content according to the query content by using the search engine 1031, determine interpretability information corresponding to the document, and then feed back the document and the interpretability information to the terminal device 102. Thus, the terminal device 102 may present the document and the interpretable information to the user 101 on the interactive interface.

For the sake of understanding, the following describes an embodiment of a document searching method provided in the present application with reference to the accompanying drawings.

Referring to fig. 3, fig. 3 is a flowchart illustrating a document searching method provided in an embodiment of the present application, where the method may be applied to the application scenarios shown in fig. 1 or fig. 2, or may be applied to other applicable application scenarios. For convenience of explanation, the present embodiment is exemplified by being applied to the application scenario shown in fig. 1.

The document searching method shown in fig. 3 may be executed by the search engine 1021 in fig. 1, and the method may specifically include:

s301: the search engine 1021 acquires query contents input by a user.

In an actual application scenario, the user 101 may input query content through the interactive interface of the terminal device 102 to trigger the search engine 1021 to perform a corresponding data search process according to the query content.

The query content input by the user can be text content in the form of characters, words, sentences and the like. Alternatively, the content of the query input by the user may be a non-text content such as a picture or voice. At this time, after acquiring the query content input by the user, the search engine 1021 may convert the query content in a non-text form into a corresponding text, for example, may convert an image into characters capable of expressing the image content through image recognition and analysis, or may convert the voice input by the user into characters expressing the semantic meaning of the voice through voice recognition, or the like. In this embodiment, a specific implementation manner of the query content input by the user is not limited.

S302: the search engine 1021 searches among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document, wherein the interpretability information includes a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document among the plurality of documents, wherein the relevance between the target document and the query content meets a preset condition.

In general, the search engine 1021 may access a database in which a plurality of documents, such as research documents inside a business, are stored, so that the search engine 1021 may retrieve one or more documents having a higher relevance to the query content from the plurality of documents according to the query content. For the sake of distinction and description, the retrieved document will be referred to as a target document hereinafter.

In a specific implementation, the search engine 1021 may determine the relevance between the plurality of documents and the query content, for example, by means of keyword matching, and determine a target document with higher relevance from the plurality of documents. The relevance between the target document and the query content can be determined through the score, and the higher the score is, the higher the representation relevance is, and the lower the score is, the lower the representation relevance is. Since the search engine 1021 retrieves a specific implementation process of a document related to the search engine 1021 according to the query content such as the keyword, the search engine has related applications in an actual application scenario, and the retrieval process is not described herein again.

It should be noted that the document in the present embodiment may be a single-mode document, such as a document that only includes text information. Alternatively, the document in the present embodiment may also be a multi-modal document, such as a document including any of various types of information, such as text, drawings, and tables. At this time, the information of the non-character type (such as information of a graph, a table, etc.) in the multimodal document can be converted into the character information with the same semantic meaning by means of semantic recognition, etc., and the character information is positioned in the document to the position where the original information of the non-character type is located (such as a corresponding paragraph or chapter in the document, etc.).

It is understood that if the search engine 1021 feeds back only the target document, and lacks an explanatory basis for the search engine 1021 to determine the target document, the user 101 may have a low credibility of the target document fed back by the search engine 1021. Therefore, in this embodiment, the search engine 1021 also determines interpretability information corresponding to the target document, where the interpretability information can explain a basis for the search engine 1021 to screen out the target document from a plurality of documents. Since the implementation technology of the search engine 1021 for searching for the target document according to the query content is mature, in this embodiment, the logic of determining the target document by the search engine 1021 does not need to be changed, and after determining the target document, the interpretable information is generated for the target document.

In one possible implementation, the search engine 1021 may first search a plurality of documents according to the query content, determine relevance scores between the query content and each document, and search, according to the relevance score corresponding to each document, the top M documents (M is a positive integer) with the highest relevance from the plurality of documents, that is, target documents, where the relevance scores of the remaining documents are lower than the relevance score of the target document. Then, the search engine 1021 determines weights corresponding to the candidate matching elements according to the target document and the relevance scores of the target document, wherein a deviation between a score calculated based on the weights corresponding to the candidate matching elements and the relevance score of the target document is smaller than a preset range, and further determines a target matching element and a weight corresponding to the target matching element from the candidate matching elements, and the determined target matching element meets a preset element determination condition. Then, the determined target matching element and the weight corresponding to the target matching element can be used as the interpretability information of the target document.

Specifically, for any query content Q, it is assumed that the query content includes N keywords, such as Q = [ word ] ₁ ,…,word _n ]Then the search engine 1021 can retrieve a document list D (Q) = [ D ] associated with Q based on the query content Q ₁ ,…, D _k ,…,D _M ]I.e. the target document is retrieved. The documents in the list are not repeated and may be sorted in descending order according to some similarity algorithm. Wherein the more forward the document is in the list, the higher the relevance between the representation document and the query content Q. For document D in document list _k (the k-th position in the list is ranked), and its relevance Score is Score (Q, D) _k ). Then, an interpretable approximation function (Q, D) can be constructed _k ) The interpretable approximation function may be constructed based on a plurality of candidate matching elements and a weight corresponding to each candidate matching element. Illustratively, an interpretable approximation function as shown in the following equation (1) may be constructed.

Wherein Q is the query content; d _k The documents in the document list; f. of _n （Q，D _k ) Is a candidate matching element; w is a _n The weight corresponding to the candidate matching element; n is the number of candidate matching elements and is a positive integer. That is, the constructed interpretable approximation function may be a function that sums all of the candidate matching elements multiplied by the weight values.

In the present embodiment, the following implementation examples of candidate matching elements are provided.

In a first example, the candidate matching elements may be features based on the document and the query content. For example, the candidate matching elements may specifically include one or more features of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multi-modal information matching, and metadata attribute matching.

Wherein, the word matching refers to the query content Q and the single document D _k The word frequency-inverse text frequency index (term frequency-in) of the word can be used to determine the word frequency of the wordrse document frequency, TF-IDF) as a feature that matches this candidate matching element. The word frequency TF refers to the word in the document D _k IDF, which is the inverse of the number of documents in the database that contain the word.

n-gram matching, which refers to querying the content Q and the document D _k N-gram items which are commonly contained, namely n continuous words, form a phrase in a fixed sequence, and the TF-IDF of the n-gram items is used as the characteristic of the n-gram matching candidate matching element. Illustratively, the value of n may be, for example, 2 or 3.

Synonym matching refers to determining synonyms that can be searched for by the words or n-gram items in the query content Q according to the synonym dictionary, and searching in the document D _k Then the TF-IDF of the synonym can be used as a synonym to match the features of this candidate matching element. The synonym dictionary may be a general domain synonym dictionary based on HowNet or the like, or may be a domain synonym dictionary mined from texts in a single domain (e.g., financial domain or the like).

Semantic vector matching, which means that other words or word groups with higher semantic similarity with words or n-gram items in the query content Q are determined according to word level semantic vector representation ^sim (e.g., semantic similarity greater than 0.9, etc.), and the word ^sim In document D _k If the word appears, the word can be processed ^sim The product of TF-IDF and semantic similarity sim is used as a semantic vector to match the characteristics of the candidate matching element. Wherein, the voice similarity sim refers to the word or n-gram item and word in the query content Q ^sim Semantic similarity between them. Illustratively, word-level semantic vectors, for example, can be inferred based on deep learning models such as word2vec, glove, BERT, GPT, and the like.

Topic keyword matching refers to querying the content Q and the document D _k The matching condition can be used as the characteristic that the topic keyword matches the candidate matching element. Wherein, the document D _k The topic keywords can be based on topic models such as Latent Dirichlet Allocation (LDA) and the likeType (and may also be determined in conjunction with manual annotation results).

Multimodal information matching, which refers to querying the content Q and the document D _k The multi-modal attribute matching is performed by using the probability of the multi-modal attribute as the feature of the candidate matching element matched by the multi-modal information. Wherein, when the document D _k In the case of a multi-modal document, by applying a pair to the document D _k The document D can be determined by the analysis understanding of the contents _k The attributes of the middle-multiple modes, such as the detailed categories of tables/pictures (asset balance tables, market trend charts, etc.), associated subject objects (corresponding to stocks, corresponding to industries, etc.), table related to specific index apertures, description keywords of various picture content information, etc., can determine the probabilities, such as the occurrence probability, corresponding to the attributes of the multiple modes, respectively.

Metadata attribute matching refers to querying the content Q and the document D _k And if the metadata attributes in the metadata database are matched, the intention probability corresponding to the metadata attributes is used as the characteristic of the candidate matching element of metadata attribute matching. Wherein, the document D _k Metadata attributes in (e.g., metadata) may be, for example, the type (industry analysis, personal equity analysis, morning briefing, etc.) to which the document content belongs, author information, publication time, attribution to industry, etc. When the query content Q is matched with the intention of a certain metadata attribute dimension or dimensions, the intention probability is used as the characteristic that the metadata attribute is matched with the candidate matching element. For example, suppose the query content Q is "personal stock analysis in the real estate industry", it is understood by intention recognition that it contains the document D _k The "Home industry" in (1) matches the intent of the "research and report type" metadata attributes, e.g.<Belonging to the industry: land industry, probability =0.9>、<Types of research and reporting: individual strand analysis, probability =0.85>Then the probability of intent that these two metadata attributes correspond is taken as a feature of the candidate matching factor.

In a second implementation example, the candidate matching element may be document D _k The feature in (1). For example, the candidate matching element may be one or more of the full-text length of the document, non-text modal data included in the document, time-dependent data of the document, and historical access data of the documentA variety of features.

The full-text length of the document can include one or more dimensions of features for measurement, such as the number of characters, the number of pages of the document, and the like.

The non-text modal data included in the document is document D _k The number of tables (which may be the total number or the number of specified categories), the number of pictures, etc. contained in the text-free mode.

The timeliness data of the document is used for measuring the document D _k Can be characterized by, for example, the current search date t _s And document release date t _p Negative exponential function e of interval ^-（ts-tp） As a feature, and the larger the time interval, the smaller the feature value.

Historical access data of the document is used for measuring the document D _k May be a document D, for example _k Historical click rate of, etc.

In an actual application scenario, the candidate matching elements may be any of the various implementation examples described above, or may be other applicable features, which is not limited in this embodiment.

Then, for the query content Q and the document list D (Q) = [ D = [ D ] ₁ ,…, D _k ,…,D _M ]The relevance Score corresponding to each document in the document list D (Q) is calculated based on the above-mentioned plurality of candidate matching elements, and the weight values of the plurality of candidate matching elements are calculated so that the deviation between the interpretable approximation function and the relevance Score of the document list D (Q) is as minimum as possible.

Specifically, a loss function as shown in the following formula (2) may be constructed:

wherein, loss (W) ^Q ) Is a loss function; w ^Q The vector is formed by weighted values corresponding to a plurality of candidate matching factors; score (Q, D) _k ) For querying content Q and document D _k Correlation betweenA sexual score; (Q, D) _k ) A score calculated based on the interpretable approximation function;

represents a weight vector W ^Q Is/are as followsL ₁ Norm, selectionL ₁ The significance of the norm is that sparse characteristic coefficients are obtained through induction, and search explanation is given based on the characteristic that a few coefficients are not 0; mu.s>And 0 is a regularization coefficient, and the larger the value is, the more sparse the obtained weight vector is. In practical application, the more sparse the weight vector is, the higher the relevance of the document determined by the characterization based on the candidate matching factor corresponding to the weight is, that is, the more worthwhile the document is as interpretability information of the document.

In other possible embodiments, since the search engine 1021 may also use Natural Language Processing (NLP) technologies such as Query rewrite (Query Write) and Semantic Matching model (Semantic Matching) to perform "fuzzy search" in an actual application scenario, the search engine 1021 may further determine a neighborhood of the Query content Q, for example, determine a neighborhood of the Query content Q through a search log, a result list, and the like in the search engine 1021, and obtain a set including the Query content Q and other Query contents fuzzy-matched with the Query content

If the query content Q is Q in the set N (Q) ₀ And the like. For example, assuming that the content of the query inputted by accident is "capital", the search engine 1021 can search not only the document including the keyword "capital", but also the document including the keyword "beijing" (but not including "capital"). Therefore, the search engine 1021 may generate a set including N (Q) = { "capital", "beijing" } by fuzzy matching or the like.

Then, the search engine 1021 may calculate scores between the plurality of query contents in the set N (Q) and the respective documents in the document list D (Q) based on the interpretable approximation function, and calculate weight values of the plurality of candidate matching elements when a deviation between the scores and relevance scores of the documents is minimized. At this time, a loss function shown in the following formula (3) may be constructed:

wherein, loss (W) ^Q ) Is a loss function; w ^Q The vector is formed by weight values corresponding to a plurality of candidate matching factors; q _j For the jth query content in the set N (Q);

for the kth and query content Q in the document list D (Q) _j A document for performing a calculation; score (Q) _j ，

) For querying content Q _j And documents

A relevance score between; (Q) _j ，

) A score calculated based on the interpretable approximation function;

denotes a weight vector W ^Q Is/are as followsL ₁ A norm; mu.s>0 is a regularization coefficient. Wherein, pi _k （Q，Q _j ) Q is equal to or greater than 0 _j And Q if _j The higher the similarity to the Q text, the greater the weight. Illustratively, Q _j The weight between Q and Q can be calculated by using a gaussian kernel function, as shown in the following equation (4):

wherein, distance（Q，Q _j ) Is Q _j The distance between the Q and the Q can be selected as the Cosine distance of the word vectors of the Q and the Q, for example, the adjacent point of the Q can be quickly searched by utilizing vector indexing tools such as Faiss, HNSW and the like to obtain the distance between the Q and the Q, and the bucket where the Q is located and the adjacent bucket can be searched by adopting local sensitive Hash algorithms such as SimHash, minHash and the like commonly used in the text deduplication problem to obtain the distance between the Q and the adjacent bucket; σ is a coefficient for controlling the range of action of the gaussian kernel function, and the larger the value thereof, the larger the local influence range of the gaussian kernel function.

Then, the search engine 1021 may solve the weights of the candidate matching factors corresponding to the loss function reaching the minimum value, i.e. may obtain the weights related to the candidate matching factors

To the optimization problem of (2). Solving W corresponding to the minimum loss function shown in the formula (3) ^Q For example, the optimization problem is shown in the following equation (5):

wherein R is ^N Is an N-dimensional euclidean space.

For a given query Q, π (Q, Q) _j ）、Score（Q _j ，

）、f _n （Q _j ，

) Are all non-negative constants in the optimization problem, and therefore, will be π (Q, Q) _j ) Abbreviated as pi _j ；Score（Q _j ，

) Abbreviated as S _j，k ；f _n （Q _j ，

) Is abbreviated as

The optimization problem is abbreviated as shown in the following formula (6)

Wherein the content of the first and second substances,

the function of the second order function,

is a non-slightly convex function.

Then, through the variable decoupling technique, the unconstrained optimization problem shown in the above formula (6) can be converted into an constrained optimization problem with equality, as shown in the following formula (7):

where "s.t." means that the symbolic representation of the constraint is satisfied, i.e. the constraint of V-W =0 is satisfied.

Thus, an augmented lagrange function can be defined as shown in equation (8) below:

wherein, the variable of N dimension

Is Lagrange multiplier, p>And 0 is a penalty term coefficient.

Then, the search engine 1021 may use an Alternating Direction Method of Multipliers (ADMM) algorithm framework to alternately solve the optimization problem, as shown in the following formula (9):

from the basic mathematical derivation, there is an explicit solution to the above optimization problem, as shown in equation (10) below:

the search engine 1021 performs alternate optimization based on ADMM algorithm, when V ^k+1 ,W ^k+1 ,Y ^k+1 Respectively with V ^k ,W ^k ,Y ^k Close enough, or when the number of iterations is large enough (e.g., the iterations may be limited to 50), the feature weight W is returned ^k+1 As W ^Q The predicted result of (1). In this way, the search engine 1021 can calculate the weights corresponding to the multiple candidate matching elements.

Then, the search engine 1021 may determine a target matching element, which is interpretable information of the target document, and a weight corresponding to the target matching element from the plurality of candidate matching elements. The following two implementation examples are provided in this embodiment:

example one: the search engine 1021 may calculate the contribution degree constraint (n) = w corresponding to each candidate matching element according to the candidate matching elements and their corresponding weights _n f _n （Q，D _k ) And screening K larger contribution degrees from the contribution degrees respectively corresponding to the plurality of candidate matching elements, thereby determining the K candidate matching elements respectively corresponding to the K contribution degrees as target matching elements serving as interpretable information, and further determining the weight corresponding to the target matching elements.

Example two: the search engine 1021 may calculate the contribution degree constraint (n) = w corresponding to each candidate matching element according to the candidate matching elements and their corresponding weights _n f _n （Q，D _k ). Then, the search engine 1021 may sort the absolute value of the contribution degrees in descending order to obtain the feature vector

Reordering of (2)

. Then, the search engine 1021 may collect the following formula (11) to calculate the minimum K value of which the similarity difference approaches 1+ epsilon, so that the search engine 1021 may determine the candidate matching factors corresponding to the top K feature vectors in the ranking as the target matching element serving as the interpretability information, and further determine the weight corresponding to the target matching element.

Where ε is a constant, and ε >0 may be, for example, 0.1.

S303: after determining the target document and the interpretability information corresponding to the target document, the search engine 1021 outputs the target document and the interpretability information.

In this way, the terminal device 102 may present the target document and the interpretability information output by the search engine 1021 to the user 101, so that the user 101 can also view the interpretability information corresponding to the target document when viewing the target document, which enables the user 101 to determine the basis for the search engine 1021 to feed back the target document according to the interpretability information, thereby improving the reliability of the user 101 for the search engine 1021 to feed back the target document, and improving the search experience of the user.

Illustratively, the terminal device 102 may present an interactive interface as shown in fig. 4, which may include related information of a plurality of documents, such as names or content summaries of the documents. Moreover, for each document, a preview button for the document may also be presented in the interactive interface, so that after the user 101 clicks the preview button, the interpretability information corresponding to the document may be viewed in a new pop-up window. Of course, the presentation manner shown in fig. 4 is only an exemplary illustration, and in practical applications, the terminal device 102 may also present the target document and the interpretability information thereof in other manners.

In a further possible implementation, the terminal device 102 may also enable the user 101 to re-order or re-screen the target document based on the presented interpretability information. In a specific implementation, after the user 101 views the interpretability information, it may be considered that the importance of a part of the target matching elements for document retrieval is higher than the importance of the rest of the target matching elements, and then the user 101 may filter from the presented target matching elements. In this way, the search engine 1021 may re-rank the target documents based on the user filtered target matching elements and present the re-ranked target documents to the user 101 on the terminal device 102. Or, the search engine 1021 may rescreen the target document based on the target matching factor used for the user to screen, and present the rescreened target document to the user 101 on the terminal device 102 according to the sorting before rescreening or the sorting after rescreening, so that the user 101 may customize the document retrieval, and the retrieval experience of the user 101 may be further improved.

In practical applications, when the user is a developer or a technician, the user 101 may further analyze and determine whether there is an incorrect matching element in the interpretability information or whether there is a missing matching element based on the interpretability information presented by the terminal apparatus 102. When there is an error or missing matching element, the user 101 corrects the matching element on the terminal device 102, or configures a new matching element for the search engine 1021, and triggers the search engine 1021 to re-interpret the retrieved target document. In this way, the search engine 1021 may generate corresponding interpretability information for the target document based on the modified matching element and the added matching element of the user 101 again, and present the updated interpretability information to the user 101 on the terminal device 102, so that the user 101 determines whether the interpretability of the search engine 1021 for retrieving and feeding back the target document is expected based on the newly presented interpretability information, so as to provide more reasonable and correct retrieval interpretation for the user 101 or other users in the following.

Document content segments with a higher degree of relevance to the query input provided by the user 101 may also be presented in this embodiment. For this reason, the present embodiment may further include the following steps:

s304: the search engine 1021 determines a target segment in the target document according to the interpretability information, wherein the degree of matching between the target segment and the query content is higher than the degree of matching between the rest of the segments in the target document and the query content.

S305: the search engine 1021 outputs the target segment.

As an implementation example, for document D _k The search engine 1021 may use a sliding window of a preset length and width to retrieve the document D from the document D _k The first page in (a) starts a backward sliding scan, each sliding generating a new sliding window, as shown in fig. 5. During each swipe, the search engine 1021 computes an interpretability score for the window region. Wherein the higher the interpretability score, the higher the association characterizing the document content (which may include text, figures, or tables, etc.) within the sliding window region with the query input.

For example, for each window region Windows (p), the search engine 1021 may count the K sets of target matching elements (or interpretable features) that match within the sliding window coordinate range, i.e., the set of K target matching elements

Then, the search engine 1021 calculates an interpretability score of each sliding window region Windows (p) according to the following formula (12):

wherein, rate (f) _n P) represents an interpretable feature f _n At D _k Sliding window ofpThe number of occurrences in the document D accounts for the feature _k Appear to scale throughout.

In this way, the search engine 1021 can filter out a preset number of sliding window regions with larger interpretability scores according to the interpretability scores of the sliding window regions. In this way, the search engine 1021 can output the document content in the screened sliding window region as a target segment.

Further, the terminal device 102 may present the target segment output by the search engine 1021 to the user 101. For example, after the user 101 clicks a preview button corresponding to a certain document on the display interface of the terminal device 102, the target segment content in the document may be displayed on the display interface, so that the user 101 can conveniently and quickly locate the content with higher relevance to the query content in the document, and thus the search experience of the user can be further improved.

In this embodiment, the document searching method performed by the search engine 1021 is taken as an example for illustration, but in other embodiments, the document searching method may be performed by another entity independent of the search engine 1021, such as a separately configured processor in the terminal device 102 or an interpretable engine, and the like, which is not limited thereto.

In a further possible embodiment, among the plurality of documents stored in the database, there may be predicted contents of part of the documents that are predicted contents of researchers, such as predicting changes of stock prices in a period of time, predicting a yield of a certain product, and the like. At this time, the search engine 1021 (or other entity independent from the search engine 1021) may determine to feed back the target document and the interpretability information to the user, and may also perform evaluation on the prediction capability of the researcher (or other object) to evaluate the prediction accuracy of the researcher, and further may perform evaluation on the business capability of the researcher.

In particular, the search engine 1021 may determine a predicted document related to the object to be evaluated from a plurality of documents stored in the database, where the predicted document records the predicted data of the object to be evaluated. The object to be evaluated may be, for example, the aforementioned researcher, or may be a research and development department, or may be an AI model, and the like, which is not limited in this embodiment. Then, the search engine 1021 may obtain actual data matching the predicted content in the predicted document, such as the change of the stock price of the XX stock in the predicted content within one week and the actual data being the real change of the XX stock within one week, so that the search engine 1021 may determine the predicted error rate corresponding to the object to be evaluated according to the predicted content and the actual data, and thereby generate evaluation information for the object to be evaluated according to the predicted error rate and the number of predicted documents.

Taking an object to be evaluated as a researcher as an example, if a prediction document is a research report for each stock, a list of real values of all numerical indexes can be recorded as

. For researcher P ₁ The predicted value of the numerical index is given as

For the numerical index not covered by the prediction, the corresponding position is recorded and is counted into the set ∅. Then, the error rate of the numerical indicator can be defined using the following equation (13):

among them, predict _i (P ₁ ) For researcher P ₁ A predicted value for the ith numerical indicator; value _i The actual value of the ith numerical index is obtained; count ({ i | predict) _i (P ₁ ) Epsilon ∅ }) means counting numerical indicators which are not predicted by researchers; parameter theta>0 is a coverage penalty term, the larger the value is, the characteristic is to the researcher P ₁ The less tolerant the prediction is incomplete for the numerical index.

Moreover, if there may be enumerated indexes in the prediction document, it is necessary to map its values (assuming that there are K possible) into the positive integer set, such as [1,2, …, K]And the like, and maintain monotonicity of the enumerated indicators at the time of mapping. For example, for an enumerated index [ buy, hold, sell]It can be mapped to [1,2,3]Rather than other random combinations of {1,2,3 }. All note all oneThe value list of the model lifting index is

. For researcher P ₁ The predicted value given is

The values of the real index and the prediction index are mapped to the positive integer set. Then, the enumerated error rate can be defined using the following equation (14):

wherein, K _j The value is a positive integer and is used for representing the possible value number of the jth enumerated index.

Thus, search engine 1021 can obtain researcher P ₁ The overall error rate for this document is:

then, for the researcher P ₁ The search engine 1021 may calculate the error rates of the documents, and then obtain the total error rate by weighted summation or average calculation, so as to implement the method for searching the predicted documents for the researcher P ₁ Quantitative assessment of the accuracy of the study over a period of time, lower total error rates indicate a greater level of study (in terms of prediction accuracy).

Since the predicted error rate is mainly to express the researcher P ₁ With accuracy of prediction convenience, search engine 1021 is in the middle of researcher P ₁ In the evaluation, the investigator P may be considered ₁ Influence factors on the quantity of the generated prediction documents avoid the influence on the fairness and the fairness of the evaluation due to the fact that fewer evaluation dimensions are adopted in the evaluation.

Specifically, the yield and the error rate of the researcher may be monotonously mapped and normalized to the index between values (0,1), where the monotonicity normalization function may be, for example, a hyperbolic tangent function shown in the following formula (16)In other embodiments, the order of the investigator P may be ₁ The predicted contents are sorted and then mapped in segments, and the like.

Thus, based on the formulas (14) to (16), the following formula (17) can be used to calculate the target for the researcher P ₁ The total score of (a):

among them, stdProduction (P) ₁ ) For researcher P ₁ The output index of (2) is between 0~1; stdErrorRate (P) ₁ ) For researcher P ₁ The value of the comprehensive error rate index is 0~1; score (P) ₁ ) For researcher P ₁ The composite score of (1).

It should be noted that the above-mentioned implementation manner in which the search engine 1021 generates evaluation information for an object to be evaluated is only an implementation example, for example, in other possible implementations, the search engine 1021 may also generate evaluation information according to a predicted error rate or generate evaluation information according to other applicable reference information, etc.; alternatively, the error rate may be multiplied by a factor less than 1 for researchers who are primarily long-term forecasts to account for the effects of operating years, to balance the effects of different operating years for different researchers. Alternatively, when the same content in the same prediction document is predicted by a plurality of researchers together, the error rate of the researcher may be adjusted by the Shapely value method or other methods, and the present embodiment is not limited thereto.

In addition, the embodiment of the application also provides a document searching device. Referring to fig. 6, fig. 6 is a schematic structural diagram illustrating a document searching apparatus according to an embodiment of the present application, and the document searching apparatus 600 shown in fig. 6 is applied to a search engine, such as the search engine 1021 in the previous embodiment, and the document searching apparatus 600 includes:

an obtaining module 601, configured to obtain query content input by a user;

a searching module 602, configured to search, according to the query content, in multiple documents to obtain a target document and interpretability information corresponding to the target document, where the interpretability information includes a target matching element corresponding to the target document and a weight corresponding to the target matching element, and the target document is at least one document, of the multiple documents, where a correlation between the target document and the query content meets a preset condition;

an output module 603, configured to output the target document and the interpretability information.

In a possible implementation, the search module 602 is specifically configured to:

In a possible implementation manner, the search module 602 is further configured to determine a target segment in the target document according to the interpretability information, where a degree of matching between the target segment and the query content is higher than a degree of matching between the remaining segments in the target document and the query content;

the output module 603 is further configured to output the target segment.

the search module 602 is further configured to:

It should be noted that, for the contents of information interaction, execution processes, and the like between the modules and units of the apparatus, since the method embodiments in the embodiments of the present application are based on the same concept, the technical effects brought by the contents are the same as those of the method embodiments in the embodiments of the present application, and specific contents may refer to the descriptions in the foregoing method embodiments in the embodiments of the present application, and are not repeated herein.

In addition, the embodiment of the application also provides the computing equipment. Referring to fig. 7, fig. 7 is a schematic diagram illustrating a hardware structure of a computing device in an embodiment of the present application, where the computing device 700 may include a processor 701 and a memory 702.

Wherein, the memory 702 is used for storing computer programs;

the processor 701 is configured to execute the following steps according to the computer program:

acquiring query content input by a user;

and outputting the target document and the interpretability information.

The processor 701 may be a CPU, and the processor 701 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete device components, and the like. The general purpose processor may be a microprocessor or any conventional processor or the like.

The memory 702 may be, for example, volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

In a possible implementation, the processor 701 is specifically configured to execute the following steps according to the computer program:

In one possible implementation, the plurality of candidate matching elements includes any of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multi-modal information matching, metadata attribute matching, document full-text length, non-textual modal data included with the document, time-sensitive data of the document, and historical access data of the document.

In a possible implementation, the processor 701 is further configured to perform the following steps according to the computer program:

and outputting the target segment.

In a possible implementation manner, the plurality of documents includes a prediction document, and the prediction document records therein prediction data of an object to be evaluated, and the processor 701 is further configured to execute the following steps according to the computer program:

In addition, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the document searching method in the foregoing method embodiment.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the goal of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A document searching method is applied to a search engine, and the method comprises the following steps:

acquiring query content input by a user;

and outputting the target document and the interpretability information.

2. The method of claim 1, wherein searching among a plurality of documents according to the query content to obtain a target document and interpretability information corresponding to the target document comprises:

3. The method of claim 2, wherein the plurality of candidate matching elements comprises any of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multi-modal information matching, metadata attribute matching, document full-text length, document included non-textual modal data, document timeliness data, document historical access data.

4. The method of claim 2, further comprising:

and outputting the target segment.

5. The method according to claim 1, wherein the plurality of documents includes a prediction document in which prediction data of an object to be evaluated is recorded, the method further comprising:

6. The method according to any one of claims 1 to 5, wherein the target document is a multi-modal document, and the multi-modal document refers to a document comprising any of a plurality of types of information in words, figures and tables.

7. A document searching apparatus applied to a search engine, the document searching apparatus comprising:

the acquisition module is used for acquiring query contents input by a user;

8. The apparatus of claim 7, wherein the search module is specifically configured to:

9. The apparatus of claim 8, wherein the plurality of candidate matching elements comprises any of word matching, n-gram matching, synonym matching, semantic vector matching, topic keyword matching, multi-modal information matching, metadata attribute matching, document full-text length, document included non-textual modal data, document timeliness data, document historical access data.

10. The apparatus of claim 8,

the search module is further used for determining a target segment in the target document according to the interpretability information, and the matching degree between the target segment and the query content is higher than that between the rest segments in the target document and the query content;

the output module is further configured to output the target segment.

11. The apparatus according to claim 7, wherein the plurality of documents include a prediction document in which prediction data of an object to be evaluated is recorded;

the search module is further configured to:

12. The apparatus according to any one of claims 7 to 11, wherein the target document is a multi-modal document, and the multi-modal document refers to a document comprising any multiple types of information in characters, figures and tables.

13. A computing device comprising a processor, a memory;

the processor is configured to execute instructions stored in the memory to cause the computing device to perform the steps of the method of any of claims 1 to 6.

14. A computer-readable storage medium comprising instructions which, when executed on a computing device, cause the computing device to perform the steps of the method of any one of claims 1 to 6.