WO2022120975A1

WO2022120975A1 - Document searching method and apparatus, and electronic device

Info

Publication number: WO2022120975A1
Application number: PCT/CN2020/139255
Authority: WO
Inventors: 吴嘉澍; 王洋; 须成忠
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-12-10
Filing date: 2020-12-25
Publication date: 2022-06-16
Also published as: CN112463950A; CN112463950B

Abstract

The present invention is applicable to the technical field of document searching, and provides a document searching method and apparatus, and an electronic device. The method comprises: obtaining entries in all documents of scholars, and performing entry extension on all the documents according to a hierarchical relationship of the entries; for each scholar, according to appearance conditions and entry extension conditions of entries in documents written by the scholar, performing different weights of scoring on the entries to construct a scholar-entry matrix; using a machine learning algorithm to train the scholar-entry matrix so as to generate a search model; and performing operation on a search keyword in the search model to output a document searching result. During entry extension, an extension score is obtained by globally considering all documents of a scholar according to extension conditions in other documents of the scholar, and the extension score may further be given different values according to a hierarchical gap between an extended entry and an original entry in the hierarchical relationship of entries, such that reasonable extension for documents is achieved, and the problem of search quality when a search keyword does not explicitly appear is effectively solved.

Description

Literature search method, device and electronic device

technical field

The invention belongs to the technical field of document search, and in particular relates to a document search method, device and electronic device.

Background technique

With the surge in the amount of data in the era of big data, in order to meet people's information needs, how to efficiently search for information related to one's own needs from massive information becomes more and more important. As one of the application technologies of information retrieval, literature search for scholars allows users to search for scholars related to keywords from an institution, a school or a wider range of scholars by entering keywords of interest. And the results are sorted in descending order according to the degree of relevance. In order to achieve this function, the retrieval system often stores academic documents published by various scholars, such as papers, journal articles, etc., so that the system can generate search results and sort them according to the documents of various scholars.

However, the search system for scholars will face a problem. For example, when users search for "computer science", although scholars engaged in "natural language processing" are highly related to computer science, they will not be searched, or the search ranking is very high. Low. The reason for this phenomenon is that the vast majority of scholars do not mention concepts and keywords in higher-level fields like "computer science" in every "natural language processing" literature, that is, users. The typed search term "computer science" did not appear explicitly in the literature written by scholars, resulting in a lower quality of search results.

technical problem

The purpose of the present invention is to provide a document search method, device and electronic device, which aims to solve the technical problem of low search quality in the prior art when search keywords do not appear explicitly.

technical solutions

In a first aspect, the present invention provides a literature search method, applied to electronic equipment, including:

Obtain the entries in all the literatures of various scholars;

Entries are expanded for all documents according to the hierarchical relationship of entries;

For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, assign different weights to the entry to construct a scholar-entry matrix;

A machine learning algorithm is used to train the scholar-entry matrix to generate a search model;

A matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.

Further, the step of obtaining the entries in all the documents of each scholar includes:

Access to all literature of various scholars;

Perform pre-operation processing on each document to obtain the entries in each document.

Further, the entry level relationship is an academic vocabulary level relationship, and the step of performing entry expansion on all documents according to the entry level relationship includes:

For each entry in the literature, look up the corresponding academic vocabulary in the academic vocabulary hierarchical relationship;

According to the hierarchical relationship of the academic vocabulary, the academic vocabulary is expanded to the upper-level entry.

Further, for each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the entry is assigned with different weights, and the steps of constructing the scholar-entry matrix include: :

For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the entries are assigned different weights;

According to the scores of each document and the entry in the document, a document-entry matrix is formed;

Transform the document-entry matrix into a scholar-entry matrix.

Further, the occurrence status includes the number of occurrences and the occurrence position, and for each scholar, according to the occurrence status of each entry in the literature written by the scholar and the expansion of the entry, the entry is given different weights. The sub-steps include:

For each scholar, according to the occurrence frequency and occurrence position of each entry in the literature written by the scholar, the corresponding occurrence frequency score and occurrence position score are assigned.

Further, for each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the steps of assigning different weights to the entry include:

For each scholar, assign a corresponding scholar score to the entry according to the average number of times the entry is expanded in all the literature written by the scholar; and

According to the distance of the hierarchical relationship of the entry when the entry is expanded, the corresponding expansion score is assigned to the entry.

Further, the machine learning algorithm is used to train the scholar-entry matrix, and the step of generating a search model includes:

Use the XGBoost algorithm to train and learn the scholar-entry matrix, and obtain the training loss value on the search ranking data set;

When the training loss value does not converge, a Bayesian optimization grid search algorithm is used to optimize parameters, and the scholar-entry matrix is updated until the training loss value converges.

In a second aspect, the present invention provides a document search device, comprising:

The entry acquisition module is used to acquire entries in all literatures of various scholars;

Entry expansion module, which is used to expand entries of all documents according to the hierarchical relationship of entries;

The matrix building module is used for each scholar to assign different weights to the entries according to the occurrence of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;

a training module, used for using a machine learning algorithm to train the scholar-entry matrix to generate a search model;

The search module is used to perform matching operation on the search keywords in the search model, and output the document search results according to the matching degree.

In a third aspect, the present invention also provides an electronic device, comprising:

processor; and

a memory communicatively connected to the processor; wherein,

The memory stores readable instructions that, when executed by the processor, implement the method of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program that, when executed, implements the method of the first aspect.

beneficial effect

In the literature search method, device and electronic device provided by the present invention, when using the hierarchical relationship of the entry to expand the entry of the document, the expansion score makes an overall consideration of all the literature written by the scholar according to the expansion of the scholar's other literature, and fully considers Factors such as the expansion of the entry in the scholar's other literature, the distance between the expanded entry and the original entry in the lexical hierarchy, and the position of the original entry in the literature, so as to achieve a reasonable expansion of the literature. Therefore, the literature search of scholars can be better carried out, which effectively solves the problem of search quality when the search keywords do not appear explicitly.

Description of drawings

FIG. 1 is a flow chart of the realization of the document search method shown in the first embodiment.

Fig. 2 is a schematic diagram of a knowledge tree containing the hierarchical relationship of academic vocabulary.

FIG. 3 is a schematic diagram illustrating text entry expansion according to an exemplary embodiment.

FIG. 4 is a specific application flowchart of Embodiment 1 according to an exemplary embodiment.

FIG. 5 is a structural block diagram of the document search apparatus shown in the second embodiment.

BEST MODE FOR CARRYING OUT THE INVENTION

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

The specific implementation of the present invention is described in detail below in conjunction with specific embodiments:

Example 1:

FIG. 1 is a flow chart of the realization of the document search method shown in the first embodiment. The document search method shown in the first embodiment is applicable to an electronic device, and a processor is set in the electronic device to perform a document search according to a search keyword. For the convenience of description, only the parts related to the embodiments of the present invention are shown, and the details are as follows:

In step S110, the entries in all documents of each scholar are acquired.

Step S120, performing lexical expansion on all documents according to the lexical relationship of the lexical entries.

Step S130 , for each scholar, according to the occurrence of each entry in the scholar's literature and the expansion of the entry, assign different weights to the entry to construct a scholar-entry matrix.

Step S140, using a machine learning algorithm to train the scholar-entry matrix to generate a search model.

In step S150, a matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.

Due to the rapid increase in the amount of literature data, how to retrieve the information most relevant to the search keywords from the massive resources and sort them reasonably to meet the information needs of users will become more and more important.

Existing information retrieval systems can retrieve a variety of entities, such as text, audio and video, games, scholars and experts, etc., and when searching, these retrieval systems will more or less face the above-mentioned problems: search keywords does not appear explicitly in the text.

In the present invention, a knowledge tree (as shown in FIG. 2 ) containing the hierarchical relationship of academic words will be used to assist document search, so as to deal with the problem that search keywords do not appear in documents explicitly. When using the knowledge tree to expand the entry of the document, the expansion score takes a global consideration of all the literature written by the scholar according to the expansion of the scholar's other literature. At the same time, when expanding different parts of the document, the expansion score will be adjusted accordingly, and the expansion score will also be assigned different scores according to the level gap between the original entry and the original entry in the knowledge tree, so as to achieve the goal of evaluating the literature. Reasonable expansion, and therefore better literature searches for scholars.

Specifically, when acquiring entries in all documents of each scholar, all documents of each scholar are acquired first, and then pre-operation processing is performed on each document to acquire the entries in each document. For example, for certain languages, such as English, French, etc., the text needs to be lowercase, but this step is not required for languages such as Chinese. After that, remove repeated spaces, punctuation, etc. After that, the sentence segmentation operation is performed, and finally, the word segmentation operation is performed on each document with the thesaurus.

Specifically, when the entry expansion is performed, for each entry in the document, the corresponding academic vocabulary is searched in the academic vocabulary hierarchical relationship, and then the academic vocabulary is expanded to the upper level entry according to the academic vocabulary hierarchical relationship.

When training the model, a dataset containing the search terms and the corresponding correct ranking is needed to preprocess the document word segmentation. In this dataset, the data is presented in the following form:

Search Item 1: Computer Science; Ranking of Related Scholars: 1. Zhang San, 2. Li Si, …

Search Item 2: Natural Language Processing; Ranking of Related Scholars: 1. Wang Wu, 2. Zhang San, …

Figure 2 is a schematic diagram of a knowledge tree containing the hierarchical relationship of academic vocabulary. As shown in Figure 2, the academic vocabulary from top to bottom in the knowledge tree are: engineering, computer science, natural language processing, machine translation, Neural Machine Translation, etc. FIG. 3 is a schematic diagram illustrating text entry expansion according to an exemplary embodiment.

The knowledge trees in Figures 2 and 3 contain similar knowledge that "machine translation" is a sub-branch of "natural language processing". Therefore, the literature is expanded with the aid of the knowledge tree. If the literature contains the entry "machine translation", then when this entry is used as the original entry, the entries "natural language processing", "computer science" , "engineering" and other high-level keywords will be expanded. Therefore, when a user searches for "computer science", even if the term "computer science" is never mentioned in the literature written by the scholar, the scholar can still be searched by this algorithm and may have a higher ranking, as long as he "Computer science" is expanded many times in the literature. It is worth noting that the literature on "machine translation" is not necessarily related to "statistical machine translation", so when expanding, the algorithm is only extended upwards. When expanding the entry upwards, taking "machine translation" as the original entry as an example, "natural language processing" will be expanded, and the level gap between "machine translation" and "machine translation" in the knowledge tree is one level. "Computer Science" will also be expanded, and its level gap with "Machine Translation" in the knowledge tree is two levels.

After the entry is expanded, the entries will be assigned different weights according to the appearance of each entry in the literature written by scholars and the expansion of the entry. According to the scores of each document and the entry in the document, the The document-entry matrix is converted into a scholar-entry matrix, and then the scholar-entry matrix is used as the model input for training.

When assigning points to an entry, the following factors will be fully considered: the expansion of the entry in the scholar's other literature, the distance between the expanded entry and the original entry in the lexical level relationship, and the appearance of the original entry in the literature location, etc.

The expanded entries that have different levels of gaps with the original entries should be assigned different scores, thus reflecting different matching degrees, which will further reflect the pertinence of the search. The equal scoring method cannot highlight the knowledge tree. The distance relationship between middle words and words.

Different from ordinary text search, the search for scholars needs to consider all the literature written by each scholar as a whole, so when using the knowledge tree to expand the entry, it is necessary to consider that the expanded entry is in the scholar. Extensions in other literature. For example, all the extended entries in all the literatures written by each scholar calculate the "scholar score" of the entry under the scholar, and its numerator is the entry in all the literatures of the scholar extended. The number of times, whose denominator is the number of documents that the scholar has expanded the entry. Therefore, the scholar score of the entry is the average number of times the entry is expanded in all the literatures written by the scholar in which the entry has been expanded. That is to say, those entries that are frequently expanded in all the literature written by the scholar will have a higher score. This scoring method fully takes into account the expansion of the entry in all the literature written by the scholar. Compared with the expansion method of the traditional text expansion method, this expansion method is more reasonable and more suitable for targeting Algorithms by which scholars conduct searches.

In addition, the extended entry will be differentiated according to the distance of the level difference between the entry and the original entry and the document part where the original entry is located.

The way of assigning points is determined by the number of layers that differ between the original entry and the expanded entry in the entry level relationship. For example, if the hierarchical relationship of the entry has six levels, five parameters will be generated, corresponding to the difference between one level and five levels. At the same time, for each part of the literature (title, abstract, main text, etc.), the score should also be different. Therefore, the final algorithm will have "(knowledge tree height - 1) * part number" assigning parameters.

Ultimately, for each entry in each scholar's literature, it:

Term score (term_score) = number of occurrences + scholar score * extension score

The higher the number of times the term appears in the document, the higher the score of the term. The higher the degree that the entry is expanded in the literature written by the scholar, that is, the higher the scholar's score, the higher the entry score. The higher the entry expansion score, the higher the entry score. The expansion score should be considered together with its expansion in the scholar's literature, so the result of multiplying the two is taken.

The document-entry matrix can be formed from the scores of each document and the entries in the document. In order to consider all the documents written by each scholar as a whole, the document-entry matrix is transformed into a scholar-entry matrix.

For all entries in all literature written by each scholar, its score in the scholar-entry matrix is the sum of the final scores of the entry in all literatures of that scholar, multiplied by the number of entries written by the scholar containing the The logarithm of the number of articles in the entry, divided by the logarithm of the number of articles written by the scholar. Therefore, the higher the cumulative score of a term in all of the scholar's literature, the higher the score of the term in the scholar-term matrix. The more articles the scholar has for that entry, the higher the entry will score in the scholar-entry matrix. More articles written will increase the probability of a high cumulative score for the entry, so the number of articles written by this scholar is used as the denominator during matrix transformation.

After the scholar-entry matrix is constructed, a machine learning algorithm will be used to train the scholar-entry matrix to generate a search model. When performing literature search, the search keywords are matched in the search model, and the literature search results can be accurately output according to the matching degree.

Specifically, the XGBoost algorithm is used to train and learn the scholar-entry matrix, and the training loss value of the model on the search ranking data set is obtained. When the training loss value does not converge, the Bayesian optimization grid search algorithm is used to optimize the parameters. Update the scholar-term matrix until the training loss value converges.

The XGBoost algorithm is a ranking learning algorithm based on result pairs, which converts the ranking problem into a binary classification problem of whether the result A is ranked higher than the result B given a pair of search results A and B. Finally, the algorithm will output the trained pairwise ranking binary classification error rate.

In order to find the optimal configuration of the parameters involved in the algorithm proposed by the present invention, the present invention adopts the Bayesian optimization grid search algorithm to quickly optimize and select the parameters, and the optimization goal is to minimize the pairwise sorting two generated by the XGBoost model. Classification error rate.

Bayesian optimization grid search is a parameter optimization algorithm that optimizes parameters such as weights and scores in the training model through the Bayesian optimization grid search algorithm. For example, the Bayesian optimization grid search algorithm first tests the combination of parameters, and then the new round of parameter selection will guide the selection of the next round of parameters based on the experimental results of the previous round of parameter selection, and minimize the loss of the XGBoost model as the Goal, iteratively optimize parameter selection until convergence. Compared with the traditional grid search optimization algorithm, the Bayesian optimization grid search algorithm can dynamically optimize and adjust the parameter selection of the next iteration based on the training results of the last parameter selection, so as to optimize the parameter selection faster.

FIG. 4 is a specific application flowchart of Embodiment 1 according to an exemplary embodiment. As shown in Figure 4, firstly, the text is preprocessed to extract the entries in the literature, and then the knowledge tree is used to expand the entries of the literature. After the expansion of the entry, according to the appearance of each entry in the literature written by scholars and the expansion of the entry, the entry is assigned with different weights, and the literature is formed according to the scores of each document and the entry in the document. -Entry matrix, convert the document-entry matrix into a scholar-entry matrix, and then use the scholar-entry matrix as the input of the XGBoost model for training, calculate the training loss value, when the training loss value does not converge, use the shell The Yeasian optimization grid search algorithm performs parameter optimization and updates the scholar-entry matrix until the training loss value converges. Finally, when the literature search is performed, the search keywords are matched in the search model, and the literature search results can be accurately output according to the matching degree.

Embodiment 2:

As shown in FIG. 5 , Embodiment 2 of the present invention provides a document search apparatus, and the apparatus can execute all or part of the steps of any of the above-mentioned document search methods. The system includes:

Entry acquisition module 1, used to acquire entries in all documents of scholars;

Entry expansion module 2, used to expand entries of all documents according to the hierarchical relationship of entries;

Matrix building module 3 is used for each scholar to assign different weights to the entries according to the appearance of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;

The training module 4 is used to train the scholar-entry matrix by using a machine learning algorithm to generate a search model;

The search module 5 is used to perform a matching operation on the search keywords in the search model, and output the document search results according to the matching degree.

Embodiment three:

The third embodiment of the present invention provides an electronic device, and the electronic device can execute all or part of the steps of any of the above-mentioned document search methods. The electronic equipment includes:

processor; and

a memory communicatively coupled to the processor; wherein,

the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform as described in any of the above exemplary embodiments method, which will not be described in detail here.

In this embodiment, a storage medium is also provided, and the storage medium is a computer-readable storage medium, for example, a temporary and non-transitory computer-readable storage medium including instructions. The storage medium includes, for example, a memory of instructions that can be executed by a processor of the server system to implement the above-mentioned document searching method.

The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

A literature search method, applied to electronic equipment, characterized in that the method comprises:

Obtain the entries in all the literatures of various scholars;

Entries are expanded for all documents according to the hierarchical relationship of entries;

For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, assign different weights to the entry to construct a scholar-entry matrix;

A machine learning algorithm is used to train the scholar-entry matrix to generate a search model;

A matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.
The method of claim 1, wherein the step of obtaining the entries in all documents of each scholar comprises:

Access to all literature of various scholars;

Perform pre-operation processing on each document to obtain the entries in each document.
The method according to claim 1, wherein the hierarchical relationship of terms is an academic vocabulary hierarchical relationship, and the step of performing term expansion on all documents according to the hierarchical relationship of terms comprises:

For each entry in the literature, look up the corresponding academic vocabulary in the academic vocabulary hierarchical relationship;

According to the hierarchical relationship of the academic vocabulary, the academic vocabulary is expanded to the upper-level entry.
The method according to claim 1, wherein, for each scholar, different weights are assigned to the entry according to the appearance of each entry in the literature written by the scholar and the expansion of the entry. , the steps of constructing the scholar-entry matrix include:

For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the entries are assigned different weights;

According to the scores of each document and the entry in the document, a document-entry matrix is formed;

Transform the document-entry matrix into a scholar-entry matrix.
The method according to claim 4, characterized in that, the occurrence status includes the number of occurrences and the occurrence position, and for each scholar, the occurrence status of each entry in the literature written by the scholar, entry expansion In this case, the steps of assigning different weights to the entries include:

For each scholar, according to the occurrence frequency and occurrence position of each entry in the literature written by the scholar, the corresponding occurrence frequency score and occurrence position score are assigned.
The method according to claim 4, wherein, for each scholar, different weights are assigned to the entry according to the appearance of each entry in the literature written by the scholar and the expansion of the entry. The steps include:

For each scholar, assign a corresponding scholar score to the entry according to the average number of times the entry is expanded in all the literature written by the scholar; and

According to the distance of the hierarchical relationship of the entry when the entry is expanded, the corresponding expansion score is assigned to the entry.
The method according to claim 1, characterized in that, using a machine learning algorithm to train the scholar-entry matrix, and the step of generating a search model comprises:

Use the XGBoost algorithm to train and learn the scholar-entry matrix, and obtain the training loss value on the search ranking data set;

When the training loss value does not converge, a Bayesian optimization grid search algorithm is used to optimize parameters, and the scholar-entry matrix is updated until the training loss value converges.
A document search device, characterized in that the device comprises:

The entry acquisition module is used to acquire the entries in all the literatures of various scholars;

Entry expansion module, which is used to expand entries of all documents according to the hierarchical relationship of entries;

The matrix building module is used for each scholar to assign different weights to the entries according to the occurrence of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;

a training module, used for using a machine learning algorithm to train the scholar-entry matrix to generate a search model;

The search module is used to perform matching operation on the search keywords in the search model, and output the document search results according to the matching degree.
An electronic device, characterized in that the electronic device comprises:

processor; and

a memory communicatively connected to the processor; wherein,

The memory stores readable instructions that, when executed by the processor, implement the method of any one of claims 1-7.
A computer-readable storage medium having stored thereon a computer program which, when executed, implements the method of any one of claims 1-7.