WO2022116324A1

WO2022116324A1 - Search model training method, apparatus, terminal device, and storage medium

Info

Publication number: WO2022116324A1
Application number: PCT/CN2020/140016
Authority: WO
Inventors: 吴嘉澍; 王洋; 须成忠; 叶可江
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-12-04
Filing date: 2020-12-28
Publication date: 2022-06-09
Also published as: CN112434134B; CN112434134A

Abstract

A search model training method, an apparatus, a terminal device, and a storage medium, applicable in the technical field of information retrieval. The search model training method comprises: pre-processing publication data to produce a publication-keyword matrix (S202); performing twice matrix conversion processing with respect to the publication-keyword matrix on the basis of a search keywork to produce scholar-keyword matrix (S203); inputting the scholar-keyword matrix into a search model to be trained for training, outputting a training average loss (S204); if the training average loss does not reach a preset loss threshold, then employing a Bayesian optimization network to update a parameter of said search model (S205); retraining said updated search model, stopping training when the training average loss reaches the preset loss threshold, and using said search model of this moment as a scholar search model (S206), thus allowing a high-quality and efficient search in the publications of a scholar, and increasing the accuracy of search results.

Description

Search model training method, device, terminal device and storage medium

technical field

The invention belongs to the technical field of information retrieval, and in particular relates to a search model training method, device, terminal equipment and storage medium.

Background technique

With the rapid development of today's big data, information retrieval technology has become more and more important, and it needs to be able to retrieve relevant information from massive data information according to the needs of users. The information retrieval system can retrieve a variety of entities, such as text, audio, games, videos, etc. Taking the search of text as an example, the retrieval system will determine whether each text is related to the keyword according to the keyword searched. Relevance is ranked, resulting in satisfactory search results. However, in ordinary text retrieval, each text is independently considered for its relevance to the search keywords, but unlike the search for entities such as text, the search for scholars often requires a search for all the books written by a scholar. The literature is considered in a centralized and comprehensive manner, rather than considering each literature individually.

At the same time, with the increasing amount of data, it becomes more and more difficult to manually label the data sets required for the training of search systems, including search keywords and their correct ordering, which makes the algorithm training under the condition of limited supervision information, The ability to learn is more difficult, resulting in existing search methods not suitable for search model training, low search quality, and low accuracy.

SUMMARY OF THE INVENTION

In view of this, the embodiments of the present invention provide a search model training method, apparatus, terminal device and storage medium to solve the problems of low search quality and low accuracy in the search model training in the prior art.

A first aspect of the embodiments of the present invention provides a search model training method, including:

Acquiring a data set, wherein the data set includes literature data and search key data, the literature data includes scholars and literature, each scholar includes at least two literatures, and the search key data includes search keywords;

Preprocess the literature data to obtain a literature-keyword matrix;

According to the search keywords, perform two matrix transformation processing on the document-keyword matrix to obtain the scholar-keyword matrix;

Input the scholar-keyword matrix into the search model to be trained for training, and output the average training loss;

If the average training loss does not reach the preset loss threshold, the Bayesian optimization network is used to update the parameters of the search model to be trained;

Retrain the updated search model to be trained, stop training until the average training loss reaches the preset loss threshold, and use the search model to be trained at this time as the scholar search model.

A second aspect of the embodiments of the present invention provides a search model training device, including:

an acquisition module, configured to acquire a data set, wherein the data set includes literature data and search key data, the literature data includes scholars and literature, each scholar includes at least two literatures, and the search key data includes search keywords;

The preprocessing module is used to preprocess the document data to obtain a document-keyword matrix;

The matrix transformation module is used to perform two matrix transformation processing on the document-keyword matrix to obtain the scholar-keyword matrix;

The training module is used to input the scholar-keyword matrix into the search model to be trained for training, and output the average training loss;

The parameter adjustment module is used to update the parameters of the search model to be trained by using the Bayesian optimization network if the average training loss does not reach the preset loss threshold;

The training completion module is used to retrain the updated search model to be trained, and stop training until the average training loss reaches the preset loss threshold, and use the search model to be trained at this time as the scholar search model.

A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above search model training method when the computer program is executed .

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above search model training method are implemented.

The beneficial effects that the embodiment of the present invention has compared with the prior art are:

In the present invention, the document-keyword matrix is obtained by preprocessing the document data, and the document-keyword matrix is processed twice according to the search keywords, so as to obtain the scholar-keyword matrix, and the scholar-keyword matrix is input. Go to the search model to be trained for training, and output the average training loss. If the average training loss does not reach the preset loss threshold, use the Bayesian optimization network to update the parameters of the search model to be trained, and retrain the updated search to be trained. The model stops training until the average training loss reaches the preset loss threshold, and the search model to be trained at this time is used as the scholar search model, that is, the information of each document in the document-keyword is converted into a matrix by the scholar. It is integrated into the scholar-keyword matrix, so that when conducting a scholar search, all literature written by scholars will be comprehensively and comprehensively considered, that is, it can accurately reflect the correlation between scholars and search keywords, so as to improve The search task for scholars is well completed, so that the search results are more accurate and the search quality is efficient. At the same time, because the present invention only relies on a small number of marked search keywords and unmarked literature data for training, it reduces the supervision of a large number of sorting marks. The dependence of information makes it possible to train and learn under the condition of limited supervision information.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present invention. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

1 is a schematic diagram of a process flow and an optimization framework of a search model training method according to an embodiment of the present invention;

2 is a schematic flowchart of a search model training method according to an embodiment of the present invention;

3 is a schematic diagram of another embodiment of a process and an optimization framework of a search model training method according to an embodiment of the present invention;

Fig. 4 is the document-keyword matrix, scholar-keyword matrix and LSA matrix decomposition schematic diagram of the embodiment of the present invention;

5 is a schematic diagram of a search model training device provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a hardware structure of a terminal device provided by an embodiment of the present invention.

Detailed ways

In the following description, for the purpose of illustration rather than limitation, specific details such as specific system structures and technologies are set forth in order to provide a thorough understanding of the embodiments of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to illustrate the technical solutions of the present invention, the following specific embodiments are used for description.

In the embodiment of the present invention, the execution subject of the process is a terminal device, and the terminal device includes but is not limited to a notebook computer, a computer, a server, a tablet computer, a smart phone, and other terminal devices with software development functions. In particular, when the terminal device executes the process in the implementation of the present invention, it can be used to provide the native application with the bearing function of the desktop application or the display of the front-end interface, and provide the interface assembly framework for the desktop application.

Further, unlike the traditional text search that considers the text independently, the search for a scholar requires a unified and comprehensive consideration of all the literature written by the scholar, which makes a method of transforming the literature-keyword matrix into a scholar- The transformation method of the keyword matrix becomes very important, and FIG. 1 is a schematic diagram of a process and an optimization framework of a search model training method according to an embodiment of the present invention, which specifically includes: preprocessing the document text to remove the punctuation to obtain the text, and performing the text on the text. Entry extraction is used to obtain document entries of the document text, and the knowledge tree is used to expand the document entries, and the expanded expanded entries and document entries are transformed to generate a document-entry matrix.

Further, the document-entry matrix is transformed into a scholar-entry matrix through the first round of matrix transformation, and pseudo-relevance feedback is calculated according to the scholar-entry matrix, and the second round of matrix transformation is guided by the pseudo-relevance feedback, so that after the first round of matrix transformation, The scholar-entry matrix after the second round of transformation is used as the training data for training the XGBoost model (a gradient boosting tree model), and the average loss value of the XGBoost model is calculated. When the average loss value does not reach the convergence condition, the Bayesian optimization network is used. Search to update the parameters of the XGBoost model, stop training until the average loss of training converges, and use the updated XGBoost model at this time as the earliest search model. Because the present invention can aggregate the information of each document in the document-entry matrix into the scholar-entry matrix by two matrix transformation methods, in the unit of scholars, so that the algorithm can reach the scholars in the process of searching for scholars. The purpose of unified and comprehensive consideration of all the literatures written is that the present invention cleverly uses pseudo-correlation feedback to assist the transformation of the matrix in the transformation process, so that the search is more accurate and efficient.

Continuing to refer to FIG. 2, FIG. 2 is a schematic flowchart of a search model training method according to an embodiment of the present invention, which is described in detail as follows:

S201. Acquire a data set, wherein the data set includes document data and search key data, the document data includes scholars and documents, each scholar includes at least two documents, and the search key data includes search keywords.

In the embodiment of the present invention, the document data includes a data set of each scholar and the academic documents written by them. In the literature data, each article includes the title of the article, the abstract of the article, the text of the article, and the name of the academic conference/journal in which the article was published.

Further, the search key data also includes result information corresponding to the search keywords, that is, the result information is documents marked by the search keywords, and the search key data can be used for the training of the search model, which is convenient for comparing the quality of the training results.

S202, preprocessing the document data to obtain a document-keyword matrix.

Specifically, the document data is in a text format, and the text data is converted from the text format into a corresponding matrix by preprocessing, which specifically includes:

Clean literature data to get text;

Extract the entry from the text to get the initial entry;

Expand the initial entry to get the expanded entry;

Convert the initial and extended terms into a document-keyword matrix.

In the embodiment of the present invention, cleaning the document data includes lowercase text data, deleting redundant spaces, punctuation, and clauses, so as to obtain text in plain text; the text is extracted by a preset word segmentation algorithm. The algorithm can be a dictionary-based method, such as the forward maximum matching idea MM, the reverse maximum matching algorithm RMM, and the Bi-directction Matching method (BM), etc.; Matching, the words that match the matching degree are used as the expansion entries of the entry.

Due to the rigorous logic of academic literature, and the differences in importance and refinement of the above four parts (namely, the title of the literature, the abstract of the literature, the text of the literature, and the name of the academic conference/journal in which the literature was published), Therefore, it is a common practice to assign different scores to different parts of the literature when searching. This paper adopts the commonly used index BM25F as an index to measure the scores of keywords in the literature.

Among them, BM25F is an improved algorithm of typical BM25. BM25 considers documents as a whole when calculating relevance, and each document is divided into multiple independent domains, especially for vertical searches. For example, web pages may be divided into domains such as title, content, and subject words. The contributions of these domains to the topic of the article cannot be treated equally, so the weight should be biased. BM25 did not consider this point, so BM25F made some improvements on this basis, that is, it no longer only considers words as individuals, and divides documents into individual considerations according to field (region), so BM25F score is each keyword. Weighted summation of scores in each field.

Further, a document-keyword matrix is constructed for the initial entry and the expanded entry, where the number of rows of the document-keyword matrix is the total number of articles in all documents, and the number of columns in the matrix is the total number of keywords in all documents, namely The total number of initial entries and extended words, the value stored in the i-th row and j column is the BM25F score of the j-th keyword in the i-th document in the entire dataset. The document-keyword matrix is shown in Figure 3.

S203, according to the search keywords, perform two matrix transformation processes on the document-keyword matrix to obtain a scholar-keyword matrix.

In the embodiment of the present invention, in order to obtain pseudo-relevance feedback information, the present invention needs to perform two rounds of matrix transformation on the document-keyword matrix. After the first round of matrix transformation, the transformed scholar-keyword matrix will be used to retrieve scholars most relevant to the search keywords and their most relevant literature, and generate pseudo-relevance feedback information. During the second round of rotation, the obtained pseudo-correlation feedback information will be used to guide the second round of matrix transformation to make the transformation more reasonable. Specifically, as shown in FIG. The process of the first matrix transformation and the second matrix transformation described in the schematic diagram of the embodiment.

Further, according to the search keywords, perform two matrix transformation processing on the document-keyword matrix to obtain the scholar-keyword matrix including:

Convert the literature-keyword matrix into the initial scholar-keyword matrix;

Determine pseudo-relevant feedback information according to the initial scholar-keyword matrix and search keywords;

The initial scholar-keyword matrix is transformed into a scholar-keyword matrix based on pseudo-relevant feedback information.

Among them, pseudo-relevant feedback, also known as blind-relevant feedback, automates part of the manual operation of relevant feedback, so users no longer need to perform additional interactions, that is, the normal retrieval process is performed first, and the most relevant documents are returned to form the initial set , then assume that the top k documents are relevant, and finally do the relevant feedback as before on this assumption. If the pseudo-relevance feedback information is exploited, the transformation process can assign higher scores to scholars who are more relevant than relevant feedback and lower scores to scholars who are less relevant and have larger gaps than relevant feedback , this matrix transformation based on pseudo-relevance feedback makes each keyword score in the matrix more reasonable.

Further, the initial scholar-keyword matrix is constructed according to formula (1), that is, the score of each document-keyword for a scholar is equal to the sum of the scores of the document-keyword in all the documents written by the scholar (the numerator is the first item), multiplied by the logarithm of the number of papers in which the keyword appears in the literature written by the scholar (the second term in the numerator), and divided by the logarithm of the total number of papers written by the scholar (the denominator term). Meanwhile, in formula (1), a parameter ω ₁ , ω ₂ and ω ₃ are respectively assigned to the above three items to balance the importance of the items.

in,

is the score for each element in the initial scholar-keyword matrix,

is the score of the literature-keyword in all literatures written by the scholar,

The number of documents in which the document-keyword appears in the documents written by scholars,

is the total number of literatures written by the scholar, i represents the number of scholars, and J represents the total number of literature keywords in all literature-keyword matrix. by calculating each

Take the score of each element in the initial scholar-keyword matrix, that is, the number of rows of the initial scholar-keyword matrix is the number of all scholars, and the number of columns of the matrix is the total number of keywords in all documents, that is, the initial entry and the total number of extended words.

Further, according to the initial scholar-keyword matrix and search keywords, it is determined that the pseudo-related feedback information includes:

Perform matrix decomposition on the initial scholar-keyword matrix to obtain the scholar vector;

Calculate the first cosine similarity between the search keyword and the scholar vector;

According to the first cosine similarity, the search scholars are selected from the scholars, and the search scholars are the top n scholars with the highest correlation with the search keywords;

Perform matrix decomposition on the document-keyword matrix to obtain the document vector;

Calculate the second cosine similarity between the search keyword and the document vector;

According to the second cosine similarity, the top n documents of each search scholar with the highest relevance to the search keywords are selected from the documents;

Calculate the third cosine similarity between the top n documents of each search scholar and the search keywords respectively;

The mean value of the third cosine similarity is calculated, and the obtained first mean value is used as pseudo-correlation feedback information.

In the embodiment of the present invention, the initial scholar-keyword matrix is decomposed into a matrix by an LSA (Latent Semantic Analysis) model, and a scholar vector is obtained, wherein the LSA model is an existing algorithm model, which will not be explained here. . The matrix decomposition process is shown in Figure 4, which is a schematic diagram of the decomposition of the literature-keyword matrix, the scholar-keyword matrix and the LSA matrix. The initial scholar-keyword matrix (that is, the scholar-keyword matrix in Figure 4) is decomposed into the product of three matrices by the LSA model, including the keyword latent space matrix, latent space matrix (uncertain variable matrix) and Scholars matrix, which consists of scholars vectors.

Wherein, n is a positive integer, and in the embodiment of the present invention, n is 5.

In the embodiment of the present invention, the initial scholar-keyword matrix is subjected to matrix decomposition operation using LSA, and after the matrix decomposition, the cosine formula is used to calculate the first cosine between the search keyword (in the form of a vector) and the scholar vector The scholars are sorted according to the first cosine similarity in descending order, and the top 5 scholars most relevant to the search keywords are selected.

Further, the literature-keyword matrix is matrix-decomposed to obtain the literature vector. LSA is also used to perform the matrix decomposition operation, and after the matrix decomposition, the cosine formula is used to calculate the search keyword (in the form of a vector) and the literature vector. The second cosine similarity, and according to the sorting order of the second cosine similarity from large to small, the top n documents of each search scholar with the highest correlation with the search keywords are screened. Since the above setting n is 5, It can also be set to 5 here, that is, the setting of n remains the same.

Further, since the sum of the scores of the literature-keyword in all the literatures written by the scholar is higher, the score in the scholar-keyword matrix will also be higher; if the scholar contains the literature-keywords, the higher the number of literature The higher the score in the scholar-keyword matrix, the higher the score in the scholar-keyword matrix; if the total number of literatures written by the scholar is large, the probability of the keyword appearing will naturally be high, so the total number of literatures written by the scholar will be high. as the denominator to reflect this fact. Through this transformation method, the information of each document in the document-keyword matrix will be aggregated into the scholar-keyword matrix on a scholar-by-scholar basis, so that when a scholar is searched, all documents written by the scholar can be searched. comprehensive consideration.

Therefore, by calculating the third cosine similarity between the top n documents of each search scholar and the search keywords, the first mean obtained by averaging the third cosine similarity represents the highly related search keywords. The degree of similarity that the literature should have, that is, the first mean value can be used as pseudo-related feedback information, and the pseudo-related feedback information is used to guide the second matrix transformation. For example, when the pseudo-related feedback information is a scholar's top 5 related documents and If the average similarity of search keywords is higher than the first average of the feedback, a higher score is assigned to the initial scholar-keyword matrix. Therefore, the matrix transformation method of the above pseudo-correlation feedback can more effectively transform the matrix, and improve the search quality when the transformed matrix is used for scholar search.

Further, converting the initial scholar-keyword matrix into a scholar-keyword matrix according to the pseudo-relevant feedback information includes:

Obtain the top n literatures of each scholar that are most relevant to the search keywords;

Calculate the fourth cosine similarity between each scholar's top n literature and search keywords;

Calculate the mean of the fourth cosine similarity to obtain the second mean;

According to the first mean and the second mean, the initial scholar-keyword matrix is transformed into a scholar-keyword matrix.

Specifically, formula (2) is used to calculate the score of each element of the scholar-keyword matrix. Among them, compared with formula (1), the denominator part in formula (2) adds the average relevance (ie the second mean) of the author's more relevant literature and the obtained pseudo-relevance feedback information (ie the above-mentioned the first mean of ) for comparison.

in,

Represents the score of each element of the scholar-keyword matrix, AVG _τ represents the second mean value, avg _τ represents the first mean value, and the rest of the expressions are the same as formula (1), and will not be repeated here.

Further, the method of obtaining the top n documents of each scholar with the highest correlation with the search keywords is the same as the above-mentioned calculation method of selecting the search scholars from scholars according to the first cosine similarity, that is, by calculating the search keywords. The cosine similarity with the documents-keywords in the first n documents of each scholar is sorted according to the cosine similarity from large to small to filter out the top n documents; the cosine similarity is also used to calculate the cosine similarity of each scholar. The fourth cosine similarity between the first n documents and the search keywords is calculated, and the fourth cosine similarity corresponding to each scholar is accumulated and averaged to obtain the second mean.

According to formula (2), if the average correlation degree of the scholar's more relevant documents is greater than or equal to the pseudo-relevance feedback information (ie AVG _τ >avg _τ ,0), then this item (ie

) will have a smaller value. Conversely, if the average relevance of the scholar's more relevant literature is less than the pseudo-relevance feedback, then if the gap is larger, the value of this item will be larger, and the overall value of formula (2) will be smaller, that is, The scholar assigns a smaller score. Therefore, the obtained pseudo-relevance feedback information will be used as a benchmark. If the gap between the scholar and the benchmark is larger, the score will be worse, making the score in the scholar-keyword matrix obtained by this transformation method more reasonable.

S204, the scholar-keyword matrix is input into the search model to be trained for training, and the average training loss is output.

In the embodiment of the present invention, the search model to be trained is an XGBoost model (eXtreme Gradient Boosting, extreme gradient boosting), and the XGBoost model converts the search ranking problem of scholar search into a two-classification problem of judging the sequence of results, that is, result A and result B , whether result A should be ranked in front of result B is a two-class judgment problem, thus reducing the demand for data, and training and learning can still be performed even when the supervised sorting data information is limited. According to each scholar-keyword in the scholar-keyword matrix, the corresponding loss value is obtained in the XGBoost model training, and all the loss values are averaged to obtain the current training average loss of the XGBoost model.

S205 , if the average training loss does not reach the preset loss threshold, a Bayesian optimization network is used to update the parameters of the search model to be trained.

The preset loss threshold can be set according to the actual training situation, for example, the preset loss threshold is 0.1. When the average training loss is greater than the preset loss threshold, a Bayesian optimization network (Bayesian Optimization) is used to optimize and select the parameters of the current search model to be trained, thereby minimizing the loss value of the search model to be trained.

Among them, the Bayesian optimization network is a black-box optimization algorithm, which is used to solve the extreme value problem of the function with unknown expression. The search principle of the Bayesian optimization network is to first generate an initial set of candidate solutions, and then search for these points The next most likely point is the extreme value, add this point to the set, repeat this step until the iteration is terminated, and finally find the point with the largest function value from these points as the solution of the extreme value problem. It is more efficient than other grid searches and random searches because the solution utilizes information from previously searched points. Because the Bayesian optimization network will consider the effect of the previous parameter selection before making a new parameter selection decision to optimize the next parameter selection, so as to more efficiently select and optimize the parameters of the training search model to achieve more efficient High-quality scholar search.

S206, retrain the updated search model to be trained, stop training until the average training loss reaches a preset loss threshold, and use the search model to be trained at this time as a scholar search model.

The search model to be trained after the parameters are updated in the above step S205 is used as the current training model, and the training is repeated, and the training average loss of each training output is compared with the preset loss threshold until the training average loss reaches the preset loss threshold. Stop training and use the current parameters as the final parameters of the search model to be trained, and then form a scholar search model

It should be noted that the data sets used in the training of the present invention are Chinese literature data, thesaurus and search ranking data sets, that is, the trained scholar search model can be used for keywords expressed in Chinese to search for Chinese scholars. Similarly, if the data set used is French, the scholar's search model can also operate, and the training and optimization frameworks remain unchanged, which is general to the language and reduces the limitations of the search model.

In the embodiment of the present invention, a document-keyword matrix is obtained by preprocessing the document data, and the document-keyword matrix is subjected to two matrix transformation processing according to the search keywords, so as to obtain a scholar-keyword matrix, and the scholar-keyword matrix is converted into a scholar-keyword matrix. The matrix is input into the search model to be trained for training, and the average training loss is output. If the average training loss does not reach the preset loss threshold, the Bayesian optimization network is used to update the parameters of the search model to be trained, and the updated Train the search model, stop training until the average training loss reaches the preset loss threshold, and use the search model to be trained at this time as the scholar search model, that is, the information of each document in the document-keyword is converted into a scholar by matrix transformation. It is integrated into the scholar-keyword matrix as a unit, so that when conducting a scholar search, all literatures written by scholars will be comprehensively and comprehensively considered, that is, the correlation between scholars and search keywords can be accurately reflected. Therefore, the search task for scholars can be better completed, and the search results are more accurate and the search quality is efficient. At the same time, because the present invention only relies on a small amount of marked search keywords and unmarked document data for training, it reduces the need for a large number of sorting. The dependence of labeled supervision information enables training and learning under the condition of limited supervision information.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In the embodiment of the present invention, a search model training apparatus is also provided, and each module included in the search model training apparatus is used to execute each step in the embodiment corresponding to FIG. 2 . For details, please refer to the relevant description in the embodiment corresponding to FIG. 2 . 5 shows a schematic structural diagram of the first embodiment of the search model training device of the present invention, including an acquisition module 51, a difference module 52, a trajectory extraction module 53, and a target acquisition module 53:

The acquisition module 51 is used to acquire a data set, wherein the data set includes document data and search key data, the document data includes scholars and documents, each scholar includes at least two documents, and the search key data includes search keywords;

The preprocessing module 52 is used for preprocessing the document data to obtain a document-keyword matrix;

The matrix transformation module 53 is used to perform two matrix transformation processes on the document-keyword matrix according to the search keywords, so as to obtain the scholar-keyword matrix;

The training module 54 is used to input the scholar-keyword matrix into the search model to be trained for training, and output the average training loss;

The parameter adjustment module 55 is used to update the parameters of the search model to be trained by using a Bayesian optimization network if the average training loss does not reach the preset loss threshold;

The training completion module 56 is used to retrain the updated search model to be trained, stop training until the average training loss reaches a preset loss threshold, and use the search model to be trained at this time as a scholar search model.

Further, the preprocessing module 52 includes:

The cleaning unit is used to clean the literature data to obtain text;

The extraction unit is used to extract the entry from the text to obtain the initial entry;

The expansion unit is used to expand the initial entry to obtain the expanded entry;

The transformation unit is used to transform the initial entry and the expanded entry into a document-keyword matrix.

Further, the matrix conversion module 53 includes:

a first conversion unit, used to convert a document-keyword matrix into an initial scholar-keyword matrix;

Pseudo-correlation unit, used to determine pseudo-correlation feedback information according to the initial scholar-keyword matrix and search keywords;

The second conversion unit is configured to convert the initial scholar-keyword matrix into a scholar-keyword matrix according to the pseudo-relevant feedback information.

Further, the pseudo-correlation unit includes:

The first decomposition subunit is used to perform matrix decomposition on the initial scholar-keyword matrix to obtain a scholar vector;

The first calculation subunit is used to calculate the first cosine similarity between the search keyword and the scholar vector;

The first screening subunit is used to screen out the search scholars from the scholars according to the first cosine similarity, wherein the search scholars are the top n scholars with the highest correlation with the search keywords;

The second decomposition subunit is used to perform matrix decomposition on the document-keyword matrix to obtain a document vector;

The second calculation subunit is used to calculate the second cosine similarity between the search keyword and the document vector;

The second screening subunit is used to screen out the top n documents of each search scholar with the highest correlation with the search keywords from the documents according to the second cosine similarity;

The third calculation subunit is used to calculate the third cosine similarity between the top n documents of each search scholar and the search keywords respectively;

The first mean value subunit is used to calculate the mean value of the third cosine similarity, and use the obtained first mean value as pseudo-correlation feedback information.

Further, the second conversion unit includes:

The acquisition subunit is used to acquire the top n documents of each scholar that are most relevant to the search keywords;

The fourth calculation subunit is used to calculate the fourth cosine similarity between the first n documents of each scholar and the search keywords;

The second mean subunit is used to perform mean calculation on the fourth cosine similarity to obtain the second mean;

The conversion subunit is used to convert the initial scholar-keyword matrix into a scholar-keyword matrix according to the first mean value and the second mean value.

The function implementation of each module/unit in the above search model training apparatus corresponds to each step in the above search model training method embodiment, and the functions and implementation process thereof will not be repeated here.

FIG. 6 is a schematic diagram of a terminal device provided by an embodiment of the present invention. As shown in FIG. 6 , this embodiment/terminal device 6 includes: a processor 60 , a memory 61 , and a computer program 62 stored in the memory 61 and executable on the processor 60 , such as a software development program. When the processor 60 executes the computer program 62 , the steps in each of the foregoing software development method embodiments are implemented, for example, steps S101 to S104 shown in FIG. 1 . Alternatively, when the processor 60 executes the computer program 62, the functions of the modules/units in the above-mentioned system embodiments are implemented, for example, the functions of the modules 51 to 56 shown in FIG. 5 .

Exemplarily, the computer program 62 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 60 to complete the this invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 62 in the search model training apparatus/terminal device 6 . For example, the computer program 62 can be divided into an acquisition module, an execution module, and a generation module (modules in a virtual device), and the specific functions of each module are as described above, which will not be repeated here.

The terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal 6 device may include, but is not limited to, a processor 60 and a memory 61 . Those skilled in the art can understand that FIG. 6 is only an example of the terminal device 6, and does not constitute a limitation on the terminal device 6, and may include more or less components than the one shown, or combine some components, or different components For example, the terminal device 6 may further include an input and output device, a network access device, a bus, and the like.

The so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6 , such as a hard disk or a memory of the terminal device 6 . The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk equipped on the terminal device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card. , Flash Card (Flash Card) and so on. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store the computer program and other programs and data required by the terminal device. The memory 61 can also be used to temporarily store data that has been output or will be output.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated to different functional units, Module completion means dividing the internal structure of the system into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present invention. For the specific working process of the units and modules in the above-mentioned system, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed system/terminal device and method may be implemented in other manners. For example, the system/terminal device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or integrated into another device, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the present invention can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented. . Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the foregoing implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the within the protection scope of the present invention.

Claims

A method for training a search model, comprising:

acquiring a data set, wherein the data set includes document data and search key data, the document data includes scholars and documents, each scholar includes at least two documents, and the search key data includes search keywords;

Preprocessing the document data to obtain a document-keyword matrix;

According to the search keywords, perform two matrix transformation processes on the document-keyword matrix to obtain a scholar-keyword matrix;

Inputting the scholar-keyword matrix into the search model to be trained for training, and outputting the average training loss;

If the training average loss does not reach the preset loss threshold, use a Bayesian optimization network to update the parameters of the search model to be trained;

Retrain the updated search model to be trained, stop training until the average training loss reaches the preset loss threshold, and use the search model to be trained at this time as a scholar search model.
The search model training method according to claim 1, wherein the preprocessing of the document data to obtain a document-keyword matrix comprises:

Cleaning the literature data to obtain text;

Extracting the entry from the text to obtain the initial entry;

Expanding the initial entry to obtain an expanded entry;

The initial term and the expanded term are converted into a document-keyword matrix.
The search model training method according to claim 1, wherein, according to the search keywords, performing two matrix transformation processes on the document-keyword matrix to obtain the scholar-keyword matrix comprises:

Converting the document-keyword matrix into an initial scholar-keyword matrix;

Determine pseudo-relevant feedback information according to the initial scholar-keyword matrix and the search keywords;

The initial scholar-keyword matrix is converted into a scholar-keyword matrix according to the pseudo-relevant feedback information.
The search model training method according to claim 3, wherein the determining of the pseudo-related feedback information according to the initial scholar-keyword matrix and the search keywords comprises:

Perform matrix decomposition on the initial scholar-keyword matrix to obtain a scholar vector;

calculating the first cosine similarity between the search keyword and the scholar vector;

According to the first cosine similarity, search scholars are selected from the scholars, wherein the search scholars are the top n scholars with the highest correlation with the search keywords;

Perform matrix decomposition on the document-keyword matrix to obtain a document vector;

calculating the second cosine similarity between the search keyword and the document vector;

According to the second cosine similarity, filter out the top n documents of each of the search scholars with the highest correlation with the search keywords from the documents;

Calculate the third cosine similarity between the first n documents of each search scholar and the search keyword respectively;

Perform mean value calculation on the third cosine similarity, and use the obtained first mean value as pseudo-correlation feedback information.
The search model training method according to claim 4, wherein the converting the initial scholar-keyword matrix into a scholar-keyword matrix according to the pseudo-relevant feedback information comprises:

Obtain the top n documents of each scholar that are most relevant to the search keywords;

Calculate the fourth cosine similarity between each scholar's top n documents and the search keyword;

performing mean calculation on the fourth cosine similarity to obtain a second mean;

The initial scholar-keyword matrix is converted into a scholar-keyword matrix according to the first mean value and the second mean value.
A search model training device, comprising:

An acquisition module for acquiring a data set, wherein the data set includes document data and search key data, the document data includes scholars and documents, each scholar includes at least two documents, and the search key data includes search keywords ;

a preprocessing module for preprocessing the document data to obtain a document-keyword matrix;

a matrix transformation module, configured to perform two matrix transformation processes on the document-keyword matrix according to the search keywords to obtain a scholar-keyword matrix;

A training module for inputting the scholar-keyword matrix into the search model to be trained for training, and outputting the average training loss;

A parameter adjustment module, configured to update the parameters of the search model to be trained by using a Bayesian optimization network if the average training loss does not reach a preset loss threshold;

The training completion module is used to retrain the updated search model to be trained, stop training until the average loss of training reaches the preset loss threshold, and use the search model to be trained at this time as the scholar search model.
The search model training device according to claim 6, wherein the preprocessing module comprises:

a cleaning unit, used for cleaning the document data to obtain text;

an extraction unit, used for extracting entries from the text to obtain an initial entry;

an expansion unit, used to expand the initial entry to obtain an expanded entry;

A conversion unit, configured to convert the initial entry and the expanded entry into a document-keyword matrix.
The search model training device according to claim 7, wherein the matrix transformation module comprises:

a first conversion unit, for converting the document-keyword matrix into an initial scholar-keyword matrix;

A pseudo-correlation unit, configured to determine pseudo-related feedback information according to the initial scholar-keyword matrix and the search keywords;

The second conversion unit is configured to convert the initial scholar-keyword matrix into a scholar-keyword matrix according to the pseudo-related feedback information.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the process according to claim 1 to 5 any one of the steps of the search model training method.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the search model training method according to any one of claims 1 to 5 is implemented. step.