CN112434134A - Search model training method and device, terminal equipment and storage medium - Google Patents

Search model training method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN112434134A
CN112434134A CN202011403845.7A CN202011403845A CN112434134A CN 112434134 A CN112434134 A CN 112434134A CN 202011403845 A CN202011403845 A CN 202011403845A CN 112434134 A CN112434134 A CN 112434134A
Authority
CN
China
Prior art keywords
search
learner
matrix
keyword
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011403845.7A
Other languages
Chinese (zh)
Other versions
CN112434134B (en
Inventor
吴嘉澍
王洋
须成忠
叶可江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202011403845.7A priority Critical patent/CN112434134B/en
Priority to PCT/CN2020/140016 priority patent/WO2022116324A1/en
Publication of CN112434134A publication Critical patent/CN112434134A/en
Application granted granted Critical
Publication of CN112434134B publication Critical patent/CN112434134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of information retrieval, and provides a search model training method, a device, a terminal device and a storage medium, wherein the search model training method comprises the following steps: the method comprises the steps of preprocessing document data to obtain a document-keyword matrix, performing matrix conversion processing on the document-keyword matrix twice according to search keywords to obtain a learner-keyword matrix, inputting the learner-keyword matrix into a search model to be trained for training, outputting training average loss, updating parameters of the search model to be trained by adopting a Bayesian optimization network if the training average loss does not reach a preset loss threshold, retraining the updated search model to be trained again until the training average loss reaches the preset loss threshold, stopping training, and taking the search model to be trained at the moment as the learner search model, so that the document search quality of a learner is high, and the search result is more accurate.

Description

Search model training method and device, terminal equipment and storage medium
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to a search model training method and device, a terminal device and a storage medium.
Background
Information retrieval technology becomes more and more important along with the rapid development of the current big data, and the information retrieval technology needs to be capable of retrieving relevant information from massive data information according to the requirements of users. The information retrieval system can retrieve various entities, such as text, audio, games, video and the like, for example, the retrieval system can judge whether each text is related to keywords according to the searched keywords and rank the relevance, so as to generate a satisfactory search result. However, in the ordinary text retrieval, each text is considered independently of its relevance to the search keywords, but unlike the search of the text and other entities, the search for a learner often requires a centralized and comprehensive consideration of all documents written by a learner, rather than a separate consideration of each document.
Meanwhile, with the increasing amount of data, a data set which includes search keywords and is required by training of a manual labeling search system and is correctly ordered becomes more difficult, so that the training and learning capabilities of an algorithm under the condition of limited supervision information are more difficult, and the problems that the existing search method is not suitable for search model training, the search quality is low, and the accuracy is not high are caused.
Disclosure of Invention
In view of this, embodiments of the present invention provide a search model training method and apparatus, a terminal device, and a storage medium, so as to solve the problems of low search quality and low accuracy in search model training in the prior art.
A first aspect of an embodiment of the present invention provides a search model training method, including:
acquiring a data set, wherein the data set comprises literature data and search key data, the literature data comprises scholars and literature, each scholars comprises at least two literatures, and the search key data comprises search keywords;
preprocessing document data to obtain a document-keyword matrix;
performing matrix conversion processing twice on the document-keyword matrix according to the search keywords to obtain a learner-keyword matrix;
inputting the learner-keyword matrix into a search model to be trained for training, and outputting training average loss;
if the training average loss does not reach the preset loss threshold, updating the parameters of the search model to be trained by adopting a Bayesian optimization network;
and retraining the updated search model to be trained until the training average loss reaches a preset loss threshold, and taking the search model to be trained at the moment as a learner search model.
A second aspect of an embodiment of the present invention provides a search model training apparatus, including:
the acquisition module is used for acquiring a data set, wherein the data set comprises literature data and search key data, the literature data comprises scholars and documents, each scholars comprises at least two documents, and the search key data comprises search keywords;
the preprocessing module is used for preprocessing the document data to obtain a document-keyword matrix;
the matrix conversion module is used for performing matrix conversion processing twice on the document-keyword matrix to obtain a learner-keyword matrix;
the training module is used for inputting the learner-keyword matrix into a search model to be trained for training and outputting training average loss;
the parameter adjusting module is used for updating the parameters of the search model to be trained by adopting a Bayesian optimization network if the training average loss does not reach the preset loss threshold;
and the training completion module is used for retraining the updated search model to be trained, stopping training until the training average loss reaches a preset loss threshold value, and taking the search model to be trained at the moment as a learner search model.
A third aspect of the embodiments of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the search model training method when executing the computer program.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the search model training method as described above.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
in the invention, a document-keyword matrix is obtained by preprocessing document data, two times of matrix conversion processing are carried out on the document-keyword matrix according to search keywords to obtain a learner-keyword matrix, the learner-keyword matrix is input into a search model to be trained for training, and training average loss is output, if the training average loss does not reach a preset loss threshold value, parameters of the search model to be trained are updated by adopting a Bayesian optimization network, the updated search model to be trained is retrained again, the training is stopped until the training average loss reaches the preset loss threshold value, the search model to be trained at the moment is used as the learner search model, namely, information of each document in the document-keyword is integrated into the learner-keyword matrix in a matrix conversion mode by taking a learner as a unit, therefore, when a learner searches, all documents which are mastered by the learner are comprehensively and comprehensively considered, namely, the correlation between the learner and the search keywords can be accurately reflected, so that the search task for the learner is better completed, the search result is more accurate, the search quality is high, and meanwhile, because the invention only depends on a small amount of marked search keywords and unmarked document data for training, the dependence on a large amount of sequencing marking supervision information is reduced, and the training and learning can be carried out under the condition of limited supervision information.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic diagram of a search model training method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram of a search model training method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a search model training method according to an embodiment of the present invention;
FIG. 4 is a schematic decomposition diagram of the document-keyword matrix, the learner-keyword matrix, and the LSA matrix according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a search model training apparatus according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
In the embodiment of the present invention, the main execution body of the process is a terminal device, and the terminal device includes, but is not limited to, a notebook computer, a server, a tablet computer, a smart phone, and other terminal devices having a software development function. In particular, the terminal device can be used for providing a load-bearing function or a front-end interface display of the desktop application for the native application when executing the process in the implementation of the present invention, and providing an interface assembly framework for the desktop application.
Further, different from the traditional text search method of independently considering texts, the search for a learner needs to uniformly and comprehensively consider all documents authored by the learner, which makes a conversion method of converting a document-keyword matrix into a learner-keyword matrix very important, and fig. 1 is a schematic diagram of a flow and an optimization framework of a search model training method according to an embodiment of the present invention, and specifically includes: the method comprises the steps of preprocessing a document text to remove punctuation coincidence to obtain a text, extracting terms of the text to obtain document terms of the document text, performing term expansion on the document terms by using a knowledge tree, and converting the expanded terms and the document terms to generate a document-term matrix.
Further, the document-entry matrix is converted through a first round of matrix to generate a learner-entry matrix, pseudo-correlation feedback is calculated according to the learner-entry matrix, second round of matrix conversion is guided through the pseudo-correlation feedback, the learner-entry matrix after the second round of conversion is used as training data for training an XGboost model (a gradient lifting tree model), the average loss value of the XGboost model is calculated, when the average loss value does not reach a convergence condition, parameters of the XGboost model are updated through Bayesian optimization network search, when the trained average loss is converged, the training is stopped, and the XGboost model updated at the moment is used as the earliest searching model. The information of each document in the document-entry matrix can be aggregated to the learner-entry matrix by taking the learner as a unit through two times of matrix transformation, so that the algorithm can achieve the purpose of uniformly and comprehensively considering all documents mastered by the learner in the process of searching by the learner, namely, the pseudo-correlation feedback is skillfully utilized to assist the transformation of the matrix in the transformation process, and the searching is more accurate and efficient.
With continuing reference to fig. 2, fig. 2 is a schematic flow chart of a search model training method according to an embodiment of the present invention, which is detailed as follows:
s201, a data set is obtained, wherein the data set comprises literature data and search key data, the literature data comprises scholars and documents, each scholars comprises at least two documents, and the search key data comprises search keywords.
In an embodiment of the present invention, the literature data includes data sets of individual scholars and academic literature they write. In the literature data, each document includes the title of the document, the abstract of the document, the text of the document, and the name of the academic conference/journal in which the document is published.
Furthermore, the search key data also comprises result information corresponding to the search keyword, namely the result information is a document marked by the search keyword, and the search key data can be used for training a search model, so that the comparison of the quality of a training result is facilitated.
S202, preprocessing the document data to obtain a document-keyword matrix.
Specifically, the document data is in a text format, and the text data is converted into a corresponding matrix from the text format in a preprocessing mode, specifically including:
washing literature data to obtain a text;
performing entry extraction on the text to obtain an initial entry;
expanding the initial entry to obtain an expanded entry;
and converting the initial entries and the expanded entries into a document-keyword matrix.
In the embodiment of the invention, the cleaning of the document data comprises the steps of lowercase writing of the text data, deletion of redundant spaces, punctuation and clauses to obtain a text of a pure character; performing entry extraction on the text by using a preset word segmentation algorithm, wherein the word segmentation algorithm can be a dictionary-based method, such as a forward maximum Matching idea (MM), a reverse maximum Matching algorithm (RMM), a bidirectional maximum Matching method (BM), and the like; and matching the vocabulary entry with the words in the preset vocabulary, and taking the words meeting the matching degree as the expansion vocabulary entry of the vocabulary entry.
Since academic documents have strict logicality and the four parts (i.e., the document title, the document abstract, the document text and the name of the academic conference/periodical where the documents are published) in the documents have differences in importance and refinement, it is common practice to assign different scores to different parts of the documents when searching, and the common index BM25F is taken as an index for measuring the scores of keywords in the documents.
Among them, BM25F is an improved algorithm of the typical BM 25. The BM25 considers documents as a whole when computing relevance, and each document is cut into multiple independent domains, especially for verticalized searches. For example, a web page may be divided into fields such as title, content, and subject word, and these fields do not contribute equally to the subject of the article, so the weight is weighted. BM25 does not take this into account, so BM25F makes some improvement over this in that the words are not considered individually, and the documents are also considered individually in terms of field (region), so the BM25F score is a weighted sum of the scores of each keyword in the respective fields (regions).
Further, a document-keyword matrix is constructed for the initial terms and the expanded terms, wherein the row number of the document-keyword matrix is the total length of all documents, the column number of the matrix is the total number of keywords in all documents, namely the total number of the initial terms and the expanded terms, the value stored in the ith row and the ith column is the BM25F score of the jth keyword in the ith document in the whole data set, and the document-keyword matrix is shown in fig. 3.
S203, according to the search keywords, performing matrix conversion processing twice on the document-keyword matrix to obtain a learner-keyword matrix.
In the embodiment of the invention, in order to obtain the pseudo-correlation feedback information, the document-keyword matrix needs to be subjected to two times of matrix transformation. After the first round of matrix transformation, the trainee-keyword matrix obtained by transformation is used for retrieving the trainee most relevant to the search keyword and the most relevant literature written by the trainee, and generating pseudo-relevance feedback information. In the second round process, the obtained pseudo-correlation feedback information is used to guide the second round matrix transformation, so that the transformation is more reasonable, specifically, as shown in fig. 3, a further embodiment of the flow and optimization framework of the search model training method of the present invention is the processes of the first matrix transformation and the second matrix transformation described in the drawing.
Further, performing two matrix transformation processes on the document-keyword matrix according to the search keyword to obtain a learner-keyword matrix comprises:
converting the document-keyword matrix into an initial learner-keyword matrix;
determining pseudo-relevant feedback information according to the initial scholars-keyword matrix and the search keywords;
the initial learner-keyword matrix is converted to a learner-keyword matrix based on the pseudo-relevant feedback information.
Wherein, the pseudo-correlation feedback, also called blind correlation feedback, automates the manual operation part of the correlation feedback, so that the user does not need to perform additional interaction, i.e. firstly perform normal search process, return the most relevant documents to form the initial set, then assume that the k documents ranked at the top are relevant, and finally perform the correlation feedback on the assumption as before. If pseudo-relevance feedback information is utilized, the conversion process can assign higher scores to scholars with higher relevance than relevance feedback and lower scores to scholars with lower relevance and larger difference than relevance feedback, and the matrix conversion based on the pseudo-relevance feedback enables each keyword in the matrix to be more reasonable in score.
Further, an initial learner-keyword matrix is constructed according to equation (1), i.e., the score of each document-keyword for a learner is equal to the sum of the scores of the document-keywords in all documents written by the learner (the numerator first term), multiplied by the logarithm of the document runs in the documents written by the learner (the numerator second term), divided by the logarithm of the total runs of the documents written by the learner (the denominator term). Meanwhile, in the formula (1), a parameter ω is assigned to each of the three terms1、ω2And ω3To balance terms with their importance.
Figure BDA0002817985580000081
Wherein,
Figure BDA0002817985580000082
for the score of each element in the initial learner-keyword matrix,
Figure BDA0002817985580000083
as document-keywords thereinThe scores in all the documents written by the scholars,
Figure BDA0002817985580000084
the literature written by the scholars is presented with the literature-keyword literature space,
Figure BDA0002817985580000085
for the overall length of the documents written by the learner, i represents the number of the learner, and J represents the total number of the document keywords in the document-keyword matrix. By calculating each
Figure BDA0002817985580000086
And taking the score of each element in the initial learner-keyword matrix, namely the number of rows of the initial learner-keyword matrix as the number of the learners, and the number of columns of the matrix as the total number of keywords in all documents, namely the total number of the initial entries and the expanded words.
Further, determining pseudo-relevant feedback information according to the initial learner-keyword matrix and the search keyword comprises:
carrying out matrix decomposition on the initial learner-keyword matrix to obtain a learner vector;
calculating the first cosine similarity of the search keyword and the learner vector;
screening out a search learner from the learners according to the first cosine similarity, wherein the search learner is the top n learners with highest correlation with the search keywords;
performing matrix decomposition on the document-keyword matrix to obtain a document vector;
calculating the second cosine similarity of the search keyword and the document vector;
screening the first n documents of each searcher with highest correlation with the search key words from the documents according to the second cosine similarity;
respectively calculating the third cosine similarity of the first n documents of each search learner and the search keywords;
and calculating the mean value of the third cosine similarity, and taking the obtained first mean value as pseudo-correlation feedback information.
In the embodiment of the present invention, the initial learner-keyword matrix is subjected to matrix decomposition by using an LSA (Latent Semantic Analysis) model to obtain a learner vector, where the LSA model is an existing algorithm model and is not explained herein. The matrix decomposition process is shown in fig. 4, and fig. 4 is a schematic diagram of document-keyword matrix, learner-keyword matrix and LSA matrix decomposition. The initial learner-keyword matrix (i.e., the learner-keyword matrix in fig. 4) is decomposed by the LSA model into the product of three matrices, including a keyword implicit space matrix, an implicit space matrix (uncertain variable matrix), and a learner matrix, the learner matrix being composed of learner vectors.
Wherein n is a positive integer, and n is 5 in the embodiment of the invention.
In the embodiment of the invention, the initial learner-keyword matrix is subjected to matrix decomposition operation by using LSA, after matrix decomposition, a cosine formula is adopted to calculate the first cosine similarity between the search keyword (in the form of vector) and the learner vector, the learners are sorted according to the sequence of the first cosine similarity from large to small, and the top 5 learners most relevant to the search keyword are selected.
Further, the document-keyword matrix is subjected to matrix decomposition to obtain a document vector, similarly, a matrix decomposition operation is performed by using LSA, after the matrix decomposition, a cosine formula is used to calculate a second cosine similarity between the search keyword (in the form of a vector) and the document vector, and according to a sorting order of the second cosine similarity from large to small, the top n documents of each searcher with the highest correlation with the search keyword are screened out, since the above setting n is 5, here, the same setting can be set to 5, that is, the setting of n is kept consistent.
Further, since the sum of the scores of the document-keyword in all documents written by the learner is higher, the score in the learner-keyword matrix will also be higher; if the learner contains more document sections for the document-keyword, the score in the learner-keyword matrix will be higher; if the overall length of the literature written by the scholars is large, the probability of the keyword occurring is naturally high, so the overall length of the literature written by the scholars is used as a denominator to reflect the fact. Through the conversion method, the information of each document in the document-keyword matrix is aggregated into the learner-keyword matrix by taking the learner as a unit, so that when the learner searches, all documents which are mastered by the learner can be comprehensively considered.
Therefore, the third cosine similarity between the first n documents of each learner and the search keyword is calculated, the first mean value obtained by performing mean calculation on the third cosine similarity represents the degree of similarity of the documents highly related to the search keyword, namely the first mean value can be used as pseudo-correlation feedback information for guiding the second matrix transformation, and for example, when the pseudo-correlation feedback information is that the average similarity between the documents related to the first 5 of a learner and the search keyword is higher than the fed back first mean value, a higher score is given to the initial learner-keyword matrix. Therefore, the matrix transformation method of the pseudo correlation feedback can more effectively transform the matrix, and can improve the search quality when the transformed matrix is used for the search of the learner.
Further, converting the initial learner-keyword matrix into a learner-keyword matrix based on the pseudo-relevant feedback information comprises:
acquiring the first n documents of each student with highest relevance to the search keywords;
calculating the fourth cosine similarity between the first n documents of each student and the search keyword;
performing mean value calculation on the fourth cosine similarity to obtain a second mean value;
and converting the initial learner-keyword matrix into a learner-keyword matrix according to the first mean value and the second mean value.
Specifically, the score of each element of the learner-keyword matrix is calculated using formula (2). In comparison with formula (1), the denominator part in formula (2) adds a term for comparing the average correlation (i.e. the second mean) of the author's more relevant documents with the obtained pseudo-correlation feedback information (i.e. the first mean).
Figure BDA0002817985580000101
Wherein,
Figure BDA0002817985580000102
representing scores, AVG, for each element of the computational learner-keyword matrixτDenotes the second mean value, avgτThe first mean value is shown, and the rest of the items are the same as the formula (1), and are not described herein again.
Furthermore, the method for obtaining the top n documents of each student with the highest correlation with the search keywords is the same as the calculation method for screening out the search students according to the first cosine similarity, namely, the top n documents are screened out by calculating the cosine similarity between the search keywords and the documents in the top n documents of each student, namely, sorting the documents according to the cosine similarity from big to small; and similarly, calculating the fourth cosine similarity between the first n documents of each student and the search keyword through the cosine similarity, and accumulating and averaging the fourth cosine similarity corresponding to each student to obtain a second average value.
As can be seen from the equation (2), if the average correlation degree of the relatively relevant literature of the scholars is greater than or equal to the pseudo-correlation feedback information (i.e. AVG)τ>avgτ0), then the item
Figure BDA0002817985580000103
There will be a smaller value. Conversely, if the average degree of correlation of the more relevant documents of the learner is less than the pseudo-correlation feedback, the larger the gap, the larger the value of the term, and the smaller the value of the overall formula of equation (2), i.e., the learner is given a smaller score. Therefore, the obtained pseudo-relevance feedback information can be used as a benchmark, and if the difference between the learner and the benchmark is larger, the score is worse, so that the score in the learner-keyword matrix obtained in the conversion mode is more reasonable.
And S204, inputting the learner-keyword matrix into a search model to be trained for training, and outputting training average loss.
In the embodiment of the invention, the search model to be trained is an XGboost (eXtreme Gradient Boosting) model, and the XGboost model converts the search ordering problem of the learner into a two-classification problem of judgment of result precedence, namely a two-classification judgment problem of whether the judgment is correct or not, namely whether the result A is arranged in front of the result B or not, so that the data demand is reduced, and even when the supervision ordering data information is limited, training and learning can still be performed. And training each learner-keyword in the learner-keyword matrix in the XGboost model to obtain a corresponding loss value, and performing mean value calculation on all the loss values to obtain the training average loss of the XGboost model in the current training.
And S205, if the training average loss does not reach the preset loss threshold, updating the parameters of the search model to be trained by adopting a Bayesian optimization network.
The preset loss threshold may be set according to an actual training situation, for example, the preset loss threshold is 0.1. And when the training average loss is greater than the preset loss threshold value, optimizing and selecting the parameters of the current search model to be trained by adopting a Bayesian Optimization network (Bayesian Optimization), thereby minimizing the loss value of the search model to be trained.
The Bayesian optimization network is a black box optimization algorithm and is used for solving the extreme value problem of a function of which an expression is unknown, the search principle of the Bayesian optimization network is that an initial candidate solution set is generated firstly, then the next most probable point of the extreme value is searched according to the points, the point is added into the set, the step is repeated until iteration is terminated, and finally the point with the largest function value is found out from the points and is used as the solution of the extreme value problem. The method is more effective than other grid search and random search because the information of the searched points is utilized in the solving process. The Bayesian optimization network considers the effect of the last parameter selection before making a new parameter selection decision to optimize the next parameter selection, so that the parameters of the to-be-trained search model are more efficiently selected and optimized to realize higher-quality learner search.
And S206, retraining the updated search model to be trained until the training average loss reaches a preset loss threshold, and taking the search model to be trained at the moment as a learner search model.
Taking the to-be-trained search model with the updated parameters in the step S205 as a current training model, repeatedly performing training, comparing the training average loss output by each training with a preset loss threshold, stopping training until the training average loss reaches the preset loss threshold, and taking the current parameters as final parameters of the to-be-trained search model, thereby forming a learner search model
It should be noted that the data set used in the training of the present invention is a Chinese literature data, a word stock and a search ranking data set, that is, the trained learner search model can be used for searching Chinese learners by using keywords expressed in Chinese. Similarly, if the data set is French, the student search model can also operate, and the training and optimization framework is not changed, so that the generality of the language is achieved, and the limitation of the search model is reduced.
In the embodiment of the invention, a document-keyword matrix is obtained by preprocessing document data, two times of matrix conversion processing are carried out on the document-keyword matrix according to search keywords to obtain a learner-keyword matrix, the learner-keyword matrix is input into a search model to be trained for training, the average loss of training is output, if the average loss of training does not reach a preset loss threshold value, the parameters of the search model to be trained are updated by adopting a Bayesian optimization network, the updated search model to be trained is retrained again, the training is stopped until the average loss of training reaches the preset loss threshold value, the search model to be trained at the moment is taken as the learner search model, namely, the information of each document in the document-keyword is integrated into the learner-keyword matrix by taking a learner as a unit in a matrix conversion mode, therefore, when a learner searches, all documents which are mastered by the learner are comprehensively and comprehensively considered, namely, the correlation between the learner and the search keywords can be accurately reflected, so that the search task for the learner is better completed, the search result is more accurate, the search quality is high, and meanwhile, because the invention only depends on a small amount of marked search keywords and unmarked document data for training, the dependence on a large amount of sequencing marking supervision information is reduced, and the training and learning can be carried out under the condition of limited supervision information.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In the embodiment of the present invention, a search model training apparatus is further provided, where each module included in the search model training apparatus is used to execute each step in the embodiment corresponding to fig. 2. Please refer to fig. 2 for a related description of the embodiment. Fig. 5 is a schematic structural diagram of a first embodiment of the search model training apparatus of the present invention, which includes an obtaining module 51, a difference module 52, a trajectory extracting module 53, and a target obtaining module 53:
the acquisition module 51 is configured to acquire a data set, where the data set includes literature data and search key data, the literature data includes scholars and documents, each scholars includes at least two documents, and the search key data includes search keywords;
a preprocessing module 52, configured to preprocess the document data to obtain a document-keyword matrix;
a matrix transformation module 53, configured to perform matrix transformation processing on the document-keyword matrix twice according to the search keyword to obtain a learner-keyword matrix;
a training module 54, configured to input the learner-keyword matrix into a search model to be trained for training, and output a training average loss;
the parameter adjusting module 55 is configured to update a parameter of the search model to be trained by using a bayesian optimization network if the training average loss does not reach the preset loss threshold;
and a training completion module 56, configured to retrain the updated search model to be trained, stop training until the training average loss reaches a preset loss threshold, and use the search model to be trained at this time as the learner search model.
Further, the preprocessing module 52 includes:
the cleaning unit is used for cleaning the document data to obtain a text;
the extraction unit is used for extracting the entries of the text to obtain initial entries;
the expansion unit is used for expanding the initial entries to obtain expanded entries;
and the conversion unit is used for converting the initial entries and the expanded entries into a document-keyword matrix.
Further, the matrix conversion module 53 includes:
a first conversion unit for converting the document-keyword matrix into an initial learner-keyword matrix;
the pseudo correlation unit is used for determining pseudo correlation feedback information according to the initial scholars-keyword matrix and the search keywords;
and the second conversion unit is used for converting the initial learner-keyword matrix into the learner-keyword matrix according to the pseudo-related feedback information.
Further, the pseudo correlation unit includes:
the first decomposition subunit is used for carrying out matrix decomposition on the initial learner-keyword matrix to obtain a learner vector;
the first calculating subunit is used for calculating the first cosine similarity between the search keyword and the learner vector;
the first screening subunit is used for screening out the search learner from the learners according to the first cosine similarity, wherein the search learner is the top n learners with the highest correlation with the search keywords;
the second decomposition subunit is used for performing matrix decomposition on the document-keyword matrix to obtain a document vector;
the second calculating subunit is used for calculating the second cosine similarity between the search keyword and the document vector;
the second screening subunit is used for screening the first n documents of each searcher with the highest correlation with the search keywords from the documents according to the second cosine similarity;
the third calculation subunit is used for respectively calculating the third cosine similarity between the first n documents of each search learner and the search keyword;
and the first mean value subunit is used for performing mean value calculation on the third cosine similarity, and taking the obtained first mean value as pseudo-correlation feedback information.
Further, the second conversion unit includes:
the acquisition subunit is used for acquiring the top n documents of each student with the highest correlation with the search keywords;
the fourth calculating subunit is used for calculating the fourth cosine similarity between the first n documents of each student and the search keyword;
the second mean value subunit is used for carrying out mean value calculation on the fourth cosine similarity to obtain a second mean value;
and the conversion subunit is used for converting the initial learner-keyword matrix into the learner-keyword matrix according to the first mean value and the second mean value.
The function implementation of each module/unit in the search model training device corresponds to each step in the search model training method embodiment, and the function and implementation process are not described in detail here.
Fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 6, the embodiment/terminal device 6 includes: a processor 60, a memory 61 and a computer program 62, such as a software development program, stored in said memory 61 and executable on said processor 60. The processor 60, when executing the computer program 62, implements the steps in the various software development method embodiments described above, such as steps S101 to S104 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of the modules/units in the system embodiments, such as the functions of the modules 51 to 56 shown in fig. 5.
Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution process of the computer program 62 in the search model training apparatus/terminal device 6. For example, the computer program 62 may be divided into an acquisition module, an execution module, and a generation module (module in the virtual device), and specific functions of the modules are as described above, and are not described herein again.
The terminal device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal 6 device may include, but is not limited to, a processor 60, a memory 61. It will be understood by those skilled in the art that fig. 6 is only an example of the terminal device 6, and does not constitute a limitation to the terminal device 6, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device 6 may further include an input-output device, a network access device, a bus, etc.
The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer program and other programs and data required by the terminal device. The memory 61 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the system is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed system/terminal device and method can be implemented in other ways. For example, the above-described system/terminal device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A search model training method, comprising:
acquiring a data set, wherein the data set comprises literature data and search key data, the literature data comprises scholars and literature, each scholars comprises at least two literatures, and the search key data comprises search keywords;
preprocessing the literature data to obtain a literature-keyword matrix;
according to the search keywords, performing matrix conversion processing twice on the literature-keyword matrix to obtain a learner-keyword matrix;
inputting the learner-keyword matrix into a search model to be trained for training, and outputting training average loss;
if the training average loss does not reach a preset loss threshold value, updating the parameters of the search model to be trained by adopting a Bayesian optimization network;
and retraining the updated search model to be trained until the training average loss reaches the preset loss threshold, and taking the search model to be trained at the moment as a learner search model.
2. The search model training method of claim 1, wherein said preprocessing the document data to obtain a document-keyword matrix comprises:
cleaning the literature data to obtain a text;
performing entry extraction on the text to obtain an initial entry;
expanding the initial entry to obtain an expanded entry;
and converting the initial entries and the expanded entries into a document-keyword matrix.
3. The method of claim 1, wherein performing two matrix transformations on the document-keyword matrix to obtain a learner-keyword matrix based on the search keywords comprises:
converting the document-keyword matrix to an initial learner-keyword matrix;
determining pseudo-relevant feedback information according to the initial learner-keyword matrix and the search keyword;
and converting the initial learner-keyword matrix into a learner-keyword matrix according to the pseudo-related feedback information.
4. The search model training method of claim 3, wherein said determining pseudo-relevant feedback information based on said initial learner-keyword matrix and said search keyword comprises:
performing matrix decomposition on the initial learner-keyword matrix to obtain a learner vector;
calculating the first cosine similarity of the search keyword and the learner vector;
screening out a search learner from the learners according to the first cosine similarity, wherein the search learner is the top n learners with highest correlation with the search keywords;
performing matrix decomposition on the literature-keyword matrix to obtain a literature vector;
calculating second cosine similarity of the search keyword and the literature vector;
screening the top n documents of each searcher with highest correlation with the search keywords from the documents according to the second cosine similarity;
respectively calculating the third cosine similarity of the first n documents of each search scholar and the search keyword;
and calculating the mean value of the third cosine similarity, and taking the obtained first mean value as pseudo-correlation feedback information.
5. The search model training method of claim 4, wherein said converting said initial learner-keyword matrix into a learner-keyword matrix based on said pseudo-relevant feedback information comprises:
acquiring the first n documents of each student with highest relevance to the search keywords;
calculating the fourth cosine similarity between the first n documents of each student and the search keyword;
performing mean value calculation on the fourth cosine similarity to obtain a second mean value;
and converting the initial learner-keyword matrix into a learner-keyword matrix according to the first mean value and the second mean value.
6. A search model training apparatus, comprising:
the system comprises an acquisition module, a search module and a search module, wherein the acquisition module is used for acquiring a data set, the data set comprises literature data and search key data, the literature data comprises scholars and literatures, each scholars comprises at least two literatures, and the search key data comprises search keywords;
the preprocessing module is used for preprocessing the literature data to obtain a literature-keyword matrix;
the matrix conversion module is used for performing matrix conversion processing on the document-keyword matrix twice according to the search keywords so as to obtain a learner-keyword matrix;
the training module is used for inputting the learner-keyword matrix into a search model to be trained for training and outputting training average loss;
the parameter adjusting module is used for updating the parameters of the search model to be trained by adopting a Bayesian optimization network if the training average loss does not reach a preset loss threshold;
and the training completion module is used for retraining the updated search model to be trained until the training average loss reaches the preset loss threshold value, stopping training and taking the search model to be trained at the moment as a learner search model.
7. The search model training apparatus of claim 6, wherein the preprocessing module comprises:
the cleaning unit is used for cleaning the literature data to obtain a text;
the extraction unit is used for extracting the entries of the text to obtain initial entries;
the expansion unit is used for expanding the initial entries to obtain expanded entries;
and the conversion unit is used for converting the initial entries and the expanded entries into a document-keyword matrix.
8. The search model training apparatus of claim 7, wherein the matrix transformation module comprises:
a first conversion unit for converting the document-keyword matrix into an initial learner-keyword matrix;
a pseudo correlation unit, configured to determine pseudo correlation feedback information according to the initial learner-keyword matrix and the search keyword;
and the second conversion unit is used for converting the initial learner-keyword matrix into a learner-keyword matrix according to the pseudo-related feedback information.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the search model training method according to any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the search model training method according to any one of claims 1 to 5.
CN202011403845.7A 2020-12-04 2020-12-04 Search model training method, device, terminal equipment and storage medium Active CN112434134B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011403845.7A CN112434134B (en) 2020-12-04 2020-12-04 Search model training method, device, terminal equipment and storage medium
PCT/CN2020/140016 WO2022116324A1 (en) 2020-12-04 2020-12-28 Search model training method, apparatus, terminal device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011403845.7A CN112434134B (en) 2020-12-04 2020-12-04 Search model training method, device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112434134A true CN112434134A (en) 2021-03-02
CN112434134B CN112434134B (en) 2023-10-20

Family

ID=74691140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011403845.7A Active CN112434134B (en) 2020-12-04 2020-12-04 Search model training method, device, terminal equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112434134B (en)
WO (1) WO2022116324A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113934872A (en) * 2021-10-29 2022-01-14 北京达佳互联信息技术有限公司 Search result sorting method, device, equipment and storage medium
CN117909491A (en) * 2024-03-18 2024-04-19 中国标准化研究院 Document metadata analysis method and system based on Bayesian network

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891959B (en) * 2024-03-15 2024-05-10 中国标准化研究院 Document metadata storage method and system based on Bayesian network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447169A (en) * 2015-12-07 2016-03-30 百度在线网络技术(北京)有限公司 Document normalization method, document searching method and corresponding apparatus
CN108763354A (en) * 2018-05-16 2018-11-06 浙江工业大学 A kind of academic documents recommendation method of personalization
CN110807101A (en) * 2019-10-15 2020-02-18 中国科学技术信息研究所 Scientific and technical literature big data classification method
CN111581949A (en) * 2020-05-12 2020-08-25 上海市研发公共服务平台管理中心 Method and device for disambiguating name of learner, storage medium and terminal

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550216A (en) * 2015-12-03 2016-05-04 百度在线网络技术(北京)有限公司 Searching method and device of academic research information and excavating method and device of academic research information
CN106951420A (en) * 2016-01-06 2017-07-14 富士通株式会社 Literature search method and apparatus, author's searching method and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447169A (en) * 2015-12-07 2016-03-30 百度在线网络技术(北京)有限公司 Document normalization method, document searching method and corresponding apparatus
WO2017096777A1 (en) * 2015-12-07 2017-06-15 百度在线网络技术(北京)有限公司 Document normalization method, document searching method, corresponding apparatuses, device, and storage medium
CN108763354A (en) * 2018-05-16 2018-11-06 浙江工业大学 A kind of academic documents recommendation method of personalization
CN110807101A (en) * 2019-10-15 2020-02-18 中国科学技术信息研究所 Scientific and technical literature big data classification method
CN111581949A (en) * 2020-05-12 2020-08-25 上海市研发公共服务平台管理中心 Method and device for disambiguating name of learner, storage medium and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
倪娜;刘凯;李耀东;: "科技文献关键词自动标注算法研究", 计算机科学, no. 09 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113934872A (en) * 2021-10-29 2022-01-14 北京达佳互联信息技术有限公司 Search result sorting method, device, equipment and storage medium
CN117909491A (en) * 2024-03-18 2024-04-19 中国标准化研究院 Document metadata analysis method and system based on Bayesian network
CN117909491B (en) * 2024-03-18 2024-05-14 中国标准化研究院 Document metadata analysis method and system based on Bayesian network

Also Published As

Publication number Publication date
WO2022116324A1 (en) 2022-06-09
CN112434134B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
KR102577514B1 (en) Method, apparatus for text generation, device and storage medium
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
CN110019732B (en) Intelligent question answering method and related device
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN112434134B (en) Search model training method, device, terminal equipment and storage medium
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN109863487A (en) Non- fact type question answering system and method and the computer program for it
CN110737756B (en) Method, apparatus, device and medium for determining answer to user input data
CN106708929B (en) Video program searching method and device
CN110598078B (en) Data retrieval method and device, computer-readable storage medium and electronic device
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
US20210103622A1 (en) Information search method, device, apparatus and computer-readable medium
Banik et al. Gru based named entity recognition system for bangla online newspapers
D’Silva et al. Unsupervised automatic text summarization of Konkani texts using K-means with Elbow method
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN111753514B (en) Automatic generation method and device of patent application text
CN112926340A (en) Semantic matching model for knowledge point positioning
CN106570196B (en) Video program searching method and device
CN113590811A (en) Text abstract generation method and device, electronic equipment and storage medium
CN116775846A (en) Domain knowledge question and answer method, system, equipment and medium
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN112559711B (en) Synonymous text prompting method and device and electronic equipment
CN112667797B (en) Question-answer matching method, system and storage medium for self-adaptive transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant