WO2022116324A1 - Procédé de formation de modèle de recherche, appareil, dispositif terminal et support de stockage - Google Patents

Procédé de formation de modèle de recherche, appareil, dispositif terminal et support de stockage Download PDF

Info

Publication number
WO2022116324A1
WO2022116324A1 PCT/CN2020/140016 CN2020140016W WO2022116324A1 WO 2022116324 A1 WO2022116324 A1 WO 2022116324A1 CN 2020140016 W CN2020140016 W CN 2020140016W WO 2022116324 A1 WO2022116324 A1 WO 2022116324A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
scholar
matrix
keyword
training
Prior art date
Application number
PCT/CN2020/140016
Other languages
English (en)
Chinese (zh)
Inventor
吴嘉澍
王洋
须成忠
叶可江
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022116324A1 publication Critical patent/WO2022116324A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention belongs to the technical field of information retrieval, and in particular relates to a search model training method, device, terminal equipment and storage medium.
  • the information retrieval system can retrieve a variety of entities, such as text, audio, games, videos, etc. Taking the search of text as an example, the retrieval system will determine whether each text is related to the keyword according to the keyword searched. Relevance is ranked, resulting in satisfactory search results.
  • each text is independently considered for its relevance to the search keywords, but unlike the search for entities such as text, the search for scholars often requires a search for all the books written by a scholar.
  • the literature is considered in a centralized and comprehensive manner, rather than considering each literature individually.
  • the embodiments of the present invention provide a search model training method, apparatus, terminal device and storage medium to solve the problems of low search quality and low accuracy in the search model training in the prior art.
  • a first aspect of the embodiments of the present invention provides a search model training method, including:
  • the data set includes literature data and search key data
  • the literature data includes scholars and literature, each scholar includes at least two literatures
  • the search key data includes search keywords
  • search keywords perform two matrix transformation processing on the document-keyword matrix to obtain the scholar-keyword matrix
  • the Bayesian optimization network is used to update the parameters of the search model to be trained
  • a second aspect of the embodiments of the present invention provides a search model training device, including:
  • an acquisition module configured to acquire a data set, wherein the data set includes literature data and search key data, the literature data includes Sri and literature, each scholar includes at least two literatures, and the search key data includes search keywords;
  • the preprocessing module is used to preprocess the document data to obtain a document-keyword matrix
  • the matrix transformation module is used to perform two matrix transformation processing on the document-keyword matrix to obtain the scholar-keyword matrix
  • the training module is used to input the scholar-keyword matrix into the search model to be trained for training, and output the average training loss;
  • the parameter adjustment module is used to update the parameters of the search model to be trained by using the Bayesian optimization network if the average training loss does not reach the preset loss threshold;
  • the training completion module is used to retrain the updated search model to be trained, and stop training until the average training loss reaches the preset loss threshold, and use the search model to be trained at this time as the scholar search model.
  • a third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above search model training method when the computer program is executed .
  • a fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above search model training method are implemented.
  • the document-keyword matrix is obtained by preprocessing the document data, and the document-keyword matrix is processed twice according to the search keywords, so as to obtain the scholar-keyword matrix, and the scholar-keyword matrix is input.
  • the model stops training until the average training loss reaches the preset loss threshold, and the search model to be trained at this time is used as the scholar search model, that is, the information of each document in the document-keyword is converted into a matrix by the scholar.
  • FIG. 1 is a schematic diagram of a process flow and an optimization framework of a search model training method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a search model training method according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of another embodiment of a process and an optimization framework of a search model training method according to an embodiment of the present invention
  • Fig. 4 is the document-keyword matrix, scholar-keyword matrix and LSA matrix decomposition schematic diagram of the embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a search model training device provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a hardware structure of a terminal device provided by an embodiment of the present invention.
  • the execution subject of the process is a terminal device
  • the terminal device includes but is not limited to a notebook computer, a computer, a server, a tablet computer, a smart phone, and other terminal devices with software development functions.
  • the terminal device executes the process in the implementation of the present invention, it can be used to provide the native application with the bearing function of the desktop application or the display of the front-end interface, and provide the interface assembly framework for the desktop application.
  • FIG. 1 is a schematic diagram of a process and an optimization framework of a search model training method according to an embodiment of the present invention, which specifically includes: preprocessing the document text to remove the punctuation to obtain the text, and performing the text on the text. Entry extraction is used to obtain document entries of the document text, and the knowledge tree is used to expand the document entries, and the expanded expanded entries and document entries are transformed to generate a document-entry matrix.
  • the document-entry matrix is transformed into a scholar-entry matrix through the first round of matrix transformation, and pseudo-relevance feedback is calculated according to the scholar-entry matrix, and the second round of matrix transformation is guided by the pseudo-relevance feedback, so that after the first round of matrix transformation,
  • the scholar-entry matrix after the second round of transformation is used as the training data for training the XGBoost model (a gradient boosting tree model), and the average loss value of the XGBoost model is calculated.
  • the Bayesian optimization network is used. Search to update the parameters of the XGBoost model, stop training until the average loss of training converges, and use the updated XGBoost model at this time as the earliest search model.
  • the present invention can aggregate the information of each document in the document-entry matrix into the scholar-entry matrix by two matrix transformation methods, in the unit of researchers, so that the algorithm can reach the scholar in the process of searching for scholars.
  • the purpose of unified and comprehensive consideration of all the literatures written is that the present invention cleverly uses pseudo-correlation feedback to assist the transformation of the matrix in the transformation process, so that the search is more accurate and efficient.
  • FIG. 2 is a schematic flowchart of a search model training method according to an embodiment of the present invention, which is described in detail as follows:
  • the document data includes a data set of each scholar and the academic documents written by them.
  • each article includes the title of the article, the abstract of the article, the text of the article, and the name of the academic conference/journal in which the article was published.
  • the search key data also includes result information corresponding to the search keywords, that is, the result information is documents marked by the search keywords, and the search key data can be used for the training of the search model, which is convenient for comparing the quality of the training results.
  • the document data is in a text format
  • the text data is converted from the text format into a corresponding matrix by preprocessing, which specifically includes:
  • cleaning the document data includes lowercase text data, deleting redundant spaces, punctuation, and clauses, so as to obtain text in plain text; the text is extracted by a preset word segmentation algorithm.
  • the algorithm can be a dictionary-based method, such as the forward maximum matching idea MM, the reverse maximum matching algorithm RMM, and the Bi-directction Matching method (BM), etc.; Matching, the words that match the matching degree are used as the expansion entries of the entry.
  • BM25F is an improved algorithm of typical BM25.
  • BM25 considers documents as a whole when calculating relevance, and each document is divided into multiple independent domains, especially for vertical searches. For example, web pages may be divided into domains such as title, content, and subject words. The contributions of these domains to the topic of the article cannot be treated equally, so the weight should be biased.
  • BM25 did not consider this point, so BM25F made some improvements on this basis, that is, it no longer only considers words as individuals, and divides documents into individual considerations according to field (region), so BM25F score is each keyword. Weighted summation of scores in each field.
  • a document-keyword matrix is constructed for the initial entry and the expanded entry, where the number of rows of the document-keyword matrix is the total number of articles in all documents, and the number of columns in the matrix is the total number of keywords in all documents, namely The total number of initial entries and extended words, the value stored in the i-th row and j column is the BM25F score of the j-th keyword in the i-th document in the entire dataset.
  • the document-keyword matrix is shown in Figure 3.
  • the present invention in order to obtain pseudo-relevance feedback information, the present invention needs to perform two rounds of matrix transformation on the document-keyword matrix. After the first round of matrix transformation, the transformed scholar-keyword matrix will be used to retrieve scholars most relevant to the search keywords and their most relevant literature, and generate pseudo-relevance feedback information. During the second round of rotation, the obtained pseudo-correlation feedback information will be used to guide the second round of matrix transformation to make the transformation more reasonable. Specifically, as shown in FIG. The process of the first matrix transformation and the second matrix transformation described in the schematic diagram of the embodiment.
  • the initial scholar-keyword matrix is transformed into a scholar-keyword matrix based on pseudo-relevant feedback information.
  • pseudo-relevant feedback also known as blind-relevant feedback
  • automates part of the manual operation of relevant feedback so users no longer need to perform additional interactions, that is, the normal retrieval process is performed first, and the most relevant documents are returned to form the initial set , then assume that the top k documents are relevant, and finally do the relevant feedback as before on this assumption.
  • the transformation process can assign higher scores to scholars who are more relevant than relevant feedback and lower scores to scholar who are less relevant and have larger gaps than relevant feedback , this matrix transformation based on pseudo-relevance feedback makes each keyword score in the matrix more reasonable.
  • the initial scholar-keyword matrix is constructed according to formula (1), that is, the score of each document-keyword for a scholar is equal to the sum of the scores of the document-keyword in all the documents written by the scholar (the numerator is the first item), multiplied by the logarithm of the number of papers in which the keyword appears in the literature written by the scholar (the second term in the numerator), and divided by the logarithm of the total number of papers written by the scholar (the denominator term).
  • a parameter ⁇ 1 , ⁇ 2 and ⁇ 3 are respectively assigned to the above three items to balance the importance of the items.
  • the pseudo-related feedback information includes:
  • the search researchers are selected from the researchers, and the search researchers are the top n scholar with the highest correlation with the search keywords;
  • the top n documents of each search scholar with the highest relevance to the search keywords are selected from the documents;
  • the mean value of the third cosine similarity is calculated, and the obtained first mean value is used as pseudo-correlation feedback information.
  • the initial scholar-keyword matrix is decomposed into a matrix by an LSA (Latent Semantic Analysis) model, and a scholar vector is obtained, wherein the LSA model is an existing algorithm model, which will not be explained here. .
  • the matrix decomposition process is shown in Figure 4, which is a schematic diagram of the decomposition of the literature-keyword matrix, the scholar-keyword matrix and the LSA matrix.
  • the initial scholar-keyword matrix (that is, the scholar-keyword matrix in Figure 4) is decomposed into the product of three matrices by the LSA model, including the keyword latent space matrix, latent space matrix (uncertain variable matrix) and Scholars matrix, which consists of scholar vectors.
  • n is a positive integer, and in the embodiment of the present invention, n is 5.
  • the initial scholar-keyword matrix is subjected to matrix decomposition operation using LSA, and after the matrix decomposition, the cosine formula is used to calculate the first cosine between the search keyword (in the form of a vector) and the scholar vector
  • the scholars are sorted according to the first cosine similarity in descending order, and the top 5 scholars most relevant to the search keywords are selected.
  • the literature-keyword matrix is matrix-decomposed to obtain the literature vector.
  • LSA is also used to perform the matrix decomposition operation, and after the matrix decomposition, the cosine formula is used to calculate the search keyword (in the form of a vector) and the literature vector.
  • the second cosine similarity and according to the sorting order of the second cosine similarity from large to small, the top n documents of each search scholar with the highest correlation with the search keywords are screened. Since the above setting n is 5, It can also be set to 5 here, that is, the setting of n remains the same.
  • the score in the scholar-keyword matrix will also be higher; if the scholar contains the literature-keywords, the higher the number of literature The higher the score in the scholar-keyword matrix, the higher the score in the scholar-keyword matrix; if the total number of literatures written by the scholar is large, the probability of the keyword appearing will naturally be high, so the total number of literatures written by the scholar will be high. as the denominator to reflect this fact.
  • the information of each document in the document-keyword matrix will be aggregated into the scholar-keyword matrix on a scholar-by-scholar basis, so that when a scholar is searched, all documents written by the scholar can be searched. comprehensive consideration.
  • the first mean obtained by averaging the third cosine similarity represents the highly related search keywords.
  • the degree of similarity that the literature should have, that is, the first mean value can be used as pseudo-related feedback information, and the pseudo-related feedback information is used to guide the second matrix transformation. For example, when the pseudo-related feedback information is a scholar's top 5 related documents and If the average similarity of search keywords is higher than the first average of the feedback, a higher score is assigned to the initial scholar-keyword matrix. Therefore, the matrix transformation method of the above pseudo-correlation feedback can more effectively transform the matrix, and improve the search quality when the transformed matrix is used for scholar search.
  • converting the initial scholar-keyword matrix into a scholar-keyword matrix according to the pseudo-relevant feedback information includes:
  • the initial scholar-keyword matrix is transformed into a scholar-keyword matrix.
  • formula (2) is used to calculate the score of each element of the scholar-keyword matrix.
  • the denominator part in formula (2) adds the average relevance (ie the second mean) of the author's more relevant literature and the obtained pseudo-relevance feedback information (ie the above-mentioned the first mean of ) for comparison.
  • the method of obtaining the top n documents of each scholar with the highest correlation with the search keywords is the same as the above-mentioned calculation method of selecting the search scholars from scholars according to the first cosine similarity, that is, by calculating the search keywords.
  • the cosine similarity with the documents-keywords in the first n documents of each scholar is sorted according to the cosine similarity from large to small to filter out the top n documents; the cosine similarity is also used to calculate the cosine similarity of each scholar.
  • the fourth cosine similarity between the first n documents and the search keywords is calculated, and the fourth cosine similarity corresponding to each scholar is accumulated and averaged to obtain the second mean.
  • the scholar-keyword matrix is input into the search model to be trained for training, and the average training loss is output.
  • the search model to be trained is an XGBoost model (eXtreme Gradient Boosting, extreme gradient boosting), and the XGBoost model converts the search ranking problem of scholar search into a two-classification problem of judging the sequence of results, that is, result A and result B , whether result A should be ranked in front of result B is a two-class judgment problem, thus reducing the demand for data, and training and learning can still be performed even when the supervised sorting data information is limited.
  • the corresponding loss value is obtained in the XGBoost model training, and all the loss values are averaged to obtain the current training average loss of the XGBoost model.
  • a Bayesian optimization network is used to update the parameters of the search model to be trained.
  • the preset loss threshold can be set according to the actual training situation, for example, the preset loss threshold is 0.1.
  • a Bayesian optimization network (Bayesian Optimization) is used to optimize and select the parameters of the current search model to be trained, thereby minimizing the loss value of the search model to be trained.
  • the Bayesian optimization network is a black-box optimization algorithm, which is used to solve the extreme value problem of the function with unknown expression.
  • the search principle of the Bayesian optimization network is to first generate an initial set of candidate solutions, and then search for these points The next most likely point is the extreme value, add this point to the set, repeat this step until the iteration is terminated, and finally find the point with the largest function value from these points as the solution of the extreme value problem. It is more efficient than other grid searches and random searches because the solution utilizes information from previously searched points. Because the Bayesian optimization network will consider the effect of the previous parameter selection before making a new parameter selection decision to optimize the next parameter selection, so as to more efficiently select and optimize the parameters of the training search model to achieve more efficient High-quality scholar search.
  • the search model to be trained after the parameters are updated in the above step S205 is used as the current training model, and the training is repeated, and the training average loss of each training output is compared with the preset loss threshold until the training average loss reaches the preset loss threshold. Stop training and use the current parameters as the final parameters of the search model to be trained, and then form a scholar search model
  • the data sets used in the training of the present invention are Chinese literature data, thesaurus and search ranking data sets, that is, the trained scholar search model can be used for keywords expressed in Chinese to search for Chinese scholars.
  • the scholar's search model can also operate, and the training and optimization frameworks remain unchanged, which is general to the language and reduces the limitations of the search model.
  • a document-keyword matrix is obtained by preprocessing the document data, and the document-keyword matrix is subjected to two matrix transformation processing according to the search keywords, so as to obtain a scholar-keyword matrix, and the scholar-keyword matrix is converted into a scholar-keyword matrix.
  • the matrix is input into the search model to be trained for training, and the average training loss is output.
  • the Bayesian optimization network is used to update the parameters of the search model to be trained, and the updated Train the search model, stop training until the average training loss reaches the preset loss threshold, and use the search model to be trained at this time as the scholar search model, that is, the information of each document in the document-keyword is converted into a scholar by matrix transformation. It is integrated into the scholar-keyword matrix as a unit, so that when conducting a scholar search, all literatures written by scholars will be comprehensively and comprehensively considered, that is, the correlation between scholar and search keywords can be accurately reflected. Therefore, the search task for scholars can be better completed, and the search results are more accurate and the search quality is efficient.
  • the present invention only relies on a small amount of marked search keywords and unmarked document data for training, it reduces the need for a large number of sorting.
  • the dependence of labeled supervision information enables training and learning under the condition of limited supervision information.
  • a search model training apparatus is also provided, and each module included in the search model training apparatus is used to execute each step in the embodiment corresponding to FIG. 2 .
  • FIG. 5 shows a schematic structural diagram of the first embodiment of the search model training device of the present invention, including an acquisition module 51, a difference module 52, a trajectory extraction module 53, and a target acquisition module 53:
  • the acquisition module 51 is used to acquire a data set, wherein the data set includes document data and search key data, the document data includes Sri and documents, each scholar includes at least two documents, and the search key data includes search keywords;
  • the preprocessing module 52 is used for preprocessing the document data to obtain a document-keyword matrix
  • the matrix transformation module 53 is used to perform two matrix transformation processes on the document-keyword matrix according to the search keywords, so as to obtain the scholar-keyword matrix;
  • the training module 54 is used to input the scholar-keyword matrix into the search model to be trained for training, and output the average training loss;
  • the parameter adjustment module 55 is used to update the parameters of the search model to be trained by using a Bayesian optimization network if the average training loss does not reach the preset loss threshold;
  • the training completion module 56 is used to retrain the updated search model to be trained, stop training until the average training loss reaches a preset loss threshold, and use the search model to be trained at this time as a scholar search model.
  • the preprocessing module 52 includes:
  • the cleaning unit is used to clean the literature data to obtain text
  • the extraction unit is used to extract the entry from the text to obtain the initial entry
  • the expansion unit is used to expand the initial entry to obtain the expanded entry
  • the transformation unit is used to transform the initial entry and the expanded entry into a document-keyword matrix.
  • the matrix conversion module 53 includes:
  • a first conversion unit used to convert a document-keyword matrix into an initial scholar-keyword matrix
  • Pseudo-correlation unit used to determine pseudo-correlation feedback information according to the initial scholar-keyword matrix and search keywords
  • the second conversion unit is configured to convert the initial scholar-keyword matrix into a scholar-keyword matrix according to the pseudo-relevant feedback information.
  • the pseudo-correlation unit includes:
  • the first decomposition subunit is used to perform matrix decomposition on the initial scholar-keyword matrix to obtain a scholar vector
  • the first calculation subunit is used to calculate the first cosine similarity between the search keyword and the scholar vector
  • the first screening subunit is used to screen out the search professionals from the scholars according to the first cosine similarity, wherein the search professionals are the top n universities with the highest correlation with the search keywords;
  • the second decomposition subunit is used to perform matrix decomposition on the document-keyword matrix to obtain a document vector
  • the second calculation subunit is used to calculate the second cosine similarity between the search keyword and the document vector
  • the second screening subunit is used to screen out the top n documents of each search scholar with the highest correlation with the search keywords from the documents according to the second cosine similarity;
  • the third calculation subunit is used to calculate the third cosine similarity between the top n documents of each search scholar and the search keywords respectively;
  • the first mean value subunit is used to calculate the mean value of the third cosine similarity, and use the obtained first mean value as pseudo-correlation feedback information.
  • the second conversion unit includes:
  • the acquisition subunit is used to acquire the top n documents of each scholar that are most relevant to the search keywords;
  • the fourth calculation subunit is used to calculate the fourth cosine similarity between the first n documents of each scholar and the search keywords;
  • the second mean subunit is used to perform mean calculation on the fourth cosine similarity to obtain the second mean
  • the conversion subunit is used to convert the initial scholar-keyword matrix into a scholar-keyword matrix according to the first mean value and the second mean value.
  • each module/unit in the above search model training apparatus corresponds to each step in the above search model training method embodiment, and the functions and implementation process thereof will not be repeated here.
  • FIG. 6 is a schematic diagram of a terminal device provided by an embodiment of the present invention.
  • this embodiment/terminal device 6 includes: a processor 60 , a memory 61 , and a computer program 62 stored in the memory 61 and executable on the processor 60 , such as a software development program.
  • the processor 60 executes the computer program 62
  • the steps in each of the foregoing software development method embodiments are implemented, for example, steps S101 to S104 shown in FIG. 1 .
  • the processor 60 executes the computer program 62
  • the functions of the modules/units in the above-mentioned system embodiments are implemented, for example, the functions of the modules 51 to 56 shown in FIG. 5 .
  • the computer program 62 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 60 to complete the this invention.
  • the one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 62 in the search model training apparatus/terminal device 6 .
  • the computer program 62 can be divided into an acquisition module, an execution module, and a generation module (modules in a virtual device), and the specific functions of each module are as described above, which will not be repeated here.
  • the terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal 6 device may include, but is not limited to, a processor 60 and a memory 61 .
  • FIG. 6 is only an example of the terminal device 6, and does not constitute a limitation on the terminal device 6, and may include more or less components than the one shown, or combine some components, or different components
  • the terminal device 6 may further include an input and output device, a network access device, a bus, and the like.
  • the so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 61 may be an internal storage unit of the terminal device 6 , such as a hard disk or a memory of the terminal device 6 .
  • the memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk equipped on the terminal device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card. , Flash Card (Flash Card) and so on.
  • the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device.
  • the memory 61 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 61 can also be used to temporarily store data that has been output or will be output.
  • the disclosed system/terminal device and method may be implemented in other manners.
  • the system/terminal device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or integrated into another device, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated modules/units if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the present invention can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé de formation de modèle de recherche, un appareil, un dispositif terminal et un support de stockage, applicables au domaine technique de la récupération d'informations. Le procédé de formation de modèle de recherche consiste : à prétraiter des données de publication pour produire une matrice de mots-clés de publication (S202) ; à effectuer deux fois un traitement de conversion de matrice par rapport à la matrice de mot-clé de publication sur la base d'un mot-clé de recherche pour produire une matrice de mot-clé scolaire (S203) ; à entrer la matrice de mot-clé scolaire dans un modèle de recherche à former pour un apprentissage, à délivrer en sortie une perte moyenne d'apprentissage (S204) ; si la perte moyenne d'apprentissage n'atteint pas un seuil de perte prédéfini, à utiliser alors un réseau d'optimisation bayésien pour mettre à jour un paramètre dudit modèle de recherche (S205) ; à reconstruire ledit modèle de recherche mis à jour, à arrêter l'apprentissage lorsque la perte moyenne d'apprentissage atteint le seuil de perte prédéfini et à utiliser ledit modèle de recherche de ce moment comme modèle de recherche scolaire (S206), ce qui permet une recherche de haute qualité et efficace dans les publications d'une école et d'augmenter la précision de résultats de recherche.
PCT/CN2020/140016 2020-12-04 2020-12-28 Procédé de formation de modèle de recherche, appareil, dispositif terminal et support de stockage WO2022116324A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011403845.7 2020-12-04
CN202011403845.7A CN112434134B (zh) 2020-12-04 2020-12-04 搜索模型训练方法、装置、终端设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022116324A1 true WO2022116324A1 (fr) 2022-06-09

Family

ID=74691140

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/140016 WO2022116324A1 (fr) 2020-12-04 2020-12-28 Procédé de formation de modèle de recherche, appareil, dispositif terminal et support de stockage

Country Status (2)

Country Link
CN (1) CN112434134B (fr)
WO (1) WO2022116324A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891959A (zh) * 2024-03-15 2024-04-16 中国标准化研究院 一种基于贝叶斯网络的文献元数据储存方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113934872A (zh) * 2021-10-29 2022-01-14 北京达佳互联信息技术有限公司 一种搜索结果的排序方法、装置、设备以及存储介质
CN117909491B (zh) * 2024-03-18 2024-05-14 中国标准化研究院 一种基于贝叶斯网络的文献元数据解析方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550216A (zh) * 2015-12-03 2016-05-04 百度在线网络技术(北京)有限公司 学术研究信息的搜索方法、挖掘方法及装置
WO2017096777A1 (fr) * 2015-12-07 2017-06-15 百度在线网络技术(北京)有限公司 Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage
CN106951420A (zh) * 2016-01-06 2017-07-14 富士通株式会社 文献搜索方法及设备、作者搜索方法及设备
CN108763354A (zh) * 2018-05-16 2018-11-06 浙江工业大学 一种个性化的学术文献推荐方法
CN110807101A (zh) * 2019-10-15 2020-02-18 中国科学技术信息研究所 一种科技文献大数据分类方法
CN111581949A (zh) * 2020-05-12 2020-08-25 上海市研发公共服务平台管理中心 学者人名的消歧方法、装置、存储介质及终端

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550216A (zh) * 2015-12-03 2016-05-04 百度在线网络技术(北京)有限公司 学术研究信息的搜索方法、挖掘方法及装置
WO2017096777A1 (fr) * 2015-12-07 2017-06-15 百度在线网络技术(北京)有限公司 Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage
CN106951420A (zh) * 2016-01-06 2017-07-14 富士通株式会社 文献搜索方法及设备、作者搜索方法及设备
CN108763354A (zh) * 2018-05-16 2018-11-06 浙江工业大学 一种个性化的学术文献推荐方法
CN110807101A (zh) * 2019-10-15 2020-02-18 中国科学技术信息研究所 一种科技文献大数据分类方法
CN111581949A (zh) * 2020-05-12 2020-08-25 上海市研发公共服务平台管理中心 学者人名的消歧方法、装置、存储介质及终端

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891959A (zh) * 2024-03-15 2024-04-16 中国标准化研究院 一种基于贝叶斯网络的文献元数据储存方法及系统
CN117891959B (zh) * 2024-03-15 2024-05-10 中国标准化研究院 一种基于贝叶斯网络的文献元数据储存方法及系统

Also Published As

Publication number Publication date
CN112434134A (zh) 2021-03-02
CN112434134B (zh) 2023-10-20

Similar Documents

Publication Publication Date Title
CN110162593B (zh) 一种搜索结果处理、相似度模型训练方法及装置
CN109101479B (zh) 一种用于中文语句的聚类方法及装置
WO2022116324A1 (fr) Procédé de formation de modèle de recherche, appareil, dispositif terminal et support de stockage
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
US20200081899A1 (en) Automated database schema matching
CN105183833B (zh) 一种基于用户模型的微博文本推荐方法及其推荐装置
Qian et al. Social event classification via boosted multimodal supervised latent dirichlet allocation
US8694303B2 (en) Systems and methods for tuning parameters in statistical machine translation
US20100205202A1 (en) Visual and Textual Query Suggestion
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN108647322B (zh) 基于词网识别大量Web文本信息相似度的方法
CN108509521B (zh) 一种自动生成文本索引的图像检索方法
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
D’Silva et al. Unsupervised automatic text summarization of Konkani texts using K-means with Elbow method
Papadopoulos et al. Image clustering through community detection on hybrid image similarity graphs
CN110705247A (zh) 基于χ2-C的文本相似度计算方法
CN113962293A (zh) 一种基于LightGBM分类与表示学习的姓名消歧方法和系统
CN115203421A (zh) 一种长文本的标签生成方法、装置、设备及存储介质
CN106570196B (zh) 视频节目的搜索方法和装置
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
CN111723179B (zh) 基于概念图谱的反馈模型信息检索方法、系统及介质
CN114298020A (zh) 一种基于主题语义信息的关键词向量化方法及其应用
Ma et al. Chinese social media entity linking based on effective context with topic semantics
Meiyu et al. Image semantic description and automatic semantic annotation
CN113761125A (zh) 动态摘要确定方法和装置、计算设备以及计算机存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964169

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20964169

Country of ref document: EP

Kind code of ref document: A1