CN112084150A - Model training method, data retrieval method, device, equipment and storage medium - Google Patents

Model training method, data retrieval method, device, equipment and storage medium Download PDF

Info

Publication number
CN112084150A
CN112084150A CN202010939453.6A CN202010939453A CN112084150A CN 112084150 A CN112084150 A CN 112084150A CN 202010939453 A CN202010939453 A CN 202010939453A CN 112084150 A CN112084150 A CN 112084150A
Authority
CN
China
Prior art keywords
search
document
target
model
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010939453.6A
Other languages
Chinese (zh)
Inventor
潘秋桐
和为
刘准
何伯磊
李雅楠
巩江传
李瑞高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010939453.6A priority Critical patent/CN112084150A/en
Publication of CN112084150A publication Critical patent/CN112084150A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Abstract

The application discloses a model training method, a data retrieval method, a device, equipment and a storage medium, and relates to the fields of information retrieval and knowledge sharing. The specific implementation scheme is as follows: acquiring a historical search click log generated by a knowledge sharing system; searching click logs according to history to generate a sample set; and training the model by using the sample set to obtain a target model. According to the implementation mode, the search click logs generated by the user in the enterprise-level wiki are analyzed and trained to obtain the target model, and the required knowledge can be quickly and accurately found through the target model.

Description

Model training method, data retrieval method, device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of information retrieval and knowledge sharing, and in particular, to a method, an apparatus, a device, and a storage medium for model training and data retrieval.
Background
The problem generally exists in the development of large, medium and small enterprises, and as the companies grow up, projects are continuously accumulated, and employees continuously iterate, a large number of documents containing precious experiences and knowledge of the employees are generated. If the documents are not managed uniformly on line, the systematization and standardization of knowledge are difficult to achieve, and part of knowledge and experience can be lost along with the departure of key employees. Therefore, most enterprises introduce enterprise-level wiki, and knowledge documents under the office scene accumulated by the enterprises are concentrated at one position to become a search engine inside the enterprises.
At the same time, a new problem is introduced: after having a huge amount of knowledge, how to find the needed knowledge quickly and accurately. Most enterprise-level wiki capabilities are weak at this user demand, affecting and even slowing the efficiency of knowledge transfer and work.
Disclosure of Invention
A model training and data retrieval method, apparatus, device and storage medium are provided.
According to a first aspect, there is provided a model training method comprising: acquiring a historical search click log generated by a knowledge sharing system; searching click logs according to history to generate a sample set; and training the model by using the sample set to obtain a target model.
According to a second aspect, there is provided a data retrieval method comprising: receiving a target search statement input by a user through a terminal; determining a target feature vector from the target search statement and the target model as described in the first aspect; determining a feature vector of each document in a target search result aiming at a target search statement; and sequencing the documents in the target result according to the feature vectors and the target feature vector.
According to a third aspect, there is provided a model training apparatus comprising: a log obtaining unit configured to obtain a history search click log generated by the knowledge sharing system; the sample generating unit is configured to search click logs according to history and generate a sample set; and the model training unit is configured to train the model by using the sample set to obtain the target model.
According to a fourth aspect, there is provided a data retrieval apparatus comprising: a search sentence receiving unit configured to receive a target search sentence input by a user through a terminal; a first vector determination unit configured to determine a target feature vector from a target search statement and a target model as described in the first aspect; a second vector determination unit configured to determine a feature vector of each document in the target search result for the target search sentence; and the document sorting unit is configured to sort the documents in the target result according to the feature vectors and the target feature vector.
According to a fifth aspect, there is provided a model training electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.
According to a sixth aspect, there is provided an electronic device for data retrieval, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the second aspect.
According to a seventh aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.
According to an eighth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the second aspect.
According to the technology of the application, the search click logs generated by the user in the enterprise-level wiki are analyzed and trained to obtain the target model, and the required knowledge can be quickly and accurately found through the target model.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a model training method according to the present application;
FIG. 3 is a flow diagram of another embodiment of a model training method according to the present application;
FIG. 4 is a flow diagram of one embodiment of a data retrieval method according to the present application;
FIG. 5 is a schematic diagram of an application scenario of a model training method, a data retrieval method according to the present application;
FIG. 6 is a schematic block diagram of one embodiment of a model training apparatus according to the present application;
FIG. 7 is a schematic block diagram of one embodiment of a data retrieval device according to the present application;
fig. 8 is a block diagram of an electronic device for implementing the model training method and the data retrieval method according to the embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the model training method, data retrieval method, model training apparatus, or data retrieval apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a browser application, a social platform application, and the like, may be installed on the terminal devices 101, 102, and 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, car computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server that provides various services, such as a background search server that searches for search sentences transmitted by the terminal apparatuses 101, 102, 103. The background retrieval server may sort the retrieval results and feed back the sorted retrieval results to the terminal devices 101, 102, and 103.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the model training method and the data retrieval method provided in the embodiments of the present application are generally executed by the server 105. Accordingly, the model training device and the data retrieval device are generally provided in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a model training method according to the present application is shown. The model training method of the embodiment comprises the following steps:
step 201, obtaining a history search click log generated by a knowledge sharing system.
In this embodiment, an executing entity of the model training method (for example, the server 105 shown in fig. 1, or other electronic devices not shown in fig. 1) may obtain the historical search click log generated by the knowledge sharing system. Here, the knowledge sharing system may be an enterprise-level wiki, through which employees within an enterprise may share knowledge, or may use the knowledge sharing system to search and obtain knowledge by browsing search results. During the searching and browsing process, a search click log is generated by the user. Here, the executing body may acquire a search click log generated one day or one week before the knowledge sharing system as a history search click log.
Step 202, according to the history search click log, generating a sample set.
After the execution main body obtains the history search click log, the execution main body can analyze the history search click log to obtain a sample set. Specifically, the execution main body may analyze the search statement input by the user from the historical search click log, and may also analyze click information of the user on each document in the search result. The execution body may take the search sentence and the clicked document as a positive sample. The search statement is taken as a negative sample with the document not clicked on. Thus, the execution subject may obtain a plurality of positive samples and a plurality of negative samples, i.e., a set of samples.
And step 203, training the model by using the sample set to obtain a target model.
After the execution subject obtains the sample set, the model can be trained by using the sample set to obtain the target model. Specifically, the execution subject may take the search statement in the sample set as input, take the documents in the positive sample and the negative sample as expected output, and train to obtain the target model.
According to the model training provided by the embodiment of the application, the search click logs generated by the user in the enterprise-level wiki are analyzed, the target model is obtained through training, and the required knowledge can be quickly and accurately found through the target model.
With continued reference to FIG. 3, a flow 300 of another embodiment of a model training method according to the present application is shown. In the embodiment shown in fig. 3, the above method may include the steps of:
step 301, obtaining a history search click log generated by a knowledge sharing system.
In this embodiment, the historical search click log includes at least one search statement and click information for each document in the search results for the at least one search statement. It will be appreciated that the log of historical search clicks may also include an identification of the user who entered each search statement, and may also include information for each user click.
Step 302, filtering the history search click log to obtain filtered data.
In this embodiment, the execution subject may filter the history search click log to obtain filter data. Therefore, the accuracy of the samples in the sample set can be improved, and the accuracy of the output result of the target model is improved. Specifically, the execution main body may search for a search statement in the history search click log, where the search frequency is less than a preset value, or a search statement where the search statement is greater than a preset value.
In some optional implementation manners of this embodiment, the execution subject may specifically filter the history search click log by:
step 3021, determining the number of searches in each search term and the number of clicks in each document according to the history search click log.
Step 3022, filtering the search sentences of which the search times are smaller than a first preset threshold and/or the documents of which the click times are smaller than a second preset threshold to obtain filtered data.
In this implementation, the execution body may first determine the search times of each search term and the click times of each document by searching the click log through history. If the number of searching times is too small, the searching sentences are rarely found, the learning effect is low, and the searching sentences can be filtered. If the number of clicks is too small, the document is said to be less accurate for the search term and may be filtered.
Step 303, for each search statement, determining a clicked document set of the search statement from the search result according to click information of each document in the search result of the search statement.
In this embodiment, the execution subject may analyze each search statement included in the history search click log, and determine click information of each document in the search result returned by each user for the search statement. That is, the execution subject may analyze which document in the search results was clicked on by the user and which document was not clicked on, resulting in a clicked document set.
Step 304, determining a positive sample set according to the search statement and the clicked document set.
The executing agent may determine a positive sample set from the search term and the resulting set of clicked documents. Specifically, the execution subject may treat the search statement and the single click document as a single positive sample, thereby obtaining a positive sample set.
Step 305, determining a negative sample set of the search statement according to the search result and the clicked document set.
After the clicked document set is obtained, the execution main body can also determine the un-clicked document according to the search result. The execution subject may treat the search statement and the single unchecked document as a single negative example, resulting in a set of negative examples.
In some optional implementations of the embodiment, the search result may include a plurality of documents and a ranking of the documents. That is, documents have been ranked in the search results. The execution subject may obtain the set of negative examples by:
3051, determining a plurality of non-clicked documents adjacent to each clicked document in the clicked document set according to the sorting to obtain the non-clicked document set.
And 3052, obtaining a negative sample set according to the search statement and the un-clicked document set.
In this implementation, the execution subject may determine, in the sorting, a plurality of clicked documents adjacent to each clicked document to obtain a set of clicked documents. For example, the search statement is abc, the user clicks the document in item 4 of the search result, and the documents in items 1, 2, 3, 5 and 6 are the un-clicked documents. The execution subject may use the search sentence and each un-clicked document as each negative sample, respectively, to obtain a negative sample set.
In some optional implementations of this embodiment, a ratio of the number of negative samples to the number of positive samples in the sample set is a preset value.
In the implementation mode, in order to ensure the learning capability of the model and ensure that the calculated amount is not too large during model training, the ratio of the number of the negative samples to the number of the positive samples can be controlled to be a preset value. The preset value can be set according to the actual application scene.
Step 306, for each sample of the training set, determining a feature vector corresponding to a document in the sample.
In this embodiment, the sample set may include a training set and a test set. Each sample in the training set is used for training the model, and each sample in the testing set is used for testing the trained model. For each sample of the training set, the executive agent may determine the feature vector corresponding to the document in that sample. Specifically, the executing agent may determine a feature vector corresponding to the document through an existing feature extraction algorithm. Alternatively, the execution body may determine certain features of the document to determine a feature vector corresponding to the document.
In some optional implementations of this embodiment, the execution subject may determine the feature vector corresponding to the document by: and obtaining a feature vector based on the search sentence, the document and at least one pre-trained model in the sample.
In this implementation, the execution subject may obtain the feature vector based on the search sentence, the document, and at least one model trained in advance in the sample. For example, the executing agent may calculate at least one of the following information to determine the feature vector: respectively inputting a search sentence and a document into a pre-trained relevance model to obtain a relevance score of the document; the click weight of the document in the last week under the search statement; the proportion of the first point of the document under the search statement; the proportion of the document as tail points under the search statement; the document under the search statement is the proportion of satisfied clicks; the document under the search statement is the proportion of long clicks; the document is the proportion of short clicks under the search statement.
Of the above information, the relevance model is used to calculate the relevance between a search sentence and a document. The click weight may be calculated using the Wilson Score (Wilson algorithm). The algorithm may be used for quality ranking. If the data includes good and bad reviews, the algorithm may calculate the score by taking into account the number of reviews and the good review rate. The higher the score, the higher the quality of the data. For example, assume that doctor A has 100 evaluations, 1 bad evaluation 99 good evaluations, and a good evaluation 99%. Doctor B has 2 evaluations, all of which are good, the rate of good evaluation is 100%, which should be ranked in front? Using the Wilson algorithm, doctor A scored 0.9440, doctor B scored 0.3333, and doctor A ranked in front.
In the above information, the first click refers to the first click performed by the user while browsing the search results. The end point refers to the last click a user makes while browsing the search results. A satisfied click refers to the last click under the last search statement under the same search intent. The same search intention is determined from the search sentence. Specifically, the execution subject may determine whether the search sentences are similar by calculating a similarity between the search sentences and comparing the similarity with a preset threshold. If so, the search intent is considered the same. When the similarity between the search sentences is calculated, the similarity can be obtained by editing the distance between the search sentences or performing semantic analysis on the distance and the semantic analysis on the search sentences. The long click may comprise a satisfactory click. In addition, the last click of a non-last search sentence with the same search intention is a long click if the time difference between the last click and the next search sentence is greater than a first preset time period (e.g., 40s) and less than a second preset time period (e.g., 3600 s). If the time difference between the last click and the next search sentence is greater than a third preset time period (e.g., 0s) and less than a fourth preset time period (e.g., 5s), it is a short click. In addition, if the time difference between the previous click and the next click is greater than a first preset time period (e.g., 40s) and less than a second preset time period (e.g., 3600s), the two adjacent clicks are long clicks. A short click may be determined if the difference between the time between the previous click and the next click is greater than a third predetermined time period (e.g., 0s) and less than a fourth predetermined time period (e.g., 5 s).
In addition, the executive body may also determine at least one of the following information for the document: calculating the authority score of the document through a pre-trained authority model; calculating the quality score of the document through a pre-trained quality model; calculating the timeliness of the document (determined according to the current time, the creation time of the document and a preset time period); computing authoritative characteristics of a document through a PageRank model (a web page ranking algorithm that is the earliest proposed and used by Google corporation. essentially, an algorithm that roughly analyzes the importance of web pages with the number and quality of hyperlinks between web pages as the main factors); and counting the searching frequency of the searching sentence in the last week.
The execution subject may use the above items of information as feature vectors of the documents in the sample.
And 307, taking the search sentences in each sample in the training set as input, taking the feature vectors corresponding to the input search sentences as expected output, and training to obtain a target model.
The execution subject may use a search term in each sample in the training set as an input of the model, use a feature vector corresponding to the input search term as an expected output, and train to obtain the target model.
And 308, determining the searching effect of the target model by using the test set.
In this embodiment, the execution subject may further determine a search effect of the target model by using the test set. Specifically, the execution subject may take a search statement in each sample of the test set as an input of the target model, and compare an output of the target model with a document corresponding to the input search statement. It can be understood that if the two are similar, the searching effect of the target model is considered to be better. And if the two are not similar, the searching effect of the target model is considered to be poor.
In some specific implementations, the performing agent may evaluate the effectiveness of the objective model by objective and subjective indicators. The objective indexes comprise two indexes of Accuracy (ACC) and Area (AUC) enclosed by a coordinate axis under an ROC curve. Subjective indicators may include GSBs (good, same, bad).
According to the model training method provided by the embodiment of the application, the historical search click logs can be analyzed in detail, invalid data in the historical search click logs can be filtered, and the model can be trained by using the rest valid data, so that the accuracy of the model is improved.
Referring to FIG. 4, a flow 400 of one embodiment of a data retrieval method according to the present application is shown. As shown in fig. 4, the data retrieval method of the present embodiment may include the following steps:
step 401, receiving a target search statement input by a user through a terminal.
In this embodiment, an execution subject of the data retrieval method (for example, the server 105 shown in fig. 1) may receive a target search statement input by a user through a terminal (for example, the terminals 101, 102, 103 shown in fig. 1). The execution main body of the present embodiment may be the same as or different from the execution main body of the embodiment shown in fig. 2 and 3. A user may access the enterprise-level wiki through the terminal and enter a target search statement through the enterprise-level wiki.
Step 402, determining a target feature vector according to the target search statement and the target model.
After receiving the target search statement, the execution body may input the target search statement into the target model to obtain a target feature vector of the target search statement. Here, the target model may be the target model obtained by the embodiment of fig. 2 or fig. 3.
Step 403, determining the feature vector of each document in the target search result of the target search statement.
The execution body may further determine a feature vector for each document in the target search result of the target search statement. The execution subject may obtain the feature vector by analyzing each document. For example, by obtaining at least one of the following information for each document: respectively inputting a search sentence and a document into a pre-trained relevance model to obtain a relevance score of the document; the click weight of the document in the last week under the search statement; the proportion of the first point of the document under the search statement; the proportion of the document as tail points under the search statement; the document under the search statement is the proportion of satisfied clicks; the document under the search statement is the proportion of long clicks; the proportion of short clicks of the document under the search statement; calculating the authority score of the document through a pre-trained authority model; calculating the quality score of the document through a pre-trained quality model; calculating the timeliness of the document (determined according to the current time, the creation time of the document and a preset time period); computing authoritative characteristics of a document through a PageRank model (a web page ranking algorithm that is the earliest proposed and used by Google corporation. essentially, an algorithm that roughly analyzes the importance of web pages with the number and quality of hyperlinks between web pages as the main factors); and counting the searching frequency of the searching sentence in the last week.
And step 404, sorting the documents in the target result according to the feature vectors and the target feature vector.
The execution body may rank each document in the target result according to the feature vector of each document and the target feature vector. Specifically, the executing agent may rank the document corresponding to the feature vector similar to the target feature vector at the front part of the search result.
According to the data retrieval method provided by the embodiment of the application, the relevant documents can be quickly and accurately retrieved by using the trained target model.
With continued reference to fig. 5, a schematic diagram of an application scenario of the model training method, the data retrieval method according to the present application is shown. In the application scenario shown in FIG. 5, a user accesses an enterprise-level wiki through terminal 501 and enters a search term "abc" therethrough. Server 502 may locally include a target model, which upon receiving the search statement, results in a plurality of search results. And combining the target model, sequencing the search results, and finally outputting the sequencing.
With further reference to fig. 6, as an implementation of the method shown in the above figures, the present application provides an embodiment of a model training apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.
As shown in fig. 6, the model training apparatus 600 of the present embodiment includes: log acquisition unit 601, sample generation unit 602, and model training unit 603.
A log obtaining unit 601 configured to obtain a history search click log generated by the knowledge sharing system.
A sample generation unit 602 configured to generate a sample set according to the history search click log.
A model training unit 603 configured to train a model using the sample set, resulting in a target model.
In some optional implementations of this embodiment, the historical search click log includes at least one search statement and click information for each document in the search results for the at least one search statement. The sample generation unit 602 may be further configured to: for each search statement, determining a clicked document set of the search statement from the search result according to click information of each document in the search result of the search statement; determining a positive sample set according to the search statement and the clicked document set; and determining a negative sample set of the search statement according to the search result and the clicked document set.
In some alternative implementations of the present embodiment, the search results include a plurality of documents and a ranking of the documents. The sample generation unit 602 may be further configured to: determining a plurality of un-clicked documents adjacent to each clicked document in the clicked document set according to the sequence to obtain an un-clicked document set; and obtaining a negative sample set according to the search statement and the un-clicked document set.
In some optional implementations of this embodiment, a ratio of the number of negative samples to the number of positive samples in the sample set is a preset value.
In some optional implementations of this embodiment, the set of samples includes a training set. The model training unit 603 may be further configured to: for each sample of the training set, determining a feature vector corresponding to the document in the sample; and taking the search sentences in each sample in the training set as input, taking the feature vectors corresponding to the input search sentences as expected output, and training to obtain the target model.
In some optional implementations of this embodiment, the model training unit 603 may be further configured to: and obtaining a feature vector based on the search sentence, the document and at least one pre-trained model in the sample.
In some optional implementations of this embodiment, the apparatus 600 may further include a data filtering unit, not shown in fig. 6, configured to filter the history search click log to obtain filtered data.
In some optional implementations of this embodiment, the data filtering unit is further configured to: determining the search times of each search statement and the click times of each document according to the historical search click log; and filtering the search sentences the search times of which are less than a first preset threshold value and/or the documents the click times of which are less than a second preset threshold value to obtain filtered data.
In some optional implementations of this embodiment, the sample set includes a test set. The apparatus 600 may further comprise a test unit, not shown in fig. 6, configured to: and determining the searching effect of the target model by using the test set.
It should be understood that units 601 to 603 recited in the model training apparatus 600 correspond to respective steps in the method described with reference to fig. 2. Thus, the operations and features described above with respect to the model training method are equally applicable to the apparatus 600 and the units included therein, and are not described in detail here.
With further reference to fig. 7, as an implementation of the method shown in the above figures, the present application provides an embodiment of a data retrieval device, which corresponds to the embodiment of the method shown in fig. 4, and which can be applied to various electronic devices.
As shown in fig. 7, the data retrieval apparatus 700 of the present embodiment includes: a search sentence receiving unit 701, a first vector determination unit 702, a second vector determination unit 703, and a document sorting unit 704.
A search sentence receiving unit 701 configured to receive a target search sentence input by a user through a terminal.
A first vector determination unit 702 configured to determine a target feature vector according to a target search statement and a target model as described in the embodiments of fig. 2 and 3.
A second vector determination unit 703 configured to determine a feature vector for each document in the target search result for the target search sentence.
And a document sorting unit 704 configured to sort the documents in the target result according to the feature vectors and the target feature vector.
It should be understood that the units 701 to 704 recited in the data retrieval device 700 correspond to the respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above for the data retrieval method are equally applicable to the apparatus 700 and the units included therein, and are not described in detail here.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 8 is a block diagram of an electronic device that executes a model training method and a data retrieval method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.
The memory 802 is a non-transitory computer readable storage medium as provided herein. The memory 802 stores instructions executable by at least one processor, so that the at least one processor executes the execution model training method and the data retrieval method provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the execution model training method, the data retrieval method provided by the present application.
The memory 802 is a non-transitory computer-readable storage medium, and can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the model training method and the data retrieval method in the embodiments of the present application (for example, the log acquisition unit 601, the sample generation unit 602, and the model training unit 603 shown in fig. 6, or the search sentence receiving unit 701, the first vector determination unit 702, the second vector determination unit 703, and the document sorting unit 704 shown in fig. 7). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the execution model training method and the data retrieval method in the above method embodiments.
The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an electronic device that performs a model training method, a data retrieval method, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, and such remote memory may be connected over a network to an electronic device that performs the model training method, the data retrieval method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device executing the model training method and the data retrieval method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.
The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus performing the model training method, the data retrieval method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, the search click logs generated by the user in the enterprise-level wiki are analyzed and trained to obtain the target model, and the required knowledge can be quickly and accurately found through the target model.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (24)

1. A model training method, comprising:
acquiring a historical search click log generated by a knowledge sharing system;
generating a sample set according to the historical search click log;
and training a model by using the sample set to obtain a target model.
2. The method of claim 1, wherein the historical search click log comprises at least one search statement and click information for each document in search results for the at least one search statement; and
generating a training sample set according to the historical search click log, wherein the training sample set comprises:
for each search statement, determining a clicked document set of the search statement from the search result according to click information of each document in the search result of the search statement;
determining a positive sample set according to the search statement and the clicked document set;
and determining a negative sample set of the search statement according to the search result and the clicked document set.
3. The method of claim 2, wherein the search results include a plurality of documents and a ranking of the documents; and
determining a negative sample set of the search statement according to the search result and the positive sample set, including:
determining a plurality of non-clicked documents adjacent to each clicked document in the clicked document set according to the sequence to obtain a non-clicked document set;
and obtaining the negative sample set according to the search statement and the un-clicked document set.
4. The method of claim 3, wherein a ratio of a number of negative samples to a number of positive samples in the set of samples is a preset value.
5. The method of claim 1, wherein the sample set comprises a training set; and
the training of the model by using the sample set to obtain the target model comprises the following steps:
for each sample of the training set, determining a feature vector corresponding to a document in the sample;
and taking the search sentences in the samples in the training set as input, taking the feature vectors corresponding to the input search sentences as expected output, and training to obtain a target model.
6. The method of claim 5, wherein the determining the feature vector corresponding to the document in the sample comprises:
and obtaining the feature vector based on the search sentence, the document and at least one pre-trained model in the sample.
7. The method of claim 1, wherein the method further comprises:
and filtering the historical search click log to obtain filtering data.
8. The method of claim 7, wherein the filtering the log of historical search clicks to obtain filtered data comprises:
determining the search times of each search statement and the click times of each document according to the historical search click log;
and filtering the search sentences the search times of which are less than a first preset threshold value and/or the documents the click times of which are less than a second preset threshold value to obtain filtered data.
9. The method of claim 1, wherein the sample set comprises a test set; and
the method further comprises the following steps:
and determining the search effect of the target model by using the test set.
10. A method of data retrieval, comprising:
receiving a target search statement input by a user through a terminal;
determining a target feature vector from the target search statement and the target model of claims 1-8;
determining a feature vector of each document in a target search result for the target search statement;
and sequencing the documents in the target result according to the feature vectors and the target feature vector.
11. A model training apparatus comprising:
a log obtaining unit configured to obtain a history search click log generated by the knowledge sharing system;
a sample generation unit configured to generate a sample set according to the historical search click log;
and the model training unit is configured to train a model by using the sample set to obtain a target model.
12. The apparatus of claim 11, wherein the historical search click log comprises at least one search statement and click information for each document in search results for the at least one search statement; and
the sample generation unit is further configured to:
for each search statement, determining a clicked document set of the search statement from the search result according to click information of each document in the search result of the search statement;
determining a positive sample set according to the search statement and the clicked document set;
and determining a negative sample set of the search statement according to the search result and the clicked document set.
13. The apparatus of claim 12, wherein the search results comprise a plurality of documents and a ranking of the documents; and
the sample generation unit is further configured to:
determining a plurality of non-clicked documents adjacent to each clicked document in the clicked document set according to the sequence to obtain a non-clicked document set;
and obtaining the negative sample set according to the search statement and the un-clicked document set.
14. The apparatus of claim 13, wherein a ratio of a number of negative samples to a number of positive samples in the set of samples is a preset value.
15. The apparatus of claim 11, wherein the sample set comprises a training set; and
the model training unit is further configured to:
for each sample of the training set, determining a feature vector corresponding to a document in the sample;
and taking the search sentences in the samples in the training set as input, taking the feature vectors corresponding to the input search sentences as expected output, and training to obtain a target model.
16. The apparatus of claim 15, wherein the model training unit is further configured to:
and obtaining the feature vector based on the search sentence, the document and at least one pre-trained model in the sample.
17. The apparatus of claim 11, wherein the apparatus further comprises:
and the data filtering unit is configured to filter the historical search click log to obtain filtered data.
18. The apparatus of claim 17, wherein the data filtering unit is further configured to:
determining the search times of each search statement and the click times of each document according to the historical search click log;
and filtering the search sentences the search times of which are less than a first preset threshold value and/or the documents the click times of which are less than a second preset threshold value to obtain filtered data.
19. The apparatus of claim 11, wherein the set of samples comprises a test set; and
the apparatus further comprises a test unit configured to:
and determining the search effect of the target model by using the test set.
20. A data retrieval apparatus comprising:
a search sentence receiving unit configured to receive a target search sentence input by a user through a terminal;
a first vector determination unit configured to determine a target feature vector from the target search statement and the target model of claims 1-8;
a second vector determination unit configured to determine a feature vector for each document in a target search result for the target search sentence;
and the document sorting unit is configured to sort the documents in the target result according to the feature vectors and the target feature vector.
21. A model training electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
22. A data retrieval electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of claim 10.
23. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.
24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of claim 10.
CN202010939453.6A 2020-09-09 2020-09-09 Model training method, data retrieval method, device, equipment and storage medium Pending CN112084150A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010939453.6A CN112084150A (en) 2020-09-09 2020-09-09 Model training method, data retrieval method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010939453.6A CN112084150A (en) 2020-09-09 2020-09-09 Model training method, data retrieval method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112084150A true CN112084150A (en) 2020-12-15

Family

ID=73732209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010939453.6A Pending CN112084150A (en) 2020-09-09 2020-09-09 Model training method, data retrieval method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112084150A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343028A (en) * 2021-05-31 2021-09-03 北京达佳互联信息技术有限公司 Method and device for training intention determination model
CN113609841A (en) * 2021-06-25 2021-11-05 北京齐尔布莱特科技有限公司 Training method and computing device for topic word generation model
CN114676227A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Sample generation method, model training method and search method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339756A (en) * 2016-08-25 2017-01-18 北京百度网讯科技有限公司 Training data generation method and device and searching method and device
CN107832432A (en) * 2017-11-15 2018-03-23 北京百度网讯科技有限公司 A kind of search result ordering method, device, server and storage medium
CN108460085A (en) * 2018-01-19 2018-08-28 北京奇艺世纪科技有限公司 A kind of video search sequence training set construction method and device based on user journal
US10394915B1 (en) * 2016-08-24 2019-08-27 Amazon Technologies, Inc. Architecture and techniques to search logging information
CN110727785A (en) * 2019-09-11 2020-01-24 北京奇艺世纪科技有限公司 Recommendation method, device and storage medium for training recommendation model and recommending search text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394915B1 (en) * 2016-08-24 2019-08-27 Amazon Technologies, Inc. Architecture and techniques to search logging information
CN106339756A (en) * 2016-08-25 2017-01-18 北京百度网讯科技有限公司 Training data generation method and device and searching method and device
CN107832432A (en) * 2017-11-15 2018-03-23 北京百度网讯科技有限公司 A kind of search result ordering method, device, server and storage medium
CN108460085A (en) * 2018-01-19 2018-08-28 北京奇艺世纪科技有限公司 A kind of video search sequence training set construction method and device based on user journal
CN110727785A (en) * 2019-09-11 2020-01-24 北京奇艺世纪科技有限公司 Recommendation method, device and storage medium for training recommendation model and recommending search text

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343028A (en) * 2021-05-31 2021-09-03 北京达佳互联信息技术有限公司 Method and device for training intention determination model
CN113609841A (en) * 2021-06-25 2021-11-05 北京齐尔布莱特科技有限公司 Training method and computing device for topic word generation model
CN114676227A (en) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 Sample generation method, model training method and search method

Similar Documents

Publication Publication Date Title
CN111709247B (en) Data set processing method and device, electronic equipment and storage medium
CN111104514A (en) Method and device for training document label model
CN112084150A (en) Model training method, data retrieval method, device, equipment and storage medium
US20210200813A1 (en) Human-machine interaction method, electronic device, and storage medium
CN110674260A (en) Training method and device of semantic similarity model, electronic equipment and storage medium
CN111782785B (en) Automatic question and answer method, device, equipment and storage medium
CN110472034B (en) Detection method, device and equipment of question-answering system and computer readable storage medium
CN111444438B (en) Method, device, equipment and storage medium for determining quasi-recall rate of recall strategy
CN112506949A (en) Method and device for generating query statement of structured query language and storage medium
CN111310058B (en) Information theme recommendation method, device, terminal and storage medium
CN112380847A (en) Interest point processing method and device, electronic equipment and storage medium
CN111563198A (en) Material recall method, device, equipment and storage medium
CN112329453B (en) Method, device, equipment and storage medium for generating sample chapter
US11468236B2 (en) Method and apparatus for performing word segmentation on text, device, and medium
CN112084393A (en) Method, apparatus, device and storage medium for outputting information
CN111523019B (en) Method, apparatus, device and storage medium for outputting information
CN112650919A (en) Entity information analysis method, apparatus, device and storage medium
CN111666417A (en) Method and device for generating synonyms, electronic equipment and readable storage medium
CN112148988B (en) Method, apparatus, device and storage medium for generating information
CN111680508B (en) Text processing method and device
CN111881255B (en) Synonymous text acquisition method and device, electronic equipment and storage medium
CN111414455B (en) Public opinion analysis method, public opinion analysis device, electronic equipment and readable storage medium
CN113902005A (en) Method, device, equipment and storage medium for pre-training language model
CN112579897A (en) Information searching method and device
CN111782794A (en) Question-answer response method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination