CN114186015A - Information retrieval method, device and computer readable storage medium - Google Patents

Information retrieval method, device and computer readable storage medium Download PDF

Info

Publication number
CN114186015A
CN114186015A CN202010970977.1A CN202010970977A CN114186015A CN 114186015 A CN114186015 A CN 114186015A CN 202010970977 A CN202010970977 A CN 202010970977A CN 114186015 A CN114186015 A CN 114186015A
Authority
CN
China
Prior art keywords
query
information retrieval
model
data
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010970977.1A
Other languages
Chinese (zh)
Inventor
丁磊
童毅轩
董滨
姜珊珊
张永伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN202010970977.1A priority Critical patent/CN114186015A/en
Priority to JP2021149311A priority patent/JP7230979B2/en
Publication of CN114186015A publication Critical patent/CN114186015A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an information retrieval method, an information retrieval device and a computer-readable storage medium. The information retrieval method provided by the invention comprises the following steps: acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction; removing noise in the first training data to obtain second training data; initializing an information retrieval model using the second training data; and utilizing the information retrieval model to perform information retrieval. The technical scheme of the invention can improve the accuracy of the information retrieval result and improve the efficiency of information retrieval.

Description

Information retrieval method, device and computer readable storage medium
Technical Field
The invention relates to the field of information retrieval, in particular to an information retrieval method, an information retrieval device and a computer-readable storage medium.
Background
The information retrieval technology is an important technology and is widely applied to search engines, question answering systems, recommendation systems and other various intelligent services. With better information retrieval technology, vendors can accurately understand the intentions of customers and provide appropriate products or services.
Currently, the main method of information retrieval is to judge semantic relevance between user query and document based on a large-scale neural network model. Training large-scale neural network models requires a large amount of labeling data, but the cost of manual labeling is high. The related art proposes to construct labeled data for training based on a generation method. However, the generated data usually contains some noise, and the correlation of negative samples in the generated data is insufficient, which affects the effect of information retrieval.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide an information retrieval method, an information retrieval device, and a computer-readable storage medium, which can improve the accuracy of an information retrieval result and improve the efficiency of information retrieval.
According to an aspect of an embodiment of the present invention, there is provided an information retrieval method, including:
acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
removing noise in the first training data to obtain second training data;
initializing an information retrieval model using the second training data;
and utilizing the information retrieval model to perform information retrieval.
Further, in accordance with at least one embodiment of the present invention, after initializing the information retrieval model, the method further comprises:
the information retrieval model is optimized by a countermeasure query.
Further in accordance with at least one embodiment of the present invention, the obtaining first training data includes:
acquiring open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;
training and generating a query data generation model by utilizing the open data, wherein the query data generation model can generate a query instruction corresponding to a query result according to the input query result;
and inputting the documents in the specific field into the query data generation model to generate the first training data.
Further in accordance with at least one embodiment of the present invention, the removing noise in the first training data includes:
initializing a noise classification model using the first training data;
training the noise classification model;
and removing the noise in the first training data by using the trained noise classification model.
Further in accordance with at least one embodiment of the present invention, the training the noise classification model includes:
carrying out N iterations to obtain a trained noise classification model, wherein N is a positive integer;
in each iteration, the noise classification model is used for eliminating noise in the first training data, the information retrieval model is trained by the data after noise elimination, and the parameters of the noise classification model are updated by the loss function of the trained information retrieval model.
Further in accordance with at least one embodiment of the present invention, the optimizing the information retrieval model by confrontational querying includes:
initializing an irrelevant query generation model by using the second training data, wherein the irrelevant query generation model has the input of a query result and a first query instruction relevant to the query result and has the output of a second query instruction irrelevant to the query result;
and inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by using the output result of the irrelevant query generation model.
Further, in accordance with at least one embodiment of the present invention, the objective function of the uncorrelated query generation models includes:
the relevance of a second query instruction generated by the irrelevant query generation model and a query result;
the second query instruction generated by the irrelevant query generation model has text similarity with the first query instruction.
According to another aspect of the embodiments of the present invention, there is provided an information retrieval apparatus including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring first training data, and the first training data comprises a query instruction and a query result corresponding to the query instruction;
the noise removing unit is used for removing noise in the first training data to obtain second training data;
an initialization unit configured to initialize an information retrieval model using the second training data;
and the information retrieval unit is used for retrieving information by using the information retrieval model.
Further, in accordance with at least one embodiment of the present invention, the apparatus further comprises:
and the optimization unit is used for optimizing the information retrieval model through countermeasure query.
Further, according to at least one embodiment of the present invention, the acquisition unit includes:
the system comprises an acquisition subunit, a query execution subunit and a query execution subunit, wherein the acquisition subunit is used for acquiring open data, and the open data comprises a query instruction and a query result corresponding to the query instruction;
the first processing subunit is used for generating a query data generation model by utilizing the open data training, and the query data generation model can generate a query instruction corresponding to an input query result according to the input query result;
and the second processing subunit is used for inputting the documents in the specific field into the query data generation model to generate the first training data.
Further, according to at least one embodiment of the present invention, the noise removing unit includes:
a first initialization subunit, configured to initialize a noise classification model using the first training data;
the training subunit is used for training the noise classification model;
and the clearing subunit is used for clearing the noise in the first training data by using the trained noise classification model.
Furthermore, in accordance with at least one embodiment of the present invention, the optimization unit includes:
a second initialization subunit, configured to initialize an irrelevant query generation model with the second training data, where the irrelevant query generation model has an input of a query result and a first query instruction relevant to the query result, and an output of a second query instruction irrelevant to the query result;
and the confrontation training subunit is used for inputting the output result of the information retrieval model into the irrelevant query generation model and training the information retrieval model by using the output result of the irrelevant query generation model.
An embodiment of the present invention further provides an information retrieval apparatus, including: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the information retrieval method as described above.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the information retrieval method described above are implemented.
Compared with the prior art, the information retrieval method, the information retrieval device and the computer-readable storage medium provided by the embodiment of the invention have the advantages that after the first training data used for training the information retrieval model is obtained, the first training data is not directly utilized to generate the information retrieval model, noise in the first training data is eliminated, the second training data after the noise is eliminated is utilized to initialize the information retrieval model, the performance of the information retrieval model can be optimized, the accuracy of an information retrieval result is improved, and the efficiency of information retrieval is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.
FIG. 1 is a schematic flow chart of an information retrieval method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating a process of acquiring first training data according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating an embodiment of removing noise from first training data;
FIG. 4 is a diagram illustrating training a noise classification model according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart illustrating the optimization of an information retrieval model according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of generating irrelevant queries, in accordance with an embodiment of the present invention;
FIG. 7 is a diagram illustrating an information retrieval model and an irrelevant query generation model for countertraining in accordance with an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of an information retrieval apparatus according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of another structure of an information retrieval apparatus according to an embodiment of the present invention;
FIG. 10 is a schematic structural diagram of an obtaining unit according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of a noise removing unit according to an embodiment of the present invention;
FIG. 12 is a schematic structural diagram of an optimization unit according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of an information retrieval apparatus according to another embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided only to help the full understanding of the embodiments of the present invention. Thus, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In order to solve the problems that the information retrieval system needs large amount of labeled data and the cost of manual labeled data is high, the query data generation model can be trained by open data, a query is generated on a document in a target field by the query data generation model, and then a query-result data pair is constructed by the generated query to train the information retrieval model.
However, there are 2 problems with this approach: one is the presence of noise in the generated data; the other method is that the generated query is used as a related query, the unrelated query is constructed by randomly selecting queries of other documents, and the quantity and the quality of the unrelated query cannot meet the requirements. High-quality irrelevant queries, which are similar to relevant queries in terms of characters but irrelevant to the content of query results, can effectively improve the effect of the information retrieval system. For example: the query result is "iphoneX produced by apple inc", and the corresponding related query is "iphoneX producer is? "the corresponding high-quality unrelated query is" color of iphoneX is? "and a low-quality unrelated query is" who is telltale? ".
Embodiments of the present invention provide an information retrieval method, an information retrieval device, and a computer-readable storage medium, which can improve accuracy of an information retrieval result and improve efficiency of information retrieval.
Example one
An embodiment of the present invention provides an information retrieval method, as shown in fig. 1, including:
step 101: acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
for example, the query is "iphoneX manufacturer is? "the query result corresponding thereto is" iphoneX produced by apple inc. Wherein the first training data is a data pair of 'query instruction-query result' of a specific target field.
As shown in fig. 2, acquiring the first training data comprises the following steps:
step 1011: acquiring open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;
the open data may be a data set disclosed by a network, or may be acquired from the network. For example, the "question-answer" data on the question-answer website may be regarded as a query instruction and a query result corresponding to the query instruction, and the "question-answer" data may be collected as training data to generate a query data generation model.
Unlike the first training data, the acquired opening data is not necessarily data of a specific target area, such as a specific target area being a medical area, and the acquired opening data may be data of other areas, such as a mechanical area.
Step 1012: training and generating a query data generation model by utilizing the open data, wherein the query data generation model can generate a query instruction corresponding to a query result according to the input query result;
the query data generation model is a neural network model, the query data generation model is generated through training of acquired open data, and the query data generation model can generate a query instruction corresponding to an input query result according to the input query result, for example, the input query result is ' delapt is american president ', and the generated output is the query instruction ' who is delapt? ".
Step 1013: and inputting the documents in the specific field into the query data generation model to generate the first training data.
Wherein, documents in specific fields can be input into the query data generation model according to requirements, and the specific fields include but are not limited to the medical field, the mechanical manufacturing field and the like. The document in the specific field is input into the query data generation model, and a query instruction-query result data set can be generated, so that a large amount of query instruction-query result data in the specific field can be generated by using the query data generation model, and the problems of large labeled data quantity and high manual labeled data cost required by an information retrieval system are solved.
Step 102: removing noise in the first training data to obtain second training data;
the noise in the domain-specific "query instruction-query result" data set generated in step 101, if any (i.e., incorrect data), may adversely affect the accuracy of the information retrieval model, and therefore, the noise in the data needs to be removed before the information retrieval model is initialized and trained. In this embodiment, noise in the first training data may be removed by using a noise classification model, and the noise classification model may be any text classification model, and may be capable of distinguishing whether a piece of data is noise.
As shown in fig. 3, in this embodiment, the removing noise in the first training data includes the following steps:
step 1021: initializing a noise classification model using the first training data;
step 1022: training the noise classification model;
in this embodiment, N iterations may be performed to obtain a trained noise classification model, where N is a positive integer.
As shown in fig. 4, in each iteration, the noise classification model is optimized by removing noise from the first training data using the noise classification model, training the information retrieval model using the data from which noise is removed, and updating parameters of the noise classification model using a loss function of the trained information retrieval model.
In this embodiment, the noise classification model may predict the probability p that the data is noisejAs follows:
pj=π(a=0|θ);
where pi is a function of the noise classification model, a-1 indicates that the data is not noise, and a-0 indicates that the data is noise.
The parameters θ of the noise classification model can be updated by the following loss function:
Figure BDA0002684019270000081
wherein, UiRemoving residual data after noise in the first training data by using a noise classification model during the ith iteration, and obtaining the data Ui-1And F is a performance index of the evaluation information retrieval model.
Wherein N may be a preset value, for example, 50 or 100, or may be determined according to the performance of the information retrieval model after iteration, and if the performance of the information retrieval model after the nth iteration is improved by a limited amount compared with the performance of the (N-1) th iteration, the iteration is stopped.
Step 1023: and removing the noise in the first training data by using the trained noise classification model.
After the noise in the first training data is eliminated by using the trained noise classification model, incorrect data in the first training data can be eliminated, and the accuracy of the information retrieval model is improved.
Step 103: initializing an information retrieval model using the second training data;
in this embodiment, the information retrieval model is initialized by using the second training data from which the noise is removed, so that the accuracy of the information retrieval result can be improved, and the efficiency of information retrieval can be improved.
To improve performance, after initializing the information retrieval model, the method further comprises: the information retrieval model is optimized by a countermeasure query. As shown in fig. 5, the optimizing the information retrieval model by confrontational query includes the following steps:
step 1051: initializing an irrelevant query generation model by using the second training data, wherein the irrelevant query generation model has the input of a query result and a first query instruction relevant to the query result and has the output of a second query instruction irrelevant to the query result;
the second query instruction is a high-quality irrelevant query, and the second query instruction needs to be irrelevant to the query result and has text similarity with the first query instruction. Wherein the objective function of the irrelevant query generation model comprises: the relevance of a second query instruction generated by the irrelevant query generation model and a query result; the second query instruction generated by the irrelevant query generation model has text similarity with the first query instruction. The objective function is as small as possible.
As shown in fig. 6, wherein the input of the irrelevant query generation model is a query result, such as "iphoneX produced by apple inc" and a query instruction relevant to the query result, such as "iphoneX manufacturer is? ", the output is another query instruction not related to the query result, such as" is the color of iphoneX? "is the irrelevant query to" is the manufacturer of iphoneX? "has textual similarity, i.e., is a high quality unrelated query.
During initialization, the following objective function may be employed to make the generated query irrelevant to the query result but textually similar to the relevant query instruction:
p (a ═ 1| results, irrelevant queries generated) + λ d (relevant queries, irrelevant queries)
Where, the part p (a ═ 1| result, generated irrelevant query) represents the relevance of the generated irrelevant query and the query result, and can be obtained by initializing the trained information retrieval model. d (related query, unrelated query) can be an index for judging text similarity such as edit distance. Both parts should be as small as possible. λ is a weighting coefficient for adjusting the importance of the second part, and the value of λ can be adjusted as needed.
Step 1052: and inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by using the output result of the irrelevant query generation model.
As shown in fig. 7, the information retrieval model is trained using the output result of the irrelevant query generation model as training data, and the result output by the information retrieval model is input as feedback to the irrelevant query generation model, and then countertraining is performed to optimize the information retrieval model.
The input of the information retrieval model is 'query instruction-query result', the output is a probability, which represents the probability that the query result is the correct query result corresponding to the query instruction, if the information retrieval model is trained by only 'related query instruction and corresponding query result', the accuracy of the information retrieval model is limited, in the embodiment, the information retrieval model is trained by 'irrelevant query instruction-query result' data output by the irrelevant query generation model, the information retrieval model and the irrelevant query generation model are trained against each other, and the 2 models can mutually optimize the effect of each other through iteration.
Step 104: and utilizing the information retrieval model to perform information retrieval.
In the embodiment, the semantic relevance between the query instruction of the user and the document can be accurately judged by using the information retrieval model.
In this embodiment, after the first training data is obtained, the information retrieval model is not generated directly by using the first training data, but noise in the first training data is removed first, and the information retrieval model is initialized by using the second training data after the noise is removed, so that the performance of the information retrieval model can be optimized, the accuracy of the information retrieval result is improved, and the efficiency of information retrieval is improved.
Example two
An embodiment of the present invention further provides an information retrieval apparatus, as shown in fig. 8, including:
an obtaining unit 21, configured to obtain first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction;
for example, the query is "iphoneX manufacturer is? "the query result corresponding thereto is" iphoneX produced by apple inc. Wherein the first training data is a data pair of 'query instruction-query result' of a specific target field.
A noise removing unit 22, configured to remove noise in the first training data to obtain second training data;
if noise (namely incorrect data) exists in the generated domain-specific query instruction-query result data set, the accuracy of the information retrieval model is adversely affected, and therefore, the noise in the data needs to be eliminated before the information retrieval model is initialized and trained. In this embodiment, noise in the first training data may be removed by using a noise classification model, and the noise classification model may be any text classification model, and may be capable of distinguishing whether a piece of data is noise.
An initialization unit 23 configured to initialize an information retrieval model using the second training data;
in this embodiment, the information retrieval model is initialized by using the second training data from which the noise is removed, so that the accuracy of the information retrieval result can be improved, and the efficiency of information retrieval can be improved.
And an information retrieval unit 24 for performing information retrieval using the information retrieval model.
In the embodiment, the semantic relevance between the query instruction of the user and the document can be accurately judged by using the information retrieval model.
In this embodiment, after the first training data is obtained, the information retrieval model is not generated directly by using the first training data, but noise in the first training data is removed first, and the information retrieval model is initialized by using the second training data after the noise is removed, so that the performance of the information retrieval model can be optimized, the accuracy of the information retrieval result is improved, and the efficiency of information retrieval is improved.
In some embodiments, as shown in fig. 9, the apparatus further comprises:
an optimization unit 25 for optimizing the information retrieval model by countermeasure queries.
In some embodiments, as shown in fig. 10, the obtaining unit 21 includes:
an obtaining subunit 211, configured to obtain open data, where the open data includes a query instruction and a query result corresponding to the query instruction;
a first processing subunit 212, configured to generate, by using the open data training, a query data generation model, where the query data generation model is capable of generating, according to an input query result, a query instruction corresponding to the query result;
the open data may be a data set disclosed by a network, or may be acquired from the network. For example, the "question-answer" data on the question-answer website may be regarded as a query instruction and a query result corresponding to the query instruction, and the "question-answer" data may be collected as training data to generate a query data generation model.
Unlike the first training data, the acquired opening data is not necessarily data of a specific target area, such as a specific target area being a medical area, and the acquired opening data may be data of other areas, such as a mechanical area.
The query data generation model is a neural network model, the query data generation model is generated through training of acquired open data, and the query data generation model can generate a query instruction corresponding to an input query result according to the input query result, for example, the input query result is ' delapt is american president ', and the generated output is the query instruction ' who is delapt? ".
And a second processing subunit 213, configured to input the domain-specific document into the query data generation model, and generate the first training data.
Wherein, documents in specific fields can be input into the query data generation model according to requirements, and the specific fields include but are not limited to the medical field, the mechanical manufacturing field and the like. The document in the specific field is input into the query data generation model, and a query instruction-query result data set can be generated, so that a large amount of query instruction-query result data in the specific field can be generated by using the query data generation model, and the problems of large labeled data quantity and high manual labeled data cost required by an information retrieval system are solved.
In some embodiments, as shown in fig. 11, the noise removing unit 22 includes:
a first initialization subunit 221, configured to initialize a noise classification model with the first training data;
a training subunit 222, configured to train the noise classification model;
in this embodiment, N iterations may be performed to obtain a trained noise classification model, where N is a positive integer.
As shown in fig. 4, in each iteration, the noise classification model is optimized by removing noise from the first training data using the noise classification model, training the information retrieval model using the data from which noise is removed, and updating parameters of the noise classification model using a loss function of the trained information retrieval model.
In this embodiment, the noise classification model may predict the probability p that the data is noisejAs follows:
pj=π(a=0|θ);
where pi is a function of the noise classification model, a-1 indicates that the data is not noise, and a-0 indicates that the data is noise.
The parameters θ of the noise classification model can be updated by the following loss function:
Figure BDA0002684019270000121
wherein, UiRemoving residual data after noise in the first training data by using a noise classification model during the ith iteration, and obtaining the data Ui-1And F is a performance index of the evaluation information retrieval model.
Wherein N may be a preset value, for example, 50 or 100, or may be determined according to the performance of the information retrieval model after iteration, and if the performance of the information retrieval model after the nth iteration is improved by a limited amount compared with the performance of the (N-1) th iteration, the iteration is stopped.
A cleaning subunit 223, configured to clean the noise in the first training data by using the trained noise classification model.
After the noise in the first training data is eliminated by using the trained noise classification model, incorrect data in the first training data can be eliminated, and the accuracy of the information retrieval model is improved.
In some embodiments, as shown in fig. 12, the optimization unit 25 includes:
a second initializing subunit 251, configured to initialize an irrelevant query generation model with the second training data, where the irrelevant query generation model has an input of a query result and a first query instruction relevant to the query result, and an output of a second query instruction irrelevant to the query result;
the second query instruction is a high-quality irrelevant query, and the second query instruction needs to be irrelevant to the query result and has text similarity with the first query instruction. Wherein the objective function of the irrelevant query generation model comprises: the relevance of a second query instruction generated by the irrelevant query generation model and a query result; the second query instruction generated by the irrelevant query generation model has text similarity with the first query instruction. The objective function is as small as possible.
As shown in fig. 6, wherein the input of the irrelevant query generation model is a query result, such as "iphoneX produced by apple inc" and a query instruction relevant to the query result, such as "iphoneX manufacturer is? ", the output is another query instruction not related to the query result, such as" is the color of iphoneX? "is the irrelevant query to" is the manufacturer of iphoneX? "has textual similarity, i.e., is a high quality unrelated query.
During initialization, the following objective function may be employed to make the generated query irrelevant to the query result but textually similar to the relevant query instruction:
p (a ═ 1| results, irrelevant queries generated) + λ d (relevant queries, irrelevant queries)
Where, the part p (a ═ 1| result, generated irrelevant query) represents the relevance of the generated irrelevant query and the query result, and can be obtained by initializing the trained information retrieval model. d (related query, unrelated query) can be an index for judging text similarity such as edit distance. Both parts should be as small as possible. λ is a weighting coefficient for adjusting the importance of the second part, and the value of λ can be adjusted as needed.
The confrontation training subunit 252 is configured to input an output result of the information retrieval model into the irrelevant query generation model, and train the information retrieval model using the output result of the irrelevant query generation model.
As shown in fig. 7, the information retrieval model is trained using the output result of the irrelevant query generation model as training data, and the result output by the information retrieval model is input as feedback to the irrelevant query generation model, and then countertraining is performed to optimize the information retrieval model.
The input of the information retrieval model is 'query instruction-query result', the output is a probability, which represents the probability that the query result is the correct query result corresponding to the query instruction, if the information retrieval model is trained by only 'related query instruction and corresponding query result', the accuracy of the information retrieval model is limited, in the embodiment, the information retrieval model is trained by 'irrelevant query instruction-query result' data output by the irrelevant query generation model, the information retrieval model and the irrelevant query generation model are trained against each other, and the 2 models can mutually optimize the effect of each other through iteration.
EXAMPLE III
An embodiment of the present invention further provides an information retrieval apparatus 30, as shown in fig. 13, including:
a processor 32; and
a memory 34, in which memory 34 computer program instructions are stored,
wherein the computer program instructions, when executed by the processor, cause the processor 32 to perform the steps of:
acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
removing noise in the first training data to obtain second training data;
initializing an information retrieval model using the second training data;
and utilizing the information retrieval model to perform information retrieval.
Further, as shown in fig. 13, the information retrieval apparatus 30 further includes a network interface 31, an input device 33, a hard disk 35, and a display device 36.
The various interfaces and devices described above may be interconnected by a bus architecture. A bus architecture may include any number of interconnected buses and bridges. Various circuits of one or more Central Processing Units (CPUs), represented in particular by processor 32, and one or more memories, represented by memory 34, are coupled together. The bus architecture may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like. It will be appreciated that a bus architecture is used to enable communications among the components. The bus architecture includes a power bus, a control bus, and a status signal bus, in addition to a data bus, all of which are well known in the art and therefore will not be described in detail herein.
The network interface 31 may be connected to a network (e.g., the internet, a local area network, etc.), and may obtain relevant data, such as public data, from the network, and may store the relevant data in the hard disk 35.
The input device 33 can receive various commands input by the operator and send the commands to the processor 32 for execution. The input device 33 may comprise a keyboard or a pointing device (e.g., a mouse, trackball, touch pad, touch screen, etc.).
The display device 36 may display the results of the instructions executed by the processor 32.
The memory 34 is used for storing programs and data necessary for operating the operating system, and data such as intermediate results in the calculation process of the processor 32.
It will be appreciated that memory 34 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 34 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 34 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system 341 and application programs 342.
The operating system 341 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application 342 includes various applications, such as a Browser (Browser), and the like, for implementing various application services. A program implementing the method of an embodiment of the present invention may be included in the application 342.
The processor 32, when calling and executing the application program and data stored in the memory 34, specifically, the application program or instruction stored in the application program 342, obtains first training data, where the first training data includes a query instruction and a query result corresponding to the query instruction; removing noise in the first training data to obtain second training data; initializing an information retrieval model using the second training data; and utilizing the information retrieval model to perform information retrieval.
Further, the processor 32 optimizes the information retrieval model through competing queries when invoking and executing applications and data stored in the memory 34, and in particular, programs or instructions stored in the application 342.
Further, the processor 32 obtains open data when calling and executing the application program and data stored in the memory 34, specifically, the application program or the data may be a program or an instruction stored in the application program 342, where the open data includes a query instruction and a query result corresponding to the query instruction; training and generating a query data generation model by utilizing the open data, wherein the query data generation model can generate a query instruction corresponding to a query result according to the input query result; and inputting the documents in the specific field into the query data generation model to generate the first training data.
Further, the processor 32 initializes the noise classification model by using the first training data when calling and executing the application program and data stored in the memory 34, specifically, the program or the instruction stored in the application program 342; training the noise classification model; and removing the noise in the first training data by using the trained noise classification model.
Further, when the processor 32 calls and executes the application program and the data stored in the memory 34, specifically, the application program or the instruction stored in the application program 342, N iterations are performed to obtain a trained noise classification model, where N is a positive integer; in each iteration, the noise classification model is used for eliminating noise in the first training data, the information retrieval model is trained by the data after noise elimination, and the parameters of the noise classification model are updated by the loss function of the trained information retrieval model.
Further, the processor 32 initializes an irrelevant query generation model with the second training data when the application program and data stored in the memory 34, specifically, the program or the instruction stored in the application program 342, the irrelevant query generation model having an input of the query result and the first query instruction related to the query result and an output of the second query instruction unrelated to the query result, is called and executed; and inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by using the output result of the irrelevant query generation model.
The objective function of the irrelevant query generation model comprises:
the relevance of a second query instruction generated by the irrelevant query generation model and a query result;
the second query instruction generated by the irrelevant query generation model has text similarity with the first query instruction.
The methods disclosed in the above embodiments of the present invention may be implemented in the processor 32 or by the processor 32. The processor 32 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 32. The processor 32 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 34, and the processor 32 reads the information in the memory 34 and completes the steps of the method in combination with the hardware thereof.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Example four
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the processor is caused to execute the following steps:
acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
removing noise in the first training data to obtain second training data;
initializing an information retrieval model using the second training data;
and utilizing the information retrieval model to perform information retrieval.
The foregoing is a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should be construed as the protection scope of the present invention.

Claims (13)

1. An information retrieval method, comprising:
acquiring first training data, wherein the first training data comprises a query instruction and a query result corresponding to the query instruction;
removing noise in the first training data to obtain second training data;
initializing an information retrieval model using the second training data;
and utilizing the information retrieval model to perform information retrieval.
2. The information retrieval method of claim 1, wherein after initializing the information retrieval model, the method further comprises:
the information retrieval model is optimized by a countermeasure query.
3. The information retrieval method according to claim 1, wherein the acquiring of the first training data includes:
acquiring open data, wherein the open data comprises a query instruction and a query result corresponding to the query instruction;
training and generating a query data generation model by utilizing the open data, wherein the query data generation model can generate a query instruction corresponding to a query result according to the input query result;
and inputting the documents in the specific field into the query data generation model to generate the first training data.
4. The information retrieval method of claim 1, wherein the removing noise in the first training data comprises:
initializing a noise classification model using the first training data;
training the noise classification model;
and removing the noise in the first training data by using the trained noise classification model.
5. The information retrieval method of claim 4, wherein the training the noise classification model comprises:
carrying out N iterations to obtain a trained noise classification model, wherein N is a positive integer;
in each iteration, the noise classification model is used for eliminating noise in the first training data, the information retrieval model is trained by the data after noise elimination, and the parameters of the noise classification model are updated by the loss function of the trained information retrieval model.
6. The information retrieval method of claim 2, wherein the optimizing the information retrieval model by countermeasure query comprises:
initializing an irrelevant query generation model by using the second training data, wherein the irrelevant query generation model has the input of a query result and a first query instruction relevant to the query result and has the output of a second query instruction irrelevant to the query result;
and inputting the output result of the information retrieval model into the irrelevant query generation model, and training the information retrieval model by using the output result of the irrelevant query generation model.
7. The information retrieval method of claim 6, wherein the objective function of the irrelevant query generation model comprises:
the relevance of a second query instruction generated by the irrelevant query generation model and a query result;
the second query instruction generated by the irrelevant query generation model has text similarity with the first query instruction.
8. An information retrieval apparatus, characterized by comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring first training data, and the first training data comprises a query instruction and a query result corresponding to the query instruction;
the noise removing unit is used for removing noise in the first training data to obtain second training data;
an initialization unit configured to initialize an information retrieval model using the second training data;
and the information retrieval unit is used for retrieving information by using the information retrieval model.
9. The information retrieval device according to claim 8, characterized by further comprising:
and the optimization unit is used for optimizing the information retrieval model through countermeasure query.
10. The information retrieval device according to claim 8, wherein the acquisition unit includes:
the system comprises an acquisition subunit, a query execution subunit and a query execution subunit, wherein the acquisition subunit is used for acquiring open data, and the open data comprises a query instruction and a query result corresponding to the query instruction;
the first processing subunit is used for generating a query data generation model by utilizing the open data training, and the query data generation model can generate a query instruction corresponding to an input query result according to the input query result;
and the second processing subunit is used for inputting the documents in the specific field into the query data generation model to generate the first training data.
11. The information retrieval device according to claim 8, wherein the noise removal unit includes:
a first initialization subunit, configured to initialize a noise classification model using the first training data;
the training subunit is used for training the noise classification model;
and the clearing subunit is used for clearing the noise in the first training data by using the trained noise classification model.
12. The information retrieval device according to claim 9, wherein the optimization unit includes:
a second initialization subunit, configured to initialize an irrelevant query generation model with the second training data, where the irrelevant query generation model has an input of a query result and a first query instruction relevant to the query result, and an output of a second query instruction irrelevant to the query result;
and the confrontation training subunit is used for inputting the output result of the information retrieval model into the irrelevant query generation model and training the information retrieval model by using the output result of the irrelevant query generation model.
13. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the information retrieval method according to one of claims 1 to 7.
CN202010970977.1A 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium Pending CN114186015A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010970977.1A CN114186015A (en) 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium
JP2021149311A JP7230979B2 (en) 2020-09-15 2021-09-14 Information retrieval method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010970977.1A CN114186015A (en) 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114186015A true CN114186015A (en) 2022-03-15

Family

ID=80539270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010970977.1A Pending CN114186015A (en) 2020-09-15 2020-09-15 Information retrieval method, device and computer readable storage medium

Country Status (2)

Country Link
JP (1) JP7230979B2 (en)
CN (1) CN114186015A (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3944159B2 (en) * 2003-12-25 2007-07-11 株式会社東芝 Question answering system and program
US20170132638A1 (en) * 2014-12-26 2017-05-11 Hitachi, Ltd. Relevant information acquisition method and apparatus, and storage medium
US10657259B2 (en) * 2017-11-01 2020-05-19 International Business Machines Corporation Protecting cognitive systems from gradient based attacks through the use of deceiving gradients
JP2020098521A (en) * 2018-12-19 2020-06-25 富士通株式会社 Information processing device, data extraction method, and data extraction program

Also Published As

Publication number Publication date
JP2022049010A (en) 2022-03-28
JP7230979B2 (en) 2023-03-01

Similar Documents

Publication Publication Date Title
Cambronero et al. When deep learning met code search
US11468233B2 (en) Intention identification method, intention identification apparatus, and computer-readable recording medium
CN110222045B (en) Data report acquisition method and device, computer equipment and storage medium
CN111488137B (en) Code searching method based on common attention characterization learning
US11031009B2 (en) Method for creating a knowledge base of components and their problems from short text utterances
CN111914097A (en) Entity extraction method and device based on attention mechanism and multi-level feature fusion
CN110674306B (en) Knowledge graph construction method and device and electronic equipment
US11861308B2 (en) Mapping natural language utterances to operations over a knowledge graph
US20220129448A1 (en) Intelligent dialogue method and apparatus, and storage medium
JP7303195B2 (en) Facilitate subject area and client-specific application program interface recommendations
US11526512B1 (en) Rewriting queries
CN113434858B (en) Malicious software family classification method based on disassembly code structure and semantic features
WO2021015936A1 (en) Word-overlap-based clustering cross-modal retrieval
JP2022145623A (en) Method and device for presenting hint information and computer program
Tan et al. Fine-grained image classification with factorized deep user click feature
CN113220996B (en) Scientific and technological service recommendation method, device, equipment and storage medium based on knowledge graph
CN114186015A (en) Information retrieval method, device and computer readable storage medium
CN112800314B (en) Method, system, storage medium and equipment for search engine query automatic completion
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN115718791A (en) Specific ordering of text elements and applications thereof
CN110309258A (en) A kind of input checking method, server and computer readable storage medium
Srinivasan et al. Model-assisted machine-code synthesis
CN113806510B (en) Legal provision retrieval method, terminal equipment and computer storage medium
US20230162031A1 (en) Method and system for training neural network for generating search string
CN114970875A (en) Machine learning pipeline skeleton instantiation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination