CN111813930B - Similar document retrieval method and device - Google Patents

Similar document retrieval method and device Download PDF

Info

Publication number
CN111813930B
CN111813930B CN202010543812.6A CN202010543812A CN111813930B CN 111813930 B CN111813930 B CN 111813930B CN 202010543812 A CN202010543812 A CN 202010543812A CN 111813930 B CN111813930 B CN 111813930B
Authority
CN
China
Prior art keywords
document
similarity
document set
documents
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010543812.6A
Other languages
Chinese (zh)
Other versions
CN111813930A (en
Inventor
毛红保
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN202010543812.6A priority Critical patent/CN111813930B/en
Publication of CN111813930A publication Critical patent/CN111813930A/en
Priority to PCT/CN2021/078813 priority patent/WO2021253873A1/en
Application granted granted Critical
Publication of CN111813930B publication Critical patent/CN111813930B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides a similar document retrieval method and a similar document retrieval device, wherein the method comprises the following steps: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set. The method considers the results of the word frequency searching method and the document vectorizing searching method at the same time, and combines the results through the similarity, so that semantic inertia is eliminated to a certain extent, a multi-dimensional search result is obtained, and the limitation of the search result obtained by a single model is avoided.

Description

Similar document retrieval method and device
Technical Field
The invention relates to the field of natural language analysis, in particular to a similar document retrieval method and device.
Background
And (3) automatically retrieving the document which is the most similar to the content of the document from a massive document library by giving the document to be retrieved. In the field of translation, when a manuscript to be translated is received, the document similar to the topic content of the manuscript needs to be retrieved from a historical manuscript library so as to be quickly matched with a proper translator, thereby improving the quality and efficiency of translation.
The conventional document retrieval method mainly comprises a keyword related method, such as TF-IDF (term frequency-inverse document frequency), and the like, and the method can meet the requirements in most cases, but has the defect of neglecting word-to-word sequences. For example, if a document contains a large number of phrases such as "machine learning", the phrases are split into two keywords, namely "machine" and "learning" for searching; if the machine learning in the document is replaced by the learning machine, the retrieval result is not affected. To solve such problems, deep learning-based document semantic representations are applied in document retrieval, such as document vectorization model Doc2vec. The document vectorization model is sensitive to word order, can better represent the document from the semantic level, but semantic inertia can exist in the actual application process. For example, the first 5 documents with the highest matching degree with the "motorcycle production" need to be searched, and the document library contains a large number of documents related to the "motorcycle sales" and the "automobile production", and if the semantic representation method is adopted for searching, it is likely that all the first 5 documents searched are related to the "automobile production". This is because the semantic representation method is more sensitive to the semantics of the global level of the document than to highlight a certain keyword. But the user is likely to want the first 5 documents to be in terms of both "car production" and "motorcycle sales". It can be seen that the search results obtained based on the current method are often limited, and no accurate search results can be obtained.
Disclosure of Invention
In order to solve the above problems, the embodiment of the invention provides a similar document retrieval method and device.
In a first aspect, an embodiment of the present invention provides a similar document retrieval method, including: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain the candidate document set; and determining a retrieval result according to the candidate document set.
Further, the determining a search result according to the candidate document set includes: selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.
Further, before the overlapping of the same document similarity in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
Further, the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain consistent.
Further, the word frequency search model is a TF-IDF model.
Further, the document vectorization model is a Doc2vec model.
Further, the first preset ratio is 2/3, and the second preset ratio is 1/2.
In a second aspect, an embodiment of the present invention provides a similar document retrieval apparatus, including: the classification acquisition module is used for searching to obtain a first document set and the similarity of each document based on the word frequency search model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; the similarity stacking module is used for stacking the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and the search result determining module is used for determining a search result according to the candidate document set.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the similar document retrieval method of the first aspect of the present invention when the program is executed by the processor.
In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the similar document retrieval method of the first aspect of the present invention.
According to the similar document retrieval method and device provided by the embodiment of the invention, the same document similarity in the first document set and the second document set is overlapped, the preset number of documents are selected according to the similarity from large to small, the candidate document set is obtained, meanwhile, the results of the word frequency search method and the document vectorization search method are considered, and the similarity is combined, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a similar document retrieval method provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a method for retrieving similar documents according to another embodiment of the present invention;
FIG. 3 is a block diagram of a similar document retrieval device according to an embodiment of the present invention;
fig. 4 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
FIG. 1 is a flowchart of a similar document retrieval method provided in an embodiment of the present invention, and as shown in FIG. 1, the embodiment of the present invention provides a similar document retrieval method, including:
101. and searching based on the word frequency search model to obtain the similarity of each document in the first document set, and searching based on the document vectorization model to obtain the similarity of the second document set and each document.
The term frequency search model generally refers to a model that searches based on the term frequency of a keyword, such as a TF-IDF model. The document vectorization model generally refers to a type of model based on semantic retrieval of keyword vectors, such as Doc2vec model and word2vec model.
In the specific implementation process, keyword retrieval is carried out on the document to be retrieved based on a word frequency search model, a keyword retrieval Result of the document to be retrieved is obtained, a first document set is obtained, and the first document set is recorded as Result TF-IDF . Carrying out semantic vectorization representation on the document to be searched, searching based on a document vectorization model, obtaining a semantic search Result of the document to be searched, obtaining a second document set, and marking the second document set as Result Doc2vec . In addition to the search results, the similarity of each searched document is obtained, and the similarity represents the similarity degree of the searched document and the document to be searched.
102. Overlapping the similarity of the same documents in the first document set and the second document set, selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set, and recording the candidate document set as Result combination
Considering that the same document exists in the first document set and the second document set, the similarity is overlapped on the same document, and the similarity of other documents in the two sets is kept unchanged. And then sorting the whole according to the similarity, and selecting a preset number of documents from the documents to serve as a candidate document set.
As an alternative embodiment, the number of documents in the first set of documents, the second set of documents, and the set of candidate documents remain the same. The values may be the same or similar. For example, the number of the documents in the first document set, the second document set and the candidate document set is N, so that the balance of word frequency based search and document vectorization based search is ensured.
103. And determining a retrieval result according to the candidate document set.
In the candidate document set, the word frequency searching mode and the semantic searching mode are comprehensively considered, and the final search result is determined according to the candidate document set, so that the limitation of the search result obtained by a single model can be avoided. For example, a portion may be selected from the candidate documents as the search result, or the search result may be further determined based on the candidate document, the first document, and the second document set.
According to the similar document retrieval method, the same document similarity in the first document set and the second document set is overlapped, a preset number of documents are selected according to the similarity from large to small to obtain a candidate document set, meanwhile, the results of a word frequency search method and a document vectorization search method are considered, and the results are combined through the similarity, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Based on the content of the above embodiment, as an alternative embodiment, determining the search result according to the candidate document set includes: according to the second document set, selecting documents with a first preset proportion from large to small according to the similarity, and taking the documents as a third document set; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as search results according to the similarity.
Fig. 2 is a flowchart of a similar document searching method according to another embodiment of the present invention, as shown in fig. 2, for a second document set, documents of a first preset proportion are selected according to the similarity, for example, the number of the first document set, the second document set and the candidate document set is 3N. And if the first preset proportion is 2/3, the selected third document set is 2N. For each document in the third document set, if the document exists in the candidate document set, updating the similarity value of the document in the third document set by using the similarity value in the candidate document set, and keeping the similarity value of other documents in the third document set unchanged. And re-ordering the updated third document set according to the similarity, and selecting the documents with the second preset proportion as search results. For example, the second preset ratio is 1/2, and the first N items Result with the similarity from large to small are selected merge As a final search result.
According to the similar document retrieval method, the semantic retrieval result of the document vectorization model is taken as the main part, and the semantic retrieval result is adjusted by keyword retrieval, so that semantic inertia can be eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the accuracy of the retrieval result is ensured.
Based on the content of the above embodiment, as an alternative embodiment, before overlapping the same document similarity in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
And respectively normalizing the document similarity in the first document set obtained by semantic retrieval and the document similarity in the second document set of the keyword retrieval result, and then superposing the document similarity simultaneously existing in the two sets. By respectively carrying out normalization processing on the document similarity in the first document set and the second document set, the influence caused by unbalanced similarity of the first document set and the second document set is avoided.
Based on the foregoing embodiment, as an alternative embodiment, the word frequency search model is a TF-IDF model.
TF-IDF is a common weighting method for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency).
And training a TF-IDF model of a document library by using a genetic tool based on the python language, and carrying out keyword vectorization representation and retrieval on the document to be retrieved based on the model to obtain a keyword retrieval result of the document to be retrieved.
Based on the foregoing embodiment, as an alternative embodiment, the document vectorization model is a Doc2vec model. Doc2vec is an unsupervised algorithm, can obtain the vector expression of the text, and is an expansion of word 2vec. The learned vectors can find the similarity between texts by calculating the distance, can be used for text clustering, and can be used for text classification by a supervised learning method for tagged data, such as classical emotion analysis.
The Doc2vec model of the document library can be trained by using a genetic tool based on the python language, semantic vectorization representation and retrieval are carried out on the document to be retrieved based on the model, and the semantic retrieval result of the document to be retrieved is obtained.
Based on the foregoing embodiments, as an alternative embodiment, the first preset ratio is 2/3, and the second preset ratio is 1/2. The foregoing embodiments have been illustrated and will not be described in detail herein.
FIG. 3 is a block diagram of a similar document searching apparatus according to an embodiment of the present invention, as shown in FIG. 3, the similar document searching apparatus includes: a classification acquisition module 301, a similarity superposition module 302, and a search result determination module 303. The classification acquisition module 301 is configured to search for a first document set and a similarity of each document based on a word frequency search model, and search for a second document set and a similarity of each document based on a document vectorization model; the similarity stacking module 302 is configured to stack the same document similarity in the first document set and the second document set, and select a preset number of documents according to the similarity from large to small, so as to obtain a candidate document set; the search result determining module 303 is configured to determine a search result according to the candidate document set.
Based on the content of the above embodiment, as an alternative embodiment, the search result determining module 303 is specifically configured to: according to the second document set, selecting documents with a first preset proportion from large to small according to the similarity, and taking the documents as a third document set; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as search results according to the similarity.
The embodiment of the device provided by the embodiment of the present invention is for implementing the above embodiments of the method, and specific flow and details refer to the above embodiments of the method, which are not repeated herein.
According to the similar document retrieval device provided by the embodiment of the invention, the same document similarity in the first document set and the second document set is overlapped, the preset number of documents are selected according to the similarity from large to small, the candidate document set is obtained, meanwhile, the results of the word frequency search method and the document vectorization search method are considered, and the similarity is combined, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Fig. 4 is a schematic physical structure of an electronic device according to an embodiment of the present invention, as shown in fig. 4, the electronic device may include: a processor (processor) 401, a communication interface (Communications Interface) 402, a memory (memory) 403, and a bus 404, wherein the processor 401, the communication interface 402, and the memory 403 complete communication with each other through the bus 404. The communication interface 402 may be used for information transfer of an electronic device. The processor 401 may call logic instructions in the memory 403 to perform a method comprising: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.
Further, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-described method embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the transmission method provided in the above embodiments, for example, including: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A similar document retrieval method, comprising:
searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model;
overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set;
determining a retrieval result according to the candidate document set;
wherein, the determining the search result according to the candidate document set includes:
selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity;
and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.
2. The method for searching similar documents according to claim 1, wherein before said superimposing the same document similarity in said first document set and said second document set, further comprising:
and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
3. The similar document retrieval method according to claim 1, wherein the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain the same.
4. The method of claim 1, wherein the term-frequency search model is a TF-IDF model.
5. The method of claim 1, wherein the document vectorization model is a Doc2vec model.
6. The similar document retrieving method according to claim 1, wherein the first preset ratio is 2/3 and the second preset ratio is 1/2.
7. A similar document retrieval apparatus, comprising:
the classification acquisition module is used for searching to obtain a first document set and the similarity of each document based on the word frequency search model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model;
the similarity stacking module is used for stacking the same document similarity in the first document set and the second document set, and selecting a preset number of documents from large to small according to the similarity to obtain a candidate document set;
the search result determining module is used for determining a search result according to the candidate document set;
wherein, the determining the search result according to the candidate document set includes:
selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity;
and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the similar document retrieval method of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the similar document retrieval method according to any one of claims 1 to 6.
CN202010543812.6A 2020-06-15 2020-06-15 Similar document retrieval method and device Active CN111813930B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010543812.6A CN111813930B (en) 2020-06-15 2020-06-15 Similar document retrieval method and device
PCT/CN2021/078813 WO2021253873A1 (en) 2020-06-15 2021-03-03 Method and apparatus for retrieving similar document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010543812.6A CN111813930B (en) 2020-06-15 2020-06-15 Similar document retrieval method and device

Publications (2)

Publication Number Publication Date
CN111813930A CN111813930A (en) 2020-10-23
CN111813930B true CN111813930B (en) 2024-02-20

Family

ID=72845178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010543812.6A Active CN111813930B (en) 2020-06-15 2020-06-15 Similar document retrieval method and device

Country Status (2)

Country Link
CN (1) CN111813930B (en)
WO (1) WO2021253873A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813930B (en) * 2020-06-15 2024-02-20 语联网(武汉)信息技术有限公司 Similar document retrieval method and device
CN112632907A (en) * 2021-01-04 2021-04-09 北京明略软件系统有限公司 Document marking method, device and equipment
CN113094519B (en) * 2021-05-07 2023-04-14 超凡知识产权服务股份有限公司 Method and device for searching based on document
CN114780690B (en) * 2022-06-20 2022-09-09 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302793A (en) * 2015-10-21 2016-02-03 南方电网科学研究院有限责任公司 Method for automatically evaluating scientific and technical literature novelty by utilizing computer
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
CN107704469A (en) * 2016-08-08 2018-02-16 中国科学院文献情报中心 The mapping method and device of patent data and industry data
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281583B (en) * 2013-07-02 2018-01-12 索意互动(北京)信息技术有限公司 Information retrieval method and device
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN107220307B (en) * 2017-05-10 2020-09-25 清华大学 Webpage searching method and device
CN107491547B (en) * 2017-08-28 2020-11-10 北京百度网讯科技有限公司 Search method and device based on artificial intelligence
CN111813930B (en) * 2020-06-15 2024-02-20 语联网(武汉)信息技术有限公司 Similar document retrieval method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302793A (en) * 2015-10-21 2016-02-03 南方电网科学研究院有限责任公司 Method for automatically evaluating scientific and technical literature novelty by utilizing computer
CN107704469A (en) * 2016-08-08 2018-02-16 中国科学院文献情报中心 The mapping method and device of patent data and industry data
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Also Published As

Publication number Publication date
CN111813930A (en) 2020-10-23
WO2021253873A1 (en) 2021-12-23

Similar Documents

Publication Publication Date Title
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN111813930B (en) Similar document retrieval method and device
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
US8918348B2 (en) Web-scale entity relationship extraction
CN105045781B (en) Query term similarity calculation method and device and query term search method and device
EP3579125A1 (en) System, computer-implemented method and computer program product for information retrieval
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN111078837B (en) Intelligent question-answering information processing method, electronic equipment and computer readable storage medium
EP3314461A1 (en) Learning entity and word embeddings for entity disambiguation
JP2005122533A (en) Question-answering system and question-answering processing method
US20190340503A1 (en) Search system for providing free-text problem-solution searching
CN109948140B (en) Word vector embedding method and device
CN108875065B (en) Indonesia news webpage recommendation method based on content
US11429792B2 (en) Creating and interacting with data records having semantic vectors and natural language expressions produced by a machine-trained model
CN111159359A (en) Document retrieval method, document retrieval device and computer-readable storage medium
CN109791570B (en) Efficient and accurate named entity recognition method and device
WO2016015267A1 (en) Rank aggregation based on markov model
JP2020091857A (en) Classification of electronic document
CN116932730A (en) Document question-answering method and related equipment based on multi-way tree and large-scale language model
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
JP7256357B2 (en) Information processing device, control method, program
CN117076636A (en) Information query method, system and equipment for intelligent customer service
US20220318318A1 (en) Systems and methods for automated information retrieval
WO2016210203A1 (en) Learning entity and word embeddings for entity disambiguation
CN109684357A (en) Information processing method and device, storage medium, terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant