CN111813930B - Similar document retrieval method and device - Google Patents
Similar document retrieval method and device Download PDFInfo
- Publication number
- CN111813930B CN111813930B CN202010543812.6A CN202010543812A CN111813930B CN 111813930 B CN111813930 B CN 111813930B CN 202010543812 A CN202010543812 A CN 202010543812A CN 111813930 B CN111813930 B CN 111813930B
- Authority
- CN
- China
- Prior art keywords
- document
- similarity
- document set
- documents
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000004590 computer program Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 5
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 230000002068 genetic effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The embodiment of the invention provides a similar document retrieval method and a similar document retrieval device, wherein the method comprises the following steps: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set. The method considers the results of the word frequency searching method and the document vectorizing searching method at the same time, and combines the results through the similarity, so that semantic inertia is eliminated to a certain extent, a multi-dimensional search result is obtained, and the limitation of the search result obtained by a single model is avoided.
Description
Technical Field
The invention relates to the field of natural language analysis, in particular to a similar document retrieval method and device.
Background
And (3) automatically retrieving the document which is the most similar to the content of the document from a massive document library by giving the document to be retrieved. In the field of translation, when a manuscript to be translated is received, the document similar to the topic content of the manuscript needs to be retrieved from a historical manuscript library so as to be quickly matched with a proper translator, thereby improving the quality and efficiency of translation.
The conventional document retrieval method mainly comprises a keyword related method, such as TF-IDF (term frequency-inverse document frequency), and the like, and the method can meet the requirements in most cases, but has the defect of neglecting word-to-word sequences. For example, if a document contains a large number of phrases such as "machine learning", the phrases are split into two keywords, namely "machine" and "learning" for searching; if the machine learning in the document is replaced by the learning machine, the retrieval result is not affected. To solve such problems, deep learning-based document semantic representations are applied in document retrieval, such as document vectorization model Doc2vec. The document vectorization model is sensitive to word order, can better represent the document from the semantic level, but semantic inertia can exist in the actual application process. For example, the first 5 documents with the highest matching degree with the "motorcycle production" need to be searched, and the document library contains a large number of documents related to the "motorcycle sales" and the "automobile production", and if the semantic representation method is adopted for searching, it is likely that all the first 5 documents searched are related to the "automobile production". This is because the semantic representation method is more sensitive to the semantics of the global level of the document than to highlight a certain keyword. But the user is likely to want the first 5 documents to be in terms of both "car production" and "motorcycle sales". It can be seen that the search results obtained based on the current method are often limited, and no accurate search results can be obtained.
Disclosure of Invention
In order to solve the above problems, the embodiment of the invention provides a similar document retrieval method and device.
In a first aspect, an embodiment of the present invention provides a similar document retrieval method, including: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain the candidate document set; and determining a retrieval result according to the candidate document set.
Further, the determining a search result according to the candidate document set includes: selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.
Further, before the overlapping of the same document similarity in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
Further, the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain consistent.
Further, the word frequency search model is a TF-IDF model.
Further, the document vectorization model is a Doc2vec model.
Further, the first preset ratio is 2/3, and the second preset ratio is 1/2.
In a second aspect, an embodiment of the present invention provides a similar document retrieval apparatus, including: the classification acquisition module is used for searching to obtain a first document set and the similarity of each document based on the word frequency search model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; the similarity stacking module is used for stacking the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and the search result determining module is used for determining a search result according to the candidate document set.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the similar document retrieval method of the first aspect of the present invention when the program is executed by the processor.
In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the similar document retrieval method of the first aspect of the present invention.
According to the similar document retrieval method and device provided by the embodiment of the invention, the same document similarity in the first document set and the second document set is overlapped, the preset number of documents are selected according to the similarity from large to small, the candidate document set is obtained, meanwhile, the results of the word frequency search method and the document vectorization search method are considered, and the similarity is combined, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a similar document retrieval method provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a method for retrieving similar documents according to another embodiment of the present invention;
FIG. 3 is a block diagram of a similar document retrieval device according to an embodiment of the present invention;
fig. 4 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
FIG. 1 is a flowchart of a similar document retrieval method provided in an embodiment of the present invention, and as shown in FIG. 1, the embodiment of the present invention provides a similar document retrieval method, including:
101. and searching based on the word frequency search model to obtain the similarity of each document in the first document set, and searching based on the document vectorization model to obtain the similarity of the second document set and each document.
The term frequency search model generally refers to a model that searches based on the term frequency of a keyword, such as a TF-IDF model. The document vectorization model generally refers to a type of model based on semantic retrieval of keyword vectors, such as Doc2vec model and word2vec model.
In the specific implementation process, keyword retrieval is carried out on the document to be retrieved based on a word frequency search model, a keyword retrieval Result of the document to be retrieved is obtained, a first document set is obtained, and the first document set is recorded as Result TF-IDF . Carrying out semantic vectorization representation on the document to be searched, searching based on a document vectorization model, obtaining a semantic search Result of the document to be searched, obtaining a second document set, and marking the second document set as Result Doc2vec . In addition to the search results, the similarity of each searched document is obtained, and the similarity represents the similarity degree of the searched document and the document to be searched.
102. Overlapping the similarity of the same documents in the first document set and the second document set, selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set, and recording the candidate document set as Result combination 。
Considering that the same document exists in the first document set and the second document set, the similarity is overlapped on the same document, and the similarity of other documents in the two sets is kept unchanged. And then sorting the whole according to the similarity, and selecting a preset number of documents from the documents to serve as a candidate document set.
As an alternative embodiment, the number of documents in the first set of documents, the second set of documents, and the set of candidate documents remain the same. The values may be the same or similar. For example, the number of the documents in the first document set, the second document set and the candidate document set is N, so that the balance of word frequency based search and document vectorization based search is ensured.
103. And determining a retrieval result according to the candidate document set.
In the candidate document set, the word frequency searching mode and the semantic searching mode are comprehensively considered, and the final search result is determined according to the candidate document set, so that the limitation of the search result obtained by a single model can be avoided. For example, a portion may be selected from the candidate documents as the search result, or the search result may be further determined based on the candidate document, the first document, and the second document set.
According to the similar document retrieval method, the same document similarity in the first document set and the second document set is overlapped, a preset number of documents are selected according to the similarity from large to small to obtain a candidate document set, meanwhile, the results of a word frequency search method and a document vectorization search method are considered, and the results are combined through the similarity, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Based on the content of the above embodiment, as an alternative embodiment, determining the search result according to the candidate document set includes: according to the second document set, selecting documents with a first preset proportion from large to small according to the similarity, and taking the documents as a third document set; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as search results according to the similarity.
Fig. 2 is a flowchart of a similar document searching method according to another embodiment of the present invention, as shown in fig. 2, for a second document set, documents of a first preset proportion are selected according to the similarity, for example, the number of the first document set, the second document set and the candidate document set is 3N. And if the first preset proportion is 2/3, the selected third document set is 2N. For each document in the third document set, if the document exists in the candidate document set, updating the similarity value of the document in the third document set by using the similarity value in the candidate document set, and keeping the similarity value of other documents in the third document set unchanged. And re-ordering the updated third document set according to the similarity, and selecting the documents with the second preset proportion as search results. For example, the second preset ratio is 1/2, and the first N items Result with the similarity from large to small are selected merge As a final search result.
According to the similar document retrieval method, the semantic retrieval result of the document vectorization model is taken as the main part, and the semantic retrieval result is adjusted by keyword retrieval, so that semantic inertia can be eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the accuracy of the retrieval result is ensured.
Based on the content of the above embodiment, as an alternative embodiment, before overlapping the same document similarity in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
And respectively normalizing the document similarity in the first document set obtained by semantic retrieval and the document similarity in the second document set of the keyword retrieval result, and then superposing the document similarity simultaneously existing in the two sets. By respectively carrying out normalization processing on the document similarity in the first document set and the second document set, the influence caused by unbalanced similarity of the first document set and the second document set is avoided.
Based on the foregoing embodiment, as an alternative embodiment, the word frequency search model is a TF-IDF model.
TF-IDF is a common weighting method for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency).
And training a TF-IDF model of a document library by using a genetic tool based on the python language, and carrying out keyword vectorization representation and retrieval on the document to be retrieved based on the model to obtain a keyword retrieval result of the document to be retrieved.
Based on the foregoing embodiment, as an alternative embodiment, the document vectorization model is a Doc2vec model. Doc2vec is an unsupervised algorithm, can obtain the vector expression of the text, and is an expansion of word 2vec. The learned vectors can find the similarity between texts by calculating the distance, can be used for text clustering, and can be used for text classification by a supervised learning method for tagged data, such as classical emotion analysis.
The Doc2vec model of the document library can be trained by using a genetic tool based on the python language, semantic vectorization representation and retrieval are carried out on the document to be retrieved based on the model, and the semantic retrieval result of the document to be retrieved is obtained.
Based on the foregoing embodiments, as an alternative embodiment, the first preset ratio is 2/3, and the second preset ratio is 1/2. The foregoing embodiments have been illustrated and will not be described in detail herein.
FIG. 3 is a block diagram of a similar document searching apparatus according to an embodiment of the present invention, as shown in FIG. 3, the similar document searching apparatus includes: a classification acquisition module 301, a similarity superposition module 302, and a search result determination module 303. The classification acquisition module 301 is configured to search for a first document set and a similarity of each document based on a word frequency search model, and search for a second document set and a similarity of each document based on a document vectorization model; the similarity stacking module 302 is configured to stack the same document similarity in the first document set and the second document set, and select a preset number of documents according to the similarity from large to small, so as to obtain a candidate document set; the search result determining module 303 is configured to determine a search result according to the candidate document set.
Based on the content of the above embodiment, as an alternative embodiment, the search result determining module 303 is specifically configured to: according to the second document set, selecting documents with a first preset proportion from large to small according to the similarity, and taking the documents as a third document set; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as search results according to the similarity.
The embodiment of the device provided by the embodiment of the present invention is for implementing the above embodiments of the method, and specific flow and details refer to the above embodiments of the method, which are not repeated herein.
According to the similar document retrieval device provided by the embodiment of the invention, the same document similarity in the first document set and the second document set is overlapped, the preset number of documents are selected according to the similarity from large to small, the candidate document set is obtained, meanwhile, the results of the word frequency search method and the document vectorization search method are considered, and the similarity is combined, so that semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Fig. 4 is a schematic physical structure of an electronic device according to an embodiment of the present invention, as shown in fig. 4, the electronic device may include: a processor (processor) 401, a communication interface (Communications Interface) 402, a memory (memory) 403, and a bus 404, wherein the processor 401, the communication interface 402, and the memory 403 complete communication with each other through the bus 404. The communication interface 402 may be used for information transfer of an electronic device. The processor 401 may call logic instructions in the memory 403 to perform a method comprising: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.
Further, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-described method embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the transmission method provided in the above embodiments, for example, including: searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model; overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (9)
1. A similar document retrieval method, comprising:
searching to obtain a first document set and the similarity of each document based on the word frequency searching model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model;
overlapping the same document similarity in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set;
determining a retrieval result according to the candidate document set;
wherein, the determining the search result according to the candidate document set includes:
selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity;
and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.
2. The method for searching similar documents according to claim 1, wherein before said superimposing the same document similarity in said first document set and said second document set, further comprising:
and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
3. The similar document retrieval method according to claim 1, wherein the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain the same.
4. The method of claim 1, wherein the term-frequency search model is a TF-IDF model.
5. The method of claim 1, wherein the document vectorization model is a Doc2vec model.
6. The similar document retrieving method according to claim 1, wherein the first preset ratio is 2/3 and the second preset ratio is 1/2.
7. A similar document retrieval apparatus, comprising:
the classification acquisition module is used for searching to obtain a first document set and the similarity of each document based on the word frequency search model, and searching to obtain a second document set and the similarity of each document based on the document vectorization model;
the similarity stacking module is used for stacking the same document similarity in the first document set and the second document set, and selecting a preset number of documents from large to small according to the similarity to obtain a candidate document set;
the search result determining module is used for determining a search result according to the candidate document set;
wherein, the determining the search result according to the candidate document set includes:
selecting the documents with the first preset proportion as a third document set according to the second document set from large to small in similarity;
and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting a second preset proportion of documents from the third document set as a retrieval result according to the similarity.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the similar document retrieval method of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the similar document retrieval method according to any one of claims 1 to 6.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010543812.6A CN111813930B (en) | 2020-06-15 | 2020-06-15 | Similar document retrieval method and device |
PCT/CN2021/078813 WO2021253873A1 (en) | 2020-06-15 | 2021-03-03 | Method and apparatus for retrieving similar document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010543812.6A CN111813930B (en) | 2020-06-15 | 2020-06-15 | Similar document retrieval method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111813930A CN111813930A (en) | 2020-10-23 |
CN111813930B true CN111813930B (en) | 2024-02-20 |
Family
ID=72845178
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010543812.6A Active CN111813930B (en) | 2020-06-15 | 2020-06-15 | Similar document retrieval method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111813930B (en) |
WO (1) | WO2021253873A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111813930B (en) * | 2020-06-15 | 2024-02-20 | 语联网(武汉)信息技术有限公司 | Similar document retrieval method and device |
CN112632907A (en) * | 2021-01-04 | 2021-04-09 | 北京明略软件系统有限公司 | Document marking method, device and equipment |
CN113094519B (en) * | 2021-05-07 | 2023-04-14 | 超凡知识产权服务股份有限公司 | Method and device for searching based on document |
CN114780690B (en) * | 2022-06-20 | 2022-09-09 | 成都信息工程大学 | Patent text retrieval method and device based on multi-mode matrix vector representation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105302793A (en) * | 2015-10-21 | 2016-02-03 | 南方电网科学研究院有限责任公司 | Method for automatically evaluating scientific and technical literature novelty by utilizing computer |
CN107562824A (en) * | 2017-08-21 | 2018-01-09 | 昆明理工大学 | A kind of text similarity detection method |
CN107704469A (en) * | 2016-08-08 | 2018-02-16 | 中国科学院文献情报中心 | The mapping method and device of patent data and industry data |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104281583B (en) * | 2013-07-02 | 2018-01-12 | 索意互动(北京)信息技术有限公司 | Information retrieval method and device |
CN106095737A (en) * | 2016-06-07 | 2016-11-09 | 杭州凡闻科技有限公司 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
CN107220307B (en) * | 2017-05-10 | 2020-09-25 | 清华大学 | Webpage searching method and device |
CN107491547B (en) * | 2017-08-28 | 2020-11-10 | 北京百度网讯科技有限公司 | Search method and device based on artificial intelligence |
CN111813930B (en) * | 2020-06-15 | 2024-02-20 | 语联网(武汉)信息技术有限公司 | Similar document retrieval method and device |
-
2020
- 2020-06-15 CN CN202010543812.6A patent/CN111813930B/en active Active
-
2021
- 2021-03-03 WO PCT/CN2021/078813 patent/WO2021253873A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105302793A (en) * | 2015-10-21 | 2016-02-03 | 南方电网科学研究院有限责任公司 | Method for automatically evaluating scientific and technical literature novelty by utilizing computer |
CN107704469A (en) * | 2016-08-08 | 2018-02-16 | 中国科学院文献情报中心 | The mapping method and device of patent data and industry data |
CN107562824A (en) * | 2017-08-21 | 2018-01-09 | 昆明理工大学 | A kind of text similarity detection method |
CN109858028A (en) * | 2019-01-30 | 2019-06-07 | 神思电子技术股份有限公司 | A kind of short text similarity calculating method based on probabilistic model |
Also Published As
Publication number | Publication date |
---|---|
CN111813930A (en) | 2020-10-23 |
WO2021253873A1 (en) | 2021-12-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444320B (en) | Text retrieval method and device, computer equipment and storage medium | |
CN111813930B (en) | Similar document retrieval method and device | |
CN109918673B (en) | Semantic arbitration method and device, electronic equipment and computer-readable storage medium | |
US8918348B2 (en) | Web-scale entity relationship extraction | |
CN105045781B (en) | Query term similarity calculation method and device and query term search method and device | |
EP3579125A1 (en) | System, computer-implemented method and computer program product for information retrieval | |
CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
CN111078837B (en) | Intelligent question-answering information processing method, electronic equipment and computer readable storage medium | |
EP3314461A1 (en) | Learning entity and word embeddings for entity disambiguation | |
JP2005122533A (en) | Question-answering system and question-answering processing method | |
US20190340503A1 (en) | Search system for providing free-text problem-solution searching | |
CN109948140B (en) | Word vector embedding method and device | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
US11429792B2 (en) | Creating and interacting with data records having semantic vectors and natural language expressions produced by a machine-trained model | |
CN111159359A (en) | Document retrieval method, document retrieval device and computer-readable storage medium | |
CN109791570B (en) | Efficient and accurate named entity recognition method and device | |
WO2016015267A1 (en) | Rank aggregation based on markov model | |
JP2020091857A (en) | Classification of electronic document | |
CN116932730A (en) | Document question-answering method and related equipment based on multi-way tree and large-scale language model | |
CN117435685A (en) | Document retrieval method, document retrieval device, computer equipment, storage medium and product | |
JP7256357B2 (en) | Information processing device, control method, program | |
CN117076636A (en) | Information query method, system and equipment for intelligent customer service | |
US20220318318A1 (en) | Systems and methods for automated information retrieval | |
WO2016210203A1 (en) | Learning entity and word embeddings for entity disambiguation | |
CN109684357A (en) | Information processing method and device, storage medium, terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |