CN101625680B - Document retrieval method in patent field - Google Patents

Document retrieval method in patent field Download PDF

Info

Publication number
CN101625680B
CN101625680B CN200810012248A CN200810012248A CN101625680B CN 101625680 B CN101625680 B CN 101625680B CN 200810012248 A CN200810012248 A CN 200810012248A CN 200810012248 A CN200810012248 A CN 200810012248A CN 101625680 B CN101625680 B CN 101625680B
Authority
CN
China
Prior art keywords
mrow
msub
text
mover
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200810012248A
Other languages
Chinese (zh)
Other versions
CN101625680A (en
Inventor
朱靖波
王会珍
曹菲菲
肖桐
李天宁
宋国龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN200810012248A priority Critical patent/CN101625680B/en
Publication of CN101625680A publication Critical patent/CN101625680A/en
Application granted granted Critical
Publication of CN101625680B publication Critical patent/CN101625680B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a document retrieval method in the patent field, which comprises the following steps: preprocessing query texts and patent texts; retrieving the patent texts correlative with the query texts, adopting a calculation method with various similarities to obtain values of different similarities, combining the values of different similarities to recalculate the similarities, and sequencing the patent texts according to the new values of the similarities; adopting various decision methods to map the sequencing of the similarities of the patent text into different sequencings of patent category interdependencies; integrating the sequencing results of various patent category interdependencies, and performing resequencing to obtain the sequencing of new patent category interdependencies; and selecting the patent category most relevant to the query texts from the sequencing of the new patent category interdependencies. The document retrieval method uses the calculation method with various similarities to finally weigh the degree of correlation of the query texts and the patent texts, and uses information of characteristic multi-angles and considers a plurality of system combinations to achieve the aim of mutual complementation and improve the system performance.

Description

Patent field-oriented document retrieval method
Technical Field
The invention relates to a data retrieval method, in particular to a document retrieval method oriented to the patent field.
Background
With the rapid development of science and technology and the massive growth of documents recording scientific and technological achievements, patents are more and more regarded as one of the most important means for intellectual property protection. The patent documents describe the technical solutions involved in the most novel inventions, but the documents describing scientific and technological achievements include non-patent documents such as scientific research papers, technical reports, and the like, in addition to patents. There is a certain relation between patents and non-patents, for example, research on the relation between scientific research papers and patents can predict the technological development trend. The research on patent documents and non-patent scientific research documents can understand the latest technologies in various fields, thereby avoiding repeated development, avoiding infringement and even analyzing the development of the whole technical industry; the technical development status and strategy of competitors can be analyzed; invalid retrieval of patents can be achieved. The search for patent documents and non-patent documents is a relatively new problem in the field of patent research.
Patent documents usually refer to related patents or scientific research papers, and the relationship between non-patent documents and patent documents is only studied by using the reference relationship between patents and scientific research papers, which is very limited. Moreover, there are millions of patent documents in the patent database, and the patent operation simply by manual means is a time-consuming and labor-consuming task. How to retrieve relevant patents from a huge patent database and obtain useful patent information is a difficult problem in patent research.
The patent searching and classifying method includes two kinds, one is classified patent searching based on patent database and the other is natural language processing technology based searching method.
Most of the early patent retrieval methods are based on patent databases, for example, patent with publication number CN1996290A, and mainly utilize text information of patent structuring to extract patent citation relations and construct patent association graphs. Then, according to a certain patent inquiry condition, such as application number, patent number, application date, announcement date, inventor, patentee, etc., the patent is searched in the patent association diagram and the searched patent is searched. The method depends on a fixed structured text of the patent, is not intelligent enough, and does not analyze the patent content.
A method based on natural language processing refers to analyzing the content of a patent text by using a natural language processing technology, acquiring useful features representing a patent from texts such as a title, an abstract, a specification, a right specification and the like of the patent, giving weight information to the features, searching relevant patent texts, such as an article someiss in the Automatic Classification of u.s.patents (an author of the article is Leah s.larkey, and the article is a special invitation report on an AAAI-98 text Classification learning workshop), and introducing a method for classifying patents by using a natural language processing technology. Article POSTECH at NTCIR-5Patent Retrieval: smooth Experiments In a Language Modeling Approach patent Retrieval (the authors In In-Su Kang, Seung-Hoon Na, Jun-Kim, Jong-Hyeok Lee, published In Proceedings of NTCIR-5 Workshop Meeting, December 6-9, 2005, Tokyo, Japan) was achieved using natural Language processing techniques.
However, the existing method is only limited to keyword search, only aims at search among patent texts, does not consider the relation between non-patent texts and between non-patent texts and patent categories, and cannot realize intelligent full-text search of the non-patent texts and the patent texts.
Disclosure of Invention
Aiming at the defects that the relation between a non-patent text and a patent text, the relation between the non-patent text and the relation between the non-patent text and the patent category are not considered in the document retrieval in the patent field in the prior art, and the intelligent full-text retrieval of the non-patent text and the patent text cannot be realized, the technical problem to be solved by the invention is to provide a patent retrieval method, which can realize the feature vector representation of the patent text, calculate the similarity between the non-patent text and the related patent text, and retrieve the most related patent text.
In order to solve the technical problems, the patent retrieval method based on the natural language processing technology adopts the technical scheme, and comprises the following steps:
preprocessing the query text and the patent text;
retrieving patent texts related to the query text, obtaining values of different similarities by adopting a plurality of methods for calculating different similarities, combining the values of different similarities, recalculating the similarities, and sequencing the patent texts according to the new values of the similarities;
adopting a plurality of different decision methods to map the similarity ranking of the patent texts into different rankings of the relevance of the patent categories; integrating the relevance sorting results of a plurality of different patent categories, and re-sorting to obtain new relevance sorting of the patent categories;
and selecting the patent category most relevant to the query text from the relevance ranking of the new patent categories.
The text processing method comprises the steps of preprocessing a text to obtain candidates of feature words, counting data information of the feature words, selecting features by adopting a feature selection method, and converting the text into a vector representation form, and specifically comprises the following steps: removing labels which are not patent texts in the patent texts, extracting patent text information, and obtaining patent numbers, patent IPC category labels, patent names, abstract of the description, claims and the description; all capital words are reserved for English texts; removing words containing numbers; removing forbidden words; performing word type reduction processing on the English text to obtain a characteristic candidate word list; counting the characteristic candidate word list to obtain word frequency, document frequency and word category frequency information; and selecting a characteristic word list from the characteristic candidate words, calculating the characteristic weight of each characteristic word in the characteristic word list, and converting the patent text and the query text into a vector capable of being calculated according to the characteristic words and the characteristic weights thereof.
The similarity values of the query text and the patent text are obtained by the various different similarity calculation methods, the various different similarity values are integrated based on a Log-linear model, and the calculation formula is as follows:
<math><mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>exp</mi> <mrow> <mo>(</mo> <mover> <mi>&theta;</mi> <mo>&RightArrow;</mo> </mover> <mo>&CenterDot;</mo> <mover> <mi>S</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>n</mi> </munderover> <mi>exp</mi> <mrow> <mo>(</mo> <mover> <mrow> <mi>&theta;</mi> <mo>&CenterDot;</mo> <mover> <mi>S</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>d</mi> <mo>&RightArrow;</mo> </mover> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow></math>
wherein,
Figure S2008100122484D00022
is a query text
Figure S2008100122484D00023
And patent text
Figure S2008100122484D00024
Similarity values obtained by adopting different similarity calculation methods are used as vectors formed by the characteristics,
Figure S2008100122484D00025
is a weight vector of similarity values obtained by adopting different similarity calculation methods, n is the total number of patent texts related to the query text,
Figure S2008100122484D00031
representing the kth relevant patent text vector.
The multiple different decision methods comprise a similarity adding method of patent category weights, a similarity adding method of patent text similarity ranking position weights and a patent text similarity adding method, wherein the similarity adding calculation formula of the patent category weights is as follows:
<math><mrow> <mi>score</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> </msup> <mo>&times;</mo> <mi>ICF</mi> <mo>&times;</mo> <msub> <mi>score</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <mi>role</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow></math>
ICF = log ( N + 0.5 C x + 0.5 )
wherein k isrIs a penalty factor constant, k represents the number of candidate patent texts in the patent text similarity ranking result, ciRefers to the position obtained by sorting the patent categories to which the candidate patent text i belongs according to the similarity,
Figure S2008100122484D00035
is a query text and a patent text diICF is the reciprocal of the frequency of the class text, where CxThe number of texts in the category x, the total number of texts N, score (x) is a value for inquiring the correlation between the texts and the patent category x, and role (x, i) judges whether the patent text di belongs to the patent category x.
The similarity addition calculation formula of the patent text similarity ranking position weight is as follows:
<math><mrow> <mi>score</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mi>i</mi> </msup> <mo>&times;</mo> <msub> <mi>score</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <mi>role</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow></math>
Figure S2008100122484D00037
the method integrates the relevance ranking results of a plurality of different patent categories, and the relevance ranking results of the patent categories are combined by adopting a method of a plurality of different similarity values and a plurality of different category decisions and serve as the characteristics of the patent category positions, and the relevance ranking results of the patent categories are combined based on a Rank-SVM model.
The integration of the relevance ranking results of the multiple different patent categories is to calculate a new relevance value of the patent category by adding the position values of the categories appearing in the relevance results of the multiple different patent categories.
The invention has the following beneficial effects and advantages:
1. the method adopts the natural language processing technology, utilizes a plurality of similarity calculation methods as the final balance of the correlation degree of the query text and the patent text, and fully utilizes the information of characteristic multi-angle. Finally, a plurality of system combinations are considered, the purpose of mutual complementation is achieved, and the system performance is improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a text pre-processing flow diagram;
FIG. 3 is a flow chart of similarity calculation between query text and patent text;
FIG. 4 is a flowchart of the query text to patent category correlation calculation;
Detailed Description
The process according to the invention is further illustrated below with reference to examples and figures:
as shown in fig. 1, a method for retrieving a document in the patent domain includes the following steps:
preprocessing the query text and the patent text; retrieving patent texts related to the query text, obtaining values of different similarities by adopting a plurality of methods for calculating different similarities, combining the values of different similarities, recalculating the similarities, and sequencing the patent texts according to the new values of the similarities; adopting a plurality of different decision methods to map the similarity ranking of the patent texts into different ranking of the relevance of the patent categories, integrating the ranking results of the relevance of the plurality of different patent categories, and reordering to obtain new relevance ranking of the patent categories; and selecting the patent category most relevant to the query text from the relevance ranking of the new patent categories.
As shown in fig. 2, the preprocessing of the query text and the patent text includes the following steps:
a) removing labels which are not patent texts in the patent texts, extracting patent text information, and obtaining patent numbers, patent IPC category labels, patent names, abstract of the description, claims and the description; removing the internal non-letters or non-Chinese characters of the words in the obtained patent text information, such as: ' - ', ' (', '), etc.; all capital words are reserved for English texts; removing words containing numbers; remove stop words, such as: "mail", "said", etc. in English patents, and "step", "feature", etc. in Chinese patents, as well as prepositions, adverbs, articles, etc.; performing word type reduction processing on the English text to obtain a characteristic candidate word list;
b) counting the characteristic candidate word list to obtain word frequency, document frequency and word category frequency information;
c) and selecting a characteristic word list from the characteristic candidate words, calculating the characteristic weight of each characteristic word in the characteristic word list, and converting the patent text and the query text into a vector capable of being calculated according to the characteristic words and the characteristic weights thereof.
d) And constructing and storing inverted index documents for patent documents and patent text vectors by taking the feature words of the patents as index words.
As shown in fig. 3, the calculation method of the plurality of different similarities includes the following steps:
and finding the patent texts with the co-occurrence feature words with the query text in the patent text library to form a related patent text set.
The similarity between the relevant patents in the relevant patent text set and the query text is calculated, in this embodiment, a plurality of similarity calculation methods are adopted, wherein the similarity calculation method specifically includes the following steps:
1. vector cosine calculating method
Representing query text with a vector space model
Figure S2008100122484D00041
And patent textThe cosine of the two vectors is calculated as:
<math><mrow> <mi>cos</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>|</mo> <mo>&CenterDot;</mo> <mo>|</mo> <mo>|</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> </mrow></math>
BM25 calculation method
There are many variations of BM25, and in this example, BM25 is calculated as follows:
<math><mrow> <mi>score</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mi>IDF</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mfrac> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>1</mn> </msub> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>k</mi> <mn>1</mn> </msub> <mo>&CenterDot;</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>b</mi> <mo>+</mo> <mi>b</mi> <mo>&CenterDot;</mo> <mfrac> <mrow> <mo>|</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>|</mo> </mrow> <mi>avgdl</mi> </mfrac> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow></math>
where n represents query text
Figure S2008100122484D00052
The number of the feature words; f (t)i,D2) Is a feature word tiIn the patent text
Figure S2008100122484D00053
The number of occurrences in (a);
Figure S2008100122484D00054
express patent text
Figure S2008100122484D00055
The text length of (d); avgdl is the average length of text in the set of patent text associated with the query text; k is a radical of1And b is a free parameter, in this example, k1The value is 2.0, and the value of b is 0.75; IDF (t)i) Is the reciprocal of the document frequency and is the search term tiThe calculation formula is as follows:
IDF ( t i ) = log N - n ( t i ) + 0.5 n ( t i ) + 0.5
where N is the total number of documents on the entire dataset, N (t)i) Means containing the search term tiThe number of documents.
SMART calculation method
The SMART algorithm is calculated as follows:
<math><mrow> <msub> <mi>Sim</mi> <mi>SMART</mi> </msub> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>&Element;</mo> <mi>T</mi> </mrow> </munder> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>&times;</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow></math>
query text vectors
Figure S2008100122484D00058
Weight w of each dimension feature iniCalculated using the formula:
<math><mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>+</mo> <mi>log</mi> <mrow> <mo>(</mo> <msub> <mi>tf</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>&times;</mo> <mi>log</mi> <mfrac> <mrow> <mi>N</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>n</mi> </mfrac> </mrow></math>
patent text vector
Figure S2008100122484D000510
Weight w of each dimension feature iniCalculated using the formula:
<math><mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mn>1</mn> <mo>+</mo> <mi>log</mi> <mrow> <mo>(</mo> <msub> <mi>tf</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mn>1</mn> <mo>+</mo> <mi>log</mi> <mrow> <mo>(</mo> <mi>avtf</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>&times;</mo> <mfrac> <mn>1</mn> <mrow> <mn>0.8</mn> <mo>+</mo> <mn>0.2</mn> <mfrac> <mi>utf</mi> <mi>pivot</mi> </mfrac> </mrow> </mfrac> </mrow></math>
where T represents query text
Figure S2008100122484D000512
And patent text
Figure S2008100122484D000513
A set of co-occurring feature words of (a); tf isiIs the word frequency of the ith feature word in the text vector; n is the number of texts in all patent text sets, and N is the number of patent texts with ith characteristics; avtf is the average word frequency of the documents of the characteristic words in the relevant patent text set; utf is a patent text vector
Figure S2008100122484D000514
The number of feature words in (1); pivot is the average number of feature words per document in the entire patent text collection.
And respectively calculating to obtain similarity values of different query texts and patent texts by using three methods.
And carrying out normalization processing on the different similarity values obtained by the calculation methods to obtain the similarity value between 0 and 1.
And respectively taking logarithms of the different normalized similarity values.
Taking different similarity values after logarithm taking as the characteristics of the Log-linear model, and calculating the formula as follows:
<math><mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>exp</mi> <mrow> <mo>(</mo> <mover> <mi>&theta;</mi> <mo>&RightArrow;</mo> </mover> <mo>&CenterDot;</mo> <mover> <mi>S</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>n</mi> </munderover> <mi>exp</mi> <mrow> <mo>(</mo> <mover> <mrow> <mi>&theta;</mi> <mo>&CenterDot;</mo> <mover> <mi>S</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>d</mi> <mo>&RightArrow;</mo> </mover> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow></math>
wherein,
Figure S2008100122484D00061
is a query text
Figure S2008100122484D00062
And patent text
Figure S2008100122484D00063
Similarity values obtained by adopting different similarity calculation methods are used as vectors formed by the characteristics,
Figure S2008100122484D00064
is a weight vector of similarity values obtained by adopting different similarity calculation methods, n is the total number of patent texts related to the query text,
Figure S2008100122484D00065
representing the kth relevant patent text vector.
As shown in fig. 4, the results of similarity ranking of different patent texts are calculated by using a plurality of different patent category decision methods, and the relevance ranking between the query text and the patent categories is calculated. In this embodiment, the patent category decision method adopted includes: the calculation method comprises the following steps:
1. the similarity addition method is calculated as the following formula:
<math><mrow> <mi>score</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msub> <mi>score</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <mi>role</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow></math>
wherein x represents the category of IPC, k represents the number of candidate patent texts in the similarity ranking result of the patent texts,representing the similarity value of the ith candidate patent text. role (x, i) judgment patent text diWhether it belongs to patent class x.
2. The patent category weight summation method comprises the following calculation formula:
<math><mrow> <mi>score</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> </msup> <mo>&times;</mo> <mi>ICF</mi> <mo>&times;</mo> <msub> <mi>score</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <mi>role</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow></math>
ICF = log ( N + 0.5 C x + 0.5 )
Figure S2008100122484D000611
wherein k isrIs a penalty factor constant, k represents the number of candidate patent texts in the patent text similarity ranking result, ciRefers to the position obtained by sorting the patent categories to which the candidate patent text i belongs according to the similarity,
Figure S2008100122484D000612
is a query text and a patent text diICF is the reciprocal of the frequency of the class text, where CxRefers to the number of texts under category x, N is the total number of texts, score (x) is the value of the relevance of the query text to patent category x. role (x, i) judgment patent text diWhether it belongs to patent class x.
3. The patent text similarity position weight adding method comprises the following calculation formula:
<math><mrow> <mi>score</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mi>i</mi> </msup> <mo>&times;</mo> <msub> <mi>score</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <mi>role</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow></math>
wherein k isiIs a penalty factor constant, k represents the number of candidate patent texts in the patent text similarity ranking result,
Figure S2008100122484D00071
is a query text and a patent text diThe similarity value of (a). role (x, i) judgment patent text diWhether it belongs to patent class x.
Combining the relevance sorting results 1-3 of a plurality of different patent categories, and re-sorting the category sorting results. There are various combinations, and the combination method adopted in this embodiment includes two types:
and combining the patent category relevance ranking results obtained by combining various different similarity values and various different category decision methods as the characteristics of the patent category positions, and combining the multiple patent category relevance ranking results based on a Rank-SVM model.
And adding the position values of the occurrence of the categories in the correlation results according to a plurality of different patent categories, and calculating to obtain a new value of the correlation of the patent categories.
And obtaining the similarity value of the query text and the patent text through the steps, sequencing according to the similarity value, and selecting the most relevant patent category of the query text.
The method of the present invention is not limited to the examples described in the collective embodiment method, and those skilled in the art can derive other embodiments from the apparent solution of the present invention, and also belong to the technical innovation scope of the present invention.

Claims (4)

1. A patent-field-oriented document retrieval method comprises the following steps:
preprocessing the query text and the patent text;
retrieving patent texts related to the query text, obtaining values of different similarities by adopting a plurality of methods for calculating different similarities, combining the values of different similarities, recalculating the similarities, and sequencing the patent texts according to the new values of the similarities;
adopting a plurality of different decision methods to map the similarity ranking of the patent texts into different rankings of the relevance of the patent categories; integrating the relevance sorting results of a plurality of different patent categories, and re-sorting to obtain new relevance sorting of the patent categories;
selecting a patent category most relevant to the query text from the relevance ranking of the new patent categories;
the similarity values of the query text and the patent text are obtained by the multiple different similarity calculation methods, the multiple different similarity values are integrated based on a log-linear model, and the calculation formula is as follows:
<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>exp</mi> <mrow> <mo>(</mo> <mover> <mi>&theta;</mi> <mo>&RightArrow;</mo> </mover> <mo>&CenterDot;</mo> <mover> <mi>S</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>0</mn> </mrow> <mi>n</mi> </munderover> <mi>exp</mi> <mrow> <mo>(</mo> <mover> <mrow> <mi>&theta;</mi> <mo>&CenterDot;</mo> <mover> <mi>S</mi> <mo>&RightArrow;</mo> </mover> </mrow> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>d</mi> <mo>&RightArrow;</mo> </mover> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </math>
wherein,
Figure FSB00000781598700012
is a query text
Figure FSB00000781598700013
And patent text
Figure FSB00000781598700014
Similarity values obtained by adopting different similarity calculation methods are used as vectors formed by the characteristics,
Figure FSB00000781598700015
is a weight vector of similarity values obtained by adopting different similarity calculation methods, n is the total number of patent texts related to the query text,
Figure FSB00000781598700016
representing a kth related patent text vector;
the multiple different decision methods comprise a similarity adding method of patent category weights, a similarity adding method of patent text similarity ranking position weights and a patent text similarity adding method, wherein the similarity adding calculation formula of the patent category weights is as follows:
<math> <mrow> <mi>score</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> </msup> <mo>&times;</mo> <mi>ICF</mi> <mo>&times;</mo> <msub> <mi>score</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <mi>role</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </math>
ICF = log ( N + 0.5 C x + 0.5 )
Figure FSB00000781598700019
wherein k isrIs a penalty factor constant, k represents the number of candidate patent texts in the patent text similarity ranking result, ciRefers to the position obtained by sorting the patent categories to which the candidate patent text i belongs according to the similarity,
Figure FSB000007815987000110
is a query text and a patent text diICF is the reciprocal of the frequency of the class text, where CxThe method comprises the steps of determining the number of texts under a category X, the total number of texts N, score (X) which is a value for inquiring the correlation between texts and a patent category X, and role (X, i) for judging whether a patent text di belongs to the patent category X;
the similarity addition calculation formula of the patent text similarity ranking position weight is as follows:
<math> <mrow> <mi>score</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mi>i</mi> </msup> <mo>&times;</mo> <msub> <mi>score</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> </msub> <mo>&times;</mo> <mi>role</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </math>
2. a patent-domain-oriented document retrieval method as recited in claim 1, wherein: the text processing method comprises the steps of preprocessing a text to obtain candidates of feature words, counting data information of the feature words, selecting features by adopting a feature selection method, and converting the text into a vector representation form, and specifically comprises the following steps:
removing labels which are not patent texts in the patent texts, extracting patent text information, and obtaining patent numbers, patent IPC category labels, patent names, abstract of the description, claims and the description; all capital words are reserved for English texts; removing words containing numbers; removing forbidden words; performing word type reduction processing on the English text to obtain a characteristic candidate word list;
counting the characteristic candidate word list to obtain word frequency, document frequency and word category frequency information;
and selecting a characteristic word list from the characteristic candidate words, calculating the characteristic weight of each characteristic word in the characteristic word list, and converting the patent text and the query text into a vector capable of being calculated according to the characteristic words and the characteristic weights thereof.
3. A patent-domain-oriented document retrieval method as claimed in claim 1, wherein: the integration of the multiple different patent category relevance ranking results is the patent category relevance ranking results combined by adopting multiple different similarity values and multiple different category decision methods, and the patent category relevance ranking results are used as the characteristics of the patent category positions and are combined based on the ranking-based support vector machine model.
4. A patent-domain-oriented document retrieval method as claimed in claim 1, wherein: the integration of the relevance ranking results of the multiple different patent categories is to calculate a new relevance value of the patent category by adding the position values of the categories appearing in the relevance results of the multiple different patent categories.
CN200810012248A 2008-07-09 2008-07-09 Document retrieval method in patent field Expired - Fee Related CN101625680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810012248A CN101625680B (en) 2008-07-09 2008-07-09 Document retrieval method in patent field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810012248A CN101625680B (en) 2008-07-09 2008-07-09 Document retrieval method in patent field

Publications (2)

Publication Number Publication Date
CN101625680A CN101625680A (en) 2010-01-13
CN101625680B true CN101625680B (en) 2012-08-29

Family

ID=41521531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810012248A Expired - Fee Related CN101625680B (en) 2008-07-09 2008-07-09 Document retrieval method in patent field

Country Status (1)

Country Link
CN (1) CN101625680B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9110971B2 (en) * 2010-02-03 2015-08-18 Thomson Reuters Global Resources Method and system for ranking intellectual property documents using claim analysis
US9189563B2 (en) 2011-11-02 2015-11-17 Microsoft Technology Licensing, Llc Inheritance of rules across hierarchical levels
US9558274B2 (en) * 2011-11-02 2017-01-31 Microsoft Technology Licensing, Llc Routing query results
CN102768679B (en) * 2012-06-25 2015-04-22 深圳市汉络计算机技术有限公司 Searching method and searching system
CN103577462B (en) * 2012-08-02 2018-10-16 北京百度网讯科技有限公司 A kind of Document Classification Method and device
CN103455609B (en) * 2013-09-05 2017-06-16 江苏大学 A kind of patent document similarity detection method based on kernel function Luke cores
CN104778276A (en) * 2015-04-29 2015-07-15 北京航空航天大学 Multi-index combining and sequencing algorithm based on improved TF-IDF (term frequency-inverse document frequency)
US10073890B1 (en) 2015-08-03 2018-09-11 Marca Research & Development International, Llc Systems and methods for patent reference comparison in a combined semantical-probabilistic algorithm
US10621499B1 (en) 2015-08-03 2020-04-14 Marca Research & Development International, Llc Systems and methods for semantic understanding of digital information
CN107193814B (en) * 2016-03-14 2020-07-31 北京京东尚科信息技术有限公司 Method and device for realizing automatic book sorting in digital reading
US10540439B2 (en) 2016-04-15 2020-01-21 Marca Research & Development International, Llc Systems and methods for identifying evidentiary information
CN107153689A (en) * 2017-04-29 2017-09-12 安徽富驰信息技术有限公司 A kind of case search method based on Topic Similarity
CN108090047B (en) * 2018-01-10 2022-05-24 华南师范大学 Text similarity determination method and equipment
CN110633407B (en) 2018-06-20 2022-05-24 百度在线网络技术(北京)有限公司 Information retrieval method, device, equipment and computer readable medium
CN109726401B (en) * 2019-01-03 2022-09-23 中国联合网络通信集团有限公司 Patent combination generation method and system
CN109960757A (en) * 2019-02-27 2019-07-02 北京搜狗科技发展有限公司 Web search method and device
CN110334269B (en) * 2019-07-11 2021-05-07 中国船舶工业综合技术经济研究院 Information retrieval method and system
CN110516062B (en) * 2019-08-26 2022-11-04 腾讯科技(深圳)有限公司 Method and device for searching and processing document

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758244A (en) * 2004-04-30 2006-04-12 微软公司 Method and system for ranking documents of a search result to improve diversity and information richness
CN101030217A (en) * 2007-03-22 2007-09-05 华中科技大学 Method for indexing and acquiring semantic net information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758244A (en) * 2004-04-30 2006-04-12 微软公司 Method and system for ranking documents of a search result to improve diversity and information richness
CN101030217A (en) * 2007-03-22 2007-09-05 华中科技大学 Method for indexing and acquiring semantic net information

Also Published As

Publication number Publication date
CN101625680A (en) 2010-01-13

Similar Documents

Publication Publication Date Title
CN101625680B (en) Document retrieval method in patent field
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN109101477B (en) Enterprise field classification and enterprise keyword screening method
CN109670014B (en) Paper author name disambiguation method based on rule matching and machine learning
CN108509629B (en) Text emotion analysis method based on emotion dictionary and support vector machine
CN110543564B (en) Domain label acquisition method based on topic model
Wang et al. Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications
CN108197117A (en) A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN110543595B (en) In-station searching system and method
CN105426426A (en) KNN text classification method based on improved K-Medoids
CN107122382A (en) A kind of patent classification method based on specification
CN105320646A (en) Incremental clustering based news topic mining method and apparatus thereof
CN113312474A (en) Similar case intelligent retrieval system of legal documents based on deep learning
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Alsharef et al. Exploring the efficiency of text-similarity measures in automated resume screening for recruitment
Murthy et al. A comparative study on term weighting methods for automated telugu text categorization with effective classifiers
CN115982359A (en) Method, system, terminal and medium for extracting and aggregating efficacy words of files
CN103150371A (en) Confusion removal text retrieval method based on positive and negative training
Guo et al. A new method for rare feature extraction in patent documents
CN110807099A (en) Text analysis retrieval method based on fuzzy set
Jiang et al. Technical University of Munich
Gaur Data mining and visualization on legal documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120829

CF01 Termination of patent right due to non-payment of annual fee