CN101625680B

CN101625680B - Document retrieval method in patent field

Info

Publication number: CN101625680B
Application number: CN200810012248A
Authority: CN
Inventors: 朱靖波; 王会珍; 曹菲菲; 肖桐; 李天宁; 宋国龙
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2008-07-09
Filing date: 2008-07-09
Publication date: 2012-08-29
Anticipated expiration: 2028-07-09
Also published as: CN101625680A

Abstract

The invention relates to a document retrieval method in the patent field, which comprises the following steps: preprocessing query texts and patent texts; retrieving the patent texts correlative with the query texts, adopting a calculation method with various similarities to obtain values of different similarities, combining the values of different similarities to recalculate the similarities, and sequencing the patent texts according to the new values of the similarities; adopting various decision methods to map the sequencing of the similarities of the patent text into different sequencings of patent category interdependencies; integrating the sequencing results of various patent category interdependencies, and performing resequencing to obtain the sequencing of new patent category interdependencies; and selecting the patent category most relevant to the query texts from the sequencing of the new patent category interdependencies. The document retrieval method uses the calculation method with various similarities to finally weigh the degree of correlation of the query texts and the patent texts, and uses information of characteristic multi-angles and considers a plurality of system combinations to achieve the aim of mutual complementation and improve the system performance.

Description

A Document Retrieval Method Oriented to the Patent Field

技术领域 technical field

本发明涉及一种资料检索方法，特别是一种面向专利领域的文档检索方法。The invention relates to a data retrieval method, in particular to a document retrieval method oriented to the patent field.

背景技术 Background technique

科学技术的迅速发展，记录科技成果的文献大量增长，专利作为知识产权保护最重要的手段之一越来越被重视。专利文本记载最新颖的发明创造所涉及的技术方案，然而记载科技成果的文献，除了专利，还有其它非专利文本，例如科研论文、技术报告等。专利与非专利之间存在一定的关系，例如，对科研论文与专利关系的研究，可以预测技术发展趋向。对专利文献和非专利的科研文献的研究，可以了解各个领域最新的技术，从而避免重复开发，避免侵权，甚至可以分析整个技术行业的发展；可以分析竞争者的技术研发状况以及策略；可以实现对专利的无效性检索。对专利文献和非专利文献的检索是专利研究领域较新的课题。With the rapid development of science and technology, the literature recording scientific and technological achievements has increased a lot, and patents, as one of the most important means of intellectual property protection, have been paid more and more attention. Patent texts record the technical solutions involved in the most novel inventions and creations. However, in addition to patents, there are other non-patent texts such as scientific research papers and technical reports that record scientific and technological achievements. There is a certain relationship between patents and non-patents. For example, research on the relationship between scientific research papers and patents can predict the trend of technological development. The study of patent literature and non-patent scientific research literature can help you understand the latest technologies in various fields, thereby avoiding repeated development, avoiding infringement, and even analyzing the development of the entire technology industry; you can analyze the technology research and development status and strategies of competitors; you can realize Invalidation search for patents. The retrieval of patent literature and non-patent literature is a relatively new topic in the field of patent research.

专利文本中通常会有引用相关的专利或者是科研论文，单纯利用专利与科研论文的引用关系研究非专利文献与专利文本之间的关系，非常有限。而且，专利数据库中的专利文档有几百万之多，单纯采用人工方式的专利操作是一项费时费力的工作。如何从庞大的专利数据库中检索到相关专利并获取有用的专利信息是专利研究的一个难题。There are usually references to related patents or scientific research papers in patent texts, and it is very limited to simply use the citation relationship between patents and scientific research papers to study the relationship between non-patent literature and patent texts. Moreover, there are millions of patent documents in the patent database, and it is a time-consuming and labor-intensive task to simply use manual patent operations. How to retrieve relevant patents and obtain useful patent information from a huge patent database is a difficult problem in patent research.

目前的专利检索和分类方法有两种，一种是基于专利数据库对已经分类的专利检索，另一种基于自然语言处理技术的检索方法。There are currently two methods of patent retrieval and classification, one is to retrieve classified patents based on patent databases, and the other is to retrieve methods based on natural language processing technology.

早期专利检索方法大多数基于专利数据库的方法，例如公开号为CN1996290A专利，主要利用了专利结构化的文本信息，抽取专利引证关系，构建专利关联图。然后根据一定的专利查询条件，例如申请号、专利号、申请日期、公告日期、发明人、专利权人等，在专利关联图中检索专利并将检索到的专利。这种方法依赖于专利本身固定的结构化文本，不够智能化，没有对专利内容进行分析。Most of the early patent retrieval methods were based on patent databases, such as the patent with publication number CN1996290A, which mainly used the structured text information of patents to extract patent citation relationships and construct patent association graphs. Then according to certain patent query conditions, such as application number, patent number, application date, announcement date, inventor, patentee, etc., search for patents in the patent association graph and retrieve the retrieved patents. This method relies on the fixed structured text of the patent itself, which is not intelligent enough and does not analyze the content of the patent.

基于自然语言处理的方法，是指采用自然语言处理技术对专利文本内容分析，从专利的标题、摘要、说明书、权利说明书等文本中，获取表征专利的有用特征，对特征赋予权重信息，检索相关专利文本，例如文章SomeIssues in the Automatic Classification of U.S.Patents(该文作者是Leah S.Larkey，文章是AAAI-98文本分类学习研讨会上的特邀报告)，介绍了采用自然语言处理技术进行专利分类的方法。文章POSTECH at NTCIR-5Patent Retrieval：Smoothing Experiments in a Language Modeling Approach toPatent Retrieval(该文作者是In-Su Kang，Seung-Hoon Na，Jun-Ki Kim，Jong-Hyeok Lee，文章发表在Proceedings of NTCIR-5 Workshop Meeting，December 6-9，2005，Tokyo，Japan)，采用自然语言处理技术实现专利检索。The method based on natural language processing refers to the use of natural language processing technology to analyze the content of patent texts, obtain useful features representing patents from texts such as patent titles, abstracts, specifications, and rights specifications, assign weight information to features, and retrieve relevant Patent texts, such as the article Some Issues in the Automatic Classification of U.S. Patents (the author of this article is Leah S.Larkey, the article is an invited report at the AAAI-98 Text Classification Learning Seminar), which introduces the use of natural language processing technology for patent classification Methods. Article POSTECH at NTCIR-5Patent Retrieval: Smoothing Experiments in a Language Modeling Approach to Patent Retrieval (The author of this article is In-Su Kang, Seung-Hoon Na, Jun-Ki Kim, Jong-Hyeok Lee, the article is published in Proceedings of NTCIR-5 Workshop Meeting, December 6-9, 2005, Tokyo, Japan), using natural language processing technology to realize patent retrieval.

但是现有的方法仅局限于关键词检索，并且只针对专利文本之间的检索，没有考虑非专利文本与专利文本、非专利文本与专利类别之间的关系，不能实现非专利文本和专利文本的智能化全文检索。However, the existing methods are limited to keyword retrieval, and only for the retrieval between patent texts, without considering the relationship between non-patent texts and patent texts, non-patent texts and patent categories, and cannot realize non-patent texts and patent texts. intelligent full-text search.

发明内容 Contents of the invention

针对现有技术中面向专利领域的文档检索没有考虑非专利文本与专利文本、非专利文本与专利类别之间的关系，不能实现非专利文本和专利文本的智能化全文检索的不足之处，本发明要解决的技术问题是提供一种专利检索的方法，能够实现专利文本的特征向量表示，计算非专利文本与相关的专利文本相似度，检索到最相关的专利文本。Aiming at the disadvantages that the patent-oriented document retrieval in the prior art does not consider the relationship between non-patent texts and patent texts, non-patent texts and patent categories, and cannot realize the intelligent full-text search of non-patent texts and patent texts, this paper The technical problem to be solved by the invention is to provide a patent retrieval method, which can realize the feature vector representation of patent texts, calculate the similarity between non-patent texts and related patent texts, and retrieve the most relevant patent texts.

为解决上述技术问题，本发明采用的技术方案基于自然语言处理技术的专利检索方法，包括以下步骤：In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is based on the patent retrieval method of natural language processing technology, comprising the following steps:

对查询文本和专利文本进行预处理；Preprocessing query text and patent text;

检索与查询文本相关的专利文本，采用多种不同相似度计算的方法得到不同相似度的值，组合不同相似度的值，重新计算相似度，按新的相似度的值对专利文本排序；Retrieve the patent text related to the query text, use a variety of similarity calculation methods to obtain different similarity values, combine different similarity values, recalculate the similarity, and sort the patent texts according to the new similarity value;

采用多种不同的决策方法，将专利文本的相似度排序映射成为专利类别相关性的不同排序；对多个不同专利类别相关性排序结果进行整合，重新排序得到新的专利类别相关性排序；Using a variety of different decision-making methods, the similarity ranking of patent texts is mapped to different rankings of patent category correlations; the correlation ranking results of multiple different patent categories are integrated, and a new patent category correlation ranking is obtained by re-ranking;

从新的专利类别相关性排序中，选择与查询文本最相关的专利类别。From the new patent class relevance ranking, select the patent class most relevant to the query text.

所述对文本的处理方法包括对文本的预处理，得到特征词的候选，统计特征词数据信息，采用特征选取的方法选取特征，将文本转化为向量表示形式，具体为：去掉专利文本中不是专利文本的标签，抽取专利文本信息，获得专利号、专利IPC类别标记、专利名称、说明书摘要、权利要求书、说明书；对英文文本保留全部大写单词；去掉含有数字的单词；去掉禁用词；对英文文本进行词型还原处理，得到特征候选词表；对特征候选词表进行统计，得到词频、文档频度、词的类别频度信息；从特征候选词中选取特征词表，计算特征词表中每个特征词的特征权重，根据特征词及其特征权重将专利文本和查询文本转化为可计算的向量。The method for processing the text includes preprocessing the text, obtaining the candidate of the feature word, counting the data information of the feature word, using the method of feature selection to select the feature, and converting the text into a vector representation, specifically: removing the patent text that is not The label of the patent text, extract the patent text information, obtain the patent number, patent IPC category mark, patent name, specification abstract, claims, specification; keep all uppercase words for the English text; remove words containing numbers; remove forbidden words; Perform word type restoration processing on the English text to obtain the feature candidate vocabulary; perform statistics on the feature candidate vocabulary to obtain word frequency, document frequency, and word category frequency information; select the feature vocabulary from the feature candidate words, and calculate the feature vocabulary The feature weight of each feature word in , and convert the patent text and query text into a computable vector according to the feature words and their feature weights.

所述多种不同相似度的计算方法得到查询文本与专利文本的相似度值，基于Log-linear模型整合上述多种不同的相似度值，计算公式如下：The above-mentioned multiple different similarity calculation methods obtain the similarity value of the query text and the patent text, and integrate the above-mentioned multiple different similarity values based on the Log-linear model, and the calculation formula is as follows:

$Sim Sim (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22})) = = \frac{exp exp ((\overset{&RightArrow; &Right Arrow;}{θ θ} \cdot \cdot \overset{&RightArrow; &Right Arrow;}{S S} (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22}))))}{{Σ Σ}_{k k = = 00}^{n no} exp exp ((\overset{&RightArrow; &Right Arrow;}{θ θ \cdot &Center Dot; \overset{&RightArrow; &Right Arrow;}{S S}} (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{d d}}_{k k}))))}$

其中，

是查询文本

和专利文本

采用不同相似度计算方法得到的相似度值作为特征组成的向量，

是采用不同相似度计算方法得到的相似度值的权重向量，n是与查询文本相关的专利文本总数，

表示第k个相关的专利文本向量。in,

is the query text

and patent text

The similarity value obtained by different similarity calculation methods is used as a vector composed of features,

is the weight vector of similarity values obtained by using different similarity calculation methods, n is the total number of patent texts related to the query text,

represents the kth relevant patent text vector.

所述多种不同的决策方法，包括专利类别权重的相似度加和方法、专利文本相似度排序位置权重的相似度加和方法以及专利文本相似度加和方法，其中专利类别权重的相似度加和计算公式如下：The multiple different decision-making methods include the similarity summing method of patent category weight, the similarity summing method of patent text similarity sorting position weight, and the patent text similarity summing method, wherein the similarity summing method of patent category weight And the calculation formula is as follows:

$score score ((x x)) = = {Σ Σ}_{i i = = 11}^{k k} {(({k k}_{r r}))}^{{c c}_{i i}} \times \times ICF ICF \times \times {score score}_{{d d}_{i i}} \times \times role role ((x x,, i i))$

$ICF ICF = = log log ((\frac{N N + + 0.5 0.5}{{C C}_{x x} + + 0.5 0.5}))$

其中，k_r是惩罚因子常数，k表示专利文本相似度排序结果中的候选的专利文本个数，c_i是指候选专利文本i所属的专利类别按照相似度排序得到的位置，

是查询文本与专利文本d_i的相似度值，ICF是指类别文本频度的倒数，其中C_x是指类别x下的文本数，N总的文本数，score(x)为查询文本与专利类别x的相关性的值，role(x，i)判断专利文本di是否属于专利类别x。Among them, k _r is a penalty factor constant, k represents the number of candidate patent texts in the result of similarity ranking of patent texts, and c _i refers to the position of the patent category to which candidate patent text i belongs according to the similarity ranking,

is the similarity value between the query text and the patent text d _i , ICF refers to the reciprocal of the category text frequency, where C _x refers to the number of texts under category x, N is the total number of texts, score(x) is the query text and patent The value of the relevance of category x, role(x, i) judges whether the patent text di belongs to patent category x.

所述专利文本相似度排序位置权重的相似度加和计算公式如下：The similarity sum calculation formula of the patent text similarity ranking position weight is as follows:

$score score ((x x)) = = {Σ Σ}_{i i = = 11}^{k k} {(({k k}_{t t}))}^{i i} \times \times {score score}_{{d d}_{i i}} \times \times role role ((x x,, i i))$

所述对多个不同专利类别相关性排序结果进行整合，是采用多种不同相似度值以及多种不同类别决策的方法组合后的专利类别相关性排序结果，做为专利类别位置的特征，基于Rank-SVM模型对多个专利类别相关性排序结果的组合。The integration of the correlation ranking results of multiple different patent categories is the result of the correlation ranking results of patent categories after using a variety of different similarity values and a variety of different category decision-making methods, as the characteristics of the position of the patent category, based on The Rank-SVM model combines the ranking results of the relevance of multiple patent categories.

所述对多个不同专利类别相关性排序结果进行整合，是采用按照多个不同专利类别相关性结果中，类别出现的位置值加和，计算得到新的专利类别相关性的值。The integration of the correlation ranking results of a plurality of different patent categories is to calculate a new correlation value of a patent category by summing the position values where the categories appear in the correlation results of a plurality of different patent categories.

本发明具有以下有益效果及优点：The present invention has the following beneficial effects and advantages:

1.本发明方法采用了自然语言处理的技术，利用多种相似度计算的方法作为最终权衡查询文本与专利文本的相关程度，充分利用特征多角度的信息。最后，考虑了多个系统组合，达到彼此的互补的目的，提高系统性能。1. The method of the present invention adopts the technology of natural language processing, uses various similarity calculation methods as the final trade-off of the correlation degree between the query text and the patent text, and makes full use of the multi-angle information of features. Finally, multiple system combinations are considered to achieve the purpose of complementing each other and improve system performance.

附图说明 Description of drawings

图1为本发明方法流程图；Fig. 1 is a flow chart of the method of the present invention;

图2为文本预处理流程图；Fig. 2 is a flow chart of text preprocessing;

图3为查询文本与专利文本相似度计算流程图；Fig. 3 is a flow chart of calculating the similarity between the query text and the patent text;

图4为查询文本与专利类别相关性计算流程图；Fig. 4 is a flow chart of calculating the correlation between query text and patent category;

具体实施方式 Detailed ways

下面结合是实施例和附图进一步阐明本发明所述的方法：Below in conjunction with embodiment and accompanying drawing further illustrate method of the present invention:

如图1所示，一种面向专利领域的文档检索方法，包括以下步骤：As shown in Figure 1, a document retrieval method for the patent field includes the following steps:

对查询文本和专利文本进行预处理；检索与查询文本相关的专利文本，采用多种不同相似度计算的方法得到不同相似度的值，组合不同相似度的值，重新计算相似度，按新的相似度的值对专利文本排序；采用多种不同的决策方法，将专利文本的相似度排序映射成为专利类别相关性的不同排序，对多个不同专利类别相关性排序结果进行整合，重新排序得到新的专利类别相关性排序；从新的专利类别相关性排序中，选择与查询文本最相关的专利类别。Preprocess the query text and patent text; retrieve the patent text related to the query text, use a variety of different similarity calculation methods to obtain different similarity values, combine different similarity values, recalculate the similarity, and press the new The value of similarity ranks patent texts; using a variety of different decision-making methods, the similarity ranking of patent texts is mapped to different rankings of patent category correlations, and the correlation ranking results of multiple different patent categories are integrated and reordered to obtain New patent category relevance ranking; from the new patent category relevance ranking, select the patent category most relevant to the query text.

如图2所示，所述对查询文本和专利文本进行预处理包括以下步骤：As shown in Figure 2, the preprocessing of the query text and the patent text includes the following steps:

a)去掉专利文本中不是专利文本的标签，抽取专利文本信息，获得专利号、专利IPC类别标记、专利名称、说明书摘要、权利要求书以及说明书；去掉获得的专利文本信息中单词内部非字母或者非汉字符号，例如：’-’、’，’、’(’、’)’等；对英文文本保留全部大写单词；去掉含有数字的单词；去掉禁用词，例如：英文专利中的，“claim”、“said”等，中文专利中的，“步骤”、“特征”等以及介词、副词、冠词等；对英文文本进行词型还原处理，得到特征候选词表；a) Remove the tags that are not patent texts in the patent text, extract the patent text information, obtain the patent number, patent IPC category mark, patent name, specification abstract, claims and specification; remove the non-letter or words inside the obtained patent text information Non-Chinese characters, such as: '-', ',', '(', ')', etc.; keep all uppercase words for English text; remove words containing numbers; remove forbidden words, such as: in English patents, "claim ", "said", etc., in Chinese patents, "step", "feature", etc., as well as prepositions, adverbs, articles, etc.; the English text is processed to restore the word type, and the feature candidate vocabulary is obtained;

b)对特征候选词表进行统计，得到词频、文档频度、词的类别频度信息；b) Perform statistics on the feature candidate vocabulary to obtain word frequency, document frequency, and word category frequency information;

c)从特征候选词中选取特征词表，计算特征词表中每个特征词的特征权重，根据特征词及其特征权重将专利文本和查询文本转化为可计算的向量。c) Select a feature vocabulary from the feature candidate words, calculate the feature weight of each feature word in the feature vocabulary, and convert the patent text and query text into a computable vector according to the feature words and their feature weights.

d)以专利的特征词作为索引词，为专利文档以及专利文本向量构建倒排索引文档存储。d) Using the characteristic words of patents as index words, construct an inverted index document storage for patent documents and patent text vectors.

如图3所示，多种不同相似度的计算方法包括以下步骤：As shown in Figure 3, a variety of different similarity calculation methods include the following steps:

在专利文本库中找到与查询文本有共现特征词的专利文本，构成相关的专利文本集合。Find patent texts that have co-occurrence feature words with the query text in the patent text database to form a set of related patent texts.

计算相关专利文本集合中的相关专利与查询文本的相似度，本实施例中采用了多种相似度计算的方法，其中有向量余弦方法、BM25方法、SMART方法，具体计算如下：To calculate the similarity between the relevant patents in the relevant patent text collection and the query text, a variety of similarity calculation methods are used in this embodiment, including the vector cosine method, the BM25 method, and the SMART method. The specific calculations are as follows:

1.向量余弦的计算方法1. Calculation method of vector cosine

用向量空间模型表示查询文本

和专利文本，两个向量的余弦计算公式：Represent query text with vector space model

and patent text , the cosine calculation formula of two vectors:

$cos cos (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22})) = = \frac{{\overset{&RightArrow; &Right Arrow;}{D D.}}_{11} \cdot \cdot {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22}}{| | | | {\overset{&RightArrow; &Right Arrow;}{D D.}}_{11} | | \cdot \cdot | | | | {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22} | | | |}$

2.BM25计算方法2. BM25 calculation method

BM25有很多变种，本实施例中BM25计算方法公式如下：There are many variants of BM25, and the calculation method formula of BM25 is as follows in the present embodiment:

$score score (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22})) = = {Σ Σ}_{i i = = 11}^{n no} IDF IDF (({t t}_{i i})) \cdot &Center Dot; \frac{f f (({t t}_{i i},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22})) \cdot \cdot (({k k}_{11} + + 11))}{f f (({t t}_{i i},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22})) + + {k k}_{11} \cdot &Center Dot; ((11 - - b b + + b b \cdot &Center Dot; \frac{| | {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22} | |}{avgdl avgdl}))}$

其中n表示查询文本

的特征词个数；f(t_i，D₂)是特征词t_i在专利文本

中出现的次数；

表示专利文本

的文本长度；avgdl是与查询文本相关的专利文本集合中文本的平均长度；k₁和b是自由参数，本实施例中，k₁取值为2.0，b取值为0.75；IDF(t_i)是文档频度的倒数，是检索词t_i的权重，计算公式如下：where n represents the query text

The number of feature words; f(t _i , D ₂ ) is the number of feature words t _i in the patent text

the number of occurrences in

Indicates patent text

The length of the text; avgdl is the average length of the text in the patent text collection related to the query text; k ₁ and b are free parameters, in the present embodiment, the value of k ₁ is 2.0, and the value of b is 0.75; IDF(t _i ) is the reciprocal of the document frequency, and is the weight of the search term t _i , the calculation formula is as follows:

$IDF IDF (({t t}_{i i})) = = log log \frac{N N - - n no (({t t}_{i i})) + + 0.5 0.5}{n no (({t t}_{i i})) + + 0.5 0.5}$

其中N是整个数据集上的文档总数，n(t_i)是指包含检索词t_i的文档数。where N is the total number of documents on the entire dataset, and n(t _i ) refers to the number of documents containing the term t _i .

3.SMART计算方法3. SMART calculation method

SMART算法计算公式如下：The calculation formula of the SMART algorithm is as follows:

${Sim Sim}_{SMART SMART} = = \underset{t t &Element; &Element; T T}{Σ Σ} (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11} \times \times {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22}))$

查询文本向量

中每维特征的权重w_i采用下式计算：query text vector

The weight w _i of each dimension feature in is calculated by the following formula:

${w w}_{i i} = = ((11 + + log log (({tf tf}_{i i})))) \times \times log log \frac{N N + + 11}{n no}$

专利文本向量

中每维特征的权重w_i采用下式计算：Patent text vector

${w w}_{i i} = = \frac{11 + + log log (({tf tf}_{i i}))}{11 + + log log ((avtf avtf))} \times \times \frac{11}{0.8 0.8 + + 0.2 0.2 \frac{utf utf}{pivot pivot}}$

其中T表示查询文本

与专利文本

的共同出现的特征词集合；tf_i是文本向量中第i个特征词的词频；N为全部专利文本集合中文本个数，n是指出现第i个特征的专利文本个数；avtf是特征词在相关专利文本集合中文档的平均词频；utf是专利文本向量

中的特征词个数；pivot是全部专利文本集合中每个文档的平均特征词数。where T represents the query text

with patent text

tf _i is the word frequency of the i-th feature word in the text vector; N is the number of texts in all patent text sets, and n refers to the number of patent texts where the i-th feature appears; avtf is the feature The average word frequency of the document in the related patent text collection; utf is the patent text vector

The number of feature words in ; pivot is the average number of feature words in each document in all patent text collections.

分别用三种方法计算得到不同的查询文本和专利文本的相似度值。Three methods are used to calculate the similarity values of different query texts and patent texts.

对经过上述各计算方法得到的不同的相似度值进行归一化处理，得到0到1之间的相似度值。The different similarity values obtained through the above calculation methods are normalized to obtain a similarity value between 0 and 1.

对归一化后不同的相似度值分别取对数。Logarithms are taken for different similarity values after normalization.

将取对数之后的不同相似度值作为Log-linear模型的特征，计算公式如下：The different similarity values after taking the logarithm are used as the characteristics of the Log-linear model, and the calculation formula is as follows:

$Sim Sim (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22})) = = \frac{exp exp ((\overset{&RightArrow; &Right Arrow;}{θ θ} \cdot &Center Dot; \overset{&RightArrow; &Right Arrow;}{S S} (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22}))))}{{Σ Σ}_{k k = = 00}^{n no} exp exp ((\overset{&RightArrow; &Right Arrow;}{θ θ \cdot \cdot \overset{&RightArrow; &Right Arrow;}{S S}} (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{d d}}_{k k}))))}$

其中，

是查询文本

和专利文本

表示第k个相关的专利文本向量。in,

is the query text

and patent text

represents the kth relevant patent text vector.

如图4所示，采用多种不同的专利类别决策的方法对不同的专利文本相似度排序结果，计算查询文本与专利类别之间的相关性排序。本实施例中，采用的专利类别决策的方法有：相似度加和的方法、专利文本相似度位置权重加和方法以及专利类别权重加和方法，其计算方法如下：As shown in Figure 4, a variety of different patent category decision-making methods are used to sort the similarity results of different patent texts, and the correlation ranking between the query text and the patent category is calculated. In this embodiment, the patent category decision-making methods adopted include: the method of summing similarity, the method of summing the position weight of similarity of patent text, and the method of summing the weight of patent category. The calculation method is as follows:

1.相似度加和的方法，计算如公式如下：1. The method of summing the similarity is calculated as follows:

$score score ((x x)) = = {Σ Σ}_{i i = = 11}^{k k} {score score}_{{d d}_{i i}} \times \times role role ((x x,, i i))$

其中x表示IPC的类别，k表示专利文本相似度排序结果中的候选的专利文本个数，代表第i个候选专利文本的相似度值。role(x，i)判断专利文本d_i是否属于专利类别x。Among them, x represents the category of IPC, and k represents the number of candidate patent texts in the result of similarity ranking of patent texts, Represents the similarity value of the i-th candidate patent text. role(x, i) determines whether the patent text d _i belongs to the patent category x.

2. 专利类别权重加和方法，计算公式如下：2. The patent category weight sum method, the calculation formula is as follows:

$ICF ICF = = log log ((\frac{N N + + 0.5 0.5}{{C C}_{x x} + + 0.5 0.5}))$

是查询文本与专利文本d_i的相似度值，ICF是指类别文本频度的倒数，其中C_x是指类别x下的文本数，N为总的文本数，score(x)为查询文本与专利类别x的相关性的值。role(x，i)判断专利文本d_i是否属于专利类别x。Among them, k _r is a penalty factor constant, k represents the number of candidate patent texts in the result of similarity ranking of patent texts, and c _i refers to the position of the patent category to which candidate patent text i belongs according to the similarity ranking,

is the similarity value between the query text and the patent text d _i , ICF refers to the reciprocal of the category text frequency, where C _x refers to the number of texts under category x, N is the total number of texts, score(x) is the query text and The value of the relevance of patent category x. role(x, i) determines whether the patent text d _i belongs to the patent category x.

3.专利文本相似度位置权重加和方法，计算公式如下：3. The patent text similarity position weight sum method, the calculation formula is as follows:

其中，k_i是一个惩罚因子常数，k表示专利文本相似度排序结果中的候选的专利文本个数，

是查询文本与专利文本d_i的相似度值。role(x，i)判断专利文本d_i是否属于专利类别x。Among them, _ki is a penalty factor constant, k represents the number of candidate patent texts in the patent text similarity ranking results,

is the similarity value between the query text and the patent text d _i . role(x, i) determines whether the patent text d _i belongs to the patent category x.

对多个不同的专利类别相关性排序结果1～3进行组合，对类别排序结果重新排序。组合方式有多种，在本实施例中采用的组合方法有如下两种：Combine a plurality of different patent category correlation ranking results 1 to 3, and reorder the category ranking results. There are many combinations, and the combination methods adopted in this embodiment are as follows:

将多种不同相似度值以及多种不同类别决策的方法组合后的专利类别相关性排序结果，做为专利类别位置的特征，基于Rank-SVM模型对多个专利类别相关性排序结果的组合。Combining the ranking results of patent category relevance after combining multiple different similarity values and multiple different category decision-making methods, as the characteristics of the position of the patent category, the combination of the ranking results of multiple patent category correlations based on the Rank-SVM model.

采用按照多个不同专利类别相关性结果中，类别出现的位置值加和，计算得到新的专利类别相关性的值。A new patent category correlation value is calculated by summing the position values of the categories according to the correlation results of multiple different patent categories.

通过上述步骤得到查询文本与专利文本的相似度值，根据该相似度值进行排序，选择与查询文本的最相关的专利类别。The similarity value between the query text and the patent text is obtained through the above steps, sorted according to the similarity value, and the most relevant patent category to the query text is selected.

本发明所述的方法并不局限于集体实施方法中所述的实施例，本领域技术人员根据本发明的就似乎方案得出其他的实施方式，同样属于本发明的技术创新范围。The method described in the present invention is not limited to the embodiments described in the collective implementation method. Those skilled in the art can draw other implementation modes according to the solutions of the present invention, which also belong to the technical innovation scope of the present invention.

Claims

1. A document retrieval method oriented to the patent field, comprising the following steps:

Preprocessing query text and patent text;

Retrieve the patent text related to the query text, use a variety of similarity calculation methods to obtain different similarity values, combine different similarity values, recalculate the similarity, and sort the patent texts according to the new similarity value;

Using a variety of different decision-making methods, the similarity ranking of patent texts is mapped to different rankings of patent category correlations; the correlation ranking results of multiple different patent categories are integrated, and a new patent category correlation ranking is obtained by re-ranking;

From the new patent category relevance ranking, select the patent category most relevant to the query text;

The above-mentioned various similarity calculation methods obtain the similarity value of the query text and the patent text, and integrate the above-mentioned various similarity values based on a logarithmic linear model, and the calculation formula is as follows:

Sim Sim (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22})) = = \frac{exp exp ((\overset{&RightArrow; &Right Arrow;}{θ θ} \cdot &Center Dot; \overset{&RightArrow; &Right Arrow;}{S S} (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{D D.}}_{22}))))}{{Σ Σ}_{k k = = 00}^{n no} exp exp ((\overset{&RightArrow; &Right Arrow;}{θ θ \cdot &Center Dot; \overset{&RightArrow; &Right Arrow;}{S S}} (({\overset{&RightArrow; &Right Arrow;}{D D.}}_{11},, {\overset{&RightArrow; &Right Arrow;}{d d}}_{k k}))))}

in,

is the query text

and patent text

represents the kth relevant patent text vector;

The multiple different decision-making methods include the similarity summing method of patent category weight, the similarity summing method of patent text similarity sorting position weight, and the patent text similarity summing method, wherein the similarity summing method of patent category weight And the calculation formula is as follows:

score score ((x x)) = = {Σ Σ}_{i i = = 11}^{k k} {(({k k}_{r r}))}^{{c c}_{i i}} \times \times ICF ICF \times \times {score score}_{{d d}_{i i}} \times \times role role ((x x,, i i))

ICF ICF = = log log ((\frac{N N + + 0.5 0.5}{{C C}_{x x} + + 0.5 0.5}))

Among them, k _r is a penalty factor constant, k represents the number of candidate patent texts in the result of similarity ranking of patent texts, and c _i refers to the position of the patent category to which candidate patent text i belongs according to the similarity ranking,

is the similarity value between the query text and the patent text d _i , ICF refers to the reciprocal of the category text frequency, where C _x refers to the number of texts under category x, N is the total number of texts, score(x) is the query text and patent The value of the relevance of category x, role(x, i) judges whether the patent text di belongs to patent category X;

The similarity sum calculation formula of the patent text similarity ranking position weight is as follows:

score score ((x x)) = = {Σ Σ}_{i i = = 11}^{k k} {(({k k}_{t t}))}^{i i} \times \times {score score}_{{d d}_{i i}} \times \times role role ((x x,, i i))

2. A kind of document retrieval method facing patent field as claimed in claim 1, it is characterized in that: the processing method to text comprises the preprocessing to text, obtains the candidate of feature word, statistical feature word data information, adopts feature selection The method selects features and converts the text into a vector representation, specifically:

Remove the tags that are not patent texts in the patent text, extract the patent text information, and obtain the patent number, patent IPC category mark, patent name, specification abstract, claims, and specification; keep all capitalized words for the English text; remove words containing numbers; Remove stop words; perform word form restoration on English text to obtain feature candidate vocabulary;

Perform statistics on the feature candidate vocabulary to obtain word frequency, document frequency, and word category frequency information;

Select the feature vocabulary from the feature candidate words, calculate the feature weight of each feature word in the feature vocabulary, and convert the patent text and query text into a computable vector according to the feature words and their feature weights.

3. A document retrieval method oriented to the patent field as claimed in claim 1, characterized in that: the integration of the correlation ranking results of multiple different patent categories is to use multiple different similarity values and multiple different categories The patent category correlation ranking results after the combination of decision-making methods are used as the characteristics of the patent category position, and the combination of multiple patent category correlation ranking results based on the sorting-based support vector machine model.

4. A document retrieval method oriented to the patent field as claimed in claim 1, characterized in that: the integration of the correlation ranking results of multiple different patent categories is based on the correlation results of multiple different patent categories, The value of the position where the category appears is summed to calculate the value of the new patent category correlation.