CN105335351B

CN105335351B - A kind of synonym automatic mining method based on patent search daily record user behavior

Info

Publication number: CN105335351B
Application number: CN201510701365.1A
Authority: CN
Inventors: 吕学强; 周建设; 董志安; 李雪伟
Original assignee: Capital Normal University; Beijing Information Science and Technology University
Current assignee: Capital Normal University; Beijing Information Science and Technology University
Priority date: 2015-10-27
Filing date: 2015-10-27
Publication date: 2018-08-28
Anticipated expiration: 2035-10-27
Also published as: CN105335351A

Abstract

The present invention relates to a kind of synonym automatic mining methods based on patent search daily record user behavior, include the following steps：Step 1) pre-processes patent search daily record, and candidate synonym collection is obtained using the stay in place form of patent search daily record synset；Step 2) extracts literal feature, pronunciation feature and the query characteristics of the candidate synonym of candidate synonym concentration.Synonym automatic mining method provided by the invention based on patent search daily record user behavior, the accuracy that can effectively improve the synonym identification in patent search daily record field by choosing literal feature, pronunciation feature and query characteristics, can meet the needs of practical application well.

Description

A kind of synonym automatic mining method based on patent search daily record user behavior

Technical field

The invention belongs to Chinese information retrieval technical fields, and in particular to a kind of based on patent search daily record user behavior Synonym automatic mining method.

Background technology

With the fast development of science and technology, various emerging high-tech products more and more flood the market, patent letter It ceases and has been paid much attention to by people as a kind of residence law, technology, the economic specific information resource in one.Patent search engine As a basic means of patent information inquiry, it is used widely.Whether user can retrieve satisfied information and search Indexing the thesaurus held up has very close relationship, and synonym is the part for forming thesaurus, in order to make user inquire more Comprehensive detailed patent information, synonym Research on Mining are particularly important.

There are a large amount of wrong words, some wrong words to be widely used by people in patent search daily record, this kind of word with and it Corresponding correct word is also considered as synonym, such as carbon nanotube and carbon nanotube, Yoga and yoga.In addition to this, patent is searched There are many unregistered words in Suo Zhi, thus it is existing《Hownet》With《Chinese thesaurus》This kind of synonym resource cannot be used for The synonym of patent search daily record excavates.Traditional synonym define refer to a things different expression-forms, pass through analysis In patent search daily record the characteristics of vocabulary, the synonym of patent field can substantially be divided into following eight major class：1) Chinese-English, This kind of synonym is mainly two kinds of different expression-forms for describing identical concept, such as：Zinc-Zn, Email-email；2) it learns Name-popular name refers to the written word and works and expressions for everyday use of same thing, such as：Ethyl alcohol-alcohol；3) full name-abbreviation refers to the original of same thing Title and simplified title, such as：Peking University-Beijing University, short message-short message, time stab-time stamp；4) unisonance synonym, this kind of word Mainly caused by the wrong word that high frequency uses, such as：Yoga-yoga, gamma-gamma, Bezafibrate-bezafibrate, automobile- Gas vehicle；5) newly claim-be once called as, refer to two kinds of address modes of different times identical concept, such as bicycle-bicycle；6) tradition is synonymous Word, it is identical and be not belonging to the word of the above classification to refer to concept, such as chitin-chitin, threshold value-thresholding；7) antonym refers to concept and cuts So opposite word, such as go out-enter, increase-reduction, left-hand rotation-right-hand rotation；8) synonym caused by translating, this kind of word is turned over to English It translates, pronunciation is roughly the same, such as：Epcos AG-Epcos AG, Rosemount Inc-Rosemount are public Department.

Currently, synonym resource has been widely used in various fields, as information retrieval, semantic disambiguation, query expansion, Keyword extraction, machine translation etc..With the promotion of application, the method for automatic mining synonym emerges one after another, at this stage mainly There are following two methods：Synonym based on corpus and based on dictionary excavates.But both methods exists centainly Defect：Method based on corpus easy tos produce matrix Sparse Problems；Synonym method for digging based on dictionary is easy to be led The limitation in domain can not play a role well.

Invention content

For the above-mentioned prior art the problem of, the purpose of the present invention is to provide one kind can avoid above-mentioned skill occur The synonym automatic mining method based on patent search daily record user behavior of art defect.

In order to achieve the above-mentioned object of the invention, the technical solution adopted by the present invention is as follows：

A kind of synonym automatic mining method based on patent search daily record user behavior, includes the following steps：

Step 1) pre-processes patent search daily record, is obtained using the stay in place form of patent search daily record synset Candidate synonym collection；

Step 2) extracts literal feature, pronunciation feature and the query characteristics of the candidate synonym of candidate synonym concentration.

Further, the step 1) is specially：

Step A：The query string of filtering useless, using regular expression remove patent search daily record in application number, openly Number, the patent information inquired of classification number；

Step B：To patent search daily record progress, full-shape is converted to half-angle, traditional font is converted to simplified processing；

Step C：The synonymous word structure in patent search daily record is extracted according to the stay in place form of candidate synonym collection；

Step D：According to name identifier rule-based filtering name information, candidate synonym collection is obtained.

Further, whether the literal feature includes moving similarity after maximum similarity, minimum similarity degree, center of gravity, having There is same prefix and whether there are five features of identical suffix, wherein：The calculation formula of the maximum similarity is as follows：

The calculation formula of the minimum similarity degree is as follows：

The calculation formula that similarity is moved after the center of gravity is as follows：

Wherein, Sim_zimian_max(w₁, w₂) word is represented to (w₁, w₂) maximum similarity；Sim_zimian_min(w₁, w₂) Word is represented to (w₁, w₂) minimum similarity degree；Sim_zimian_zhongxin(w₁, w₂) word is represented to (w₁, w₂) center of gravity after phase shift seemingly Degree；same(w₁, w₂) word is represented to (w₁, w₂) in same word number；min(|w₁|, | w₂|) word is represented to (w₁, w₂) in it is minimum Word it is long；max(|w₁|, | w₂|) word is represented to (w₁, w₂) in maximum word it is long；|w₁| represent w₁Word it is long；|w₂| represent w₂Word It is longRefer to weights sum of the identical word in word different location；K is represented in word Word number, same (w₁, m) and represent the position of identical word；Wherein, α=0.6, β=0.4, γ=1.

Further, the pronunciation calculating formula of similarity of the pronunciation feature is as follows：

Wherein,Represent w₁Pronunciation,Word is represented to (w₁, w₂) pronunciation smallest edit distance,Word is represented to (w₁, w₂) in maximum pronunciation length；Word is represented to (w₁, w₂) reading Sound similarity.

Further, patent search daily record is will appear in the vocabulary in a line as a query characteristics, and utilization is following Formula calculates query characteristics value：

(w₁, w₂) ∈ row represent word to (w₁, w₂) same a line in patent search daily record occurs,Generation Table word is to (w₁, w₂) do not occur in same a line of patent search daily record.

Synonym automatic mining method provided by the invention based on patent search daily record user behavior, it is literal by choosing Feature, pronunciation feature and query characteristics can effectively improve the accuracy of the synonym identification in patent search daily record field, can To meet the needs of practical application well.

Description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the specific steps flow chart of step 1)；

Fig. 3 is that the data of a linearly inseparable convert the linear separability sample obtained later by gaussian kernel function, In, the point being circled is supporting vector.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to It is of the invention in limiting.

As shown in Figure 1, a kind of synonym automatic mining method based on patent search daily record user behavior, including following step Suddenly：

Most of query string in patent search daily record contains a variety of describing modes of a things, these describing modes Between be attached by the logical operators such as " or ", " and ", " not ", the part vocabulary of these logical operators connection exists Coordination.It is as shown in table 1 by analyzing the characteristics of synonym is distributed in patent search daily record.

Table 1：Processed patent search daily record language material

The synset stay in place form of structure mainly has following five kinds：

1. template 1

“&apos；words1&apos；OR&apos；words2&apos；OR&apos；words3&apos；", wherein Words1, words2 and words3 are candidate synonym collection；The template is connected with " OR " or " or ", is simplest synset Stay in place form, as shown in 18 rows in Fig. 1；

2. template 2

“(&apos；words1&apos；pre/2&apos；words2&apos；)OR&apos；words3&apos；", " ＆ apos；words1&apos；pre/2&apos；words2&apos；" indicate that the phrase that words1 and words2 is constituted connects with " OR " The words3 connect is that candidate synonym, i.e. words1+words2 and words3 are candidate synonym, such as 19,24,26 row in Fig. 1 It is shown；

3. template 3

“&apos；words1&apos；OR(&apos；words2&apos；AND&apos；words3&apos；AND& apos；words4&apos；) ", wherein query statement " ＆apos；words2&apos；AND&apos；words3&apos；AND& apos；words4&apos；" indicate that words2, words3 and words4 constitute the words1 that a phrase is connect with " OR " and constitute Candidate synonym, i.e. word1 and words2+words3+words4 are candidate synonym, as shown in 27 rows in Fig. 1；

4. template 4

“&apos；words1&apos；OR(&apos；words2&apos；and/sen&apos；words3&apos；) ", Wherein word1 and words2+word3 is candidate synonym, as shown in 29 rows in Fig. 1；

5. template 5

" indications=＆apos；words1&apos；OR indications=s ＆apos；words2&apos；OR indications=＆ apos；words3&apos；", indications are usually " DESC1 ", " KWRF ", " TICN ", " ABS ", " TI, KW+ " etc. are referred to not Same query word characteristic.Wherein words1, words2 and words3 refer to the candidate synonym collection for having same nature, such as table 1 In shown in 17,20,22,23,28 rows.

Candidate synonym collection is obtained using the stay in place form of patent search daily record synset.First, regular expressions are utilized The patent information inquired with application number, publication number, classification number in formula removal patent search daily record.Due to inquiring the defeated of record Enter method and font disunity, full-shape is converted to half-angle, traditional font is converted to simplified processing to daily record progress.According to synset Patent search daily record is carried out division processing by stay in place form 1, template 2, template 3, template 4, template 5.Step 1) is excavated candidate same The flow chart of adopted word set is as shown in Fig. 2, step 1) is specially：

Wherein：In the query string of filtering useless, retain with title, address, applicant, inventor, patent agency Deng the information for carrying out patent consulting.By analyzing patent search daily record find that merely with synonym template name letter can not be filtered Breath extracts the indications rule comprising name, such as to improve the accuracy and recall rate of synonym table：" INV ", " ATCN ", " GK_IN " etc. filter the interference of name information, in row 17 and row 23 in table 1 according to the rule of name indications Content.Include nearly 30,000 query word in patent search daily record after processed, including Chinese, English and Japanese vocabulary.

In table 1 shown in 18 rows, candidate synonym collection is：Chitin chitin chitosan, then candidate synonym is to just having 3 It is right, i.e.,：Chitin chitin；Chitin chitosan；Chitin chitosan.Synonym in patent search daily record is made full use of to be distributed The characteristics of, the accuracy rate of the candidate synonym collection of acquisition is also relatively high.

The basic thought of supporting vector machine model：An optimal hyperplane is defined, and optimal hyperlane will be found and calculated Method is attributed to the problem of solution convex programming.Then according to the expansion theorem of Mercer cores, pass through a Nonlinear Mapping Sample space is mapped to compared with (spaces Hilbert) in higher-dimension or infinite dimensional feature space, in this way in feature space model Point of the recurrence in sample space model, estimation of density function and high dimensional nonlinear can be solved using linear learning method Class problem.Especially prominent in solving the problems, such as text classification, the recall rate and accuracy obtained using this method is superior to other Method.

It is pattern recognition problem that classification problem, which is also known as, is exactly to find point inherent in data according to existing observation data Then class relationship is treated prediction data using obtained disaggregated model y=M (x) and is tested.Synonymous word identification problem is exactly one A two classifying and dividing is exactly to find a suitable function y=f (x), by f (x_iThe x of) >=0_iIt is classified as positive class, by f (x_i) ＜ 0 X_iIt is classified as negative class.

The kernel function of supporting vector machine model, common is mainly the following：

1. polynomial kernel K (x, x_i)=(axy+c)^d (1)

It is the linear separability sample that the data of a linearly inseparable obtain later by gaussian kernel function transformation as shown in Figure 3 This, wherein the point being circled is supporting vector.

In machine learning method, the selection of feature is very important classification.The candidate that the present invention chooses is synonymous Word is all word pair similar in the meaning of a word, therefore the division of classification is only difficult to realize by simple feature.The present invention not only considers word Region feature also takes into account the similar feature of pronunciation and user query behavior feature.

Synonym is exactly usually morpheme having the same there are an apparent feature, such as：Peking University and Beijing University, Running shoes and running shoe, timestamp value and time stamp etc..Therefore the feature of literal similarity is considered when using support vector machines.Word Whether region feature includes mainly maximum similarity, and minimum similarity degree moves similarity after center of gravity, if having same prefix and have The calculating formula of similarity difference of five features of identical suffix, wherein first three feature is as follows：

The calculation formula of the maximum similarity is

The calculation formula of the minimum similarity degree is

The calculation formula of shifting similarity is after the center of gravity

Wherein, Sim_zimian_max(w₁, w₂) word is represented to (w₁, w₂) maximum similarity；Sim_zimian_min(w₁, w₂) Word is represented to (w₁, w₂) minimum similarity degree；Sim_zimian_zhongxin(w₁, w₂) word is represented to (w₁, w₂) center of gravity after phase shift seemingly Degree；same(w₁, w₂) word is represented to (w₁, w₂) in same word number；min(|w₁|, | w₂|) word is represented to (w₁, w₂) in it is minimum Word it is long；max(|w₁|, | w₂|) word is represented to (w₁, w₂) in maximum word it is long；|w₁| represent w₁Word it is long；|w₂| represent w₂Word It is longRefer to weights sum of the identical word in word different location；K is represented in word Word number, same (w₁, m) and represent the position of identical word；Wherein, α=0.6, β=0.4, γ=1.Following table 2 is listed The example of one literal feature：

Table 2：Literal feature

There are many wrong words, some wrong words largely to be used by people in daily record, therefore, by this part word to as same Adopted word.This kind of word is to there are one common ground, i.e. pronunciation is similar, such as：Fourier and Fourier, saltcake and mirabilite, Yoga and yoga Deng.The pronunciation of word is obtained by parsing search dog cell dictionary, the pronunciation similarity calculation of the pronunciation feature in step 2) is public Formula is as follows：

Wherein,Represent w₁Pronunciation,Word is represented to (w₁, w₂) pronunciation smallest edit distance,Word is represented to (w₁, w₂) in maximum pronunciation length；Word is represented to (w₁, w₂) reading Sound similarity.Following table 3 lists the example of a pronunciation feature：

Table 3：Pronunciation feature

The query word that same a line is appeared in patent search daily record is similar word or related term, because these vocabulary are all pair The different describing modes of the same patent.

Patent search daily record be will appear in the vocabulary in a line as a query characteristics, treated, and partial monopoly is searched Rope log query information is as shown in table 4.

Table 4：Part treated patent search log query string

Query word in the same row is that the possibility of synonym is bigger as can be seen from Table 4, described in step 2) The calculation formula of query characteristics is as follows：

(w₁, w₂) ∈ row represent word to (w₁, w₂) same a line in patent search daily record occurs,Generation Table word is to (w₁, w₂) do not occur in same a line of patent search daily record.The query characteristics of 4 middle part participle pair of table are shown in table 5 Value：

Table 5：Query characteristics

The patent search daily record that following embodiment is provided using certain patent search system, total size 10G.It is right first Patent search daily record is pre-processed, and candidate synonym collection is extracted according to the characteristics of synonym occurs in patent search daily record, Then literal feature, pronunciation feature and the query characteristics of word pair in daily record after handling are extracted respectively, and using artificial mark 4741 words are to for training corpus, wherein synonym word pair 2108, non-synonymous word word pair 2633, and use " 1 " and " -1 " Label synonym pair and non-synonymous word pair respectively.

It sequentially adds literal, pronunciation and query characteristics is tested, the variation table of the feature weight factor in each feature combination As shown in table 6：

Table 6：Feature weight factor variations table

Wherein, feature combination 1 refers to literal feature；Feature combination 2 refers to literal feature+pronunciation feature；Feature combines 3 Refer to literal feature+pronunciation feature+query characteristics.The results are shown in Table 7 for each feature combination：

Table 7：SVM model experiment results

The accuracy of feature combination 3 as can be seen from Table 7, recall rate and F values all increase, therefore context of methods is adopted It is combined, is compared using the method and method commonly used in the prior art of the present invention, experimental result such as table 8, table with No. 3 features Shown in 9.

Table 8：Experiment statistics result

Table 9：Contrast and experiment

Wherein, identify that word logarithm refers to the word logarithm in the synonym table excavated.

From table 8 and table 9 as can be seen that with each feature addition, method using the present invention, synonym identification Accuracy, recall rate and F values are higher than the prior art.It can be seen that the present invention is by choosing literal feature, pronunciation feature The accuracy of the synonym identification in patent search daily record field can be effectively improved with query characteristics.

Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that for those of ordinary skill in the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection model of the present invention It encloses.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of synonym automatic mining method based on patent search daily record user behavior, which is characterized in that including following step Suddenly：

Step 1) pre-processes patent search daily record, is obtained using the stay in place form of patent search daily record synset candidate Synset；

Step 2) extracts literal feature, pronunciation feature and the query characteristics of the candidate synonym of candidate synonym concentration；

Step 3), which uses, manually marks several words to for training corpus, and marked respectively using " 1 " and " -1 " synonym pair with Non-synonymous word pair；

Literal feature, pronunciation feature and query characteristics are added SVM models and carry out word to identification by step 4)；

Wherein, patent search daily record is will appear in the vocabulary in a line as a query characteristics, is calculated using following formula Query characteristics value：

Wherein,

(w₁, w₂) ∈ row represent word to (w₁, w₂) same a line in patent search daily record occurs,Represent word To (w₁, w₂) do not occur in same a line of patent search daily record.

2. the synonym automatic mining method according to claim 1 based on patent search daily record user behavior, feature It is, the step 1) is specially：

Step A：The query string of filtering useless, using regular expression remove patent search daily record in application number, publication number, point The patent information that class-mark is inquired；

3. the synonym automatic mining method according to claim 1 based on patent search daily record user behavior, feature It is, the pronunciation calculating formula of similarity of the pronunciation feature is as follows：

Wherein,Represent w₁Pronunciation,Word is represented to (w₁, w₂) pronunciation smallest edit distance,

Word is represented to (w₁, w₂) in maximum pronunciation length；Word is represented to (w₁, w₂) Pronunciation similarity.