CN109376352A

CN109376352A - A kind of patent text modeling method based on word2vec and semantic similarity

Info

Publication number: CN109376352A
Application number: CN201810991083.3A
Authority: CN
Inventors: 路永和; 刘小桦
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2018-08-28
Filing date: 2018-08-28
Publication date: 2019-02-22
Anticipated expiration: 2038-08-28
Also published as: CN109376352B

Abstract

The present invention relates to text modeling fields, propose a kind of patent text modeling method based on word2vec and semantic similarity, comprising the following steps: crawl patent text collection and pre-processed；The TF-IDF value of each word of patent text collection is calculated, sequence is chosen and obtains feature word set；Text set imports word2vec model and obtains term vector by training；It calculates cosine similarity and obtains close word set wordC_1；It calculates word2vec similarity and obtains close word set textC_1；Text set imports text processing system and is trained, and obtains semantic similarity, chooses close word set wordC_2；Computing semantic similarity obtains close word set textC_2；Hybrid similarity is calculated to be expanded word set textC_f；It calculates weight and forms new Text Flag, complete modeling.The present invention is that traditional vector space model increases the information between a part of word from the angle of statistics of word2vec and the semantic angle of semantic similarity, the sparsity of its text matrix is reduced to a certain extent, and Clustering Effect is significantly stable, has stronger Text Flag ability.

Description

A kind of patent text modeling method based on word2vec and semantic similarity

Technical field

The present invention relates to text modeling fields, more particularly, to a kind of patent of word2vec and semantic similarity text This modeling method.

Background technique

In terms of the text modeling of patent text, scholars attempted a variety of distinct methods to traditional text modeling method It improves, such as is expressed as patent text to have the text vector of patent semantic weight information and word frequency weight information, mention It is based on the patent term extraction scheme of condition random field (CRF) out, proposes that potential applications index (LSI) model realization is multi-lingual Say vector space etc..Other than improving to traditional text modeling method vector space model, many scholars are also constructed Different from the text modeling method of vector space, to improve the text representation ability to patent text.

However, the problem of characteristic dimension existing for traditional vector space model is sparse and semantic information missing again without It is well solved, the considerations of prior art is also the absence of to patent text entire contents to patent text analysis method, nothing Method deeply excavates the problem of inherent laws of patent text under same area.

Summary of the invention

The present invention is to overcome the defect for lacking described in the above-mentioned prior art and considering patent text entire contents, is provided A kind of patent text modeling method of word2vec and semantic similarity can carry out large-scale patent text data automatic Cluster.

In order to solve the above technical problems, technical scheme is as follows:

A kind of patent text modeling method based on word2vec and semantic similarity, comprising the following steps:

S1: the patent text collection of designated field is crawled, and patent text collection is pre-processed；

S2: statistics patent text concentrates the word frequency of each word, obtains word frequency document；

S3: calculating the TF-IDF value that patent text concentrates each word, carries out descending sort to each word by TF-IDF value, Feature word set of the n word as Feature Words and composition patent text collection before selecting；

S4: word2vec model will be imported by pretreated patent text collection, model parameter, patent text collection is set Term vector is obtained by training；

S5: the term vector and patent text for calculating each Feature Words are concentrated between the term vector of other all Feature Words Cosine similarity, by cosine similarity to each Feature Words carry out descending arrangement, select series of features word as close word simultaneously Form the close word set wordC_1 of corresponding the specific word；

S6: the word2vec similarity of close word corresponding to each Feature Words Yu the patent text is calculated, is pressed Word2vec similarity carries out descending sort to close word, and m close words form the close word set of corresponding patent text before choosing textC_1；

S7: it will be trained in the training module for importing text processing system by pretreated patent text collection；

S8: other Feature Words are concentrated to be input to the semantic similar of text processing system to patent text each Feature Words It spends in computing module, obtains corresponding semantic similarity, descending arrangement is carried out to Feature Words by semantic similarity, selects a system The close word set wordC_2 of column word composition the specific word；

S9: the semanteme of each Feature Words corresponding each close word based on semantic similarity and the patent text is calculated Similarity carries out descending sort to the close word based on semantic similarity by semantic similarity, m close word compositions before choosing Close word set textC_2 of the patent text based on semantic similarity；

S10: calculating the close word set textC_1 in every patent text based on word2vec and is based on semantic similarity Close word set textC_1 in all close words and corresponding patent text hybrid similarity, by hybrid similarity to it is corresponding specially All close words of sharp text carry out descending sort, and m close words form the expansion-word set textC_ of the patent text before choosing f；

S11: the word in every patent text in each word expansion-word set textC_f corresponding with patent text is calculated special Weight in sharp text forms new text representation, completes patent text and is built based on the text of word2vec and semantic similarity Mould.

The present invention has belonging to for semantic relation specific according in patent text comprising a large amount of, proposes one kind and is based on The patent text modeling method of word2vec and semantic similarity, are found in text using word2vec and text processing module The close word of each vocabulary, and then expansion of the close word of text as text feature in vector space model is calculated Exhibition, the semantic angle of angle of statistics and semantic similarity from word2vec are that traditional vector space model increases a part Information between word considers the entire contents of patent text, and large-scale patent text data are clustered and modeled.

Preferably, the specific steps in step S1 include:

S1.1: the patent text collection for the method crawl designated field by crawler and manually downloaded；

S1.2: patent text collection is carried out by the keyword extraction function in NLPIR/ICTCLAS2016 Words partition system Keyword extraction, and duplicate removal is carried out to the keyword of extraction, the relevant keyword of artificial screening designated field is simultaneously constructed for cutting The user-oriented dictionary of participle；

S1.3: user-oriented dictionary is imported into NLPIR/ICTCLAS2016 Words partition system, passes through the participle function in Words partition system Patent text collection can be segmented；

S1.4: according to the stop words in existing deactivated vocabulary removal patent text, and patent text is removed according to part of speech In the phrase without practical significance such as pronoun, preposition and the noun of locality.

Preferably, in step S6 word2vec similarity calculation formula are as follows:

Sim_w (t, d)=sim_w (t, w₁)+sim_w(t,w₂)+sim_w(t,w₃)+...+sim_w(t,w_n)

Wherein, w₁、w₂、…、w_nFor the vocabulary in patent text d, t w₁、w₂、…、w_nCommon close word, sim_w (t, d) is the word2vec similarity of close word t and patent text d, sim_w (t, w_i) be patent text d in Feature Words w_i With the cosine similarity of the term vector of close word t.In a certain piece patent text, if the close word set of several Feature Words It include the same close word t in word_C1, then the cosine similarity by the close word in each word_C1 is added, and is made For the word2vec similarity of itself and the patent text；Otherwise, if the close word set word_C1 packet of only one Feature Words It is containing close word t, then similar to the word2vec of the patent text as it by the cosine similarity of close word t and the specific word Degree.

Preferably, the calculation formula of the semantic similarity in step S9 are as follows:

Sim_s (t, d)=sim_s (t, w₁)+sim_s(t,w₂)+sim_s(t,w₃)+...+sim_s(t,w_n)

Wherein, sim_s (t, d) is the semantic similarity of close word t and patent text d, sim_s (t, w_i) it is patent text Feature Words w in this d_iWith the semantic similarity of close word t.In a certain piece patent text, if the phase of several Feature Words It include the same close word t in nearly word set word_C2, then the semantic similarity phase by the close word in each word_C2 Add, the semantic similarity as itself and the patent text；Otherwise, if the close word set word_C2 packet of only one Feature Words Semantic similarity containing close word t, then by the semantic similarity of close word t and the specific word, as itself and the patent text.

Preferably, the calculation formula of the hybrid similarity in step S10 are as follows:

Sim_m (t, d)=sim_w (t, d)+sim_s (t, d)

Wherein, sim_m (t, d) is the hybrid similarity of close word t and patent text d.For a certain piece patent text, If the close word set textC_1 based on word2vec and the close word set textC_2 based on semantic similarity include same The close word t is then added by a close word t with the word2vec similarity and semantic similarity of text, as close word t with The hybrid similarity of text；Otherwise, if only textC_1 or textC_2 includes close word t, by close word t and text The hybrid similarity of word2vec similarity or semantic similarity as close word t and text.

Preferably, the calculation formula of the weight of step S11 are as follows:

Wherein, pTFIDF (t, d) is characterized power of the word t in patent text d based on word2vec and semantic similarity Weight, W (t, d) are characterized TFIDF weight of the word t in the patent text d that joined expansion-word set textC_f.

Compared with prior art, the beneficial effect of technical solution of the present invention is: the present invention is from the statistics angle of word2vec The semantic angle of degree and semantic similarity is that traditional vector space model increases the information between a part of word, to a certain degree The upper sparsity for reducing its text matrix.Compared to traditional vector space model, effect of the present invention in text cluster experiment It is more significant, there is stronger text representation ability, Clustering Effect is stablized, influenced by the selection of characteristic dimension small.

Detailed description of the invention

Fig. 1 is the patent text modeling method flow chart of the invention based on word2vec and semantic similarity.

Fig. 2, which is that traditional vector space model is different under different characteristic dimension, clusters the corresponding DB index variation of number Line chart.

Fig. 3 is that the present embodiment different corresponding DB index of number that cluster under different characteristic dimension change line chart.

Fig. 4 is the DBindex histogram of Traditional Space vector space model and the present embodiment modeling method.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent practical production The size of product；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

As shown in Figure 1, being the patent text modeling method process of the invention based on word2vec and semantic similarity Figure, the present embodiment are modeled according to Chinese patent text of the flow chart to the communications field.

Step 1: crawling the patent text collection of the communications field, and patent text collection is pre-processed.The present embodiment passes through Crawler and the method manually downloaded grab the patent text collection of the communications field from Google's patent network, using " communication " as retrieval Keyword crawls the patent text in search result in 2013 to 5 years 2017, total to crawl 11230 patent texts. It is related to " communicating " keyword but actually using other themes as main contents on a small quantity present in the result for rejecting retrieval 8528 patent texts are obtained after patent text, and using 8528 patent texts as patent text collection, wherein patent text Collecting format is txt document, and is stored using mysql database.

Keyword extraction is carried out to patent text collection using NLPIR/ICTCLAS2016 Words partition system, to the pass extracted Keyword collection carries out duplicate removal, and artificial screening goes out word relevant to the communications field, constructs the user-oriented dictionary for segmenting word.It will use Family dictionary is loaded into NLPIR/ICTCLAS2016 Words partition system, is segmented using participle function to patent text collection.According to existing The stop words in deactivated vocabulary removal text having, and pronoun, preposition, noun of locality etc. are removed without practical significance according to part of speech Word, to obtain by pretreated patent text collection.

Step 2: statistics patent text concentrates the word frequency of each word, obtains word frequency document.

Step 3: calculating the TF-IDF value that patent text concentrates each word, descending row is carried out to each word by TF-IDF value Sequence, 300-2000 word is as Feature Words before selecting, and selects to be divided into 100 between quantity, forms the Feature Words of patent text collection Collection, as the characteristic attribute of text representation, for testing whether text modeling method proposed by the present invention is suitable for different dimensions Under text cluster task.

Step 4: word2vec model will be imported by pretreated patent text collection, model parameter, patent text are set Collection obtains term vector by training.Under Linux environment, it will be inputted by pretreated patent text collection as input data Into the open source projects word2vec of Google.The parameter setting of word2vec are as follows:-cbow 0-size 200-window 5- Negative 0-hs 1-sample 1e-3-threads 12-binary 1, using Skip-Gram model, training window is big Small is 5, and training obtains the term vector of corresponding one 200 dimension of each word.

Step 5: calculating Feature Words term vector and patent text concentrates the cosine between the term vector of other all Feature Words Similarity.The cosine similarity between the term vector of Feature Words pair is obtained by the distance function of word2vec, by cosine The arrangement of similarity descending, selects a series of words to form the close word set wordC_1 of the word, is Partial Feature word as shown in table 1 And its cosine similarity between corresponding close word and word pair.

The close word and cosine similarity for the Partial Feature word that table 1 is obtained using word2vec method

Step 6: the close word for each of all Feature Words in every patent text based on word2vec calculates it With the word2vec similarity of the patent text.The calculation formula of word2vec similarity are as follows:

Sim_w (t, d)=sim_w (t, w₁)+sim_w(t,w₂)+sim_w(t,w₃)+...+sim_w(t,w_n)

Wherein, w₁、w₂、…、w_nFor the vocabulary in patent text d, t w₁、w₂、…、w_nCommon close word, sim_w (t, d) is the word2vec similarity of close word t and patent text d, sim_w (t, w_i) be patent text d in Feature Words w_i With the cosine similarity of the term vector of close word t.By all close words based on word2vec of every patent text according to it With the similarity descending sort of the patent text, considers the scale of every text and the time cost of calculating, choose first 50 Close word forms close word set textC_1 of the text based on word2vec.

Step 7: will be imported in the training module of text processing system by pretreated patent text collection and word frequency document It is trained.

Step 8: for each Feature Words in every patent text, concentrating other to own with patent text it respectively Feature Words are input in the Semantic Similarity Measurement module of text processing system, obtain the semantic similarity between two words, note Adopted similarity descending arrangement, selects a series of words to form the close word set wordC_2 of the word.

Step 9: calculating the language of each Feature Words corresponding each close word based on semantic similarity and the patent text Adopted similarity, the calculation formula of semantic similarity are as follows:

Sim_s (t, d)=sim_s (t, w₁)+sim_s(t,w₂)+sim_s(t,w₃)+...+sim_s(t,w_n)

Wherein, sim_s (t, d) is the semantic similarity of close word t and patent text d, sim_s (t, w_i) it is patent text Feature Words w in this d_iWith the semantic similarity of close word t.By all phases based on semantic similarity of every patent text Nearly word considers the scale of every text and the time cost of calculating according to the similarity descending sort of itself and the patent text, It chooses first 50 close words and forms close word set textC_2 of the text based on semantic similarity.

Step 10: comprehensive to be based on word2vec close word set textC_1 obtained and base for every patent text In the close word set textC_2 of semantic similarity, mixing for all close words and the patent text in two close word sets is calculated Close similarity, the calculation formula of hybrid similarity are as follows:

Sim_m (t, d)=sim_w (t, d)+sim_s (t, d)

Wherein, sim_m (t, d) is the hybrid similarity of close word t and patent text d.By all of every patent text Close word chooses the first 50 final expansion-word set textC_ of the close word composition text according to hybrid similarity descending sort f。

Step 11: calculating the word in every patent text in each word expansion-word set textC_f corresponding with patent text and exist Weight in patent text, the calculation formula of weight are as follows:

Wherein, pTFIDF (t, d) is characterized power of the word t in patent text d based on word2vec and semantic similarity Weight, W (t, d) are characterized TFIDF weight of the word t in the patent text d that joined expansion-word set textC_f.Pass through weight It calculates, forms new text representation, complete text modeling of each patent text based on word2vec and semantic similarity.

The present embodiment examines the patent text based on word2vec and semantic similarity by using the method for text cluster The validity and practicability of modeling method.The present embodiment uses K-means clustering method, and cluster number is set as 2-25, Seed primary system one is set as 10, calculates its corresponding DB index, observes the variation tendency of DB index to determine last gather Class number.

The calculation formula of DB index are as follows:

Wherein, k is clusters number, d_ijIt is the distance at two class centers, s_iIt is sample in class cluster i to class cluster center Average distance.

As shown in Fig. 2, clustering the corresponding DB of number for traditional vector space model is different under different characteristic dimension Index changes line chart；As shown in figure 3, clustering the corresponding DB of number for the present embodiment is different under different characteristic dimension Index changes line chart.As seen from the figure, it under different characteristic dimensions, is combined based on word2vec and semantic similarity The corresponding DB index difference of text modeling method is small compared with traditional vector space modeling method.This is illustrated based on word2vec The text modeling method combined with semantic similarity can obtain more stable Clustering Effect, i.e., to the ability to express ratio of text It is more stable, obvious fluctuation will not be generated with the increase of characteristic dimension.In addition, using similar with semanteme based on word2vec It is overall to spend the text modeling method combined corresponding DB index variation tendency of difference cluster number under different characteristic dimension Upper almost the same with the DB index variation tendency based on Traditional Space vector model, on a declining curve, fluctuation range narrows, And occur significantly concentrating inflection point when clustering number and being 22, each characteristic dimension corresponding DB when clustering number and being 22 Index is also more concentrated.Although the curve fluctuation of Fig. 3 is shown in the text combined based on word2vec and semantic similarity Under modeling method, 1800~2000 dimensions and cluster number be 2 in the case where there is the DB index in low level, The DB index trend clustered when number is 3 to 25 is again substantially similar with other dimensions, therefore this experiment is thought here in low The DB index of position belongs to accidental in K-means clustering algorithm as a result, being a kind of exceptional value, cannot function as text modeling The foundation of the optimum cluster result of method.

In conclusion cluster result evaluation index DB index is hardly by characteristic dimension when clustering number is 22 It influences, separating degree keeps relative stability between dispersion and cluster in cluster, and DB index is more concentrated, and has obvious inflection point, is existed Minimum value, therefore select cluster number 22 as optimum cluster number, and compared the effect of two kinds of text modeling methods based on this Fruit.

When clustering number is 22, cluster knot of two different text modeling methods under 300~2000 characteristic dimensions The evaluation index DB index of fruit is as shown in table 2.

The lower four kinds of text modeling methods of 2 different characteristic dimension of table corresponding DB index when clustering number and being 22

As shown in figure 4, being the DBindex histogram of Traditional Space vector space model and the present embodiment modeling method.By Table 2 and Fig. 4 can be obtained, and when cluster number is 22, characteristic dimension is based on word2vec and semantic similarity knot at 300~2000 The DB index that the patent text modeling method of conjunction obtains is lower than traditional vector space model, although in 400 dimensions, 600 Difference is smaller both when dimension, 1100 dimensions and 1300 dimension, and the range of decrease is between 0.180~0.280, but in other dimensions Under, reduction amplitude has 0.530 or more, and it is 2.143 that the maximum range of decrease is reached in 2000 dimension.Generally speaking, 300 tie up~ When 1300 dimension, the patent text modeling method based on word2vec and semantic similarity is being clustered with traditional vector space model It is not much different in effect, but is tieed up since 1400 dimensions to 2000, the two gap significantly increases.This shows based on word2vec The Clustering Effect obtained with the patent text modeling method of semantic similarity is more preferable than traditional vector space model, further Illustrate in text representation ability, is better than passing based on the patent text modeling method that word2vec and semantic similarity combine System vector space model.And long text higher for characteristic dimension, based on the patent of word2vec and semantic similarity text The improvement effect of this modeling method is more obvious.

Use the corresponding Clustering Effect evaluation index DB index difference of paired sample T test verifying both methods Conspicuousness, inspection result is as shown in table 3, and under conditions of significance is 0.05, p value refuses null hypothesis less than 0.05, That is the corresponding Clustering Effect evaluation index DB index of two methods is there are significant difference, and the DB of traditional vector space model The DB index mean value of patent text modeling method of the index average ratio based on word2vec and semantic similarity is higher by 1.07190, illustrate that the patent text modeling method based on semantic similarity obtains better cluster than traditional vector space model Effect.

The traditional vector space model of table 3 and the present embodiment modeling method paired sample T test

The numerical value of Cluster Assessment index DB index is shown, when class number of clusters is 22, and characteristic dimension is 1200 dimension, is used The corresponding cluster result of text modeling method based on word2vec and semantic similarity is best.Specific theme and its corresponding Theme feature word is as shown in table 4.

The subject description of 4 cluster result of table and include textual data

As shown in Table 4, the Feature Words of same subject and the content of patent text are more close, illustrate in same subject Condensation degree is higher；And the Feature Words and patent text content difference of different themes are larger, illustrate difference degree between different themes compared with It is high.In all themes, the patent text quantity that " wireless communication " theme includes is most, reaches 3291；Followed by " move Dynamic communication " includes textual data 971；Third is " fiber optic communication ", includes textual data 947.This explanation is in communications field side Face, Chinese patent are concentrated mainly on " wireless communication ", " mobile communication " and " fiber optic communication " these three aspects.On the other hand, The amount of text that " communication test ", " digital communication ", " auto communication " are included is less, wherein " auto communication theme " includes Textual data it is minimum, only 54.

Although the corresponding textual data of subject categories, which exists, is distributed unbalanced problem in cluster result, can pass through Merge and subject fields smaller comprising amount of text and there is the cluster intersected, or further decomposes more comprising amount of text The cluster in topical subject field solves this problem.

It can be seen that using the patent text modeling pattern proposed by the present invention based on word2vec and semantic similarity, Clustering Effect can be effectively improved, suitable for tasks such as the cluster of patent text, topic identification, analysiss of central issue.

The same or similar label correspond to the same or similar components；

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of patent text modeling method based on word2vec and semantic similarity, which comprises the following steps:

S3: calculating the TF-IDF value that patent text concentrates each word, descending sort is carried out to each word by TF-IDF value, before selection Feature word set of the n word as Feature Words and composition patent text collection；

S4: word2vec model will be imported by pretreated patent text collection, model parameter is set, and patent text collection passes through instruction Get term vector；

S5: the term vector for calculating each Feature Words concentrates cosine phase between the term vector of other all Feature Words with patent text Like degree, descending arrangement is carried out to each Feature Words by cosine similarity, series of features word is selected as close word and organizes pairs of Should Feature Words close word set wordC_1；

S6: the word2vec similarity of close word corresponding to each Feature Words Yu the patent text is calculated, by word2vec phase Descending sort is carried out to close word like degree, m close words form the close word set textC_1 of corresponding patent text before choosing；

S7: it will be instructed in the training module for importing text processing system by pretreated patent text collection and word frequency document Practice；

S8: other Feature Words are concentrated to be input to the Semantic Similarity Measurement of text processing system each Feature Words and patent text In module, corresponding semantic similarity is obtained, descending arrangement is carried out to Feature Words by semantic similarity, a series of words is selected to form The close word set wordC_2 of the specific word；

S9: it is similar to the semanteme of the patent text to calculate the corresponding each close word based on semantic similarity of each Feature Words Degree carries out descending sort to the close word based on semantic similarity by semantic similarity, and m close words form the patent before choosing Close word set textC_2 of the text based on semantic similarity；

S10: close word set textC_1 based on word2vec is calculated in every patent text and based on the close of semantic similarity The hybrid similarity of all close words and corresponding patent text in word set textC_1, by hybrid similarity to corresponding patent text All close words carry out descending sorts, m close words form the expansion-word set textC_f of the patent text before choosing；

S11: the word in every patent text in each word expansion-word set textC_f corresponding with patent text is calculated in patent text In weight, form new text representation, complete text modeling of the patent text based on word2vec and semantic similarity.

2. the patent text modeling method according to claim 1 based on word2vec and semantic similarity, feature exist In: the specific steps in the step S1 include:

S1.2: patent text collection is carried out by the keyword extraction function in NLPIR/ICTCLAS2016 Words partition system crucial Word extracts, and carries out duplicate removal to the keyword of extraction, and the relevant keyword of artificial screening designated field is simultaneously constructed for segmenting word User-oriented dictionary；

S1.3: user-oriented dictionary is imported into NLPIR/ICTCLAS2016 Words partition system, by the participle function in Words partition system to special Sharp text set is segmented；

S1.4: it according to the stop words in existing deactivated vocabulary removal patent text, and is removed in patent text according to part of speech Pronoun, preposition and the noun of locality.

3. the patent text modeling method according to claim 1 based on word2vec and semantic similarity, feature exist In: the calculation formula of word2vec similarity in the step S6 are as follows:

Sim_w (t, d)=sim_w (t, w₁)+sim_w(t,w₂)+sim_w(t,w₃)+...+sim_w(t,w_n)

Wherein, w₁、w₂、…、w_nFor the vocabulary in patent text d, t w₁、w₂、…、w_nCommon close word, sim_w (t, d) are The word2vec similarity of close word t and patent text d, sim_w (t, w_i) be patent text d in Feature Words w_iWith close word t Term vector cosine similarity.

4. the patent text modeling method according to claim 3 based on word2vec and semantic similarity, feature exist In: the calculation formula of the semantic similarity in the step S9 are as follows:

Sim_s (t, d)=sim_s (t, w₁)+sim_s(t,w₂)+sim_s(t,w₃)+...+sim_s(t,w_n)

Wherein, sim_s (t, d) is the semantic similarity of close word t and patent text d, sim_s (t, w_i) it is in patent text d Feature Words w_iWith the semantic similarity of close word t.

5. the patent text modeling method according to claim 4 based on word2vec and semantic similarity, feature exist In: the calculation formula of the hybrid similarity in the step S10 are as follows:

Sim_m (t, d)=sim_w (t, d)+sim_s (t, d)

Wherein, sim_m (t, d) is the hybrid similarity of close word t and patent text d.

6. the patent text modeling method according to claim 5 based on word2vec and semantic similarity, feature exist In: the calculation formula of the weight of the step S11 are as follows:

Wherein, pTFIDF (t, d) is characterized weight of the word t in patent text d based on word2vec and semantic similarity, W (t, D) it is characterized TFIDF weight of the word t in the patent text d that joined expansion-word set textC_f.