CN109376352A - A kind of patent text modeling method based on word2vec and semantic similarity - Google Patents

A kind of patent text modeling method based on word2vec and semantic similarity Download PDF

Info

Publication number
CN109376352A
CN109376352A CN201810991083.3A CN201810991083A CN109376352A CN 109376352 A CN109376352 A CN 109376352A CN 201810991083 A CN201810991083 A CN 201810991083A CN 109376352 A CN109376352 A CN 109376352A
Authority
CN
China
Prior art keywords
word
patent text
text
word2vec
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810991083.3A
Other languages
Chinese (zh)
Other versions
CN109376352B (en
Inventor
路永和
刘小桦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201810991083.3A priority Critical patent/CN109376352B/en
Publication of CN109376352A publication Critical patent/CN109376352A/en
Application granted granted Critical
Publication of CN109376352B publication Critical patent/CN109376352B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents
    • G06Q50/184Intellectual property management

Abstract

The present invention relates to text modeling fields, propose a kind of patent text modeling method based on word2vec and semantic similarity, comprising the following steps: crawl patent text collection and pre-processed;The TF-IDF value of each word of patent text collection is calculated, sequence is chosen and obtains feature word set;Text set imports word2vec model and obtains term vector by training;It calculates cosine similarity and obtains close word set wordC_1;It calculates word2vec similarity and obtains close word set textC_1;Text set imports text processing system and is trained, and obtains semantic similarity, chooses close word set wordC_2;Computing semantic similarity obtains close word set textC_2;Hybrid similarity is calculated to be expanded word set textC_f;It calculates weight and forms new Text Flag, complete modeling.The present invention is that traditional vector space model increases the information between a part of word from the angle of statistics of word2vec and the semantic angle of semantic similarity, the sparsity of its text matrix is reduced to a certain extent, and Clustering Effect is significantly stable, has stronger Text Flag ability.

Description

A kind of patent text modeling method based on word2vec and semantic similarity
Technical field
The present invention relates to text modeling fields, more particularly, to a kind of patent of word2vec and semantic similarity text This modeling method.
Background technique
In terms of the text modeling of patent text, scholars attempted a variety of distinct methods to traditional text modeling method It improves, such as is expressed as patent text to have the text vector of patent semantic weight information and word frequency weight information, mention It is based on the patent term extraction scheme of condition random field (CRF) out, proposes that potential applications index (LSI) model realization is multi-lingual Say vector space etc..Other than improving to traditional text modeling method vector space model, many scholars are also constructed Different from the text modeling method of vector space, to improve the text representation ability to patent text.
However, the problem of characteristic dimension existing for traditional vector space model is sparse and semantic information missing again without It is well solved, the considerations of prior art is also the absence of to patent text entire contents to patent text analysis method, nothing Method deeply excavates the problem of inherent laws of patent text under same area.
Summary of the invention
The present invention is to overcome the defect for lacking described in the above-mentioned prior art and considering patent text entire contents, is provided A kind of patent text modeling method of word2vec and semantic similarity can carry out large-scale patent text data automatic Cluster.
In order to solve the above technical problems, technical scheme is as follows:
A kind of patent text modeling method based on word2vec and semantic similarity, comprising the following steps:
S1: the patent text collection of designated field is crawled, and patent text collection is pre-processed;
S2: statistics patent text concentrates the word frequency of each word, obtains word frequency document;
S3: calculating the TF-IDF value that patent text concentrates each word, carries out descending sort to each word by TF-IDF value, Feature word set of the n word as Feature Words and composition patent text collection before selecting;
S4: word2vec model will be imported by pretreated patent text collection, model parameter, patent text collection is set Term vector is obtained by training;
S5: the term vector and patent text for calculating each Feature Words are concentrated between the term vector of other all Feature Words Cosine similarity, by cosine similarity to each Feature Words carry out descending arrangement, select series of features word as close word simultaneously Form the close word set wordC_1 of corresponding the specific word;
S6: the word2vec similarity of close word corresponding to each Feature Words Yu the patent text is calculated, is pressed Word2vec similarity carries out descending sort to close word, and m close words form the close word set of corresponding patent text before choosing textC_1;
S7: it will be trained in the training module for importing text processing system by pretreated patent text collection;
S8: other Feature Words are concentrated to be input to the semantic similar of text processing system to patent text each Feature Words It spends in computing module, obtains corresponding semantic similarity, descending arrangement is carried out to Feature Words by semantic similarity, selects a system The close word set wordC_2 of column word composition the specific word;
S9: the semanteme of each Feature Words corresponding each close word based on semantic similarity and the patent text is calculated Similarity carries out descending sort to the close word based on semantic similarity by semantic similarity, m close word compositions before choosing Close word set textC_2 of the patent text based on semantic similarity;
S10: calculating the close word set textC_1 in every patent text based on word2vec and is based on semantic similarity Close word set textC_1 in all close words and corresponding patent text hybrid similarity, by hybrid similarity to it is corresponding specially All close words of sharp text carry out descending sort, and m close words form the expansion-word set textC_ of the patent text before choosing f;
S11: the word in every patent text in each word expansion-word set textC_f corresponding with patent text is calculated special Weight in sharp text forms new text representation, completes patent text and is built based on the text of word2vec and semantic similarity Mould.
The present invention has belonging to for semantic relation specific according in patent text comprising a large amount of, proposes one kind and is based on The patent text modeling method of word2vec and semantic similarity, are found in text using word2vec and text processing module The close word of each vocabulary, and then expansion of the close word of text as text feature in vector space model is calculated Exhibition, the semantic angle of angle of statistics and semantic similarity from word2vec are that traditional vector space model increases a part Information between word considers the entire contents of patent text, and large-scale patent text data are clustered and modeled.
Preferably, the specific steps in step S1 include:
S1.1: the patent text collection for the method crawl designated field by crawler and manually downloaded;
S1.2: patent text collection is carried out by the keyword extraction function in NLPIR/ICTCLAS2016 Words partition system Keyword extraction, and duplicate removal is carried out to the keyword of extraction, the relevant keyword of artificial screening designated field is simultaneously constructed for cutting The user-oriented dictionary of participle;
S1.3: user-oriented dictionary is imported into NLPIR/ICTCLAS2016 Words partition system, passes through the participle function in Words partition system Patent text collection can be segmented;
S1.4: according to the stop words in existing deactivated vocabulary removal patent text, and patent text is removed according to part of speech In the phrase without practical significance such as pronoun, preposition and the noun of locality.
Preferably, in step S6 word2vec similarity calculation formula are as follows:
Sim_w (t, d)=sim_w (t, w1)+sim_w(t,w2)+sim_w(t,w3)+...+sim_w(t,wn)
Wherein, w1、w2、…、wnFor the vocabulary in patent text d, t w1、w2、…、wnCommon close word, sim_w (t, d) is the word2vec similarity of close word t and patent text d, sim_w (t, wi) be patent text d in Feature Words wi With the cosine similarity of the term vector of close word t.In a certain piece patent text, if the close word set of several Feature Words It include the same close word t in word_C1, then the cosine similarity by the close word in each word_C1 is added, and is made For the word2vec similarity of itself and the patent text;Otherwise, if the close word set word_C1 packet of only one Feature Words It is containing close word t, then similar to the word2vec of the patent text as it by the cosine similarity of close word t and the specific word Degree.
Preferably, the calculation formula of the semantic similarity in step S9 are as follows:
Sim_s (t, d)=sim_s (t, w1)+sim_s(t,w2)+sim_s(t,w3)+...+sim_s(t,wn)
Wherein, sim_s (t, d) is the semantic similarity of close word t and patent text d, sim_s (t, wi) it is patent text Feature Words w in this diWith the semantic similarity of close word t.In a certain piece patent text, if the phase of several Feature Words It include the same close word t in nearly word set word_C2, then the semantic similarity phase by the close word in each word_C2 Add, the semantic similarity as itself and the patent text;Otherwise, if the close word set word_C2 packet of only one Feature Words Semantic similarity containing close word t, then by the semantic similarity of close word t and the specific word, as itself and the patent text.
Preferably, the calculation formula of the hybrid similarity in step S10 are as follows:
Sim_m (t, d)=sim_w (t, d)+sim_s (t, d)
Wherein, sim_m (t, d) is the hybrid similarity of close word t and patent text d.For a certain piece patent text, If the close word set textC_1 based on word2vec and the close word set textC_2 based on semantic similarity include same The close word t is then added by a close word t with the word2vec similarity and semantic similarity of text, as close word t with The hybrid similarity of text;Otherwise, if only textC_1 or textC_2 includes close word t, by close word t and text The hybrid similarity of word2vec similarity or semantic similarity as close word t and text.
Preferably, the calculation formula of the weight of step S11 are as follows:
Wherein, pTFIDF (t, d) is characterized power of the word t in patent text d based on word2vec and semantic similarity Weight, W (t, d) are characterized TFIDF weight of the word t in the patent text d that joined expansion-word set textC_f.
Compared with prior art, the beneficial effect of technical solution of the present invention is: the present invention is from the statistics angle of word2vec The semantic angle of degree and semantic similarity is that traditional vector space model increases the information between a part of word, to a certain degree The upper sparsity for reducing its text matrix.Compared to traditional vector space model, effect of the present invention in text cluster experiment It is more significant, there is stronger text representation ability, Clustering Effect is stablized, influenced by the selection of characteristic dimension small.
Detailed description of the invention
Fig. 1 is the patent text modeling method flow chart of the invention based on word2vec and semantic similarity.
Fig. 2, which is that traditional vector space model is different under different characteristic dimension, clusters the corresponding DB index variation of number Line chart.
Fig. 3 is that the present embodiment different corresponding DB index of number that cluster under different characteristic dimension change line chart.
Fig. 4 is the DBindex histogram of Traditional Space vector space model and the present embodiment modeling method.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent practical production The size of product;
To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
As shown in Figure 1, being the patent text modeling method process of the invention based on word2vec and semantic similarity Figure, the present embodiment are modeled according to Chinese patent text of the flow chart to the communications field.
Step 1: crawling the patent text collection of the communications field, and patent text collection is pre-processed.The present embodiment passes through Crawler and the method manually downloaded grab the patent text collection of the communications field from Google's patent network, using " communication " as retrieval Keyword crawls the patent text in search result in 2013 to 5 years 2017, total to crawl 11230 patent texts. It is related to " communicating " keyword but actually using other themes as main contents on a small quantity present in the result for rejecting retrieval 8528 patent texts are obtained after patent text, and using 8528 patent texts as patent text collection, wherein patent text Collecting format is txt document, and is stored using mysql database.
Keyword extraction is carried out to patent text collection using NLPIR/ICTCLAS2016 Words partition system, to the pass extracted Keyword collection carries out duplicate removal, and artificial screening goes out word relevant to the communications field, constructs the user-oriented dictionary for segmenting word.It will use Family dictionary is loaded into NLPIR/ICTCLAS2016 Words partition system, is segmented using participle function to patent text collection.According to existing The stop words in deactivated vocabulary removal text having, and pronoun, preposition, noun of locality etc. are removed without practical significance according to part of speech Word, to obtain by pretreated patent text collection.
Step 2: statistics patent text concentrates the word frequency of each word, obtains word frequency document.
Step 3: calculating the TF-IDF value that patent text concentrates each word, descending row is carried out to each word by TF-IDF value Sequence, 300-2000 word is as Feature Words before selecting, and selects to be divided into 100 between quantity, forms the Feature Words of patent text collection Collection, as the characteristic attribute of text representation, for testing whether text modeling method proposed by the present invention is suitable for different dimensions Under text cluster task.
Step 4: word2vec model will be imported by pretreated patent text collection, model parameter, patent text are set Collection obtains term vector by training.Under Linux environment, it will be inputted by pretreated patent text collection as input data Into the open source projects word2vec of Google.The parameter setting of word2vec are as follows:-cbow 0-size 200-window 5- Negative 0-hs 1-sample 1e-3-threads 12-binary 1, using Skip-Gram model, training window is big Small is 5, and training obtains the term vector of corresponding one 200 dimension of each word.
Step 5: calculating Feature Words term vector and patent text concentrates the cosine between the term vector of other all Feature Words Similarity.The cosine similarity between the term vector of Feature Words pair is obtained by the distance function of word2vec, by cosine The arrangement of similarity descending, selects a series of words to form the close word set wordC_1 of the word, is Partial Feature word as shown in table 1 And its cosine similarity between corresponding close word and word pair.
The close word and cosine similarity for the Partial Feature word that table 1 is obtained using word2vec method
Step 6: the close word for each of all Feature Words in every patent text based on word2vec calculates it With the word2vec similarity of the patent text.The calculation formula of word2vec similarity are as follows:
Sim_w (t, d)=sim_w (t, w1)+sim_w(t,w2)+sim_w(t,w3)+...+sim_w(t,wn)
Wherein, w1、w2、…、wnFor the vocabulary in patent text d, t w1、w2、…、wnCommon close word, sim_w (t, d) is the word2vec similarity of close word t and patent text d, sim_w (t, wi) be patent text d in Feature Words wi With the cosine similarity of the term vector of close word t.By all close words based on word2vec of every patent text according to it With the similarity descending sort of the patent text, considers the scale of every text and the time cost of calculating, choose first 50 Close word forms close word set textC_1 of the text based on word2vec.
Step 7: will be imported in the training module of text processing system by pretreated patent text collection and word frequency document It is trained.
Step 8: for each Feature Words in every patent text, concentrating other to own with patent text it respectively Feature Words are input in the Semantic Similarity Measurement module of text processing system, obtain the semantic similarity between two words, note Adopted similarity descending arrangement, selects a series of words to form the close word set wordC_2 of the word.
Step 9: calculating the language of each Feature Words corresponding each close word based on semantic similarity and the patent text Adopted similarity, the calculation formula of semantic similarity are as follows:
Sim_s (t, d)=sim_s (t, w1)+sim_s(t,w2)+sim_s(t,w3)+...+sim_s(t,wn)
Wherein, sim_s (t, d) is the semantic similarity of close word t and patent text d, sim_s (t, wi) it is patent text Feature Words w in this diWith the semantic similarity of close word t.By all phases based on semantic similarity of every patent text Nearly word considers the scale of every text and the time cost of calculating according to the similarity descending sort of itself and the patent text, It chooses first 50 close words and forms close word set textC_2 of the text based on semantic similarity.
Step 10: comprehensive to be based on word2vec close word set textC_1 obtained and base for every patent text In the close word set textC_2 of semantic similarity, mixing for all close words and the patent text in two close word sets is calculated Close similarity, the calculation formula of hybrid similarity are as follows:
Sim_m (t, d)=sim_w (t, d)+sim_s (t, d)
Wherein, sim_m (t, d) is the hybrid similarity of close word t and patent text d.By all of every patent text Close word chooses the first 50 final expansion-word set textC_ of the close word composition text according to hybrid similarity descending sort f。
Step 11: calculating the word in every patent text in each word expansion-word set textC_f corresponding with patent text and exist Weight in patent text, the calculation formula of weight are as follows:
Wherein, pTFIDF (t, d) is characterized power of the word t in patent text d based on word2vec and semantic similarity Weight, W (t, d) are characterized TFIDF weight of the word t in the patent text d that joined expansion-word set textC_f.Pass through weight It calculates, forms new text representation, complete text modeling of each patent text based on word2vec and semantic similarity.
The present embodiment examines the patent text based on word2vec and semantic similarity by using the method for text cluster The validity and practicability of modeling method.The present embodiment uses K-means clustering method, and cluster number is set as 2-25, Seed primary system one is set as 10, calculates its corresponding DB index, observes the variation tendency of DB index to determine last gather Class number.
The calculation formula of DB index are as follows:
Wherein, k is clusters number, dijIt is the distance at two class centers, siIt is sample in class cluster i to class cluster center Average distance.
As shown in Fig. 2, clustering the corresponding DB of number for traditional vector space model is different under different characteristic dimension Index changes line chart;As shown in figure 3, clustering the corresponding DB of number for the present embodiment is different under different characteristic dimension Index changes line chart.As seen from the figure, it under different characteristic dimensions, is combined based on word2vec and semantic similarity The corresponding DB index difference of text modeling method is small compared with traditional vector space modeling method.This is illustrated based on word2vec The text modeling method combined with semantic similarity can obtain more stable Clustering Effect, i.e., to the ability to express ratio of text It is more stable, obvious fluctuation will not be generated with the increase of characteristic dimension.In addition, using similar with semanteme based on word2vec It is overall to spend the text modeling method combined corresponding DB index variation tendency of difference cluster number under different characteristic dimension Upper almost the same with the DB index variation tendency based on Traditional Space vector model, on a declining curve, fluctuation range narrows, And occur significantly concentrating inflection point when clustering number and being 22, each characteristic dimension corresponding DB when clustering number and being 22 Index is also more concentrated.Although the curve fluctuation of Fig. 3 is shown in the text combined based on word2vec and semantic similarity Under modeling method, 1800~2000 dimensions and cluster number be 2 in the case where there is the DB index in low level, The DB index trend clustered when number is 3 to 25 is again substantially similar with other dimensions, therefore this experiment is thought here in low The DB index of position belongs to accidental in K-means clustering algorithm as a result, being a kind of exceptional value, cannot function as text modeling The foundation of the optimum cluster result of method.
In conclusion cluster result evaluation index DB index is hardly by characteristic dimension when clustering number is 22 It influences, separating degree keeps relative stability between dispersion and cluster in cluster, and DB index is more concentrated, and has obvious inflection point, is existed Minimum value, therefore select cluster number 22 as optimum cluster number, and compared the effect of two kinds of text modeling methods based on this Fruit.
When clustering number is 22, cluster knot of two different text modeling methods under 300~2000 characteristic dimensions The evaluation index DB index of fruit is as shown in table 2.
The lower four kinds of text modeling methods of 2 different characteristic dimension of table corresponding DB index when clustering number and being 22
As shown in figure 4, being the DBindex histogram of Traditional Space vector space model and the present embodiment modeling method.By Table 2 and Fig. 4 can be obtained, and when cluster number is 22, characteristic dimension is based on word2vec and semantic similarity knot at 300~2000 The DB index that the patent text modeling method of conjunction obtains is lower than traditional vector space model, although in 400 dimensions, 600 Difference is smaller both when dimension, 1100 dimensions and 1300 dimension, and the range of decrease is between 0.180~0.280, but in other dimensions Under, reduction amplitude has 0.530 or more, and it is 2.143 that the maximum range of decrease is reached in 2000 dimension.Generally speaking, 300 tie up~ When 1300 dimension, the patent text modeling method based on word2vec and semantic similarity is being clustered with traditional vector space model It is not much different in effect, but is tieed up since 1400 dimensions to 2000, the two gap significantly increases.This shows based on word2vec The Clustering Effect obtained with the patent text modeling method of semantic similarity is more preferable than traditional vector space model, further Illustrate in text representation ability, is better than passing based on the patent text modeling method that word2vec and semantic similarity combine System vector space model.And long text higher for characteristic dimension, based on the patent of word2vec and semantic similarity text The improvement effect of this modeling method is more obvious.
Use the corresponding Clustering Effect evaluation index DB index difference of paired sample T test verifying both methods Conspicuousness, inspection result is as shown in table 3, and under conditions of significance is 0.05, p value refuses null hypothesis less than 0.05, That is the corresponding Clustering Effect evaluation index DB index of two methods is there are significant difference, and the DB of traditional vector space model The DB index mean value of patent text modeling method of the index average ratio based on word2vec and semantic similarity is higher by 1.07190, illustrate that the patent text modeling method based on semantic similarity obtains better cluster than traditional vector space model Effect.
The traditional vector space model of table 3 and the present embodiment modeling method paired sample T test
The numerical value of Cluster Assessment index DB index is shown, when class number of clusters is 22, and characteristic dimension is 1200 dimension, is used The corresponding cluster result of text modeling method based on word2vec and semantic similarity is best.Specific theme and its corresponding Theme feature word is as shown in table 4.
The subject description of 4 cluster result of table and include textual data
As shown in Table 4, the Feature Words of same subject and the content of patent text are more close, illustrate in same subject Condensation degree is higher;And the Feature Words and patent text content difference of different themes are larger, illustrate difference degree between different themes compared with It is high.In all themes, the patent text quantity that " wireless communication " theme includes is most, reaches 3291;Followed by " move Dynamic communication " includes textual data 971;Third is " fiber optic communication ", includes textual data 947.This explanation is in communications field side Face, Chinese patent are concentrated mainly on " wireless communication ", " mobile communication " and " fiber optic communication " these three aspects.On the other hand, The amount of text that " communication test ", " digital communication ", " auto communication " are included is less, wherein " auto communication theme " includes Textual data it is minimum, only 54.
Although the corresponding textual data of subject categories, which exists, is distributed unbalanced problem in cluster result, can pass through Merge and subject fields smaller comprising amount of text and there is the cluster intersected, or further decomposes more comprising amount of text The cluster in topical subject field solves this problem.
It can be seen that using the patent text modeling pattern proposed by the present invention based on word2vec and semantic similarity, Clustering Effect can be effectively improved, suitable for tasks such as the cluster of patent text, topic identification, analysiss of central issue.
The same or similar label correspond to the same or similar components;
The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims (6)

1. a kind of patent text modeling method based on word2vec and semantic similarity, which comprises the following steps:
S1: the patent text collection of designated field is crawled, and patent text collection is pre-processed;
S2: statistics patent text concentrates the word frequency of each word, obtains word frequency document;
S3: calculating the TF-IDF value that patent text concentrates each word, descending sort is carried out to each word by TF-IDF value, before selection Feature word set of the n word as Feature Words and composition patent text collection;
S4: word2vec model will be imported by pretreated patent text collection, model parameter is set, and patent text collection passes through instruction Get term vector;
S5: the term vector for calculating each Feature Words concentrates cosine phase between the term vector of other all Feature Words with patent text Like degree, descending arrangement is carried out to each Feature Words by cosine similarity, series of features word is selected as close word and organizes pairs of Should Feature Words close word set wordC_1;
S6: the word2vec similarity of close word corresponding to each Feature Words Yu the patent text is calculated, by word2vec phase Descending sort is carried out to close word like degree, m close words form the close word set textC_1 of corresponding patent text before choosing;
S7: it will be instructed in the training module for importing text processing system by pretreated patent text collection and word frequency document Practice;
S8: other Feature Words are concentrated to be input to the Semantic Similarity Measurement of text processing system each Feature Words and patent text In module, corresponding semantic similarity is obtained, descending arrangement is carried out to Feature Words by semantic similarity, a series of words is selected to form The close word set wordC_2 of the specific word;
S9: it is similar to the semanteme of the patent text to calculate the corresponding each close word based on semantic similarity of each Feature Words Degree carries out descending sort to the close word based on semantic similarity by semantic similarity, and m close words form the patent before choosing Close word set textC_2 of the text based on semantic similarity;
S10: close word set textC_1 based on word2vec is calculated in every patent text and based on the close of semantic similarity The hybrid similarity of all close words and corresponding patent text in word set textC_1, by hybrid similarity to corresponding patent text All close words carry out descending sorts, m close words form the expansion-word set textC_f of the patent text before choosing;
S11: the word in every patent text in each word expansion-word set textC_f corresponding with patent text is calculated in patent text In weight, form new text representation, complete text modeling of the patent text based on word2vec and semantic similarity.
2. the patent text modeling method according to claim 1 based on word2vec and semantic similarity, feature exist In: the specific steps in the step S1 include:
S1.1: the patent text collection for the method crawl designated field by crawler and manually downloaded;
S1.2: patent text collection is carried out by the keyword extraction function in NLPIR/ICTCLAS2016 Words partition system crucial Word extracts, and carries out duplicate removal to the keyword of extraction, and the relevant keyword of artificial screening designated field is simultaneously constructed for segmenting word User-oriented dictionary;
S1.3: user-oriented dictionary is imported into NLPIR/ICTCLAS2016 Words partition system, by the participle function in Words partition system to special Sharp text set is segmented;
S1.4: it according to the stop words in existing deactivated vocabulary removal patent text, and is removed in patent text according to part of speech Pronoun, preposition and the noun of locality.
3. the patent text modeling method according to claim 1 based on word2vec and semantic similarity, feature exist In: the calculation formula of word2vec similarity in the step S6 are as follows:
Sim_w (t, d)=sim_w (t, w1)+sim_w(t,w2)+sim_w(t,w3)+...+sim_w(t,wn)
Wherein, w1、w2、…、wnFor the vocabulary in patent text d, t w1、w2、…、wnCommon close word, sim_w (t, d) are The word2vec similarity of close word t and patent text d, sim_w (t, wi) be patent text d in Feature Words wiWith close word t Term vector cosine similarity.
4. the patent text modeling method according to claim 3 based on word2vec and semantic similarity, feature exist In: the calculation formula of the semantic similarity in the step S9 are as follows:
Sim_s (t, d)=sim_s (t, w1)+sim_s(t,w2)+sim_s(t,w3)+...+sim_s(t,wn)
Wherein, sim_s (t, d) is the semantic similarity of close word t and patent text d, sim_s (t, wi) it is in patent text d Feature Words wiWith the semantic similarity of close word t.
5. the patent text modeling method according to claim 4 based on word2vec and semantic similarity, feature exist In: the calculation formula of the hybrid similarity in the step S10 are as follows:
Sim_m (t, d)=sim_w (t, d)+sim_s (t, d)
Wherein, sim_m (t, d) is the hybrid similarity of close word t and patent text d.
6. the patent text modeling method according to claim 5 based on word2vec and semantic similarity, feature exist In: the calculation formula of the weight of the step S11 are as follows:
Wherein, pTFIDF (t, d) is characterized weight of the word t in patent text d based on word2vec and semantic similarity, W (t, D) it is characterized TFIDF weight of the word t in the patent text d that joined expansion-word set textC_f.
CN201810991083.3A 2018-08-28 2018-08-28 Patent text modeling method based on word2vec and semantic similarity Active CN109376352B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810991083.3A CN109376352B (en) 2018-08-28 2018-08-28 Patent text modeling method based on word2vec and semantic similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810991083.3A CN109376352B (en) 2018-08-28 2018-08-28 Patent text modeling method based on word2vec and semantic similarity

Publications (2)

Publication Number Publication Date
CN109376352A true CN109376352A (en) 2019-02-22
CN109376352B CN109376352B (en) 2022-11-29

Family

ID=65404084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810991083.3A Active CN109376352B (en) 2018-08-28 2018-08-28 Patent text modeling method based on word2vec and semantic similarity

Country Status (1)

Country Link
CN (1) CN109376352B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933670A (en) * 2019-03-19 2019-06-25 中南大学 A kind of file classification method calculating semantic distance based on combinatorial matrix
CN110020430A (en) * 2019-03-01 2019-07-16 新华三信息安全技术有限公司 A kind of fallacious message recognition methods, device, equipment and storage medium
CN110175224A (en) * 2019-06-03 2019-08-27 安徽大学 Patent recommended method and device based on semantic interlink Heterogeneous Information internet startup disk
CN110795533A (en) * 2019-10-22 2020-02-14 王帅 Long text-oriented theme detection method
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111950264A (en) * 2020-08-05 2020-11-17 广东工业大学 Text data enhancement method and knowledge element extraction method
CN112836010A (en) * 2020-10-22 2021-05-25 长城计算机软件与系统有限公司 Patent retrieval method, storage medium and device
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
WO2023173546A1 (en) * 2022-03-15 2023-09-21 平安科技(深圳)有限公司 Method and apparatus for training text recognition model, and computer device and storage medium
CN111950264B (en) * 2020-08-05 2024-04-26 广东工业大学 Text data enhancement method and knowledge element extraction method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881401A (en) * 2015-05-27 2015-09-02 大连理工大学 Patent literature clustering method
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining
US20170039176A1 (en) * 2015-08-03 2017-02-09 BlackBoiler, LLC Method and System for Suggesting Revisions to an Electronic Document
CN106528588A (en) * 2016-09-14 2017-03-22 厦门幻世网络科技有限公司 Method and apparatus for matching resources for text information
CN106599037A (en) * 2016-11-04 2017-04-26 焦点科技股份有限公司 Recommendation method based on label semantic normalization
CN106844346A (en) * 2017-02-09 2017-06-13 北京红马传媒文化发展有限公司 Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
KR20170094063A (en) * 2016-02-05 2017-08-17 한국과학기술원 Apparatus and method for computing noun similarities using semantic contexts
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881401A (en) * 2015-05-27 2015-09-02 大连理工大学 Patent literature clustering method
US20170039176A1 (en) * 2015-08-03 2017-02-09 BlackBoiler, LLC Method and System for Suggesting Revisions to an Electronic Document
KR20170094063A (en) * 2016-02-05 2017-08-17 한국과학기술원 Apparatus and method for computing noun similarities using semantic contexts
CN106528588A (en) * 2016-09-14 2017-03-22 厦门幻世网络科技有限公司 Method and apparatus for matching resources for text information
CN106599037A (en) * 2016-11-04 2017-04-26 焦点科技股份有限公司 Recommendation method based on label semantic normalization
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining
CN106844346A (en) * 2017-02-09 2017-06-13 北京红马传媒文化发展有限公司 Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
路永和: "文本分类中受词性影响的特征权重计算方法", 《现代图书情报技术》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020430A (en) * 2019-03-01 2019-07-16 新华三信息安全技术有限公司 A kind of fallacious message recognition methods, device, equipment and storage medium
CN110020430B (en) * 2019-03-01 2023-06-23 新华三信息安全技术有限公司 Malicious information identification method, device, equipment and storage medium
CN109933670B (en) * 2019-03-19 2021-06-04 中南大学 Text classification method for calculating semantic distance based on combined matrix
CN109933670A (en) * 2019-03-19 2019-06-25 中南大学 A kind of file classification method calculating semantic distance based on combinatorial matrix
CN110175224A (en) * 2019-06-03 2019-08-27 安徽大学 Patent recommended method and device based on semantic interlink Heterogeneous Information internet startup disk
CN110175224B (en) * 2019-06-03 2022-09-30 安徽大学 Semantic link heterogeneous information network embedding-based patent recommendation method and device
CN110795533A (en) * 2019-10-22 2020-02-14 王帅 Long text-oriented theme detection method
CN111680131B (en) * 2020-06-22 2022-08-12 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111680131A (en) * 2020-06-22 2020-09-18 平安银行股份有限公司 Document clustering method and system based on semantics and computer equipment
CN111950264A (en) * 2020-08-05 2020-11-17 广东工业大学 Text data enhancement method and knowledge element extraction method
CN111950264B (en) * 2020-08-05 2024-04-26 广东工业大学 Text data enhancement method and knowledge element extraction method
CN112836010A (en) * 2020-10-22 2021-05-25 长城计算机软件与系统有限公司 Patent retrieval method, storage medium and device
CN112836010B (en) * 2020-10-22 2024-04-05 新长城科技有限公司 Retrieval method, storage medium and device for patent
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
WO2023173546A1 (en) * 2022-03-15 2023-09-21 平安科技(深圳)有限公司 Method and apparatus for training text recognition model, and computer device and storage medium

Also Published As

Publication number Publication date
CN109376352B (en) 2022-11-29

Similar Documents

Publication Publication Date Title
CN109376352A (en) A kind of patent text modeling method based on word2vec and semantic similarity
CN108874878B (en) Knowledge graph construction system and method
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN104834735B (en) A kind of documentation summary extraction method based on term vector
CN105808526B (en) Commodity short text core word extracting method and device
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN102289522B (en) Method of intelligently classifying texts
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN102799577B (en) A kind of Chinese inter-entity semantic relation extraction method
CN107193801A (en) A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN103617290B (en) Chinese machine-reading system
CN107992542A (en) A kind of similar article based on topic model recommends method
CN103049569A (en) Text similarity matching method on basis of vector space model
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN101231634A (en) Autoabstract method for multi-document
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN103678275A (en) Two-level text similarity calculation method based on subjective and objective semantics
CN105279264A (en) Semantic relevancy calculation method of document
CN111309925A (en) Knowledge graph construction method of military equipment
CN102054029A (en) Figure information disambiguation treatment method based on social network and name context
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN111625622B (en) Domain ontology construction method and device, electronic equipment and storage medium
CN105787097A (en) Distributed index establishment method and system based on text clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant