CN108804421A

CN108804421A - Text similarity analysis method, device, electronic equipment and computer storage media

Info

Publication number: CN108804421A
Application number: CN201810522854.4A
Authority: CN
Inventors: 高影繁; 姚长青; 刘志辉; 崔笛; 李岩; 郑明�
Original assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Current assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-13
Anticipated expiration: 2038-05-28
Also published as: CN108804421B

Abstract

This application involves text-processing field, a kind of text similarity analysis method, device, electronic equipment and computer readable storage medium are disclosed, wherein text similarity analysis method includes：Determine the foundation characteristic word of the first predetermined number of target text；Then based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word；Then the weighted value based on each foundation characteristic word, each expansion word and each word determines the Similar Text of target text from pre-set text database.The method of the embodiment of the present application, the quantity for the professional vocabulary that can characterize target text being drawn into greatly is expanded, effectively improve the statistical property of the text feature word frequency of characterization target text, the similar patent that target text can be quickly and accurately selected out from pre-set text database, is greatly improved the accuracy of patent similarity analysis.

Description

Text similarity analysis method, device, electronic equipment and computer storage media

Technical field

This application involves identity identification technical fields, specifically, this application involves a kind of text similarity analysis sides Method, device, electronic equipment and computer storage media.

Background technology

Carrier of the text (such as paper text, patent text) as natural language, usually with a kind of unstructured or half The form of structuring exists.With the rapid development of computer interconnected network technology, text similarity analysis has in many fields It and is widely applied, for example, in the fields such as information retrieval, text classification, text cluster and automatic question answering, text similarity Analysis is even more a basic and important job.

By taking patent text as an example, during carrying out patent similarity analysis, need non-structured patent text first Originally it is converted into convenient for the structured message of computer identifying processing, then feature extraction is carried out to the structured message, and foundation carries The feature taken carries out the similarity analysis of patent.Wherein, common patent similarity analysis method includes patent semantic analysis The methods of method, patent tree and text mining, although these methods have certain improvement in terms of analyzing quality, in patent Similarity analysis during, still remain the low problem of similarity analysis accuracy.

Invention content

The purpose of the application is intended at least solve above-mentioned one of technological deficiency, especially similarity analysis accuracy Low technological deficiency.

In a first aspect, a kind of text similarity analysis method is provided, including：

Determine the foundation characteristic word of the first predetermined number of target text；

Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, is obtained The expansion word of corresponding second predetermined number of each foundation characteristic word；

Weighted value based on each foundation characteristic word, each expansion word and each word determines mesh from pre-set text database Mark the Similar Text of text.

Second aspect provides a kind of text similarity analytical equipment, including：

First determining module, the foundation characteristic word of the first predetermined number for determining target text；

Expansion module, for based on the text term vector library after training, distinguishing the foundation characteristic word of the first predetermined number It is extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word；

Second determining module is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from default The Similar Text of target text is determined in text database.

The third aspect, provides a kind of electronic equipment, including memory, processor and storage are on a memory and can be The computer program run on processor, processor realize above-mentioned text similarity analysis method when executing described program.

Fourth aspect provides a kind of computer readable storage medium, calculating is stored on computer readable storage medium Machine program, the program realize above-mentioned text similarity analysis method when being executed by processor.

The application implements the text similarity analysis method provided, determines the basis of the first predetermined number of target text Feature Words, to extract the text feature word that can characterize target text, for subsequently based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively；Based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word corresponding second and presets The expansion word of number has greatly expanded the quantity for the professional vocabulary that can characterize target text being drawn into, has effectively improved table The statistical property for levying the text feature word frequency of target text, for the follow-up Similar Text for quickly and accurately determining target text It lays the foundation；Weighted value based on each foundation characteristic word, each expansion word and each word is determined from pre-set text database The Similar Text of target text, to quickly and accurately select out the similar special of target text from pre-set text database Profit, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism, patent is greatly improved The accuracy of the accuracy of similarity analysis and Patent Competition opponent identification.

The additional aspect of the application and advantage will be set forth in part in the description, these will from the following description Become apparent, or is recognized by the practice of the application.

Description of the drawings

The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments It obtains obviously and is readily appreciated that, wherein：

Fig. 1 is the flow diagram of the text similarity analysis method of the embodiment of the present application；

Fig. 2 is the weight distribution schematic diagram of the text feature word of the embodiment of the present application；

Fig. 3 is the schematic diagram of the text similarity analytic process of the embodiment of the present application；

Fig. 4 is the basic structure schematic diagram of the text similarity analytical equipment of the embodiment of the present application；

Fig. 5 is the detailed construction schematic diagram of the text similarity analytical equipment of the embodiment of the present application；

Fig. 6 is the structural schematic diagram of the electronic equipment of the embodiment of the present application.

Specific implementation mode

Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to The embodiment of attached drawing description is exemplary, and is only used for explaining the application, and cannot be construed to the limitation to the application.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that is used in the description of the present application arranges It refers to there are the feature, integer, step, operation, element and/or component, but it is not excluded that presence or addition to take leave " comprising " Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or can also deposit In intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.Used here as Wording "and/or" include one or more associated list items whole or any cell and all combine.

To keep the purpose, technical scheme and advantage of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.

Important carrier of the patent text as record scientific research activity and research method is that scientific research personnel obtains scientific and technological experience With the important literature data for understanding industry cutting edge technology.In face of the patent resource of magnanimity, the side by using automation is needed Method, quickly selects out the similar patent of certain enterprise or mechanism, and then identifies the technology competition opponent of the enterprise or mechanism.Mesh Before, all it is in the number such as title, abstract of patent in the method that competition among enterprises opponent is identified using Data Mining Patent Feature Words extraction is carried out on the basis of, and on the basis of the Feature Words being drawn into, utilizes VSM (Vector Space Model, vector space model) model carries out vectorial expression to patent text, then carry out the similarity analysis of patent.But It is the title of patent and shorter, the statistical property of the text feature word frequency for characterizing patented technology for from length of making a summary Unobvious, and the lazy weight for the professional vocabulary that can characterize patent being drawn into, thus the patent text obtained based on this The information content of this VSM vectors is insufficient, limited to the characterization ability of patent original text, leads to the accurate of patent correlation analysis result Property it is relatively low, and then influence Patent Competition opponent identification accuracy.

Text similarity analysis method, device, electronic equipment and computer readable storage medium provided by the present application, purport In the technical problem as above for solving the prior art.

How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for same or analogous concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.

Embodiment one

The embodiment of the present application provides a kind of text similarity analysis method, as shown in Figure 1, including：

Step S100 determines the foundation characteristic word of the first predetermined number of target text.

Specifically, it is default that first is extracted from the text messages such as the title, abstract of target text (such as patent text) The Feature Words of the target text of number, wherein the first predetermined number can be set according to the actual needs in extraction process, Such as the first predetermined number can be set as to 5,10 and 15 etc., i.e., extracted from the title of target text, abstract 5 or 10 or 15 or other numerical value Feature Words, and using the Feature Words being drawn into as the foundation characteristic word of target text, i.e. table Levy the professional vocabulary of target text.

Step S200 carries out the foundation characteristic word of the first predetermined number based on the text term vector library after training respectively Extension, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word.

Specifically, the title of target text and the length of abstract are shorter, and what is therefrom extracted can characterize target text The quantity of professional foundation characteristic word is extremely limited, and is not enough to the statistical property of the text feature word frequency of characterization target text, It is extended respectively by the foundation characteristic word of the first predetermined number to being drawn into, obtains each foundation characteristic word and correspond to respectively The second predetermined number expansion word, can greatly expand the number for the professional vocabulary that can characterize target text being drawn into Amount effectively improves the statistical property of the text feature word frequency of characterization target text, and target text is quickly and accurately determined to be follow-up This Similar Text lays the foundation.

Further, the second predetermined number can be set according to the actual needs in expansion process, the second predetermined number Can be identical as the first predetermined number, it can also be differed with the first predetermined number, such as the second predetermined number can be set It is 5,15 and 30 etc., i.e., each foundation characteristic word is extended, obtains 5 or 15 or 30 of each foundation characteristic word A or other numerical value expansion words.

Exemplary, when foundation characteristic word is " installation procedure " and the second predetermined number is 6, expansion word can be " driving Program ", " installation file ", " software ", " installation kit ", " configuration file " and " client-side program ".

Step S300, the weighted value based on each foundation characteristic word, each expansion word and each word, from pre-set text data The Similar Text of target text is determined in library.

Specifically, the weighted value of each foundation characteristic word based on file destination, each expansion word and each word, from default In a large amount of textual resources in text database, the Similar Text of the target text is quickly and accurately selected out.

It is exemplary, it, can be from pre- and when entitled " air purifier " of the patent when target text is patent text If in the patent resource in text database, quickly and accurately select out the similar patent of the patent, such as similar patent Entitled " electronic air cleaner ", " a kind of electric-bag complex dust collector " etc..

Further, after determining the Similar Text of target text, check that the related of the Similar Text is believed by clicking Breath, can obtain the information such as enterprise or the mechanism belonging to the similar patent, the letters such as enterprise or mechanism belonging to similar patent Breath can further know the technology competition opponent of the target text owned enterprise or mechanism, such as rival is similar special Enterprise belonging to profit or mechanism.

Text similarity analysis method provided by the embodiments of the present application determines the of target text compared with prior art The foundation characteristic word of one predetermined number, to extract the text feature word that can characterize target text, to be follow-up based on training Text term vector library afterwards is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively；Based on training Text term vector library afterwards, is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word difference The expansion word of corresponding second predetermined number has greatly expanded the professional vocabulary that can characterize target text being drawn into Quantity effectively improves the statistical property of the text feature word frequency of characterization target text, and target is quickly and accurately determined to be follow-up The Similar Text of text lays the foundation；Weighted value based on each foundation characteristic word, each expansion word and each word, from default text The Similar Text that target text is determined in database, to quickly and accurately select out target from pre-set text database The similar patent of text, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism, The accuracy of accuracy and the Patent Competition opponent identification of patent similarity analysis is greatly improved.

Embodiment two

The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment one Method shown in example two, wherein

In the step s 100, by TextRank algorithm, the foundation characteristic of the first predetermined number of target text is determined Word.

Specifically, in the embodiment of the present application by taking target text is patent text as an example, above-mentioned steps S100 is carried out such as Lower explanation：

Existing method is typically to determine patent according to word frequency on the basis of the methods of common participle, part-of-speech tagging Feature Words when due to extracting word using these methods, can extract the word of some word frequency height but professional difference, thus adopt The word extracted with these methods does not have good patent and characterizes ability.In order to solve this problem, the embodiment of the present application is adopted The foundation characteristic word that patent is extracted with textRank algorithms, the foundation characteristic word being drawn by this method have stronger It is professional, it lays the foundation for structure patent text VSM models.

Wherein, TextRank algorithm is a kind of sort algorithm based on figure for text, and basic thought derives from paddy The PageRank algorithms of song, by the way that text segmentation at several component units (such as word, sentence) and is established graph model, profit The important component in text is ranked up with voting mechanism, keyword can be realized merely with the information of single document itself Extraction.With LDA (Latent Dirichlet Allocation, document subject matter generate model), HMM (Hidden Markov Model, hidden Markov model) etc. models it is different, TextRank need not carry out learning training to multiple documents in advance, because It is succinct effective and is used widely.TextRank algorithms are using relationship (co-occurrence window) between local vocabulary to follow-up Keyword is ranked up, and is directly extracted from text itself.

Further, it by TextRank algorithm, determines the foundation characteristic word of the first predetermined number of target text, wraps Include following steps：

1) given target text is split according to complete words；

2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech Word, such as noun, verb, adjective, the candidate keywords after as retaining；

3) structure candidate keywords figure G=(V, E), wherein V are node set, and E is the set on side.By the time 2) generated Select crucial phrase at then using the wantonly side between 2 points of cooccurrence relation construction, there are the case where side between two nodes to refer to Vocabulary corresponding to the two nodes co-occurrence in the window that length is K, K indicate window size, i.e., most K words of co-occurrence；

4) according to formula G=(V, E) above, the weight of each node of iterative diffusion, until convergence；

5) Bit-reversed is carried out to node weights, to obtain most important T word, as candidate keywords, i.e., originally Apply for the foundation characteristic word in embodiment；

6) the most important T word that will 5) obtain, is marked in urtext, if forming adjacent phrase, group Synthesize more word keywords.

For the embodiment of the present application, the foundation characteristic word of target text is extracted using textRank algorithms, is not only had It is stronger professional, and need not learning training be carried out to multiple documents in advance, thus it is more simple and efficient.

Embodiment three

The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment two Method shown in example three, wherein

Further include step S101 (being not marked in figure) before step S200：Pass through continuous bag of words neural network model pair Text in presetting database is trained, the text term vector library after being trained.

Step S200 includes step S2001 (being not marked in figure), step S2002 (being not marked in figure) and step S2003 (being not marked in figure), wherein

Step S2001：By inquiring the text term vector library after training, obtain the first word of any foundation characteristic word to Amount.

Step S2002：The cosine similarity value between the first term vector and the second term vector is calculated, the second term vector is instruction Term vector in text term vector library after white silk in addition to the first term vector.

Step S2003：Determine that cosine similarity value is more than the second term vector of the second predetermined number of the first predetermined threshold value Corresponding word, and as the expansion word of any foundation characteristic word.

Specifically, the embodiment of the present application is extended foundation characteristic word using depth learning technology, and method and step is such as Under：

1) Word2Vec (term vector) method training text term vector library is utilized

Word in word vector expression text is the core skill that deep learning algorithm is introduced to natural language processing Art.Word2vec is a outstanding modeling tool for obtaining term vector that Google increased income in 2013, main to use CBOW (Continuous Bag-Of-Words, continuous bag of words) and Skip-gram (vertical jump in succession metagrammar) model. Wherein, the embodiment of the present application uses more efficient CBOW neural network models, is instructed to the text in presetting database Practice, the text term vector library after being trained.

Exemplary, when text is patent text, 2,000 ten thousand patent texts of the embodiment of the present application in about 10G are enterprising Row training, the patent term vector library after being trained, wherein patent text includes the text fields such as patent title and abstract, raw At term vector dimension be 100, after training there are about 1,000,000 vocabulary, size about 990M in patent term vector library.

2) foundation characteristic word is extended based on the text term vector library after training

Specifically, when target text is patent text, the foundation characteristic word that each patent text extracts is carried out The method of extension is inquired one by one exactly by the foundation characteristic word of the first predetermined number obtained above by TextRank algorithm Patent term vector library obtains the term vector (i.e. the first term vector in step S2001) of each foundation characteristic word, then carries out Cosine similarity calculating process, wherein cosine similarity calculating process are：Calculate the term vector and patent of any foundation characteristic word Between other term vectors (i.e. the second term vector in step S2002) in term vector library in addition to the term vector of the foundation characteristic word Cosine similarity value this is determined according to the comparison of cosine similarity value and the first predetermined threshold value and the second predetermined number The expansion word of foundation characteristic word.

Further, it for each foundation characteristic word determined, is performed both by above-mentioned cosine similarity value and calculated Journey, so that it is determined that going out the expansion word of each foundation characteristic word.

It is exemplary, when foundation characteristic word be " installation procedure ", " cheap ", " water reuse ", " decontamination ", " high-speed railway " and " partially fall ", and when the second predetermined number is 6, the expansion word that can obtain each foundation characteristic word is as shown in table 1：

1 foundation characteristic word of table and its corresponding expansion word

For the embodiment of the present application, the text term vector library after giving based on training determines the expansion of each foundation characteristic word Open up the detailed process and operating procedure of word so that those skilled in the art can be according to the step in the embodiment of the present application, quickly It is accurately finished the extension of foundation characteristic word, greatly expands the number for the professional vocabulary that can characterize target text being drawn into Amount effectively improves the statistical property of the text feature word frequency of characterization target text, and target text is quickly and accurately determined to be follow-up This Similar Text lays the foundation.

Example IV

The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment three Method shown in example four, wherein

Further include step S201 (being not marked in figure) before step S300：Filter out the expansion word of any foundation characteristic word In stop words；And/or it filters out reverse document-frequency in the expansion word of any foundation characteristic word and is less than the second predetermined threshold value Word.

Further include step S202 (being not marked in figure) before step S300：Determine the weighted value of each word.Wherein, really The weighted value of fixed each word, including：

By following formula, the weighted value of any word is determined：

w_i=idf_i*(p_tf_i+c_tf_i)

Wherein, w_iIndicate weighted value, idf_iIndicate the reverse document-frequency of any word, p_tf_iIndicate that any word exists Frequency in the text header and text snippet of the target text, c_tf_iIndicate any word in addition to the target text Other texts in frequency.

Specifically, each foundation characteristic word difference that S2001, step S2002 and step S2003 are obtained through the above steps After the expansion word of corresponding second predetermined number, need further to filter obtained expansion word, wherein can be as needed Stop words therein is only filtered out, the word that reverse text frequency therein is less than the second predetermined threshold value can also be only filtered out, it can be with Filter out the word that stop words therein and reverse text frequency are less than the second predetermined threshold value simultaneously, by the expansion word to obtaining into Row filtering so that expansion word can preferably characterize target text.

Non- example, it, can be with during being filtered to obtained expansion word when the second predetermined threshold value is taken as 4.0 Stop words therein is only filtered out, the word that reverse text frequency therein is less than 4.0 can also be only filtered out, can also filter out simultaneously Stop words therein and reverse text frequency are less than 4.0 word, and the basis finally obtained in set of words i.e. the embodiment of the present application is special Levy the expansion word of word.

Further, it is assumed that the expansion word of each foundation characteristic word and each foundation characteristic word that obtain through the above steps is w₁,w₂,…,w_N, and target text is patent text in above-mentioned steps, can be calculated at this time with formula (1) determine each word (including The expansion word of each foundation characteristic word and each foundation characteristic word) weighted value：

w_i=idf_i*(p_tf_i+c_tf_i) (1)

Wherein, w_iIndicate the weighted value of any word, idf_iIndicate the reverse document-frequency of any word, p_tf_iIndicating should Frequency of any word in patent title and abridgments of specifications；c_tf_iIndicate any word in other texts in addition to patent text The frequency of occurrences in (such as paper text).In addition, p_tf_iCalculation can be：(word is in patent title and patent Occurrence number+1 in abstract)/(total word number+1 of each foundation characteristic word and the expansion word of each foundation characteristic word), for special There is no the word occurred in sharp title and abridgments of specifications, adds 1 can play smoothing effect.

Further, the weighted value w of each word is obtained_iAfterwards, the further weighted value w to obtaining_iIt is normalized, The weight distribution of each word of patent is obtained, as shown in Figure 2.

For the embodiment of the present application, by being less than the second predetermined threshold value to stop words in expansion word and reverse document-frequency Word filtering so that expansion word can preferably characterize target text, and stop words and reverse document-frequency is effectively avoided to be less than The influence for the accuracy that the word of second predetermined threshold value analyzes text similarity.In addition, the weighted value of each word of the determination provided Implementation method, the weighted value of each word is quickly determined convenient for those skilled in the art, for subsequently from pre-set text database Determine that the Similar Text of target text provides premise guarantee.

Embodiment five

The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of example IV Method shown in example five, wherein

Include step S3001 (being not marked in figure), step S3002 (being not marked in figure), step in step S300 S3003 (being not marked in figure) and step S3004 (being not marked in figure), wherein

Step S3001：First predetermined number is determined respectively to multiple texts to be screened in pre-set text database Foundation characteristic word, based on the text term vector library after training, the foundation characteristic word of the first predetermined number is expanded respectively Exhibition obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the step of the weighted value of each word Suddenly, the extension of each text to be screened corresponding foundation characteristic word, the weighted value of foundation characteristic word, foundation characteristic word is obtained The weighted value of word and expansion word.

Step S3002：It detects and whether there is in the foundation characteristic word and expansion word of any text to be screened and target text Foundation characteristic word and the identical word of expansion word.

Step S3003：For any text to be screened, if it is present calculating any same words in the text to be screened In weighted value and the weighted value of any same words in target text product, and calculate whole same words product it With.

Step S3004：In multiple texts to be screened, the sum of products being calculated is selected to be more than third predetermined threshold value Text to be screened, the Similar Text as target text.

Specifically, the texts such as a large amount of patent and paper are stored in pre-set text database, from pre-set text database When the Similar Text of middle screening target text, above-mentioned implementation is passed through to multiple texts to be screened in pre-set text database Step S100 (the foundation characteristic word for determining the first predetermined number), step S200 in example one to example IV is (after training Text term vector library, the foundation characteristic word of the first predetermined number is extended respectively, it is right respectively to obtain each foundation characteristic word The expansion word for the second predetermined number answered), step S201 (filter out the stop words in the expansion word of any foundation characteristic word；With/ Or filter out the word that reverse document-frequency in the expansion word of any foundation characteristic word is less than the second predetermined threshold value) and step S202 is (really The weighted value of fixed each word) etc., obtain the weight of the corresponding foundation characteristic word of each text to be screened, foundation characteristic word The weighted value of value, the expansion word of foundation characteristic word and expansion word.

Further, in the similar text for searching target text from each of pre-set text database text to be screened During this, text to be screened can be traversed according to the foundation characteristic word and expansion word of target text, it specifically can be with In foundation characteristic word and expansion word by detecting any text to be screened with the presence or absence of with target text foundation characteristic word and The mode of the identical word of expansion word, to be traversed successively to each text to be screened, and there will be no the bases with target text The text filtering to be screened of plinth Feature Words and the identical word of expansion word falls, and only retains the foundation characteristic word existed with target text And the text to be screened of the identical word of expansion word, to be further processed.

Further, when there is word identical with the foundation characteristic word and expansion word of target text in text to be screened, Calculate multiplying for weighted value and any same words weighted value in target text of any same words in the text to be screened Product, wherein when identical word has multiple, the corresponding product of multiple word is added up, that is, calculates whole same words The sum of products, when identical word only there are one when, directly using the product as the final sum of products.

Further, from the text to be screened that there is word identical with the foundation characteristic word and expansion word of target text, Filter out the Similar Text as target text with the immediate text of target text, wherein the sum of products can be selected to be more than The text to be screened of third predetermined threshold value, as the Similar Text of target text, the value of third predetermined threshold value can be according to reality Border needs dynamic to set.Table 2 gives the displaying example to the relevant information of target text and its corresponding Similar Text.

2 target text of table and its corresponding Similar Text information

Further, in conjunction with the embodiment of the present application one to the method for embodiment five, Fig. 3 target texts are with patent text Example gives the basic process of the similar patent to searching target patent, wherein first carries out step S1 in figure 3 and (is based on The patent foundation characteristic word of TextRank extracts), step S2 (determining deep learning algorithm) is then carried out, step is then carried out S3 (trains patent word to library), then carries out step S4 (extension that foundation characteristic word is carried out based on patent term vector library), then Step S5 (filtering of patent characteristic expansion word) is carried out, step S6 (patent characteristic word weight calculation), final step are then carried out S7 (exports similar patent and corresponding patentee).

For the embodiment of the present application, the weighted value based on each foundation characteristic word, each expansion word and each word is given, The detailed process and operating procedure of the Similar Text of target text are determined from pre-set text database so that art technology Personnel quickly and accurately can select out target text according to the step in the embodiment of the present application from pre-set text database Similar Text, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism.

Embodiment six

Fig. 4 is a kind of structural schematic diagram of the translating equipment of text message provided by the embodiments of the present application, as shown in figure 4, The translating equipment 40 of text information may include：First determining module 41, expansion module 42 and the second determining module 43, In：

First determining module 41 is used to determine the foundation characteristic word of the first predetermined number of target text；

Expansion module 42 is used for based on the text term vector library after training, to the foundation characteristic word point of the first predetermined number It is not extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word；

Second determining module 43 is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from default The Similar Text of target text is determined in text database.

Specifically, the first determining module 41 is specifically used for, by TextRank algorithm, determining that the first of target text is default The foundation characteristic word of number.

Further, which further includes training module 44, as shown in Figure 5, wherein training module 44 is for passing through company Continuous bag of words neural network model is trained the text in presetting database, the text term vector library after being trained.

Further, expansion module 42 includes acquisition submodule 421, computational submodule 422 and expansion word determination sub-module 423, as shown in Figure 5, wherein acquisition submodule 421 is used to, by inquiring the text term vector library after training, obtain any base First term vector of plinth Feature Words；

Computational submodule 422 is used to calculate cosine similarity value between the first term vector and the second term vector, the second word to Amount is the term vector in the text term vector library after training in addition to the first term vector；

Expansion word determination sub-module 423 is used to determine that cosine similarity value to be more than second default of the first predetermined threshold value Several corresponding words of the second term vector, and as the expansion word of any foundation characteristic word.

Further, which further includes filtering out module 45, as shown in Figure 5, wherein filters out module 45 and appoints for filtering out Stop words in the expansion word of one foundation characteristic word；And/or reverse file in the expansion word for filtering out any foundation characteristic word Frequency is less than the word of the second predetermined threshold value.

Further, which further includes weight determination module 46, as shown in Figure 5, wherein weight determination module 46 is used In the weighted value for determining each word；Wherein, it is specifically used for, by following formula, determining the weighted value of any word：

w_i=idf_i*(p_tf_i+c_tf_i)

Wherein, w_iIndicate weighted value, idf_iIndicate the reverse document-frequency of any word, p_tf_iIndicate that any word exists Frequency in the text header and text snippet of target text, c_tf_iIndicate any word in other texts in addition to target text Frequency in this.

Further, the second determining module 43 includes pretreatment submodule 431, detection sub-module 432, product calculating Module 433 and screening submodule 434, wherein

Pretreatment submodule 431 is used to carry out acquisition the respectively to multiple texts to be screened in pre-set text database The foundation characteristic word of one predetermined number, based on the text term vector library after training, to the foundation characteristic word point of the first predetermined number It is not extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the power of each word The step of weight values, obtains the corresponding foundation characteristic word of each text to be screened, the weighted value of foundation characteristic word, foundation characteristic The expansion word of word and the weighted value of expansion word；

Detection sub-module 432 be used to detect in the foundation characteristic word and expansion word of any text to be screened with the presence or absence of with The identical word of foundation characteristic word and expansion word of target text；

Product computational submodule 433 is used to be directed to any text to be screened, exists if it is present calculating any same words The product of weighted value and the weighted value of any same words in target text in the text to be screened, and calculate whole phases With the sum of products of word；

It screens submodule 434 to be used in multiple texts to be screened, selects the sum of products being calculated to be more than third pre- If the text to be screened of threshold value, the Similar Text as target text.

Device provided by the embodiments of the present application determines the base of the first predetermined number of target text compared with prior art Plinth Feature Words, to extract the text feature word that can characterize target text, for subsequently based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively；Based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word corresponding second and presets The expansion word of number has greatly expanded the quantity for the professional vocabulary that can characterize target text being drawn into, has effectively improved table The statistical property for levying the text feature word frequency of target text quickly and accurately determines that the Similar Text of target text is established to be follow-up Fixed basis；Weighted value based on each foundation characteristic word, each expansion word and each word determines mesh from pre-set text database The Similar Text for marking text, to quickly and accurately select out the similar patent of target text from pre-set text database, And then the technology competition opponent of target text owned enterprise or mechanism is identified according to the similar patent, patent phase is greatly improved The accuracy that the accuracy analyzed like property and Patent Competition opponent identify.

Embodiment seven

The embodiment of the present application provides a kind of electronic equipment, as shown in fig. 6, electronic equipment shown in fig. 6 600 includes：Place Manage device 601 and memory 603.Wherein, processor 601 is connected with memory 603, is such as connected by bus 602.Further, Electronic equipment 600 can also include transceiver 604.It should be noted that transceiver 604 is not limited to one in practical application, it should The structure of electronic equipment 600 does not constitute the restriction to the embodiment of the present application.

Wherein, processor 601 is applied in the embodiment of the present application, for realizing the first determining module shown in Fig. 4, expands Open up the function of module and the second determining module.Transceiver 604 includes Receiver And Transmitter, and transceiver 604 is applied to the application In embodiment, for realizing the function of acquisition submodule shown in fig. 5.

Processor 601 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance Body pipe logical device, hardware component or its arbitrary combination.It is may be implemented or executed in conjunction with described by present disclosure Various illustrative logic blocks, module and circuit.Processor 601 can also be to realize the combination of computing function, such as wrap It is combined containing one or more microprocessors, the combination etc. of DSP and microprocessor.

Bus 602 may include an access, and information is transmitted between said modules.Bus 602 can be pci bus or Eisa bus etc..Bus 602 can be divided into address bus, data/address bus, controlling bus etc..For ease of indicating, only used in Fig. 6 One thick line indicates, it is not intended that an only bus or a type of bus.

Memory 603 can be ROM or can store static information and the other kinds of static storage device of instruction, RAM Or the other kinds of dynamic memory of information and instruction can be stored, can also be EEPROM, CD-ROM or other CDs Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium Or other magnetic storage apparatus or can be used in carry or store with instruction or data structure form desired program Code and can by any other medium of computer access, but not limited to this.

Memory 603 is used to store the application code for executing application scheme, and is held by processor 601 to control Row.Processor 601 is for executing the application code stored in memory 603, to realize what embodiment illustrated in fig. 4 provided The action of text similarity analytical equipment.

The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium There is computer program, method shown in embodiment one is realized when which is executed by processor.Compared with prior art, it determines The foundation characteristic word of first predetermined number of target text is to extract the text feature word that can characterize target text Subsequently based on the text term vector library after training, offer premise is extended respectively to the foundation characteristic word of the first predetermined number It ensures；Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, is obtained each The expansion word of corresponding second predetermined number of foundation characteristic word, greatly expanded be drawn into can characterize target text Professional vocabulary quantity, effectively improve the statistical property of the text feature word frequency of characterization target text, for it is follow-up quickly, The Similar Text for accurately determining target text lays the foundation；Power based on each foundation characteristic word, each expansion word and each word Weight values determine the Similar Text of target text from pre-set text database, to quickly and accurately from pre-set text data The similar patent of target text is selected out in library, and then target text owned enterprise or mechanism are identified according to the similar patent Technology competition opponent, be greatly improved patent similarity analysis accuracy and Patent Competition opponent identification accuracy.

Computer readable storage medium provided by the embodiments of the present application is suitable for any embodiment of the above method.Herein It repeats no more.

It should be understood that although each step in the flow chart of attached drawing is shown successively according to the instruction of arrow, These steps are not that the inevitable sequence indicated according to arrow executes successively.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, in the flow chart of attached drawing at least A part of step may include that either these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps Moment executes completion, but can execute at different times, and execution sequence is also not necessarily and carries out successively, but can be with Either the sub-step of other steps or at least part in stage execute in turn or alternately with other steps.

The above is only some embodiments of the application, it is noted that for the ordinary skill people of the art For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications It should be regarded as the protection domain of the application.

Claims

1. a kind of text similarity analysis method, which is characterized in that including：

Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, obtains each base The expansion word of corresponding second predetermined number of plinth Feature Words；

Weighted value based on each foundation characteristic word, each expansion word and each word determines target text from pre-set text database This Similar Text.

2. according to the method described in claim 1, it is characterized in that, determining the foundation characteristic of the first predetermined number of target text Word, including：

By TextRank algorithm, the foundation characteristic word of the first predetermined number of target text is determined.

3. according to the method described in claim 1, it is characterized in that, based on the text term vector library after training, in advance to first Before if the foundation characteristic word of number is extended respectively, further include：

The text in presetting database is trained by continuous bag of words neural network model, the text word after being trained to Measure library.

4. according to claim 1-3 any one of them methods, which is characterized in that right based on the text term vector library after training The foundation characteristic word of first predetermined number is extended respectively, obtains corresponding second predetermined number of each foundation characteristic word Expansion word, including：

By inquiring the text term vector library after training, the first term vector of any foundation characteristic word is obtained；

The cosine similarity value between the first term vector and the second term vector is calculated, the second term vector is the text term vector after training Term vector in library in addition to the first term vector；

Determine that cosine similarity value is more than the corresponding word of the second term vector of the second predetermined number of the first predetermined threshold value, and As the expansion word of any foundation characteristic word.

5. according to the method described in claim 1, it is characterized in that, based on each foundation characteristic word, each expansion word and each The weighted value of word, from pre-set text database determine target text Similar Text before, further include：

Filter out the stop words in the expansion word of any foundation characteristic word；And/or

Filter out the word that reverse document-frequency in the expansion word of any foundation characteristic word is less than the second predetermined threshold value.

6. according to the method described in claim 5, it is characterized in that, based on each foundation characteristic word, each expansion word and each The weighted value of word, from pre-set text database determine target text Similar Text before, further include：Determine the power of each word Weight values；

Wherein it is determined that the weighted value of each word, including：

By following formula, the weighted value of any word is determined：

w_i=idf_i*(p_tf_i+c_tf_i)

Wherein, w_iIndicate weighted value, idf_iIndicate the reverse document-frequency of any word, p_tf_iIndicate any word in the mesh Mark the frequency in the text header and text snippet of text, c_tf_iIndicate any word in other in addition to the target text Frequency in text.

7. according to the method described in claim 6, it is characterized in that, being based on each foundation characteristic word, each expansion word and each word Weighted value, from pre-set text database determine target text Similar Text, including：

Multiple texts to be screened in pre-set text database are carried out obtain with foundation characteristic word, the base of the first predetermined number respectively Text term vector library after training, is extended the foundation characteristic word of the first predetermined number, obtains each foundation characteristic respectively The step of weighted value of the expansion word of corresponding second predetermined number of word and determining each word, obtain each text to be screened The weight of this corresponding foundation characteristic word, the weighted value of foundation characteristic word, the expansion word of foundation characteristic word and expansion word Value；

Detect in the foundation characteristic word and expansion word of any text to be screened with the presence or absence of with the foundation characteristic word of target text and The identical word of expansion word；

For any text to be screened, if it is present calculating weighted value of any same words in the text to be screened and should The product of weighted value of any same words in target text, and calculate the sum of products of whole same words；

In multiple texts to be screened, the sum of products being calculated is selected to be more than the text to be screened of third predetermined threshold value, made For the Similar Text of the target text.

8. a kind of text similarity analytical equipment, which is characterized in that including：

Expansion module, for based on the text term vector library after training, being carried out respectively to the foundation characteristic word of the first predetermined number Extension, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word；

Second determining module is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from pre-set text number According to the Similar Text for determining target text in library.

9. device according to claim 8, which is characterized in that first determining module is specifically used for passing through TextRank Algorithm determines the foundation characteristic word of the first predetermined number of target text.

10. device according to claim 8, which is characterized in that further include training module；

The training module is obtained for being trained to the text in presetting database by continuous bag of words neural network model Text term vector library after to training.

11. according to claim 8-10 any one of them devices, which is characterized in that the expansion module includes obtaining submodule Block, computational submodule and expansion word determination sub-module；

The acquisition submodule, for by inquiring the text term vector library after training, obtaining the first of any foundation characteristic word Term vector；

The computational submodule, for calculating the cosine similarity value between the first term vector and the second term vector, the second term vector For the term vector in the text term vector library after training in addition to the first term vector；

The expansion word determination sub-module, for determining that cosine similarity value is more than the second predetermined number of the first predetermined threshold value The corresponding word of second term vector, and as the expansion word of any foundation characteristic word.

12. device according to claim 8, which is characterized in that further include filtering out module；

Stop words in the expansion word that module is filtered out for filtering out any foundation characteristic word；And/or for filtering out any base Reverse document-frequency is less than the word of the second predetermined threshold value in the expansion word of plinth Feature Words.

13. according to the method for claim 12, which is characterized in that further include：Weight determination module；

The weight determination module, the weighted value for determining each word；Wherein, it is specifically used for, by following formula, determining and appointing The weighted value of one word：

w_i=idf_i*(p_f_i+c_tf_i)

14. according to the method for claim 13, which is characterized in that second determining module include pretreatment submodule, Detection sub-module, product computational submodule and screening submodule；

The pretreatment submodule, for multiple texts to be screened in pre-set text database to be carried out obtain with first respectively in advance If the foundation characteristic word of number, based on the text term vector library after training, to the foundation characteristic word of the first predetermined number respectively into Row extension obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the weighted value of each word The step of, obtain the expansion of each text to be screened corresponding foundation characteristic word, the weighted value of foundation characteristic word, foundation characteristic word Open up the weighted value of word and expansion word；

The detection sub-module, whether there is in the foundation characteristic word and expansion word for detecting any text to be screened and target The identical word of foundation characteristic word and expansion word of text；

The product computational submodule waits for if it is present calculating any same words at this for being directed to any text to be screened The product of the weighted value and the weighted value of any same words in target text in text is screened, and calculates whole same words The sum of products；

The screening submodule, in multiple texts to be screened, selecting the sum of products being calculated default more than third The text to be screened of threshold value, the Similar Text as the target text.

15. a kind of electronic equipment, including memory, processor and storage are on a memory and the calculating that can run on a processor Machine program, which is characterized in that the processor realizes that claim 1-7 any one of them texts are similar when executing described program Property analysis method.

16. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program realizes the text similarity analysis method described in any one of claim 1-7 when the program is executed by processor.