CN108804421A - Text similarity analysis method, device, electronic equipment and computer storage media - Google Patents

Text similarity analysis method, device, electronic equipment and computer storage media Download PDF

Info

Publication number
CN108804421A
CN108804421A CN201810522854.4A CN201810522854A CN108804421A CN 108804421 A CN108804421 A CN 108804421A CN 201810522854 A CN201810522854 A CN 201810522854A CN 108804421 A CN108804421 A CN 108804421A
Authority
CN
China
Prior art keywords
word
text
foundation characteristic
expansion
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810522854.4A
Other languages
Chinese (zh)
Other versions
CN108804421B (en
Inventor
高影繁
姚长青
刘志辉
崔笛
李岩
郑明�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Original Assignee
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA filed Critical INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority to CN201810522854.4A priority Critical patent/CN108804421B/en
Publication of CN108804421A publication Critical patent/CN108804421A/en
Application granted granted Critical
Publication of CN108804421B publication Critical patent/CN108804421B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • G06Q50/184Intellectual property management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Technology Law (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application involves text-processing field, a kind of text similarity analysis method, device, electronic equipment and computer readable storage medium are disclosed, wherein text similarity analysis method includes:Determine the foundation characteristic word of the first predetermined number of target text;Then based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;Then the weighted value based on each foundation characteristic word, each expansion word and each word determines the Similar Text of target text from pre-set text database.The method of the embodiment of the present application, the quantity for the professional vocabulary that can characterize target text being drawn into greatly is expanded, effectively improve the statistical property of the text feature word frequency of characterization target text, the similar patent that target text can be quickly and accurately selected out from pre-set text database, is greatly improved the accuracy of patent similarity analysis.

Description

Text similarity analysis method, device, electronic equipment and computer storage media
Technical field
This application involves identity identification technical fields, specifically, this application involves a kind of text similarity analysis sides Method, device, electronic equipment and computer storage media.
Background technology
Carrier of the text (such as paper text, patent text) as natural language, usually with a kind of unstructured or half The form of structuring exists.With the rapid development of computer interconnected network technology, text similarity analysis has in many fields It and is widely applied, for example, in the fields such as information retrieval, text classification, text cluster and automatic question answering, text similarity Analysis is even more a basic and important job.
By taking patent text as an example, during carrying out patent similarity analysis, need non-structured patent text first Originally it is converted into convenient for the structured message of computer identifying processing, then feature extraction is carried out to the structured message, and foundation carries The feature taken carries out the similarity analysis of patent.Wherein, common patent similarity analysis method includes patent semantic analysis The methods of method, patent tree and text mining, although these methods have certain improvement in terms of analyzing quality, in patent Similarity analysis during, still remain the low problem of similarity analysis accuracy.
Invention content
The purpose of the application is intended at least solve above-mentioned one of technological deficiency, especially similarity analysis accuracy Low technological deficiency.
In a first aspect, a kind of text similarity analysis method is provided, including:
Determine the foundation characteristic word of the first predetermined number of target text;
Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, is obtained The expansion word of corresponding second predetermined number of each foundation characteristic word;
Weighted value based on each foundation characteristic word, each expansion word and each word determines mesh from pre-set text database Mark the Similar Text of text.
Second aspect provides a kind of text similarity analytical equipment, including:
First determining module, the foundation characteristic word of the first predetermined number for determining target text;
Expansion module, for based on the text term vector library after training, distinguishing the foundation characteristic word of the first predetermined number It is extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;
Second determining module is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from default The Similar Text of target text is determined in text database.
The third aspect, provides a kind of electronic equipment, including memory, processor and storage are on a memory and can be The computer program run on processor, processor realize above-mentioned text similarity analysis method when executing described program.
Fourth aspect provides a kind of computer readable storage medium, calculating is stored on computer readable storage medium Machine program, the program realize above-mentioned text similarity analysis method when being executed by processor.
The application implements the text similarity analysis method provided, determines the basis of the first predetermined number of target text Feature Words, to extract the text feature word that can characterize target text, for subsequently based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively;Based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word corresponding second and presets The expansion word of number has greatly expanded the quantity for the professional vocabulary that can characterize target text being drawn into, has effectively improved table The statistical property for levying the text feature word frequency of target text, for the follow-up Similar Text for quickly and accurately determining target text It lays the foundation;Weighted value based on each foundation characteristic word, each expansion word and each word is determined from pre-set text database The Similar Text of target text, to quickly and accurately select out the similar special of target text from pre-set text database Profit, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism, patent is greatly improved The accuracy of the accuracy of similarity analysis and Patent Competition opponent identification.
The additional aspect of the application and advantage will be set forth in part in the description, these will from the following description Become apparent, or is recognized by the practice of the application.
Description of the drawings
The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments It obtains obviously and is readily appreciated that, wherein:
Fig. 1 is the flow diagram of the text similarity analysis method of the embodiment of the present application;
Fig. 2 is the weight distribution schematic diagram of the text feature word of the embodiment of the present application;
Fig. 3 is the schematic diagram of the text similarity analytic process of the embodiment of the present application;
Fig. 4 is the basic structure schematic diagram of the text similarity analytical equipment of the embodiment of the present application;
Fig. 5 is the detailed construction schematic diagram of the text similarity analytical equipment of the embodiment of the present application;
Fig. 6 is the structural schematic diagram of the electronic equipment of the embodiment of the present application.
Specific implementation mode
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to The embodiment of attached drawing description is exemplary, and is only used for explaining the application, and cannot be construed to the limitation to the application.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that is used in the description of the present application arranges It refers to there are the feature, integer, step, operation, element and/or component, but it is not excluded that presence or addition to take leave " comprising " Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or can also deposit In intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.Used here as Wording "and/or" include one or more associated list items whole or any cell and all combine.
To keep the purpose, technical scheme and advantage of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.
Important carrier of the patent text as record scientific research activity and research method is that scientific research personnel obtains scientific and technological experience With the important literature data for understanding industry cutting edge technology.In face of the patent resource of magnanimity, the side by using automation is needed Method, quickly selects out the similar patent of certain enterprise or mechanism, and then identifies the technology competition opponent of the enterprise or mechanism.Mesh Before, all it is in the number such as title, abstract of patent in the method that competition among enterprises opponent is identified using Data Mining Patent Feature Words extraction is carried out on the basis of, and on the basis of the Feature Words being drawn into, utilizes VSM (Vector Space Model, vector space model) model carries out vectorial expression to patent text, then carry out the similarity analysis of patent.But It is the title of patent and shorter, the statistical property of the text feature word frequency for characterizing patented technology for from length of making a summary Unobvious, and the lazy weight for the professional vocabulary that can characterize patent being drawn into, thus the patent text obtained based on this The information content of this VSM vectors is insufficient, limited to the characterization ability of patent original text, leads to the accurate of patent correlation analysis result Property it is relatively low, and then influence Patent Competition opponent identification accuracy.
Text similarity analysis method, device, electronic equipment and computer readable storage medium provided by the present application, purport In the technical problem as above for solving the prior art.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for same or analogous concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
Embodiment one
The embodiment of the present application provides a kind of text similarity analysis method, as shown in Figure 1, including:
Step S100 determines the foundation characteristic word of the first predetermined number of target text.
Specifically, it is default that first is extracted from the text messages such as the title, abstract of target text (such as patent text) The Feature Words of the target text of number, wherein the first predetermined number can be set according to the actual needs in extraction process, Such as the first predetermined number can be set as to 5,10 and 15 etc., i.e., extracted from the title of target text, abstract 5 or 10 or 15 or other numerical value Feature Words, and using the Feature Words being drawn into as the foundation characteristic word of target text, i.e. table Levy the professional vocabulary of target text.
Step S200 carries out the foundation characteristic word of the first predetermined number based on the text term vector library after training respectively Extension, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word.
Specifically, the title of target text and the length of abstract are shorter, and what is therefrom extracted can characterize target text The quantity of professional foundation characteristic word is extremely limited, and is not enough to the statistical property of the text feature word frequency of characterization target text, It is extended respectively by the foundation characteristic word of the first predetermined number to being drawn into, obtains each foundation characteristic word and correspond to respectively The second predetermined number expansion word, can greatly expand the number for the professional vocabulary that can characterize target text being drawn into Amount effectively improves the statistical property of the text feature word frequency of characterization target text, and target text is quickly and accurately determined to be follow-up This Similar Text lays the foundation.
Further, the second predetermined number can be set according to the actual needs in expansion process, the second predetermined number Can be identical as the first predetermined number, it can also be differed with the first predetermined number, such as the second predetermined number can be set It is 5,15 and 30 etc., i.e., each foundation characteristic word is extended, obtains 5 or 15 or 30 of each foundation characteristic word A or other numerical value expansion words.
Exemplary, when foundation characteristic word is " installation procedure " and the second predetermined number is 6, expansion word can be " driving Program ", " installation file ", " software ", " installation kit ", " configuration file " and " client-side program ".
Step S300, the weighted value based on each foundation characteristic word, each expansion word and each word, from pre-set text data The Similar Text of target text is determined in library.
Specifically, the weighted value of each foundation characteristic word based on file destination, each expansion word and each word, from default In a large amount of textual resources in text database, the Similar Text of the target text is quickly and accurately selected out.
It is exemplary, it, can be from pre- and when entitled " air purifier " of the patent when target text is patent text If in the patent resource in text database, quickly and accurately select out the similar patent of the patent, such as similar patent Entitled " electronic air cleaner ", " a kind of electric-bag complex dust collector " etc..
Further, after determining the Similar Text of target text, check that the related of the Similar Text is believed by clicking Breath, can obtain the information such as enterprise or the mechanism belonging to the similar patent, the letters such as enterprise or mechanism belonging to similar patent Breath can further know the technology competition opponent of the target text owned enterprise or mechanism, such as rival is similar special Enterprise belonging to profit or mechanism.
Text similarity analysis method provided by the embodiments of the present application determines the of target text compared with prior art The foundation characteristic word of one predetermined number, to extract the text feature word that can characterize target text, to be follow-up based on training Text term vector library afterwards is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively;Based on training Text term vector library afterwards, is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word difference The expansion word of corresponding second predetermined number has greatly expanded the professional vocabulary that can characterize target text being drawn into Quantity effectively improves the statistical property of the text feature word frequency of characterization target text, and target is quickly and accurately determined to be follow-up The Similar Text of text lays the foundation;Weighted value based on each foundation characteristic word, each expansion word and each word, from default text The Similar Text that target text is determined in database, to quickly and accurately select out target from pre-set text database The similar patent of text, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism, The accuracy of accuracy and the Patent Competition opponent identification of patent similarity analysis is greatly improved.
Embodiment two
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment one Method shown in example two, wherein
In the step s 100, by TextRank algorithm, the foundation characteristic of the first predetermined number of target text is determined Word.
Specifically, in the embodiment of the present application by taking target text is patent text as an example, above-mentioned steps S100 is carried out such as Lower explanation:
Existing method is typically to determine patent according to word frequency on the basis of the methods of common participle, part-of-speech tagging Feature Words when due to extracting word using these methods, can extract the word of some word frequency height but professional difference, thus adopt The word extracted with these methods does not have good patent and characterizes ability.In order to solve this problem, the embodiment of the present application is adopted The foundation characteristic word that patent is extracted with textRank algorithms, the foundation characteristic word being drawn by this method have stronger It is professional, it lays the foundation for structure patent text VSM models.
Wherein, TextRank algorithm is a kind of sort algorithm based on figure for text, and basic thought derives from paddy The PageRank algorithms of song, by the way that text segmentation at several component units (such as word, sentence) and is established graph model, profit The important component in text is ranked up with voting mechanism, keyword can be realized merely with the information of single document itself Extraction.With LDA (Latent Dirichlet Allocation, document subject matter generate model), HMM (Hidden Markov Model, hidden Markov model) etc. models it is different, TextRank need not carry out learning training to multiple documents in advance, because It is succinct effective and is used widely.TextRank algorithms are using relationship (co-occurrence window) between local vocabulary to follow-up Keyword is ranked up, and is directly extracted from text itself.
Further, it by TextRank algorithm, determines the foundation characteristic word of the first predetermined number of target text, wraps Include following steps:
1) given target text is split according to complete words;
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech Word, such as noun, verb, adjective, the candidate keywords after as retaining;
3) structure candidate keywords figure G=(V, E), wherein V are node set, and E is the set on side.By the time 2) generated Select crucial phrase at then using the wantonly side between 2 points of cooccurrence relation construction, there are the case where side between two nodes to refer to Vocabulary corresponding to the two nodes co-occurrence in the window that length is K, K indicate window size, i.e., most K words of co-occurrence;
4) according to formula G=(V, E) above, the weight of each node of iterative diffusion, until convergence;
5) Bit-reversed is carried out to node weights, to obtain most important T word, as candidate keywords, i.e., originally Apply for the foundation characteristic word in embodiment;
6) the most important T word that will 5) obtain, is marked in urtext, if forming adjacent phrase, group Synthesize more word keywords.
For the embodiment of the present application, the foundation characteristic word of target text is extracted using textRank algorithms, is not only had It is stronger professional, and need not learning training be carried out to multiple documents in advance, thus it is more simple and efficient.
Embodiment three
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment two Method shown in example three, wherein
Further include step S101 (being not marked in figure) before step S200:Pass through continuous bag of words neural network model pair Text in presetting database is trained, the text term vector library after being trained.
Step S200 includes step S2001 (being not marked in figure), step S2002 (being not marked in figure) and step S2003 (being not marked in figure), wherein
Step S2001:By inquiring the text term vector library after training, obtain the first word of any foundation characteristic word to Amount.
Step S2002:The cosine similarity value between the first term vector and the second term vector is calculated, the second term vector is instruction Term vector in text term vector library after white silk in addition to the first term vector.
Step S2003:Determine that cosine similarity value is more than the second term vector of the second predetermined number of the first predetermined threshold value Corresponding word, and as the expansion word of any foundation characteristic word.
Specifically, the embodiment of the present application is extended foundation characteristic word using depth learning technology, and method and step is such as Under:
1) Word2Vec (term vector) method training text term vector library is utilized
Word in word vector expression text is the core skill that deep learning algorithm is introduced to natural language processing Art.Word2vec is a outstanding modeling tool for obtaining term vector that Google increased income in 2013, main to use CBOW (Continuous Bag-Of-Words, continuous bag of words) and Skip-gram (vertical jump in succession metagrammar) model. Wherein, the embodiment of the present application uses more efficient CBOW neural network models, is instructed to the text in presetting database Practice, the text term vector library after being trained.
Exemplary, when text is patent text, 2,000 ten thousand patent texts of the embodiment of the present application in about 10G are enterprising Row training, the patent term vector library after being trained, wherein patent text includes the text fields such as patent title and abstract, raw At term vector dimension be 100, after training there are about 1,000,000 vocabulary, size about 990M in patent term vector library.
2) foundation characteristic word is extended based on the text term vector library after training
Specifically, when target text is patent text, the foundation characteristic word that each patent text extracts is carried out The method of extension is inquired one by one exactly by the foundation characteristic word of the first predetermined number obtained above by TextRank algorithm Patent term vector library obtains the term vector (i.e. the first term vector in step S2001) of each foundation characteristic word, then carries out Cosine similarity calculating process, wherein cosine similarity calculating process are:Calculate the term vector and patent of any foundation characteristic word Between other term vectors (i.e. the second term vector in step S2002) in term vector library in addition to the term vector of the foundation characteristic word Cosine similarity value this is determined according to the comparison of cosine similarity value and the first predetermined threshold value and the second predetermined number The expansion word of foundation characteristic word.
Further, it for each foundation characteristic word determined, is performed both by above-mentioned cosine similarity value and calculated Journey, so that it is determined that going out the expansion word of each foundation characteristic word.
It is exemplary, when foundation characteristic word be " installation procedure ", " cheap ", " water reuse ", " decontamination ", " high-speed railway " and " partially fall ", and when the second predetermined number is 6, the expansion word that can obtain each foundation characteristic word is as shown in table 1:
1 foundation characteristic word of table and its corresponding expansion word
For the embodiment of the present application, the text term vector library after giving based on training determines the expansion of each foundation characteristic word Open up the detailed process and operating procedure of word so that those skilled in the art can be according to the step in the embodiment of the present application, quickly It is accurately finished the extension of foundation characteristic word, greatly expands the number for the professional vocabulary that can characterize target text being drawn into Amount effectively improves the statistical property of the text feature word frequency of characterization target text, and target text is quickly and accurately determined to be follow-up This Similar Text lays the foundation.
Example IV
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment three Method shown in example four, wherein
Further include step S201 (being not marked in figure) before step S300:Filter out the expansion word of any foundation characteristic word In stop words;And/or it filters out reverse document-frequency in the expansion word of any foundation characteristic word and is less than the second predetermined threshold value Word.
Further include step S202 (being not marked in figure) before step S300:Determine the weighted value of each word.Wherein, really The weighted value of fixed each word, including:
By following formula, the weighted value of any word is determined:
wi=idfi*(p_tfi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate that any word exists Frequency in the text header and text snippet of the target text, c_tfiIndicate any word in addition to the target text Other texts in frequency.
Specifically, each foundation characteristic word difference that S2001, step S2002 and step S2003 are obtained through the above steps After the expansion word of corresponding second predetermined number, need further to filter obtained expansion word, wherein can be as needed Stop words therein is only filtered out, the word that reverse text frequency therein is less than the second predetermined threshold value can also be only filtered out, it can be with Filter out the word that stop words therein and reverse text frequency are less than the second predetermined threshold value simultaneously, by the expansion word to obtaining into Row filtering so that expansion word can preferably characterize target text.
Non- example, it, can be with during being filtered to obtained expansion word when the second predetermined threshold value is taken as 4.0 Stop words therein is only filtered out, the word that reverse text frequency therein is less than 4.0 can also be only filtered out, can also filter out simultaneously Stop words therein and reverse text frequency are less than 4.0 word, and the basis finally obtained in set of words i.e. the embodiment of the present application is special Levy the expansion word of word.
Further, it is assumed that the expansion word of each foundation characteristic word and each foundation characteristic word that obtain through the above steps is w1,w2,…,wN, and target text is patent text in above-mentioned steps, can be calculated at this time with formula (1) determine each word (including The expansion word of each foundation characteristic word and each foundation characteristic word) weighted value:
wi=idfi*(p_tfi+c_tfi) (1)
Wherein, wiIndicate the weighted value of any word, idfiIndicate the reverse document-frequency of any word, p_tfiIndicating should Frequency of any word in patent title and abridgments of specifications;c_tfiIndicate any word in other texts in addition to patent text The frequency of occurrences in (such as paper text).In addition, p_tfiCalculation can be:(word is in patent title and patent Occurrence number+1 in abstract)/(total word number+1 of each foundation characteristic word and the expansion word of each foundation characteristic word), for special There is no the word occurred in sharp title and abridgments of specifications, adds 1 can play smoothing effect.
Further, the weighted value w of each word is obtainediAfterwards, the further weighted value w to obtainingiIt is normalized, The weight distribution of each word of patent is obtained, as shown in Figure 2.
For the embodiment of the present application, by being less than the second predetermined threshold value to stop words in expansion word and reverse document-frequency Word filtering so that expansion word can preferably characterize target text, and stop words and reverse document-frequency is effectively avoided to be less than The influence for the accuracy that the word of second predetermined threshold value analyzes text similarity.In addition, the weighted value of each word of the determination provided Implementation method, the weighted value of each word is quickly determined convenient for those skilled in the art, for subsequently from pre-set text database Determine that the Similar Text of target text provides premise guarantee.
Embodiment five
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of example IV Method shown in example five, wherein
Include step S3001 (being not marked in figure), step S3002 (being not marked in figure), step in step S300 S3003 (being not marked in figure) and step S3004 (being not marked in figure), wherein
Step S3001:First predetermined number is determined respectively to multiple texts to be screened in pre-set text database Foundation characteristic word, based on the text term vector library after training, the foundation characteristic word of the first predetermined number is expanded respectively Exhibition obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the step of the weighted value of each word Suddenly, the extension of each text to be screened corresponding foundation characteristic word, the weighted value of foundation characteristic word, foundation characteristic word is obtained The weighted value of word and expansion word.
Step S3002:It detects and whether there is in the foundation characteristic word and expansion word of any text to be screened and target text Foundation characteristic word and the identical word of expansion word.
Step S3003:For any text to be screened, if it is present calculating any same words in the text to be screened In weighted value and the weighted value of any same words in target text product, and calculate whole same words product it With.
Step S3004:In multiple texts to be screened, the sum of products being calculated is selected to be more than third predetermined threshold value Text to be screened, the Similar Text as target text.
Specifically, the texts such as a large amount of patent and paper are stored in pre-set text database, from pre-set text database When the Similar Text of middle screening target text, above-mentioned implementation is passed through to multiple texts to be screened in pre-set text database Step S100 (the foundation characteristic word for determining the first predetermined number), step S200 in example one to example IV is (after training Text term vector library, the foundation characteristic word of the first predetermined number is extended respectively, it is right respectively to obtain each foundation characteristic word The expansion word for the second predetermined number answered), step S201 (filter out the stop words in the expansion word of any foundation characteristic word;With/ Or filter out the word that reverse document-frequency in the expansion word of any foundation characteristic word is less than the second predetermined threshold value) and step S202 is (really The weighted value of fixed each word) etc., obtain the weight of the corresponding foundation characteristic word of each text to be screened, foundation characteristic word The weighted value of value, the expansion word of foundation characteristic word and expansion word.
Further, in the similar text for searching target text from each of pre-set text database text to be screened During this, text to be screened can be traversed according to the foundation characteristic word and expansion word of target text, it specifically can be with In foundation characteristic word and expansion word by detecting any text to be screened with the presence or absence of with target text foundation characteristic word and The mode of the identical word of expansion word, to be traversed successively to each text to be screened, and there will be no the bases with target text The text filtering to be screened of plinth Feature Words and the identical word of expansion word falls, and only retains the foundation characteristic word existed with target text And the text to be screened of the identical word of expansion word, to be further processed.
Further, when there is word identical with the foundation characteristic word and expansion word of target text in text to be screened, Calculate multiplying for weighted value and any same words weighted value in target text of any same words in the text to be screened Product, wherein when identical word has multiple, the corresponding product of multiple word is added up, that is, calculates whole same words The sum of products, when identical word only there are one when, directly using the product as the final sum of products.
Further, from the text to be screened that there is word identical with the foundation characteristic word and expansion word of target text, Filter out the Similar Text as target text with the immediate text of target text, wherein the sum of products can be selected to be more than The text to be screened of third predetermined threshold value, as the Similar Text of target text, the value of third predetermined threshold value can be according to reality Border needs dynamic to set.Table 2 gives the displaying example to the relevant information of target text and its corresponding Similar Text.
2 target text of table and its corresponding Similar Text information
Further, in conjunction with the embodiment of the present application one to the method for embodiment five, Fig. 3 target texts are with patent text Example gives the basic process of the similar patent to searching target patent, wherein first carries out step S1 in figure 3 and (is based on The patent foundation characteristic word of TextRank extracts), step S2 (determining deep learning algorithm) is then carried out, step is then carried out S3 (trains patent word to library), then carries out step S4 (extension that foundation characteristic word is carried out based on patent term vector library), then Step S5 (filtering of patent characteristic expansion word) is carried out, step S6 (patent characteristic word weight calculation), final step are then carried out S7 (exports similar patent and corresponding patentee).
For the embodiment of the present application, the weighted value based on each foundation characteristic word, each expansion word and each word is given, The detailed process and operating procedure of the Similar Text of target text are determined from pre-set text database so that art technology Personnel quickly and accurately can select out target text according to the step in the embodiment of the present application from pre-set text database Similar Text, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism.
Embodiment six
Fig. 4 is a kind of structural schematic diagram of the translating equipment of text message provided by the embodiments of the present application, as shown in figure 4, The translating equipment 40 of text information may include:First determining module 41, expansion module 42 and the second determining module 43, In:
First determining module 41 is used to determine the foundation characteristic word of the first predetermined number of target text;
Expansion module 42 is used for based on the text term vector library after training, to the foundation characteristic word point of the first predetermined number It is not extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;
Second determining module 43 is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from default The Similar Text of target text is determined in text database.
Specifically, the first determining module 41 is specifically used for, by TextRank algorithm, determining that the first of target text is default The foundation characteristic word of number.
Further, which further includes training module 44, as shown in Figure 5, wherein training module 44 is for passing through company Continuous bag of words neural network model is trained the text in presetting database, the text term vector library after being trained.
Further, expansion module 42 includes acquisition submodule 421, computational submodule 422 and expansion word determination sub-module 423, as shown in Figure 5, wherein acquisition submodule 421 is used to, by inquiring the text term vector library after training, obtain any base First term vector of plinth Feature Words;
Computational submodule 422 is used to calculate cosine similarity value between the first term vector and the second term vector, the second word to Amount is the term vector in the text term vector library after training in addition to the first term vector;
Expansion word determination sub-module 423 is used to determine that cosine similarity value to be more than second default of the first predetermined threshold value Several corresponding words of the second term vector, and as the expansion word of any foundation characteristic word.
Further, which further includes filtering out module 45, as shown in Figure 5, wherein filters out module 45 and appoints for filtering out Stop words in the expansion word of one foundation characteristic word;And/or reverse file in the expansion word for filtering out any foundation characteristic word Frequency is less than the word of the second predetermined threshold value.
Further, which further includes weight determination module 46, as shown in Figure 5, wherein weight determination module 46 is used In the weighted value for determining each word;Wherein, it is specifically used for, by following formula, determining the weighted value of any word:
wi=idfi*(p_tfi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate that any word exists Frequency in the text header and text snippet of target text, c_tfiIndicate any word in other texts in addition to target text Frequency in this.
Further, the second determining module 43 includes pretreatment submodule 431, detection sub-module 432, product calculating Module 433 and screening submodule 434, wherein
Pretreatment submodule 431 is used to carry out acquisition the respectively to multiple texts to be screened in pre-set text database The foundation characteristic word of one predetermined number, based on the text term vector library after training, to the foundation characteristic word point of the first predetermined number It is not extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the power of each word The step of weight values, obtains the corresponding foundation characteristic word of each text to be screened, the weighted value of foundation characteristic word, foundation characteristic The expansion word of word and the weighted value of expansion word;
Detection sub-module 432 be used to detect in the foundation characteristic word and expansion word of any text to be screened with the presence or absence of with The identical word of foundation characteristic word and expansion word of target text;
Product computational submodule 433 is used to be directed to any text to be screened, exists if it is present calculating any same words The product of weighted value and the weighted value of any same words in target text in the text to be screened, and calculate whole phases With the sum of products of word;
It screens submodule 434 to be used in multiple texts to be screened, selects the sum of products being calculated to be more than third pre- If the text to be screened of threshold value, the Similar Text as target text.
Device provided by the embodiments of the present application determines the base of the first predetermined number of target text compared with prior art Plinth Feature Words, to extract the text feature word that can characterize target text, for subsequently based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively;Based on the text term vector after training Library is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word corresponding second and presets The expansion word of number has greatly expanded the quantity for the professional vocabulary that can characterize target text being drawn into, has effectively improved table The statistical property for levying the text feature word frequency of target text quickly and accurately determines that the Similar Text of target text is established to be follow-up Fixed basis;Weighted value based on each foundation characteristic word, each expansion word and each word determines mesh from pre-set text database The Similar Text for marking text, to quickly and accurately select out the similar patent of target text from pre-set text database, And then the technology competition opponent of target text owned enterprise or mechanism is identified according to the similar patent, patent phase is greatly improved The accuracy that the accuracy analyzed like property and Patent Competition opponent identify.
Embodiment seven
The embodiment of the present application provides a kind of electronic equipment, as shown in fig. 6, electronic equipment shown in fig. 6 600 includes:Place Manage device 601 and memory 603.Wherein, processor 601 is connected with memory 603, is such as connected by bus 602.Further, Electronic equipment 600 can also include transceiver 604.It should be noted that transceiver 604 is not limited to one in practical application, it should The structure of electronic equipment 600 does not constitute the restriction to the embodiment of the present application.
Wherein, processor 601 is applied in the embodiment of the present application, for realizing the first determining module shown in Fig. 4, expands Open up the function of module and the second determining module.Transceiver 604 includes Receiver And Transmitter, and transceiver 604 is applied to the application In embodiment, for realizing the function of acquisition submodule shown in fig. 5.
Processor 601 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance Body pipe logical device, hardware component or its arbitrary combination.It is may be implemented or executed in conjunction with described by present disclosure Various illustrative logic blocks, module and circuit.Processor 601 can also be to realize the combination of computing function, such as wrap It is combined containing one or more microprocessors, the combination etc. of DSP and microprocessor.
Bus 602 may include an access, and information is transmitted between said modules.Bus 602 can be pci bus or Eisa bus etc..Bus 602 can be divided into address bus, data/address bus, controlling bus etc..For ease of indicating, only used in Fig. 6 One thick line indicates, it is not intended that an only bus or a type of bus.
Memory 603 can be ROM or can store static information and the other kinds of static storage device of instruction, RAM Or the other kinds of dynamic memory of information and instruction can be stored, can also be EEPROM, CD-ROM or other CDs Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium Or other magnetic storage apparatus or can be used in carry or store with instruction or data structure form desired program Code and can by any other medium of computer access, but not limited to this.
Memory 603 is used to store the application code for executing application scheme, and is held by processor 601 to control Row.Processor 601 is for executing the application code stored in memory 603, to realize what embodiment illustrated in fig. 4 provided The action of text similarity analytical equipment.
The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium There is computer program, method shown in embodiment one is realized when which is executed by processor.Compared with prior art, it determines The foundation characteristic word of first predetermined number of target text is to extract the text feature word that can characterize target text Subsequently based on the text term vector library after training, offer premise is extended respectively to the foundation characteristic word of the first predetermined number It ensures;Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, is obtained each The expansion word of corresponding second predetermined number of foundation characteristic word, greatly expanded be drawn into can characterize target text Professional vocabulary quantity, effectively improve the statistical property of the text feature word frequency of characterization target text, for it is follow-up quickly, The Similar Text for accurately determining target text lays the foundation;Power based on each foundation characteristic word, each expansion word and each word Weight values determine the Similar Text of target text from pre-set text database, to quickly and accurately from pre-set text data The similar patent of target text is selected out in library, and then target text owned enterprise or mechanism are identified according to the similar patent Technology competition opponent, be greatly improved patent similarity analysis accuracy and Patent Competition opponent identification accuracy.
Computer readable storage medium provided by the embodiments of the present application is suitable for any embodiment of the above method.Herein It repeats no more.
It should be understood that although each step in the flow chart of attached drawing is shown successively according to the instruction of arrow, These steps are not that the inevitable sequence indicated according to arrow executes successively.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, in the flow chart of attached drawing at least A part of step may include that either these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps Moment executes completion, but can execute at different times, and execution sequence is also not necessarily and carries out successively, but can be with Either the sub-step of other steps or at least part in stage execute in turn or alternately with other steps.
The above is only some embodiments of the application, it is noted that for the ordinary skill people of the art For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications It should be regarded as the protection domain of the application.

Claims (16)

1. a kind of text similarity analysis method, which is characterized in that including:
Determine the foundation characteristic word of the first predetermined number of target text;
Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, obtains each base The expansion word of corresponding second predetermined number of plinth Feature Words;
Weighted value based on each foundation characteristic word, each expansion word and each word determines target text from pre-set text database This Similar Text.
2. according to the method described in claim 1, it is characterized in that, determining the foundation characteristic of the first predetermined number of target text Word, including:
By TextRank algorithm, the foundation characteristic word of the first predetermined number of target text is determined.
3. according to the method described in claim 1, it is characterized in that, based on the text term vector library after training, in advance to first Before if the foundation characteristic word of number is extended respectively, further include:
The text in presetting database is trained by continuous bag of words neural network model, the text word after being trained to Measure library.
4. according to claim 1-3 any one of them methods, which is characterized in that right based on the text term vector library after training The foundation characteristic word of first predetermined number is extended respectively, obtains corresponding second predetermined number of each foundation characteristic word Expansion word, including:
By inquiring the text term vector library after training, the first term vector of any foundation characteristic word is obtained;
The cosine similarity value between the first term vector and the second term vector is calculated, the second term vector is the text term vector after training Term vector in library in addition to the first term vector;
Determine that cosine similarity value is more than the corresponding word of the second term vector of the second predetermined number of the first predetermined threshold value, and As the expansion word of any foundation characteristic word.
5. according to the method described in claim 1, it is characterized in that, based on each foundation characteristic word, each expansion word and each The weighted value of word, from pre-set text database determine target text Similar Text before, further include:
Filter out the stop words in the expansion word of any foundation characteristic word;And/or
Filter out the word that reverse document-frequency in the expansion word of any foundation characteristic word is less than the second predetermined threshold value.
6. according to the method described in claim 5, it is characterized in that, based on each foundation characteristic word, each expansion word and each The weighted value of word, from pre-set text database determine target text Similar Text before, further include:Determine the power of each word Weight values;
Wherein it is determined that the weighted value of each word, including:
By following formula, the weighted value of any word is determined:
wi=idfi*(p_tfi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate any word in the mesh Mark the frequency in the text header and text snippet of text, c_tfiIndicate any word in other in addition to the target text Frequency in text.
7. according to the method described in claim 6, it is characterized in that, being based on each foundation characteristic word, each expansion word and each word Weighted value, from pre-set text database determine target text Similar Text, including:
Multiple texts to be screened in pre-set text database are carried out obtain with foundation characteristic word, the base of the first predetermined number respectively Text term vector library after training, is extended the foundation characteristic word of the first predetermined number, obtains each foundation characteristic respectively The step of weighted value of the expansion word of corresponding second predetermined number of word and determining each word, obtain each text to be screened The weight of this corresponding foundation characteristic word, the weighted value of foundation characteristic word, the expansion word of foundation characteristic word and expansion word Value;
Detect in the foundation characteristic word and expansion word of any text to be screened with the presence or absence of with the foundation characteristic word of target text and The identical word of expansion word;
For any text to be screened, if it is present calculating weighted value of any same words in the text to be screened and should The product of weighted value of any same words in target text, and calculate the sum of products of whole same words;
In multiple texts to be screened, the sum of products being calculated is selected to be more than the text to be screened of third predetermined threshold value, made For the Similar Text of the target text.
8. a kind of text similarity analytical equipment, which is characterized in that including:
First determining module, the foundation characteristic word of the first predetermined number for determining target text;
Expansion module, for based on the text term vector library after training, being carried out respectively to the foundation characteristic word of the first predetermined number Extension, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;
Second determining module is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from pre-set text number According to the Similar Text for determining target text in library.
9. device according to claim 8, which is characterized in that first determining module is specifically used for passing through TextRank Algorithm determines the foundation characteristic word of the first predetermined number of target text.
10. device according to claim 8, which is characterized in that further include training module;
The training module is obtained for being trained to the text in presetting database by continuous bag of words neural network model Text term vector library after to training.
11. according to claim 8-10 any one of them devices, which is characterized in that the expansion module includes obtaining submodule Block, computational submodule and expansion word determination sub-module;
The acquisition submodule, for by inquiring the text term vector library after training, obtaining the first of any foundation characteristic word Term vector;
The computational submodule, for calculating the cosine similarity value between the first term vector and the second term vector, the second term vector For the term vector in the text term vector library after training in addition to the first term vector;
The expansion word determination sub-module, for determining that cosine similarity value is more than the second predetermined number of the first predetermined threshold value The corresponding word of second term vector, and as the expansion word of any foundation characteristic word.
12. device according to claim 8, which is characterized in that further include filtering out module;
Stop words in the expansion word that module is filtered out for filtering out any foundation characteristic word;And/or for filtering out any base Reverse document-frequency is less than the word of the second predetermined threshold value in the expansion word of plinth Feature Words.
13. according to the method for claim 12, which is characterized in that further include:Weight determination module;
The weight determination module, the weighted value for determining each word;Wherein, it is specifically used for, by following formula, determining and appointing The weighted value of one word:
wi=idfi*(p_fi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate any word in the mesh Mark the frequency in the text header and text snippet of text, c_tfiIndicate any word in other in addition to the target text Frequency in text.
14. according to the method for claim 13, which is characterized in that second determining module include pretreatment submodule, Detection sub-module, product computational submodule and screening submodule;
The pretreatment submodule, for multiple texts to be screened in pre-set text database to be carried out obtain with first respectively in advance If the foundation characteristic word of number, based on the text term vector library after training, to the foundation characteristic word of the first predetermined number respectively into Row extension obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the weighted value of each word The step of, obtain the expansion of each text to be screened corresponding foundation characteristic word, the weighted value of foundation characteristic word, foundation characteristic word Open up the weighted value of word and expansion word;
The detection sub-module, whether there is in the foundation characteristic word and expansion word for detecting any text to be screened and target The identical word of foundation characteristic word and expansion word of text;
The product computational submodule waits for if it is present calculating any same words at this for being directed to any text to be screened The product of the weighted value and the weighted value of any same words in target text in text is screened, and calculates whole same words The sum of products;
The screening submodule, in multiple texts to be screened, selecting the sum of products being calculated default more than third The text to be screened of threshold value, the Similar Text as the target text.
15. a kind of electronic equipment, including memory, processor and storage are on a memory and the calculating that can run on a processor Machine program, which is characterized in that the processor realizes that claim 1-7 any one of them texts are similar when executing described program Property analysis method.
16. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program realizes the text similarity analysis method described in any one of claim 1-7 when the program is executed by processor.
CN201810522854.4A 2018-05-28 2018-05-28 Text similarity analysis method and device, electronic equipment and computer storage medium Active CN108804421B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810522854.4A CN108804421B (en) 2018-05-28 2018-05-28 Text similarity analysis method and device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810522854.4A CN108804421B (en) 2018-05-28 2018-05-28 Text similarity analysis method and device, electronic equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN108804421A true CN108804421A (en) 2018-11-13
CN108804421B CN108804421B (en) 2022-04-15

Family

ID=64090466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810522854.4A Active CN108804421B (en) 2018-05-28 2018-05-28 Text similarity analysis method and device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN108804421B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558481A (en) * 2018-12-03 2019-04-02 中国科学技术信息研究所 Patent and Business Relevancy Measurement Method, device, equipment and readable storage medium storing program for executing
CN109614478A (en) * 2018-12-18 2019-04-12 北京中科闻歌科技股份有限公司 Construction method, key word matching method and the device of term vector model
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN110427330A (en) * 2019-08-13 2019-11-08 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of code analysis
CN111199148A (en) * 2019-12-26 2020-05-26 东软集团股份有限公司 Text similarity determination method and device, storage medium and electronic equipment
CN111414753A (en) * 2020-03-09 2020-07-14 中国美术学院 Method and system for extracting perceptual image vocabulary of product
CN112215008A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Entity recognition method and device based on semantic understanding, computer equipment and medium
CN113033197A (en) * 2021-03-24 2021-06-25 中新国际联合研究院 Building construction contract rule query method and device
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN114331766A (en) * 2022-01-05 2022-04-12 中国科学技术信息研究所 Method and device for determining patent technology core degree, electronic equipment and storage medium
CN115358221A (en) * 2022-08-12 2022-11-18 维正知识产权科技有限公司 Enterprise patent data comparison method and device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143322A1 (en) * 2005-12-15 2007-06-21 International Business Machines Corporation Document comparision using multiple similarity measures
CN103377226A (en) * 2012-04-25 2013-10-30 中国移动通信集团公司 Intelligent search method and system thereof
CN105320772A (en) * 2015-11-02 2016-02-10 武汉大学 Associated paper query method for patent duplicate checking
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143322A1 (en) * 2005-12-15 2007-06-21 International Business Machines Corporation Document comparision using multiple similarity measures
CN103377226A (en) * 2012-04-25 2013-10-30 中国移动通信集团公司 Intelligent search method and system thereof
CN105320772A (en) * 2015-11-02 2016-02-10 武汉大学 Associated paper query method for patent duplicate checking
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许晓阳、郑彦宁、刘志辉: "论文和专利相结合的研究前沿识别方法研究", 《图书情报工作》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558481A (en) * 2018-12-03 2019-04-02 中国科学技术信息研究所 Patent and Business Relevancy Measurement Method, device, equipment and readable storage medium storing program for executing
CN109614478A (en) * 2018-12-18 2019-04-12 北京中科闻歌科技股份有限公司 Construction method, key word matching method and the device of term vector model
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN110427330A (en) * 2019-08-13 2019-11-08 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of code analysis
CN110427330B (en) * 2019-08-13 2023-09-26 腾讯科技(深圳)有限公司 Code analysis method and related device
CN111199148B (en) * 2019-12-26 2023-01-20 东软集团股份有限公司 Text similarity determination method and device, storage medium and electronic equipment
CN111199148A (en) * 2019-12-26 2020-05-26 东软集团股份有限公司 Text similarity determination method and device, storage medium and electronic equipment
CN111414753A (en) * 2020-03-09 2020-07-14 中国美术学院 Method and system for extracting perceptual image vocabulary of product
CN112215008A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Entity recognition method and device based on semantic understanding, computer equipment and medium
CN112215008B (en) * 2020-10-23 2024-04-16 中国平安人寿保险股份有限公司 Entity identification method, device, computer equipment and medium based on semantic understanding
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN113033197A (en) * 2021-03-24 2021-06-25 中新国际联合研究院 Building construction contract rule query method and device
CN114331766A (en) * 2022-01-05 2022-04-12 中国科学技术信息研究所 Method and device for determining patent technology core degree, electronic equipment and storage medium
CN114331766B (en) * 2022-01-05 2022-07-08 中国科学技术信息研究所 Method and device for determining patent technology core degree, electronic equipment and storage medium
CN115358221A (en) * 2022-08-12 2022-11-18 维正知识产权科技有限公司 Enterprise patent data comparison method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN108804421B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN108804421A (en) Text similarity analysis method, device, electronic equipment and computer storage media
CA2423033C (en) A document categorisation system
US9256649B2 (en) Method and system of filtering and recommending documents
CN102609407B (en) Fine-grained semantic detection method of harmful text contents in network
CN102332028A (en) Webpage-oriented unhealthy Web content identifying method
CN111191022A (en) Method and device for generating short titles of commodities
CN107015961A (en) A kind of text similarity comparison method
CN106649849A (en) Text information base building method and device and searching method, device and system
CN109446423B (en) System and method for judging sentiment of news and texts
CN110134777A (en) Problem De-weight method, device, electronic equipment and computer readable storage medium
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN111897955B (en) Comment generation method, device, equipment and storage medium based on encoding and decoding
CN110795930A (en) Article title optimization method, system, medium and equipment
CN107305555A (en) Data processing method and device
CN107291686B (en) Method and system for identifying emotion identification
US20090319514A1 (en) Method and system for assigning scores
Syn et al. Using latent semantic analysis to identify quality in use (qu) indicators from user reviews
EP1197884A2 (en) Method and apparatus for authoring and viewing audio documents
CN107229654A (en) A kind of heat searches word acquisition methods and system
CN110019702B (en) Data mining method, device and equipment
CN110633466B (en) Short message crime identification method and system based on semantic analysis and readable storage medium
CN109558481B (en) Method, device and equipment for measuring correlation between patent and enterprise and readable storage medium
CN112308453A (en) Risk identification model training method, user risk identification method and related device
JP2002251590A (en) Document analyzer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant