CN108804421A - Text similarity analysis method, device, electronic equipment and computer storage media - Google Patents
Text similarity analysis method, device, electronic equipment and computer storage media Download PDFInfo
- Publication number
- CN108804421A CN108804421A CN201810522854.4A CN201810522854A CN108804421A CN 108804421 A CN108804421 A CN 108804421A CN 201810522854 A CN201810522854 A CN 201810522854A CN 108804421 A CN108804421 A CN 108804421A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- foundation characteristic
- expansion
- term vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 30
- 239000013598 vector Substances 0.000 claims abstract description 81
- 238000000034 method Methods 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 41
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000001914 filtration Methods 0.000 claims description 10
- 238000003062 neural network model Methods 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 abstract description 7
- 238000012545 processing Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 13
- 230000007246 mechanism Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000009434 installation Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011017 operating method Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000005202 decontamination Methods 0.000 description 1
- 230000003588 decontaminative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 239000000428 dust Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
- G06Q50/184—Intellectual property management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Technology Law (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Probability & Statistics with Applications (AREA)
- Operations Research (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves text-processing field, a kind of text similarity analysis method, device, electronic equipment and computer readable storage medium are disclosed, wherein text similarity analysis method includes:Determine the foundation characteristic word of the first predetermined number of target text;Then based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;Then the weighted value based on each foundation characteristic word, each expansion word and each word determines the Similar Text of target text from pre-set text database.The method of the embodiment of the present application, the quantity for the professional vocabulary that can characterize target text being drawn into greatly is expanded, effectively improve the statistical property of the text feature word frequency of characterization target text, the similar patent that target text can be quickly and accurately selected out from pre-set text database, is greatly improved the accuracy of patent similarity analysis.
Description
Technical field
This application involves identity identification technical fields, specifically, this application involves a kind of text similarity analysis sides
Method, device, electronic equipment and computer storage media.
Background technology
Carrier of the text (such as paper text, patent text) as natural language, usually with a kind of unstructured or half
The form of structuring exists.With the rapid development of computer interconnected network technology, text similarity analysis has in many fields
It and is widely applied, for example, in the fields such as information retrieval, text classification, text cluster and automatic question answering, text similarity
Analysis is even more a basic and important job.
By taking patent text as an example, during carrying out patent similarity analysis, need non-structured patent text first
Originally it is converted into convenient for the structured message of computer identifying processing, then feature extraction is carried out to the structured message, and foundation carries
The feature taken carries out the similarity analysis of patent.Wherein, common patent similarity analysis method includes patent semantic analysis
The methods of method, patent tree and text mining, although these methods have certain improvement in terms of analyzing quality, in patent
Similarity analysis during, still remain the low problem of similarity analysis accuracy.
Invention content
The purpose of the application is intended at least solve above-mentioned one of technological deficiency, especially similarity analysis accuracy
Low technological deficiency.
In a first aspect, a kind of text similarity analysis method is provided, including:
Determine the foundation characteristic word of the first predetermined number of target text;
Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, is obtained
The expansion word of corresponding second predetermined number of each foundation characteristic word;
Weighted value based on each foundation characteristic word, each expansion word and each word determines mesh from pre-set text database
Mark the Similar Text of text.
Second aspect provides a kind of text similarity analytical equipment, including:
First determining module, the foundation characteristic word of the first predetermined number for determining target text;
Expansion module, for based on the text term vector library after training, distinguishing the foundation characteristic word of the first predetermined number
It is extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;
Second determining module is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from default
The Similar Text of target text is determined in text database.
The third aspect, provides a kind of electronic equipment, including memory, processor and storage are on a memory and can be
The computer program run on processor, processor realize above-mentioned text similarity analysis method when executing described program.
Fourth aspect provides a kind of computer readable storage medium, calculating is stored on computer readable storage medium
Machine program, the program realize above-mentioned text similarity analysis method when being executed by processor.
The application implements the text similarity analysis method provided, determines the basis of the first predetermined number of target text
Feature Words, to extract the text feature word that can characterize target text, for subsequently based on the text term vector after training
Library is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively;Based on the text term vector after training
Library is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word corresponding second and presets
The expansion word of number has greatly expanded the quantity for the professional vocabulary that can characterize target text being drawn into, has effectively improved table
The statistical property for levying the text feature word frequency of target text, for the follow-up Similar Text for quickly and accurately determining target text
It lays the foundation;Weighted value based on each foundation characteristic word, each expansion word and each word is determined from pre-set text database
The Similar Text of target text, to quickly and accurately select out the similar special of target text from pre-set text database
Profit, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism, patent is greatly improved
The accuracy of the accuracy of similarity analysis and Patent Competition opponent identification.
The additional aspect of the application and advantage will be set forth in part in the description, these will from the following description
Become apparent, or is recognized by the practice of the application.
Description of the drawings
The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments
It obtains obviously and is readily appreciated that, wherein:
Fig. 1 is the flow diagram of the text similarity analysis method of the embodiment of the present application;
Fig. 2 is the weight distribution schematic diagram of the text feature word of the embodiment of the present application;
Fig. 3 is the schematic diagram of the text similarity analytic process of the embodiment of the present application;
Fig. 4 is the basic structure schematic diagram of the text similarity analytical equipment of the embodiment of the present application;
Fig. 5 is the detailed construction schematic diagram of the text similarity analytical equipment of the embodiment of the present application;
Fig. 6 is the structural schematic diagram of the electronic equipment of the embodiment of the present application.
Specific implementation mode
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to
The embodiment of attached drawing description is exemplary, and is only used for explaining the application, and cannot be construed to the limitation to the application.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " one " used herein, " one
It is a ", " described " and "the" may also comprise plural form.It is to be further understood that is used in the description of the present application arranges
It refers to there are the feature, integer, step, operation, element and/or component, but it is not excluded that presence or addition to take leave " comprising "
Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member
Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or can also deposit
In intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.Used here as
Wording "and/or" include one or more associated list items whole or any cell and all combine.
To keep the purpose, technical scheme and advantage of the application clearer, below in conjunction with attached drawing to the application embodiment party
Formula is described in further detail.
Important carrier of the patent text as record scientific research activity and research method is that scientific research personnel obtains scientific and technological experience
With the important literature data for understanding industry cutting edge technology.In face of the patent resource of magnanimity, the side by using automation is needed
Method, quickly selects out the similar patent of certain enterprise or mechanism, and then identifies the technology competition opponent of the enterprise or mechanism.Mesh
Before, all it is in the number such as title, abstract of patent in the method that competition among enterprises opponent is identified using Data Mining Patent
Feature Words extraction is carried out on the basis of, and on the basis of the Feature Words being drawn into, utilizes VSM (Vector Space
Model, vector space model) model carries out vectorial expression to patent text, then carry out the similarity analysis of patent.But
It is the title of patent and shorter, the statistical property of the text feature word frequency for characterizing patented technology for from length of making a summary
Unobvious, and the lazy weight for the professional vocabulary that can characterize patent being drawn into, thus the patent text obtained based on this
The information content of this VSM vectors is insufficient, limited to the characterization ability of patent original text, leads to the accurate of patent correlation analysis result
Property it is relatively low, and then influence Patent Competition opponent identification accuracy.
Text similarity analysis method, device, electronic equipment and computer readable storage medium provided by the present application, purport
In the technical problem as above for solving the prior art.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned
Technical problem is described in detail.These specific embodiments can be combined with each other below, for same or analogous concept
Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
Embodiment one
The embodiment of the present application provides a kind of text similarity analysis method, as shown in Figure 1, including:
Step S100 determines the foundation characteristic word of the first predetermined number of target text.
Specifically, it is default that first is extracted from the text messages such as the title, abstract of target text (such as patent text)
The Feature Words of the target text of number, wherein the first predetermined number can be set according to the actual needs in extraction process,
Such as the first predetermined number can be set as to 5,10 and 15 etc., i.e., extracted from the title of target text, abstract 5 or
10 or 15 or other numerical value Feature Words, and using the Feature Words being drawn into as the foundation characteristic word of target text, i.e. table
Levy the professional vocabulary of target text.
Step S200 carries out the foundation characteristic word of the first predetermined number based on the text term vector library after training respectively
Extension, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word.
Specifically, the title of target text and the length of abstract are shorter, and what is therefrom extracted can characterize target text
The quantity of professional foundation characteristic word is extremely limited, and is not enough to the statistical property of the text feature word frequency of characterization target text,
It is extended respectively by the foundation characteristic word of the first predetermined number to being drawn into, obtains each foundation characteristic word and correspond to respectively
The second predetermined number expansion word, can greatly expand the number for the professional vocabulary that can characterize target text being drawn into
Amount effectively improves the statistical property of the text feature word frequency of characterization target text, and target text is quickly and accurately determined to be follow-up
This Similar Text lays the foundation.
Further, the second predetermined number can be set according to the actual needs in expansion process, the second predetermined number
Can be identical as the first predetermined number, it can also be differed with the first predetermined number, such as the second predetermined number can be set
It is 5,15 and 30 etc., i.e., each foundation characteristic word is extended, obtains 5 or 15 or 30 of each foundation characteristic word
A or other numerical value expansion words.
Exemplary, when foundation characteristic word is " installation procedure " and the second predetermined number is 6, expansion word can be " driving
Program ", " installation file ", " software ", " installation kit ", " configuration file " and " client-side program ".
Step S300, the weighted value based on each foundation characteristic word, each expansion word and each word, from pre-set text data
The Similar Text of target text is determined in library.
Specifically, the weighted value of each foundation characteristic word based on file destination, each expansion word and each word, from default
In a large amount of textual resources in text database, the Similar Text of the target text is quickly and accurately selected out.
It is exemplary, it, can be from pre- and when entitled " air purifier " of the patent when target text is patent text
If in the patent resource in text database, quickly and accurately select out the similar patent of the patent, such as similar patent
Entitled " electronic air cleaner ", " a kind of electric-bag complex dust collector " etc..
Further, after determining the Similar Text of target text, check that the related of the Similar Text is believed by clicking
Breath, can obtain the information such as enterprise or the mechanism belonging to the similar patent, the letters such as enterprise or mechanism belonging to similar patent
Breath can further know the technology competition opponent of the target text owned enterprise or mechanism, such as rival is similar special
Enterprise belonging to profit or mechanism.
Text similarity analysis method provided by the embodiments of the present application determines the of target text compared with prior art
The foundation characteristic word of one predetermined number, to extract the text feature word that can characterize target text, to be follow-up based on training
Text term vector library afterwards is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively;Based on training
Text term vector library afterwards, is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word difference
The expansion word of corresponding second predetermined number has greatly expanded the professional vocabulary that can characterize target text being drawn into
Quantity effectively improves the statistical property of the text feature word frequency of characterization target text, and target is quickly and accurately determined to be follow-up
The Similar Text of text lays the foundation;Weighted value based on each foundation characteristic word, each expansion word and each word, from default text
The Similar Text that target text is determined in database, to quickly and accurately select out target from pre-set text database
The similar patent of text, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism,
The accuracy of accuracy and the Patent Competition opponent identification of patent similarity analysis is greatly improved.
Embodiment two
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment one
Method shown in example two, wherein
In the step s 100, by TextRank algorithm, the foundation characteristic of the first predetermined number of target text is determined
Word.
Specifically, in the embodiment of the present application by taking target text is patent text as an example, above-mentioned steps S100 is carried out such as
Lower explanation:
Existing method is typically to determine patent according to word frequency on the basis of the methods of common participle, part-of-speech tagging
Feature Words when due to extracting word using these methods, can extract the word of some word frequency height but professional difference, thus adopt
The word extracted with these methods does not have good patent and characterizes ability.In order to solve this problem, the embodiment of the present application is adopted
The foundation characteristic word that patent is extracted with textRank algorithms, the foundation characteristic word being drawn by this method have stronger
It is professional, it lays the foundation for structure patent text VSM models.
Wherein, TextRank algorithm is a kind of sort algorithm based on figure for text, and basic thought derives from paddy
The PageRank algorithms of song, by the way that text segmentation at several component units (such as word, sentence) and is established graph model, profit
The important component in text is ranked up with voting mechanism, keyword can be realized merely with the information of single document itself
Extraction.With LDA (Latent Dirichlet Allocation, document subject matter generate model), HMM (Hidden Markov
Model, hidden Markov model) etc. models it is different, TextRank need not carry out learning training to multiple documents in advance, because
It is succinct effective and is used widely.TextRank algorithms are using relationship (co-occurrence window) between local vocabulary to follow-up
Keyword is ranked up, and is directly extracted from text itself.
Further, it by TextRank algorithm, determines the foundation characteristic word of the first predetermined number of target text, wraps
Include following steps:
1) given target text is split according to complete words;
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech
Word, such as noun, verb, adjective, the candidate keywords after as retaining;
3) structure candidate keywords figure G=(V, E), wherein V are node set, and E is the set on side.By the time 2) generated
Select crucial phrase at then using the wantonly side between 2 points of cooccurrence relation construction, there are the case where side between two nodes to refer to
Vocabulary corresponding to the two nodes co-occurrence in the window that length is K, K indicate window size, i.e., most K words of co-occurrence;
4) according to formula G=(V, E) above, the weight of each node of iterative diffusion, until convergence;
5) Bit-reversed is carried out to node weights, to obtain most important T word, as candidate keywords, i.e., originally
Apply for the foundation characteristic word in embodiment;
6) the most important T word that will 5) obtain, is marked in urtext, if forming adjacent phrase, group
Synthesize more word keywords.
For the embodiment of the present application, the foundation characteristic word of target text is extracted using textRank algorithms, is not only had
It is stronger professional, and need not learning training be carried out to multiple documents in advance, thus it is more simple and efficient.
Embodiment three
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment two
Method shown in example three, wherein
Further include step S101 (being not marked in figure) before step S200:Pass through continuous bag of words neural network model pair
Text in presetting database is trained, the text term vector library after being trained.
Step S200 includes step S2001 (being not marked in figure), step S2002 (being not marked in figure) and step S2003
(being not marked in figure), wherein
Step S2001:By inquiring the text term vector library after training, obtain the first word of any foundation characteristic word to
Amount.
Step S2002:The cosine similarity value between the first term vector and the second term vector is calculated, the second term vector is instruction
Term vector in text term vector library after white silk in addition to the first term vector.
Step S2003:Determine that cosine similarity value is more than the second term vector of the second predetermined number of the first predetermined threshold value
Corresponding word, and as the expansion word of any foundation characteristic word.
Specifically, the embodiment of the present application is extended foundation characteristic word using depth learning technology, and method and step is such as
Under:
1) Word2Vec (term vector) method training text term vector library is utilized
Word in word vector expression text is the core skill that deep learning algorithm is introduced to natural language processing
Art.Word2vec is a outstanding modeling tool for obtaining term vector that Google increased income in 2013, main to use
CBOW (Continuous Bag-Of-Words, continuous bag of words) and Skip-gram (vertical jump in succession metagrammar) model.
Wherein, the embodiment of the present application uses more efficient CBOW neural network models, is instructed to the text in presetting database
Practice, the text term vector library after being trained.
Exemplary, when text is patent text, 2,000 ten thousand patent texts of the embodiment of the present application in about 10G are enterprising
Row training, the patent term vector library after being trained, wherein patent text includes the text fields such as patent title and abstract, raw
At term vector dimension be 100, after training there are about 1,000,000 vocabulary, size about 990M in patent term vector library.
2) foundation characteristic word is extended based on the text term vector library after training
Specifically, when target text is patent text, the foundation characteristic word that each patent text extracts is carried out
The method of extension is inquired one by one exactly by the foundation characteristic word of the first predetermined number obtained above by TextRank algorithm
Patent term vector library obtains the term vector (i.e. the first term vector in step S2001) of each foundation characteristic word, then carries out
Cosine similarity calculating process, wherein cosine similarity calculating process are:Calculate the term vector and patent of any foundation characteristic word
Between other term vectors (i.e. the second term vector in step S2002) in term vector library in addition to the term vector of the foundation characteristic word
Cosine similarity value this is determined according to the comparison of cosine similarity value and the first predetermined threshold value and the second predetermined number
The expansion word of foundation characteristic word.
Further, it for each foundation characteristic word determined, is performed both by above-mentioned cosine similarity value and calculated
Journey, so that it is determined that going out the expansion word of each foundation characteristic word.
It is exemplary, when foundation characteristic word be " installation procedure ", " cheap ", " water reuse ", " decontamination ", " high-speed railway " and
" partially fall ", and when the second predetermined number is 6, the expansion word that can obtain each foundation characteristic word is as shown in table 1:
1 foundation characteristic word of table and its corresponding expansion word
For the embodiment of the present application, the text term vector library after giving based on training determines the expansion of each foundation characteristic word
Open up the detailed process and operating procedure of word so that those skilled in the art can be according to the step in the embodiment of the present application, quickly
It is accurately finished the extension of foundation characteristic word, greatly expands the number for the professional vocabulary that can characterize target text being drawn into
Amount effectively improves the statistical property of the text feature word frequency of characterization target text, and target text is quickly and accurately determined to be follow-up
This Similar Text lays the foundation.
Example IV
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of embodiment three
Method shown in example four, wherein
Further include step S201 (being not marked in figure) before step S300:Filter out the expansion word of any foundation characteristic word
In stop words;And/or it filters out reverse document-frequency in the expansion word of any foundation characteristic word and is less than the second predetermined threshold value
Word.
Further include step S202 (being not marked in figure) before step S300:Determine the weighted value of each word.Wherein, really
The weighted value of fixed each word, including:
By following formula, the weighted value of any word is determined:
wi=idfi*(p_tfi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate that any word exists
Frequency in the text header and text snippet of the target text, c_tfiIndicate any word in addition to the target text
Other texts in frequency.
Specifically, each foundation characteristic word difference that S2001, step S2002 and step S2003 are obtained through the above steps
After the expansion word of corresponding second predetermined number, need further to filter obtained expansion word, wherein can be as needed
Stop words therein is only filtered out, the word that reverse text frequency therein is less than the second predetermined threshold value can also be only filtered out, it can be with
Filter out the word that stop words therein and reverse text frequency are less than the second predetermined threshold value simultaneously, by the expansion word to obtaining into
Row filtering so that expansion word can preferably characterize target text.
Non- example, it, can be with during being filtered to obtained expansion word when the second predetermined threshold value is taken as 4.0
Stop words therein is only filtered out, the word that reverse text frequency therein is less than 4.0 can also be only filtered out, can also filter out simultaneously
Stop words therein and reverse text frequency are less than 4.0 word, and the basis finally obtained in set of words i.e. the embodiment of the present application is special
Levy the expansion word of word.
Further, it is assumed that the expansion word of each foundation characteristic word and each foundation characteristic word that obtain through the above steps is
w1,w2,…,wN, and target text is patent text in above-mentioned steps, can be calculated at this time with formula (1) determine each word (including
The expansion word of each foundation characteristic word and each foundation characteristic word) weighted value:
wi=idfi*(p_tfi+c_tfi) (1)
Wherein, wiIndicate the weighted value of any word, idfiIndicate the reverse document-frequency of any word, p_tfiIndicating should
Frequency of any word in patent title and abridgments of specifications;c_tfiIndicate any word in other texts in addition to patent text
The frequency of occurrences in (such as paper text).In addition, p_tfiCalculation can be:(word is in patent title and patent
Occurrence number+1 in abstract)/(total word number+1 of each foundation characteristic word and the expansion word of each foundation characteristic word), for special
There is no the word occurred in sharp title and abridgments of specifications, adds 1 can play smoothing effect.
Further, the weighted value w of each word is obtainediAfterwards, the further weighted value w to obtainingiIt is normalized,
The weight distribution of each word of patent is obtained, as shown in Figure 2.
For the embodiment of the present application, by being less than the second predetermined threshold value to stop words in expansion word and reverse document-frequency
Word filtering so that expansion word can preferably characterize target text, and stop words and reverse document-frequency is effectively avoided to be less than
The influence for the accuracy that the word of second predetermined threshold value analyzes text similarity.In addition, the weighted value of each word of the determination provided
Implementation method, the weighted value of each word is quickly determined convenient for those skilled in the art, for subsequently from pre-set text database
Determine that the Similar Text of target text provides premise guarantee.
Embodiment five
The embodiment of the present application provides alternatively possible realization method, further includes implementing on the basis of example IV
Method shown in example five, wherein
Include step S3001 (being not marked in figure), step S3002 (being not marked in figure), step in step S300
S3003 (being not marked in figure) and step S3004 (being not marked in figure), wherein
Step S3001:First predetermined number is determined respectively to multiple texts to be screened in pre-set text database
Foundation characteristic word, based on the text term vector library after training, the foundation characteristic word of the first predetermined number is expanded respectively
Exhibition obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the step of the weighted value of each word
Suddenly, the extension of each text to be screened corresponding foundation characteristic word, the weighted value of foundation characteristic word, foundation characteristic word is obtained
The weighted value of word and expansion word.
Step S3002:It detects and whether there is in the foundation characteristic word and expansion word of any text to be screened and target text
Foundation characteristic word and the identical word of expansion word.
Step S3003:For any text to be screened, if it is present calculating any same words in the text to be screened
In weighted value and the weighted value of any same words in target text product, and calculate whole same words product it
With.
Step S3004:In multiple texts to be screened, the sum of products being calculated is selected to be more than third predetermined threshold value
Text to be screened, the Similar Text as target text.
Specifically, the texts such as a large amount of patent and paper are stored in pre-set text database, from pre-set text database
When the Similar Text of middle screening target text, above-mentioned implementation is passed through to multiple texts to be screened in pre-set text database
Step S100 (the foundation characteristic word for determining the first predetermined number), step S200 in example one to example IV is (after training
Text term vector library, the foundation characteristic word of the first predetermined number is extended respectively, it is right respectively to obtain each foundation characteristic word
The expansion word for the second predetermined number answered), step S201 (filter out the stop words in the expansion word of any foundation characteristic word;With/
Or filter out the word that reverse document-frequency in the expansion word of any foundation characteristic word is less than the second predetermined threshold value) and step S202 is (really
The weighted value of fixed each word) etc., obtain the weight of the corresponding foundation characteristic word of each text to be screened, foundation characteristic word
The weighted value of value, the expansion word of foundation characteristic word and expansion word.
Further, in the similar text for searching target text from each of pre-set text database text to be screened
During this, text to be screened can be traversed according to the foundation characteristic word and expansion word of target text, it specifically can be with
In foundation characteristic word and expansion word by detecting any text to be screened with the presence or absence of with target text foundation characteristic word and
The mode of the identical word of expansion word, to be traversed successively to each text to be screened, and there will be no the bases with target text
The text filtering to be screened of plinth Feature Words and the identical word of expansion word falls, and only retains the foundation characteristic word existed with target text
And the text to be screened of the identical word of expansion word, to be further processed.
Further, when there is word identical with the foundation characteristic word and expansion word of target text in text to be screened,
Calculate multiplying for weighted value and any same words weighted value in target text of any same words in the text to be screened
Product, wherein when identical word has multiple, the corresponding product of multiple word is added up, that is, calculates whole same words
The sum of products, when identical word only there are one when, directly using the product as the final sum of products.
Further, from the text to be screened that there is word identical with the foundation characteristic word and expansion word of target text,
Filter out the Similar Text as target text with the immediate text of target text, wherein the sum of products can be selected to be more than
The text to be screened of third predetermined threshold value, as the Similar Text of target text, the value of third predetermined threshold value can be according to reality
Border needs dynamic to set.Table 2 gives the displaying example to the relevant information of target text and its corresponding Similar Text.
2 target text of table and its corresponding Similar Text information
Further, in conjunction with the embodiment of the present application one to the method for embodiment five, Fig. 3 target texts are with patent text
Example gives the basic process of the similar patent to searching target patent, wherein first carries out step S1 in figure 3 and (is based on
The patent foundation characteristic word of TextRank extracts), step S2 (determining deep learning algorithm) is then carried out, step is then carried out
S3 (trains patent word to library), then carries out step S4 (extension that foundation characteristic word is carried out based on patent term vector library), then
Step S5 (filtering of patent characteristic expansion word) is carried out, step S6 (patent characteristic word weight calculation), final step are then carried out
S7 (exports similar patent and corresponding patentee).
For the embodiment of the present application, the weighted value based on each foundation characteristic word, each expansion word and each word is given,
The detailed process and operating procedure of the Similar Text of target text are determined from pre-set text database so that art technology
Personnel quickly and accurately can select out target text according to the step in the embodiment of the present application from pre-set text database
Similar Text, and then identify according to the similar patent technology competition opponent of target text owned enterprise or mechanism.
Embodiment six
Fig. 4 is a kind of structural schematic diagram of the translating equipment of text message provided by the embodiments of the present application, as shown in figure 4,
The translating equipment 40 of text information may include:First determining module 41, expansion module 42 and the second determining module 43,
In:
First determining module 41 is used to determine the foundation characteristic word of the first predetermined number of target text;
Expansion module 42 is used for based on the text term vector library after training, to the foundation characteristic word point of the first predetermined number
It is not extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;
Second determining module 43 is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from default
The Similar Text of target text is determined in text database.
Specifically, the first determining module 41 is specifically used for, by TextRank algorithm, determining that the first of target text is default
The foundation characteristic word of number.
Further, which further includes training module 44, as shown in Figure 5, wherein training module 44 is for passing through company
Continuous bag of words neural network model is trained the text in presetting database, the text term vector library after being trained.
Further, expansion module 42 includes acquisition submodule 421, computational submodule 422 and expansion word determination sub-module
423, as shown in Figure 5, wherein acquisition submodule 421 is used to, by inquiring the text term vector library after training, obtain any base
First term vector of plinth Feature Words;
Computational submodule 422 is used to calculate cosine similarity value between the first term vector and the second term vector, the second word to
Amount is the term vector in the text term vector library after training in addition to the first term vector;
Expansion word determination sub-module 423 is used to determine that cosine similarity value to be more than second default of the first predetermined threshold value
Several corresponding words of the second term vector, and as the expansion word of any foundation characteristic word.
Further, which further includes filtering out module 45, as shown in Figure 5, wherein filters out module 45 and appoints for filtering out
Stop words in the expansion word of one foundation characteristic word;And/or reverse file in the expansion word for filtering out any foundation characteristic word
Frequency is less than the word of the second predetermined threshold value.
Further, which further includes weight determination module 46, as shown in Figure 5, wherein weight determination module 46 is used
In the weighted value for determining each word;Wherein, it is specifically used for, by following formula, determining the weighted value of any word:
wi=idfi*(p_tfi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate that any word exists
Frequency in the text header and text snippet of target text, c_tfiIndicate any word in other texts in addition to target text
Frequency in this.
Further, the second determining module 43 includes pretreatment submodule 431, detection sub-module 432, product calculating
Module 433 and screening submodule 434, wherein
Pretreatment submodule 431 is used to carry out acquisition the respectively to multiple texts to be screened in pre-set text database
The foundation characteristic word of one predetermined number, based on the text term vector library after training, to the foundation characteristic word point of the first predetermined number
It is not extended, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the power of each word
The step of weight values, obtains the corresponding foundation characteristic word of each text to be screened, the weighted value of foundation characteristic word, foundation characteristic
The expansion word of word and the weighted value of expansion word;
Detection sub-module 432 be used to detect in the foundation characteristic word and expansion word of any text to be screened with the presence or absence of with
The identical word of foundation characteristic word and expansion word of target text;
Product computational submodule 433 is used to be directed to any text to be screened, exists if it is present calculating any same words
The product of weighted value and the weighted value of any same words in target text in the text to be screened, and calculate whole phases
With the sum of products of word;
It screens submodule 434 to be used in multiple texts to be screened, selects the sum of products being calculated to be more than third pre-
If the text to be screened of threshold value, the Similar Text as target text.
Device provided by the embodiments of the present application determines the base of the first predetermined number of target text compared with prior art
Plinth Feature Words, to extract the text feature word that can characterize target text, for subsequently based on the text term vector after training
Library is extended the foundation characteristic word of the first predetermined number and provides premise guarantee respectively;Based on the text term vector after training
Library is extended the foundation characteristic word of the first predetermined number respectively, obtains each foundation characteristic word corresponding second and presets
The expansion word of number has greatly expanded the quantity for the professional vocabulary that can characterize target text being drawn into, has effectively improved table
The statistical property for levying the text feature word frequency of target text quickly and accurately determines that the Similar Text of target text is established to be follow-up
Fixed basis;Weighted value based on each foundation characteristic word, each expansion word and each word determines mesh from pre-set text database
The Similar Text for marking text, to quickly and accurately select out the similar patent of target text from pre-set text database,
And then the technology competition opponent of target text owned enterprise or mechanism is identified according to the similar patent, patent phase is greatly improved
The accuracy that the accuracy analyzed like property and Patent Competition opponent identify.
Embodiment seven
The embodiment of the present application provides a kind of electronic equipment, as shown in fig. 6, electronic equipment shown in fig. 6 600 includes:Place
Manage device 601 and memory 603.Wherein, processor 601 is connected with memory 603, is such as connected by bus 602.Further,
Electronic equipment 600 can also include transceiver 604.It should be noted that transceiver 604 is not limited to one in practical application, it should
The structure of electronic equipment 600 does not constitute the restriction to the embodiment of the present application.
Wherein, processor 601 is applied in the embodiment of the present application, for realizing the first determining module shown in Fig. 4, expands
Open up the function of module and the second determining module.Transceiver 604 includes Receiver And Transmitter, and transceiver 604 is applied to the application
In embodiment, for realizing the function of acquisition submodule shown in fig. 5.
Processor 601 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance
Body pipe logical device, hardware component or its arbitrary combination.It is may be implemented or executed in conjunction with described by present disclosure
Various illustrative logic blocks, module and circuit.Processor 601 can also be to realize the combination of computing function, such as wrap
It is combined containing one or more microprocessors, the combination etc. of DSP and microprocessor.
Bus 602 may include an access, and information is transmitted between said modules.Bus 602 can be pci bus or
Eisa bus etc..Bus 602 can be divided into address bus, data/address bus, controlling bus etc..For ease of indicating, only used in Fig. 6
One thick line indicates, it is not intended that an only bus or a type of bus.
Memory 603 can be ROM or can store static information and the other kinds of static storage device of instruction, RAM
Or the other kinds of dynamic memory of information and instruction can be stored, can also be EEPROM, CD-ROM or other CDs
Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium
Or other magnetic storage apparatus or can be used in carry or store with instruction or data structure form desired program
Code and can by any other medium of computer access, but not limited to this.
Memory 603 is used to store the application code for executing application scheme, and is held by processor 601 to control
Row.Processor 601 is for executing the application code stored in memory 603, to realize what embodiment illustrated in fig. 4 provided
The action of text similarity analytical equipment.
The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium
There is computer program, method shown in embodiment one is realized when which is executed by processor.Compared with prior art, it determines
The foundation characteristic word of first predetermined number of target text is to extract the text feature word that can characterize target text
Subsequently based on the text term vector library after training, offer premise is extended respectively to the foundation characteristic word of the first predetermined number
It ensures;Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, is obtained each
The expansion word of corresponding second predetermined number of foundation characteristic word, greatly expanded be drawn into can characterize target text
Professional vocabulary quantity, effectively improve the statistical property of the text feature word frequency of characterization target text, for it is follow-up quickly,
The Similar Text for accurately determining target text lays the foundation;Power based on each foundation characteristic word, each expansion word and each word
Weight values determine the Similar Text of target text from pre-set text database, to quickly and accurately from pre-set text data
The similar patent of target text is selected out in library, and then target text owned enterprise or mechanism are identified according to the similar patent
Technology competition opponent, be greatly improved patent similarity analysis accuracy and Patent Competition opponent identification accuracy.
Computer readable storage medium provided by the embodiments of the present application is suitable for any embodiment of the above method.Herein
It repeats no more.
It should be understood that although each step in the flow chart of attached drawing is shown successively according to the instruction of arrow,
These steps are not that the inevitable sequence indicated according to arrow executes successively.Unless expressly stating otherwise herein, these steps
Execution there is no stringent sequences to limit, can execute in the other order.Moreover, in the flow chart of attached drawing at least
A part of step may include that either these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps
Moment executes completion, but can execute at different times, and execution sequence is also not necessarily and carries out successively, but can be with
Either the sub-step of other steps or at least part in stage execute in turn or alternately with other steps.
The above is only some embodiments of the application, it is noted that for the ordinary skill people of the art
For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications
It should be regarded as the protection domain of the application.
Claims (16)
1. a kind of text similarity analysis method, which is characterized in that including:
Determine the foundation characteristic word of the first predetermined number of target text;
Based on the text term vector library after training, the foundation characteristic word of the first predetermined number is extended respectively, obtains each base
The expansion word of corresponding second predetermined number of plinth Feature Words;
Weighted value based on each foundation characteristic word, each expansion word and each word determines target text from pre-set text database
This Similar Text.
2. according to the method described in claim 1, it is characterized in that, determining the foundation characteristic of the first predetermined number of target text
Word, including:
By TextRank algorithm, the foundation characteristic word of the first predetermined number of target text is determined.
3. according to the method described in claim 1, it is characterized in that, based on the text term vector library after training, in advance to first
Before if the foundation characteristic word of number is extended respectively, further include:
The text in presetting database is trained by continuous bag of words neural network model, the text word after being trained to
Measure library.
4. according to claim 1-3 any one of them methods, which is characterized in that right based on the text term vector library after training
The foundation characteristic word of first predetermined number is extended respectively, obtains corresponding second predetermined number of each foundation characteristic word
Expansion word, including:
By inquiring the text term vector library after training, the first term vector of any foundation characteristic word is obtained;
The cosine similarity value between the first term vector and the second term vector is calculated, the second term vector is the text term vector after training
Term vector in library in addition to the first term vector;
Determine that cosine similarity value is more than the corresponding word of the second term vector of the second predetermined number of the first predetermined threshold value, and
As the expansion word of any foundation characteristic word.
5. according to the method described in claim 1, it is characterized in that, based on each foundation characteristic word, each expansion word and each
The weighted value of word, from pre-set text database determine target text Similar Text before, further include:
Filter out the stop words in the expansion word of any foundation characteristic word;And/or
Filter out the word that reverse document-frequency in the expansion word of any foundation characteristic word is less than the second predetermined threshold value.
6. according to the method described in claim 5, it is characterized in that, based on each foundation characteristic word, each expansion word and each
The weighted value of word, from pre-set text database determine target text Similar Text before, further include:Determine the power of each word
Weight values;
Wherein it is determined that the weighted value of each word, including:
By following formula, the weighted value of any word is determined:
wi=idfi*(p_tfi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate any word in the mesh
Mark the frequency in the text header and text snippet of text, c_tfiIndicate any word in other in addition to the target text
Frequency in text.
7. according to the method described in claim 6, it is characterized in that, being based on each foundation characteristic word, each expansion word and each word
Weighted value, from pre-set text database determine target text Similar Text, including:
Multiple texts to be screened in pre-set text database are carried out obtain with foundation characteristic word, the base of the first predetermined number respectively
Text term vector library after training, is extended the foundation characteristic word of the first predetermined number, obtains each foundation characteristic respectively
The step of weighted value of the expansion word of corresponding second predetermined number of word and determining each word, obtain each text to be screened
The weight of this corresponding foundation characteristic word, the weighted value of foundation characteristic word, the expansion word of foundation characteristic word and expansion word
Value;
Detect in the foundation characteristic word and expansion word of any text to be screened with the presence or absence of with the foundation characteristic word of target text and
The identical word of expansion word;
For any text to be screened, if it is present calculating weighted value of any same words in the text to be screened and should
The product of weighted value of any same words in target text, and calculate the sum of products of whole same words;
In multiple texts to be screened, the sum of products being calculated is selected to be more than the text to be screened of third predetermined threshold value, made
For the Similar Text of the target text.
8. a kind of text similarity analytical equipment, which is characterized in that including:
First determining module, the foundation characteristic word of the first predetermined number for determining target text;
Expansion module, for based on the text term vector library after training, being carried out respectively to the foundation characteristic word of the first predetermined number
Extension, obtains the expansion word of corresponding second predetermined number of each foundation characteristic word;
Second determining module is used for the weighted value based on each foundation characteristic word, each expansion word and each word, from pre-set text number
According to the Similar Text for determining target text in library.
9. device according to claim 8, which is characterized in that first determining module is specifically used for passing through TextRank
Algorithm determines the foundation characteristic word of the first predetermined number of target text.
10. device according to claim 8, which is characterized in that further include training module;
The training module is obtained for being trained to the text in presetting database by continuous bag of words neural network model
Text term vector library after to training.
11. according to claim 8-10 any one of them devices, which is characterized in that the expansion module includes obtaining submodule
Block, computational submodule and expansion word determination sub-module;
The acquisition submodule, for by inquiring the text term vector library after training, obtaining the first of any foundation characteristic word
Term vector;
The computational submodule, for calculating the cosine similarity value between the first term vector and the second term vector, the second term vector
For the term vector in the text term vector library after training in addition to the first term vector;
The expansion word determination sub-module, for determining that cosine similarity value is more than the second predetermined number of the first predetermined threshold value
The corresponding word of second term vector, and as the expansion word of any foundation characteristic word.
12. device according to claim 8, which is characterized in that further include filtering out module;
Stop words in the expansion word that module is filtered out for filtering out any foundation characteristic word;And/or for filtering out any base
Reverse document-frequency is less than the word of the second predetermined threshold value in the expansion word of plinth Feature Words.
13. according to the method for claim 12, which is characterized in that further include:Weight determination module;
The weight determination module, the weighted value for determining each word;Wherein, it is specifically used for, by following formula, determining and appointing
The weighted value of one word:
wi=idfi*(p_fi+c_tfi)
Wherein, wiIndicate weighted value, idfiIndicate the reverse document-frequency of any word, p_tfiIndicate any word in the mesh
Mark the frequency in the text header and text snippet of text, c_tfiIndicate any word in other in addition to the target text
Frequency in text.
14. according to the method for claim 13, which is characterized in that second determining module include pretreatment submodule,
Detection sub-module, product computational submodule and screening submodule;
The pretreatment submodule, for multiple texts to be screened in pre-set text database to be carried out obtain with first respectively in advance
If the foundation characteristic word of number, based on the text term vector library after training, to the foundation characteristic word of the first predetermined number respectively into
Row extension obtains the expansion word of corresponding second predetermined number of each foundation characteristic word and determines the weighted value of each word
The step of, obtain the expansion of each text to be screened corresponding foundation characteristic word, the weighted value of foundation characteristic word, foundation characteristic word
Open up the weighted value of word and expansion word;
The detection sub-module, whether there is in the foundation characteristic word and expansion word for detecting any text to be screened and target
The identical word of foundation characteristic word and expansion word of text;
The product computational submodule waits for if it is present calculating any same words at this for being directed to any text to be screened
The product of the weighted value and the weighted value of any same words in target text in text is screened, and calculates whole same words
The sum of products;
The screening submodule, in multiple texts to be screened, selecting the sum of products being calculated default more than third
The text to be screened of threshold value, the Similar Text as the target text.
15. a kind of electronic equipment, including memory, processor and storage are on a memory and the calculating that can run on a processor
Machine program, which is characterized in that the processor realizes that claim 1-7 any one of them texts are similar when executing described program
Property analysis method.
16. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program realizes the text similarity analysis method described in any one of claim 1-7 when the program is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810522854.4A CN108804421B (en) | 2018-05-28 | 2018-05-28 | Text similarity analysis method and device, electronic equipment and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810522854.4A CN108804421B (en) | 2018-05-28 | 2018-05-28 | Text similarity analysis method and device, electronic equipment and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108804421A true CN108804421A (en) | 2018-11-13 |
CN108804421B CN108804421B (en) | 2022-04-15 |
Family
ID=64090466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810522854.4A Active CN108804421B (en) | 2018-05-28 | 2018-05-28 | Text similarity analysis method and device, electronic equipment and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108804421B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558481A (en) * | 2018-12-03 | 2019-04-02 | 中国科学技术信息研究所 | Patent and Business Relevancy Measurement Method, device, equipment and readable storage medium storing program for executing |
CN109614478A (en) * | 2018-12-18 | 2019-04-12 | 北京中科闻歌科技股份有限公司 | Construction method, key word matching method and the device of term vector model |
CN109885813A (en) * | 2019-02-18 | 2019-06-14 | 武汉瓯越网视有限公司 | A kind of operation method, system, server and the storage medium of the text similarity based on word coverage |
CN110427330A (en) * | 2019-08-13 | 2019-11-08 | 腾讯科技(深圳)有限公司 | A kind of method and relevant apparatus of code analysis |
CN111199148A (en) * | 2019-12-26 | 2020-05-26 | 东软集团股份有限公司 | Text similarity determination method and device, storage medium and electronic equipment |
CN111414753A (en) * | 2020-03-09 | 2020-07-14 | 中国美术学院 | Method and system for extracting perceptual image vocabulary of product |
CN112215008A (en) * | 2020-10-23 | 2021-01-12 | 中国平安人寿保险股份有限公司 | Entity recognition method and device based on semantic understanding, computer equipment and medium |
CN113033197A (en) * | 2021-03-24 | 2021-06-25 | 中新国际联合研究院 | Building construction contract rule query method and device |
CN113064979A (en) * | 2021-03-10 | 2021-07-02 | 国网河北省电力有限公司 | Keyword retrieval-based method for judging construction period and price reasonability |
CN114331766A (en) * | 2022-01-05 | 2022-04-12 | 中国科学技术信息研究所 | Method and device for determining patent technology core degree, electronic equipment and storage medium |
CN115358221A (en) * | 2022-08-12 | 2022-11-18 | 维正知识产权科技有限公司 | Enterprise patent data comparison method and device, electronic equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070143322A1 (en) * | 2005-12-15 | 2007-06-21 | International Business Machines Corporation | Document comparision using multiple similarity measures |
CN103377226A (en) * | 2012-04-25 | 2013-10-30 | 中国移动通信集团公司 | Intelligent search method and system thereof |
CN105320772A (en) * | 2015-11-02 | 2016-02-10 | 武汉大学 | Associated paper query method for patent duplicate checking |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
-
2018
- 2018-05-28 CN CN201810522854.4A patent/CN108804421B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070143322A1 (en) * | 2005-12-15 | 2007-06-21 | International Business Machines Corporation | Document comparision using multiple similarity measures |
CN103377226A (en) * | 2012-04-25 | 2013-10-30 | 中国移动通信集团公司 | Intelligent search method and system thereof |
CN105320772A (en) * | 2015-11-02 | 2016-02-10 | 武汉大学 | Associated paper query method for patent duplicate checking |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
Non-Patent Citations (1)
Title |
---|
许晓阳、郑彦宁、刘志辉: "论文和专利相结合的研究前沿识别方法研究", 《图书情报工作》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558481A (en) * | 2018-12-03 | 2019-04-02 | 中国科学技术信息研究所 | Patent and Business Relevancy Measurement Method, device, equipment and readable storage medium storing program for executing |
CN109614478A (en) * | 2018-12-18 | 2019-04-12 | 北京中科闻歌科技股份有限公司 | Construction method, key word matching method and the device of term vector model |
CN109885813A (en) * | 2019-02-18 | 2019-06-14 | 武汉瓯越网视有限公司 | A kind of operation method, system, server and the storage medium of the text similarity based on word coverage |
CN110427330A (en) * | 2019-08-13 | 2019-11-08 | 腾讯科技(深圳)有限公司 | A kind of method and relevant apparatus of code analysis |
CN110427330B (en) * | 2019-08-13 | 2023-09-26 | 腾讯科技(深圳)有限公司 | Code analysis method and related device |
CN111199148B (en) * | 2019-12-26 | 2023-01-20 | 东软集团股份有限公司 | Text similarity determination method and device, storage medium and electronic equipment |
CN111199148A (en) * | 2019-12-26 | 2020-05-26 | 东软集团股份有限公司 | Text similarity determination method and device, storage medium and electronic equipment |
CN111414753A (en) * | 2020-03-09 | 2020-07-14 | 中国美术学院 | Method and system for extracting perceptual image vocabulary of product |
CN112215008A (en) * | 2020-10-23 | 2021-01-12 | 中国平安人寿保险股份有限公司 | Entity recognition method and device based on semantic understanding, computer equipment and medium |
CN112215008B (en) * | 2020-10-23 | 2024-04-16 | 中国平安人寿保险股份有限公司 | Entity identification method, device, computer equipment and medium based on semantic understanding |
CN113064979A (en) * | 2021-03-10 | 2021-07-02 | 国网河北省电力有限公司 | Keyword retrieval-based method for judging construction period and price reasonability |
CN113033197A (en) * | 2021-03-24 | 2021-06-25 | 中新国际联合研究院 | Building construction contract rule query method and device |
CN114331766A (en) * | 2022-01-05 | 2022-04-12 | 中国科学技术信息研究所 | Method and device for determining patent technology core degree, electronic equipment and storage medium |
CN114331766B (en) * | 2022-01-05 | 2022-07-08 | 中国科学技术信息研究所 | Method and device for determining patent technology core degree, electronic equipment and storage medium |
CN115358221A (en) * | 2022-08-12 | 2022-11-18 | 维正知识产权科技有限公司 | Enterprise patent data comparison method and device, electronic equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN108804421B (en) | 2022-04-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804421A (en) | Text similarity analysis method, device, electronic equipment and computer storage media | |
CA2423033C (en) | A document categorisation system | |
US9256649B2 (en) | Method and system of filtering and recommending documents | |
CN102609407B (en) | Fine-grained semantic detection method of harmful text contents in network | |
CN102332028A (en) | Webpage-oriented unhealthy Web content identifying method | |
CN111191022A (en) | Method and device for generating short titles of commodities | |
CN107015961A (en) | A kind of text similarity comparison method | |
CN106649849A (en) | Text information base building method and device and searching method, device and system | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
CN110134777A (en) | Problem De-weight method, device, electronic equipment and computer readable storage medium | |
KR101638535B1 (en) | Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same | |
CN110019820A (en) | Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history | |
CN111897955B (en) | Comment generation method, device, equipment and storage medium based on encoding and decoding | |
CN110795930A (en) | Article title optimization method, system, medium and equipment | |
CN107305555A (en) | Data processing method and device | |
CN107291686B (en) | Method and system for identifying emotion identification | |
US20090319514A1 (en) | Method and system for assigning scores | |
Syn et al. | Using latent semantic analysis to identify quality in use (qu) indicators from user reviews | |
EP1197884A2 (en) | Method and apparatus for authoring and viewing audio documents | |
CN107229654A (en) | A kind of heat searches word acquisition methods and system | |
CN110019702B (en) | Data mining method, device and equipment | |
CN110633466B (en) | Short message crime identification method and system based on semantic analysis and readable storage medium | |
CN109558481B (en) | Method, device and equipment for measuring correlation between patent and enterprise and readable storage medium | |
CN112308453A (en) | Risk identification model training method, user risk identification method and related device | |
JP2002251590A (en) | Document analyzer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |