CN103049569A - Text similarity matching method on basis of vector space model - Google Patents

Text similarity matching method on basis of vector space model Download PDF

Info

Publication number
CN103049569A
CN103049569A CN2012105931481A CN201210593148A CN103049569A CN 103049569 A CN103049569 A CN 103049569A CN 2012105931481 A CN2012105931481 A CN 2012105931481A CN 201210593148 A CN201210593148 A CN 201210593148A CN 103049569 A CN103049569 A CN 103049569A
Authority
CN
China
Prior art keywords
keyword
text
similarity
vector space
space model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105931481A
Other languages
Chinese (zh)
Inventor
江潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd filed Critical WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority to CN2012105931481A priority Critical patent/CN103049569A/en
Publication of CN103049569A publication Critical patent/CN103049569A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text similarity matching method on the basis of a vector space model. The text similarity matching method includes extracting keywords of texts, clustering all the keywords and generating a keyword concept tree; and computing the similarity of the texts according to the created keyword concept tree of the keywords in the texts to be translated, and acquiring texts in a translation depository according to the similarity. The texts in the translation depository are matched with the texts to be translated. According to the technical scheme, the test similarity matching method has the advantages that relations among the texts can be relatively accurately reflected, so that the similarity of the texts can be sufficiently reflected.

Description

Text similarity matching process based on vector space model
Technical field
The present invention relates to a kind of computer technology, specifically, relate to a kind of text similarity matching process based on vector space model.
Background technology
Present some Text Retrieval Model commonly used comprise based on the retrieval model of literal with based on the retrieval model of structure.The text based retrieval model comprises again: vector space model, approximate model, probability model and statistical language retrieval model; Text Retrieval Model based on structure comprises again: inner structure retrieval model, external structure retrieval model.
The similarity of text, namely the numerical metric of similarity degree between two pieces of texts is got two pieces of text D1, D2, if more the similarity near two pieces of texts of 1 expression is higher for (D1 ∩ D2)/(D1 ∪ D2), on the contrary opposite.In text retrieval technique, similarity is calculated the similarity degree that is mainly used in weighing between the text object, is a basic calculating in data mining, natural language processing.Gordian technique wherein mainly is two parts, the character representation of object and the similarity relation between the characteristic set.Declare weight, commending system etc. at information retrieval, webpage, all relate between the object or the calculating of the similarity of object and object set.For different application scenarioss, be subject to the restriction of data scale, space-time expense etc., the selection of similarity calculating method can be distinguished with different again to some extent.
The method of normally used calculating similarity is the VSM(vector space model).Then this model carries out the weights assignment by text is extracted keyword, and text table is shown as the vector that the keyword by weighted consists of, thereby obtains the similarity of text by the vector distance of calculating two texts.
Because probably there are the phenomenons such as synonym, polysemy in keyword, so the similarity computational solution precision that obtains with traditional vector space model method is not high, the result is often also unsatisfactory; The keyword weighting algorithm only is the relation of seeking between text and the keyword, can not laterally contact the relation between the keyword between different texts, has brought following problem to text retrieval:
(1) user's request can not accurately be expressed in keyword.
The user is difficult to select accurately keyword to search for, because wherein relate to the Semantic mapping problem between inquiry and the concept.The searching keyword that the user provides can not reflect user's intention well.
(2) keyword can not reflect content of text.
If the keyword extension is too large, just is difficult to or can't retrieves related text.
(3) polysemy.
Because the keyword matching technique is difficult to solve polysemy, tends to retrieve a large amount of irrelevant informations.
(4) keyword occurs in the text in the synonym mode.
User's searching keyword does not directly occur sometimes in the text, but occurs in other word-building modes of synonym, near synonym or keyword, and like this, text just can not retrieve.When searching keyword and feature word of text formation concept hyponymy, then more be difficult to retrieve.
Summary of the invention
Technical matters solved by the invention provides a kind of text similarity matching process based on vector space model, and the relatively accurate contact that has reflected between the text can reflect the similarity of text so more fully.
Technical scheme is as follows:
A kind of text similarity matching process based on vector space model comprises:
Extract the keyword of text, all keywords are carried out cluster, generate the keyword conceptional tree;
Calculate the similarity of text according to the keyword conceptional tree of keyword in the text to be translated that makes up, obtain the text that in translation list of references storehouse, mates by the size of similarity.
Further, the step of described generation keyword conceptional tree comprises:
Extract all keywords in document to be sorted and the reference library, obtain keyword set;
Keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to a concept class set, symphysis becomes described keyword conceptional tree according to described concept class set.
Further, if keyword k iProbability p (the k that occurs i) P1; And have, k occurring iThe text in keyword k also appears jConditional probability p(k j| k i) P2, then think keyword k jAnd k iExpress same concept, P1 and P2 are for setting the probability threshold values.
Further, the process concrete steps that generate described keyword conceptional tree comprise:
Extract all keywords in document to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn} calculates the Probability p (k) that each keyword k occurs among the C in reference library, the textual data of keyword k and the ratio of set Chinese version sum namely occur;
Filter keyword according to setting threshold values, get p Min<p(k)<p MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m, wherein p MaxAnd p MinBe the high lower bound threshold values that sets;
To the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k 1, { k 2...., { k m;
In this m keyword, calculate at keyword k iKeyword k in the text that occurs jThe probability that also occurs is designated as p(k j| k i), amount to
Figure BDA00002687023200031
Individual conditional probability, (1≤i, j≤m; I ≠ j); P(k j| k i)=p(k jk i)/p(k i), p(k jk i) be k jAnd k iAppear at simultaneously the probability in the same piece of writing text;
Merge set to be combined, generating root node is the keyword conceptional tree of keyword set C.
Further, for two keyword set C1 to be combined and C2, the merging condition is: have k iBelong to C1, k jBelong to C2, and p(k i) P1, p(k j| k i) P2, work as p(k i) and p(k j| k i) during greater than described setting threshold values, keyword k iAnd k jExpress same concept, satisfy one of the merging condition of the set at its place; Appoint to a keyword k in the set after merging i, its with set in the keyword over half p(k that all satisfies condition j| k i) P2; If above two conditions are satisfied in two set, then concept has very large similarity, belongs to annexable set, generates the set of last layer concept class after merging.
Further, the process of searching the text of coupling in reference library comprises: extract the keyword of all documents in the reference library, form keyword set; According to the structure of described keyword conceptional tree, by improved Text similarity computing formula, calculate the similarity of each text in text to be sorted and the reference library, according to similarity descending return results text.
The process concrete steps of further, searching the text of coupling in translation list of references storehouse comprise:
Definition H is the height of the conceptional tree that generates, definition depth(k) be the degree of depth of node k in tree, be from root node to limit number that this node experiences;
Definition com(k i, k j) be from node k iAnd k jNearest common father node, it is root node that any two nodes must have a common father node;
The long-pending computing formula of any two keywords: k i* k j=depth(com(k i, k j))/H;
If vectorial A={a 1, a 2..., a n, B={b 1, b 2..., b n, the definition vector calculation: A * B = Σ i = 1 n Σ j = 1 n ( a i × b j ) ;
The calculating formula of similarity of text is: Sim ( d 1 , d 2 ) = d 1 * d 2 d 1 * d 1 d 2 * d 2 , D1 and d2 represent text vector.
Compared with prior art, technique effect comprises:
In the prior art, when with the vector space model method text being carried out similarity calculating, if the vector representation of two texts is d1={k1, k2, k3}, d2={k4, k5, k6} is because these two text vectors are vertical, so its similarity is 0.The synonymy that the keyword that compares owing to two texts may exist, concept hyponymy etc., the account form that only adopts same keyword to mate can not embody the relation between the text effectively.
Therefore, among the present invention, by keyword is carried out conceptual clustering, the keyword that concept is similar condenses together, by a kind of improved vectorial cosine computing method, the similarity of mutually perpendicular vector may not be 0 just, the relatively accurate contact that has reflected between the text, than traditional vector space method, can reflect more fully like this similarity of text.
Description of drawings
Fig. 1 is one 4 layers the conceptional tree synoptic diagram that makes up among the present invention;
Embodiment
It is text similarity technology in the text retrieval technique that the present invention relates generally to technology.Text retrieval is a cross discipline, from large subject, across subjects such as computing machine, information, mathematical statisticss, on concrete research direction, comprises the technology such as text retrieval, natural language processing, data mining, machine learning.
Translation list of references storehouse (abbreviation reference library) is a huge resources bank that mass text is arranged, adopt the method for complicated similarity retrieval, text to be translated is carried out similarity retrieval therein, thereby find the operation of similar referenced text set, speed is very slow, is difficult to accomplish quick-searching.Yet adopt relatively simple VSN vector space method to carry out similarity retrieval, its precision is very low, this method is utilized a kind of improved VSM method, and raising retrieval precision that can be larger under the prerequisite that keeps VSM method retrieval rate obtains a relatively accurate similar reference documents set.
Among the present invention, provide a kind of Text similarity computing method based on vector space model.
Step 1: extract all keywords of text to be sorted, extract the keyword of all documents in the reference library, form keyword set, all keywords are carried out cluster, generate the keyword conceptional tree;
Technical solution of the present invention has provided a suitable clustering algorithm, and the generation of keyword conceptional tree is described in detail.
Step 11: extract all keywords in text to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn};
Step 12: the keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to the same concept set;
If two keywords occur in the one piece of text of being everlasting simultaneously, when namely they appeared at the same piece of probability in the text simultaneously greater than a certain threshold values, we thought that it expresses same concept, belong to the concept that can merge.That is, if keyword k iProbability p (the k that in text set, occurs i) P1; And have, k occurring iThe text in keyword k also appears jConditional probability p(k j| k i) P2, then think keyword k j, k iExpress same concept, merge it (P1 and P2 are the probability threshold values that sets).
In like manner for two keyword set C1, C2 to be combined, if satisfy following two conditions:
Condition 1: have k iBelong to C1, k jBelong to C2, and p(k i) P1, p(k j| k i) P2;
Work as p(k i) and p(k j| k i) during greater than respective thresholds, we think keyword k iAnd k jExpress same concept, satisfy one of the merging condition of the set at its place.
Condition 2: appoint to a keyword k in the set after merging i, its with the set in keyword over half all meet the following conditions: p(k j| k i) P2.
If satisfy condition simultaneously 1 and condition 2, then we think that the concept of these two set has and satisfy certain similarity, belong to the set that can merge, and generate the set of last layer concept class after merging.
When remaining any two keyword set merged, the condition above not satisfying merged and stops, the father node of the remaining set set C that all keywords consist of that serves as reasons.
The step of keyword clustering is as follows:
Step 121: extract all keywords, obtain keyword set C={k1, k2 ..., kn};
Calculate among the C each keyword k at the probability that occurs, be textual data that keyword k occurs and the ratio of text sum, be designated as p(k).
Step 122: filter keyword according to setting threshold values;
Get p Min<p(k)<p MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m (p Max, p MinBe the high lower bound threshold values that sets, with removing extremely high frequency word and extremely low frequency word).
Step 123: to the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k 1, { k 2..., { k m;
Step 124: in this m keyword, calculate at keyword k iIn the text that occurs, keyword k jThe probability that also occurs is designated as p(k j| k i), amount to Individual conditional probability, (1≤i, j≤m; I ≠ j);
P(k j| k i) computing method: p(k j| k i)=p(k jk i)/p(k i), p(k jk i) be k j, k iAppear at simultaneously the probability in the same piece of writing text.
Step 125: merge set I and J, (I, J are set to be combined);
When satisfying following two conditions simultaneously, merge:
I. Satisfy p(k i) P1, p(k j| k i) P2;
Ii.
Figure BDA00002687023200063
Satisfy | { k j∈ I UJ|p(k j| k i) P2}| (| I|+|J|)/2, | X| represents to gather the number of element among the X.
Step 126: when any two set all do not meet this two conditions, merge and finish.Obtain simultaneously ground floor cluster keyword set C={C1, C2 ..., Cq};
Step 127: to C={C1, C2 ..., Cq} gets threshold value P3<P2, again carries out cluster (step 125 and 126) with above-mentioned steps 11 to 17, generates the set of last layer concept.
Repeat this process, until cluster set cluster again, these again the concept set of cluster be combined into the child node of root node C, the keyword conceptional tree that so just to generate a root node be keyword set C.
As shown in Figure 1, be one 4 layers the conceptional tree synoptic diagram that makes up among the present invention.
Step 2: according to the keyword conceptional tree of keyword in the text to be translated that makes up, in translation list of references storehouse, search the text of coupling.
The present invention has defined a kind of computing method of the vectorial cosine based on the keyword conceptional tree, i.e. a kind of method of new Text similarity computing.
Step 21: according to the structure of keyword conceptional tree, adopt Innovative method to calculate the similarity of different keywords;
Step 22: adopt improved cosine similarity method, calculate waiting for translating originally and the similarity of reference translation storehouse Chinese version;
In the VSM vector space model, any two keyword k i, k jBe fully vertical, it is long-pending to be 0.And in conceptional tree of the present invention, any two concept k i, k jMight not be vertical, but be determined by the distance of their common father nodes from root node.K in Fig. 1 for example 1, k 2Common nearest father node is C11, and its distance from root node is 2, and the height of tree is 3, so k 1* k 2=2/3.
1. definition H is the height of the conceptional tree of generation.
2. be the degree of depth of node k in tree definition depth(k), be from root node to limit number that this node experiences;
3. define com(k i, k j) be from node k iAnd k jNearest common father node, it is root node that any two nodes must have a common father node;
4. the long-pending computing formula of any two keywords: k i* k j=depth(com(k i, k j))/H;
5. establish vectorial A={a 1, a 2..., a n, B={b 1, b 2..., b n, the definition vector calculation: A * B = Σ i = 1 n Σ j = 1 n ( a i × b j ) ;
6. the calculating formula of similarity of text is: Sim ( d 1 , d 2 ) = d 1 * d 2 d 1 * d 1 d 2 * d 2 , D1 and d2 represent text vector.
Step 23: according to similarity descending return results text.
The below describes concrete application according to technical solution of the present invention.
Use one: adopt the method for interpreter's achievement document content similarity matching to optimize interpreter's retrieval
Each interpreter has the document of much oneself translating, and these documents of translating have consisted of this interpreter's document library, and numerous interpreters' document library consists of huge " interpreter's achievement document library "; Will seek suitable interpreter when one piece of document to be translated translates, this document can be carried out similarity matching in " interpreter's achievement document library ", from the storehouse, match the high document of similarity, the interpreter that the document that these similarities are high is corresponding, being exactly suitable interpreter, is exactly the ordering of interpreter's appropriate degree according to sequencing of similarity.Because the interpreter once translated similarly document, translation gets up just can accomplish faster and better.
Use two: adopt classifying documents storehouse similarity matching to realize the document automation classification
Set up one according to the standardized documentation of set criteria for classification classification, wherein each classification has the sample document of some, with non-classified document still, pass through similarity matching, match all documents that similarity in the classifying documents storehouse surpasses predetermined value, the classification situation of these similar document is carried out tabulate statistics and bring computation model into being weighted calculating, calculate the classification situation score of the document, the classification that score is the highest is exactly the most probable classification of the document.If the classification score of score second is more or less the same with first score, can be used as subsidiary classification.
Use three: adopt ambit to divide the contribution fragmentation strategy of being combined with similarity retrieval
When carrying out large document translation task, large translation contribution is broken into a plurality of less translation fragment contributions, be to promote the division of labor to improve the common method of translation efficiency, but the strategy of how contribution " being smashed " just become key link.Here the method that adopts is that the content of contribution is smashed not according to simple paragraph, but judge the ambit of paragraph content according to keyword, according to ambit the content of contribution is carried out preliminary division, and then in the historical result document library, carry out similarity retrieval with the fragment contribution of dividing, draw the interpreter that these fragment contributions are fit to, carry out again the integration of fragment according to the interpreter: will be suitable for fragment contribution same or same class interpreter translation and merge or the part merging, the final like this result who obtains the contribution fragmentation is exactly desirable, is convenient to very much the arrangement task and is conducive to ensure translation quality.

Claims (7)

1. text similarity matching process based on vector space model comprises:
Extract the keyword of text, all keywords are carried out cluster, generate the keyword conceptional tree;
Calculate the similarity of text according to the keyword conceptional tree of keyword in the text to be translated that makes up, obtain the text that in translation list of references storehouse, mates by the size of similarity.
2. the text similarity matching process based on vector space model as claimed in claim 1 is characterized in that, the step of described generation keyword conceptional tree comprises:
Extract all keywords in document to be sorted and the reference library, obtain keyword set;
Keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to a concept class set, symphysis becomes described keyword conceptional tree according to described concept class set.
3. the text similarity matching process based on vector space model as claimed in claim 2 is characterized in that, if keyword k iProbability p (the k that occurs i) P1; And have, k occurring iThe text in keyword k also appears jConditional probability p(k j| k i) P2, then think keyword k jAnd k iExpress same concept, P1 and P2 are for setting the probability threshold values.
4. the text similarity matching process based on vector space model as claimed in claim 3 is characterized in that, the process concrete steps that generate described keyword conceptional tree comprise:
Extract all keywords in document to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn}, each keyword k the textual data of keyword k and the ratio of text sum occur and is designated as p(k at the probability that occurs among the calculating C);
Filter keyword according to setting threshold values, get p Min<p(k)<p MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m, wherein p MaxAnd p MinBe the high lower bound threshold values that sets;
To the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k 1, { k 2..., { k m;
In this m keyword, calculate at keyword k iKeyword k in the text that occurs jThe probability that occurs is designated as p(k j| k i), amount to
Figure FDA00002687023100011
Individual conditional probability, (1≤i, j≤m; I ≠ j); P(k j| k i)=p(k jk i)/p(k i), p(k jk i) be k jAnd k iAppear at simultaneously the probability in the same piece of writing text;
Merge set to be combined, generating root node is the keyword conceptional tree of keyword set C.
5. the text similarity matching process based on vector space model as claimed in claim 4 is characterized in that, for two keyword set C1 to be combined and C2, the merging condition is: have k iBelong to C1, k jBelong to C2, and p(k i) P1, p(k j| k i) P2, work as p(k i) and p(k j| k i) during greater than described setting threshold values, keyword k iAnd k jExpress same concept, satisfy one of the merging condition of the set at its place; Appoint to a keyword k in the set after merging i, its with set in the keyword over half p(k that all satisfies condition j| k i) P2; If above two conditions are satisfied in two set, then concept has very large similarity, belongs to annexable set, generates the set of last layer concept class after merging.
6. the text similarity matching process based on vector space model as claimed in claim 1, it is characterized in that, the process of searching the text of coupling in translation list of references storehouse comprises: extract the keyword of all documents in the translation list of references storehouse, form keyword set; According to the structure of described keyword conceptional tree, by improved Text similarity computing formula, calculate text to be sorted and reference library close in the similarity of each text, according to similarity descending return results text.
7. the text similarity matching process based on vector space model as claimed in claim 6 is characterized in that, the process concrete steps of searching the text of coupling in translation list of references storehouse comprise:
Definition H is the height of the conceptional tree that generates, definition depth(k) be the degree of depth of node k in tree, be from root node to limit number that this node experiences;
Definition com(k i, k j) be from node k iAnd k jNearest common father node, it is root node that any two nodes must have a common father node;
The long-pending computing formula of any two keywords: k i* k j=depth(com(k i, k j))/H;
If vectorial A={a 1, a 2..., a n, B={b 1, b 2..., b n, the definition vector calculation: A * B = Σ i = 1 n Σ j = 1 n ( a i × b j ) ;
The calculating formula of similarity of text is: Sim ( d 1 , d 2 ) = d 1 * d 2 d 1 * d 1 d 2 * d 2 , D1 and d2 represent text vector.
CN2012105931481A 2012-12-31 2012-12-31 Text similarity matching method on basis of vector space model Pending CN103049569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105931481A CN103049569A (en) 2012-12-31 2012-12-31 Text similarity matching method on basis of vector space model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012105931481A CN103049569A (en) 2012-12-31 2012-12-31 Text similarity matching method on basis of vector space model

Publications (1)

Publication Number Publication Date
CN103049569A true CN103049569A (en) 2013-04-17

Family

ID=48062209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105931481A Pending CN103049569A (en) 2012-12-31 2012-12-31 Text similarity matching method on basis of vector space model

Country Status (1)

Country Link
CN (1) CN103049569A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN103761264A (en) * 2013-12-31 2014-04-30 浙江大学 Concept hierarchy establishing method based on product review document set
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device
CN104572645A (en) * 2013-10-11 2015-04-29 高德软件有限公司 Method and device for POI (Point Of Interest) data association
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104866631A (en) * 2015-06-18 2015-08-26 北京京东尚科信息技术有限公司 Method and device for aggregating counseling problems
CN105138521A (en) * 2015-08-27 2015-12-09 武汉传神信息技术有限公司 General translator recommendation method for risk project in translation industry
CN105279147A (en) * 2015-09-29 2016-01-27 武汉传神信息技术有限公司 Translator document quick matching method
CN106250412A (en) * 2016-07-22 2016-12-21 浙江大学 The knowledge mapping construction method merged based on many source entities
CN106372122A (en) * 2016-08-23 2017-02-01 温州大学瓯江学院 Wiki semantic matching-based document classification method and system
CN106503457A (en) * 2016-10-26 2017-03-15 清华大学 The integrated technical data introduction method of clinical data based on translational medicine analysis platform
CN106776563A (en) * 2016-12-21 2017-05-31 语联网(武汉)信息技术有限公司 A kind of is the method for treating manuscript of a translation part matching interpreter
CN106802881A (en) * 2016-12-25 2017-06-06 语联网(武汉)信息技术有限公司 A kind of is to treat the method that manuscript of a translation part matches interpreter based on vocabulary is disabled
CN106844304A (en) * 2016-12-26 2017-06-13 语联网(武汉)信息技术有限公司 It is a kind of to be categorized as treating the method that manuscript of a translation part matches interpreter based on the manuscript of a translation
CN106844303A (en) * 2016-12-23 2017-06-13 语联网(武汉)信息技术有限公司 A kind of is to treat the method that manuscript of a translation part matches interpreter based on similarity mode algorithm
CN107562854A (en) * 2017-08-28 2018-01-09 云南大学 A kind of modeling method of quantitative analysis Party building data
CN108182182A (en) * 2017-12-27 2018-06-19 传神语联网网络科技股份有限公司 Document matching process, device and computer readable storage medium in translation database
CN109284486A (en) * 2018-08-14 2019-01-29 重庆邂智科技有限公司 Text similarity measure, device, terminal and storage medium
CN109636199A (en) * 2018-12-14 2019-04-16 语联网(武汉)信息技术有限公司 A kind of method and system to match interpreter to manuscript of a translation part
CN110019785A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1828610A (en) * 2006-04-13 2006-09-06 北大方正集团有限公司 Improved file similarity measure method based on file structure
CN101004761A (en) * 2007-01-10 2007-07-25 复旦大学 Hierarchy clustering method of successive dichotomy for document in large scale
US20110213777A1 (en) * 2010-02-01 2011-09-01 Alibaba Group Holding Limited Method and Apparatus of Text Classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1828610A (en) * 2006-04-13 2006-09-06 北大方正集团有限公司 Improved file similarity measure method based on file structure
CN101004761A (en) * 2007-01-10 2007-07-25 复旦大学 Hierarchy clustering method of successive dichotomy for document in large scale
US20110213777A1 (en) * 2010-02-01 2011-09-01 Alibaba Group Holding Limited Method and Apparatus of Text Classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吕月娥: "中文科技期刊数据库文献分类与检索", 《临沂师范学院学报》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device
CN104424279B (en) * 2013-08-30 2018-11-20 腾讯科技(深圳)有限公司 A kind of correlation calculations method and apparatus of text
CN104572645A (en) * 2013-10-11 2015-04-29 高德软件有限公司 Method and device for POI (Point Of Interest) data association
CN103678287B (en) * 2013-11-30 2016-12-07 语联网(武汉)信息技术有限公司 A kind of method that keyword is unified
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN103678277A (en) * 2013-12-04 2014-03-26 东软集团股份有限公司 Theme-vocabulary distribution establishing method and system based on document segmenting
CN103761264A (en) * 2013-12-31 2014-04-30 浙江大学 Concept hierarchy establishing method based on product review document set
CN103761264B (en) * 2013-12-31 2017-01-18 浙江大学 Concept hierarchy establishing method based on product review document set
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104778158B (en) * 2015-03-04 2018-07-17 新浪网技术(中国)有限公司 A kind of document representation method and device
CN104866631A (en) * 2015-06-18 2015-08-26 北京京东尚科信息技术有限公司 Method and device for aggregating counseling problems
CN105138521A (en) * 2015-08-27 2015-12-09 武汉传神信息技术有限公司 General translator recommendation method for risk project in translation industry
CN105138521B (en) * 2015-08-27 2017-12-22 武汉传神信息技术有限公司 A kind of translation industry risk project general recommendations interpreter's method
CN105279147A (en) * 2015-09-29 2016-01-27 武汉传神信息技术有限公司 Translator document quick matching method
CN105279147B (en) * 2015-09-29 2018-02-23 语联网(武汉)信息技术有限公司 A kind of interpreter's contribution fast matching method
CN106250412B (en) * 2016-07-22 2019-04-23 浙江大学 Knowledge mapping construction method based on the fusion of multi-source entity
CN106250412A (en) * 2016-07-22 2016-12-21 浙江大学 The knowledge mapping construction method merged based on many source entities
CN106372122A (en) * 2016-08-23 2017-02-01 温州大学瓯江学院 Wiki semantic matching-based document classification method and system
CN106503457B (en) * 2016-10-26 2018-12-11 清华大学 Clinical data based on translational medicine analysis platform integrates technical data introduction method
CN106503457A (en) * 2016-10-26 2017-03-15 清华大学 The integrated technical data introduction method of clinical data based on translational medicine analysis platform
CN106776563A (en) * 2016-12-21 2017-05-31 语联网(武汉)信息技术有限公司 A kind of is the method for treating manuscript of a translation part matching interpreter
CN106844303A (en) * 2016-12-23 2017-06-13 语联网(武汉)信息技术有限公司 A kind of is to treat the method that manuscript of a translation part matches interpreter based on similarity mode algorithm
CN106802881A (en) * 2016-12-25 2017-06-06 语联网(武汉)信息技术有限公司 A kind of is to treat the method that manuscript of a translation part matches interpreter based on vocabulary is disabled
CN106844304A (en) * 2016-12-26 2017-06-13 语联网(武汉)信息技术有限公司 It is a kind of to be categorized as treating the method that manuscript of a translation part matches interpreter based on the manuscript of a translation
CN107562854B (en) * 2017-08-28 2020-09-22 云南大学 Modeling method for quantitatively analyzing party building data
CN107562854A (en) * 2017-08-28 2018-01-09 云南大学 A kind of modeling method of quantitative analysis Party building data
CN110019785A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device
CN110019785B (en) * 2017-09-29 2022-03-01 北京国双科技有限公司 Text classification method and device
CN108182182A (en) * 2017-12-27 2018-06-19 传神语联网网络科技股份有限公司 Document matching process, device and computer readable storage medium in translation database
CN109284486A (en) * 2018-08-14 2019-01-29 重庆邂智科技有限公司 Text similarity measure, device, terminal and storage medium
CN109284486B (en) * 2018-08-14 2023-08-22 重庆邂智科技有限公司 Text similarity measurement method, device, terminal and storage medium
CN109636199A (en) * 2018-12-14 2019-04-16 语联网(武汉)信息技术有限公司 A kind of method and system to match interpreter to manuscript of a translation part
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method

Similar Documents

Publication Publication Date Title
CN103049569A (en) Text similarity matching method on basis of vector space model
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
US10437867B2 (en) Scenario generating apparatus and computer program therefor
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN102591988B (en) Short text classification method based on semantic graphs
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN106776562A (en) A kind of keyword extracting method and extraction system
US10095685B2 (en) Phrase pair collecting apparatus and computer program therefor
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
WO2021051518A1 (en) Text data classification method and apparatus based on neural network model, and storage medium
CN105653518A (en) Specific group discovery and expansion method based on microblog data
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN102637192A (en) Method for answering with natural language
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN102495892A (en) Webpage information extraction method
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN102253982A (en) Query suggestion method based on query semantics and click-through data
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN111221968B (en) Author disambiguation method and device based on subject tree clustering
CN102054029A (en) Figure information disambiguation treatment method based on social network and name context
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN106484797A (en) Accident summary abstracting method based on sparse study

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130417