CN103049569A - Text similarity matching method on basis of vector space model - Google Patents
Text similarity matching method on basis of vector space model Download PDFInfo
- Publication number
- CN103049569A CN103049569A CN2012105931481A CN201210593148A CN103049569A CN 103049569 A CN103049569 A CN 103049569A CN 2012105931481 A CN2012105931481 A CN 2012105931481A CN 201210593148 A CN201210593148 A CN 201210593148A CN 103049569 A CN103049569 A CN 103049569A
- Authority
- CN
- China
- Prior art keywords
- keyword
- text
- similarity
- vector space
- space model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text similarity matching method on the basis of a vector space model. The text similarity matching method includes extracting keywords of texts, clustering all the keywords and generating a keyword concept tree; and computing the similarity of the texts according to the created keyword concept tree of the keywords in the texts to be translated, and acquiring texts in a translation depository according to the similarity. The texts in the translation depository are matched with the texts to be translated. According to the technical scheme, the test similarity matching method has the advantages that relations among the texts can be relatively accurately reflected, so that the similarity of the texts can be sufficiently reflected.
Description
Technical field
The present invention relates to a kind of computer technology, specifically, relate to a kind of text similarity matching process based on vector space model.
Background technology
Present some Text Retrieval Model commonly used comprise based on the retrieval model of literal with based on the retrieval model of structure.The text based retrieval model comprises again: vector space model, approximate model, probability model and statistical language retrieval model; Text Retrieval Model based on structure comprises again: inner structure retrieval model, external structure retrieval model.
The similarity of text, namely the numerical metric of similarity degree between two pieces of texts is got two pieces of text D1, D2, if more the similarity near two pieces of texts of 1 expression is higher for (D1 ∩ D2)/(D1 ∪ D2), on the contrary opposite.In text retrieval technique, similarity is calculated the similarity degree that is mainly used in weighing between the text object, is a basic calculating in data mining, natural language processing.Gordian technique wherein mainly is two parts, the character representation of object and the similarity relation between the characteristic set.Declare weight, commending system etc. at information retrieval, webpage, all relate between the object or the calculating of the similarity of object and object set.For different application scenarioss, be subject to the restriction of data scale, space-time expense etc., the selection of similarity calculating method can be distinguished with different again to some extent.
The method of normally used calculating similarity is the VSM(vector space model).Then this model carries out the weights assignment by text is extracted keyword, and text table is shown as the vector that the keyword by weighted consists of, thereby obtains the similarity of text by the vector distance of calculating two texts.
Because probably there are the phenomenons such as synonym, polysemy in keyword, so the similarity computational solution precision that obtains with traditional vector space model method is not high, the result is often also unsatisfactory; The keyword weighting algorithm only is the relation of seeking between text and the keyword, can not laterally contact the relation between the keyword between different texts, has brought following problem to text retrieval:
(1) user's request can not accurately be expressed in keyword.
The user is difficult to select accurately keyword to search for, because wherein relate to the Semantic mapping problem between inquiry and the concept.The searching keyword that the user provides can not reflect user's intention well.
(2) keyword can not reflect content of text.
If the keyword extension is too large, just is difficult to or can't retrieves related text.
(3) polysemy.
Because the keyword matching technique is difficult to solve polysemy, tends to retrieve a large amount of irrelevant informations.
(4) keyword occurs in the text in the synonym mode.
User's searching keyword does not directly occur sometimes in the text, but occurs in other word-building modes of synonym, near synonym or keyword, and like this, text just can not retrieve.When searching keyword and feature word of text formation concept hyponymy, then more be difficult to retrieve.
Summary of the invention
Technical matters solved by the invention provides a kind of text similarity matching process based on vector space model, and the relatively accurate contact that has reflected between the text can reflect the similarity of text so more fully.
Technical scheme is as follows:
A kind of text similarity matching process based on vector space model comprises:
Extract the keyword of text, all keywords are carried out cluster, generate the keyword conceptional tree;
Calculate the similarity of text according to the keyword conceptional tree of keyword in the text to be translated that makes up, obtain the text that in translation list of references storehouse, mates by the size of similarity.
Further, the step of described generation keyword conceptional tree comprises:
Extract all keywords in document to be sorted and the reference library, obtain keyword set;
Keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to a concept class set, symphysis becomes described keyword conceptional tree according to described concept class set.
Further, if keyword k
iProbability p (the k that occurs
i) P1; And have, k occurring
iThe text in keyword k also appears
jConditional probability p(k
j| k
i) P2, then think keyword k
jAnd k
iExpress same concept, P1 and P2 are for setting the probability threshold values.
Further, the process concrete steps that generate described keyword conceptional tree comprise:
Extract all keywords in document to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn} calculates the Probability p (k) that each keyword k occurs among the C in reference library, the textual data of keyword k and the ratio of set Chinese version sum namely occur;
Filter keyword according to setting threshold values, get p
Min<p(k)<p
MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m, wherein p
MaxAnd p
MinBe the high lower bound threshold values that sets;
To the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k
1, { k
2...., { k
m;
In this m keyword, calculate at keyword k
iKeyword k in the text that occurs
jThe probability that also occurs is designated as p(k
j| k
i), amount to
Individual conditional probability, (1≤i, j≤m; I ≠ j); P(k
j| k
i)=p(k
jk
i)/p(k
i), p(k
jk
i) be k
jAnd k
iAppear at simultaneously the probability in the same piece of writing text;
Merge set to be combined, generating root node is the keyword conceptional tree of keyword set C.
Further, for two keyword set C1 to be combined and C2, the merging condition is: have k
iBelong to C1, k
jBelong to C2, and p(k
i) P1, p(k
j| k
i) P2, work as p(k
i) and p(k
j| k
i) during greater than described setting threshold values, keyword k
iAnd k
jExpress same concept, satisfy one of the merging condition of the set at its place; Appoint to a keyword k in the set after merging
i, its with set in the keyword over half p(k that all satisfies condition
j| k
i) P2; If above two conditions are satisfied in two set, then concept has very large similarity, belongs to annexable set, generates the set of last layer concept class after merging.
Further, the process of searching the text of coupling in reference library comprises: extract the keyword of all documents in the reference library, form keyword set; According to the structure of described keyword conceptional tree, by improved Text similarity computing formula, calculate the similarity of each text in text to be sorted and the reference library, according to similarity descending return results text.
The process concrete steps of further, searching the text of coupling in translation list of references storehouse comprise:
Definition H is the height of the conceptional tree that generates, definition depth(k) be the degree of depth of node k in tree, be from root node to limit number that this node experiences;
Definition com(k
i, k
j) be from node k
iAnd k
jNearest common father node, it is root node that any two nodes must have a common father node;
The long-pending computing formula of any two keywords: k
i* k
j=depth(com(k
i, k
j))/H;
If vectorial A={a
1, a
2..., a
n, B={b
1, b
2..., b
n, the definition vector calculation:
The calculating formula of similarity of text is:
D1 and d2 represent text vector.
Compared with prior art, technique effect comprises:
In the prior art, when with the vector space model method text being carried out similarity calculating, if the vector representation of two texts is d1={k1, k2, k3}, d2={k4, k5, k6} is because these two text vectors are vertical, so its similarity is 0.The synonymy that the keyword that compares owing to two texts may exist, concept hyponymy etc., the account form that only adopts same keyword to mate can not embody the relation between the text effectively.
Therefore, among the present invention, by keyword is carried out conceptual clustering, the keyword that concept is similar condenses together, by a kind of improved vectorial cosine computing method, the similarity of mutually perpendicular vector may not be 0 just, the relatively accurate contact that has reflected between the text, than traditional vector space method, can reflect more fully like this similarity of text.
Description of drawings
Fig. 1 is one 4 layers the conceptional tree synoptic diagram that makes up among the present invention;
Embodiment
It is text similarity technology in the text retrieval technique that the present invention relates generally to technology.Text retrieval is a cross discipline, from large subject, across subjects such as computing machine, information, mathematical statisticss, on concrete research direction, comprises the technology such as text retrieval, natural language processing, data mining, machine learning.
Translation list of references storehouse (abbreviation reference library) is a huge resources bank that mass text is arranged, adopt the method for complicated similarity retrieval, text to be translated is carried out similarity retrieval therein, thereby find the operation of similar referenced text set, speed is very slow, is difficult to accomplish quick-searching.Yet adopt relatively simple VSN vector space method to carry out similarity retrieval, its precision is very low, this method is utilized a kind of improved VSM method, and raising retrieval precision that can be larger under the prerequisite that keeps VSM method retrieval rate obtains a relatively accurate similar reference documents set.
Among the present invention, provide a kind of Text similarity computing method based on vector space model.
Step 1: extract all keywords of text to be sorted, extract the keyword of all documents in the reference library, form keyword set, all keywords are carried out cluster, generate the keyword conceptional tree;
Technical solution of the present invention has provided a suitable clustering algorithm, and the generation of keyword conceptional tree is described in detail.
Step 11: extract all keywords in text to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn};
Step 12: the keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to the same concept set;
If two keywords occur in the one piece of text of being everlasting simultaneously, when namely they appeared at the same piece of probability in the text simultaneously greater than a certain threshold values, we thought that it expresses same concept, belong to the concept that can merge.That is, if keyword k
iProbability p (the k that in text set, occurs
i) P1; And have, k occurring
iThe text in keyword k also appears
jConditional probability p(k
j| k
i) P2, then think keyword k
j, k
iExpress same concept, merge it (P1 and P2 are the probability threshold values that sets).
In like manner for two keyword set C1, C2 to be combined, if satisfy following two conditions:
Condition 1: have k
iBelong to C1, k
jBelong to C2, and p(k
i) P1, p(k
j| k
i) P2;
Work as p(k
i) and p(k
j| k
i) during greater than respective thresholds, we think keyword k
iAnd k
jExpress same concept, satisfy one of the merging condition of the set at its place.
Condition 2: appoint to a keyword k in the set after merging
i, its with the set in keyword over half all meet the following conditions: p(k
j| k
i) P2.
If satisfy condition simultaneously 1 and condition 2, then we think that the concept of these two set has and satisfy certain similarity, belong to the set that can merge, and generate the set of last layer concept class after merging.
When remaining any two keyword set merged, the condition above not satisfying merged and stops, the father node of the remaining set set C that all keywords consist of that serves as reasons.
The step of keyword clustering is as follows:
Step 121: extract all keywords, obtain keyword set C={k1, k2 ..., kn};
Calculate among the C each keyword k at the probability that occurs, be textual data that keyword k occurs and the ratio of text sum, be designated as p(k).
Step 122: filter keyword according to setting threshold values;
Get p
Min<p(k)<p
MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m (p
Max, p
MinBe the high lower bound threshold values that sets, with removing extremely high frequency word and extremely low frequency word).
Step 123: to the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k
1, { k
2..., { k
m;
Step 124: in this m keyword, calculate at keyword k
iIn the text that occurs, keyword k
jThe probability that also occurs is designated as p(k
j| k
i), amount to
Individual conditional probability, (1≤i, j≤m; I ≠ j);
P(k
j| k
i) computing method: p(k
j| k
i)=p(k
jk
i)/p(k
i), p(k
jk
i) be k
j, k
iAppear at simultaneously the probability in the same piece of writing text.
Step 125: merge set I and J, (I, J are set to be combined);
When satisfying following two conditions simultaneously, merge:
I.
Satisfy p(k
i) P1, p(k
j| k
i) P2;
Ii.
Satisfy | { k
j∈ I UJ|p(k
j| k
i) P2}| (| I|+|J|)/2, | X| represents to gather the number of element among the X.
Step 126: when any two set all do not meet this two conditions, merge and finish.Obtain simultaneously ground floor cluster keyword set C={C1, C2 ..., Cq};
Step 127: to C={C1, C2 ..., Cq} gets threshold value P3<P2, again carries out cluster (step 125 and 126) with above-mentioned steps 11 to 17, generates the set of last layer concept.
Repeat this process, until cluster set cluster again, these again the concept set of cluster be combined into the child node of root node C, the keyword conceptional tree that so just to generate a root node be keyword set C.
As shown in Figure 1, be one 4 layers the conceptional tree synoptic diagram that makes up among the present invention.
Step 2: according to the keyword conceptional tree of keyword in the text to be translated that makes up, in translation list of references storehouse, search the text of coupling.
The present invention has defined a kind of computing method of the vectorial cosine based on the keyword conceptional tree, i.e. a kind of method of new Text similarity computing.
Step 21: according to the structure of keyword conceptional tree, adopt Innovative method to calculate the similarity of different keywords;
Step 22: adopt improved cosine similarity method, calculate waiting for translating originally and the similarity of reference translation storehouse Chinese version;
In the VSM vector space model, any two keyword k
i, k
jBe fully vertical, it is long-pending to be 0.And in conceptional tree of the present invention, any two concept k
i, k
jMight not be vertical, but be determined by the distance of their common father nodes from root node.K in Fig. 1 for example
1, k
2Common nearest father node is C11, and its distance from root node is 2, and the height of tree is 3, so k
1* k
2=2/3.
1. definition H is the height of the conceptional tree of generation.
2. be the degree of depth of node k in tree definition depth(k), be from root node to limit number that this node experiences;
3. define com(k
i, k
j) be from node k
iAnd k
jNearest common father node, it is root node that any two nodes must have a common father node;
4. the long-pending computing formula of any two keywords: k
i* k
j=depth(com(k
i, k
j))/H;
5. establish vectorial A={a
1, a
2..., a
n, B={b
1, b
2..., b
n, the definition vector calculation:
6. the calculating formula of similarity of text is:
D1 and d2 represent text vector.
Step 23: according to similarity descending return results text.
The below describes concrete application according to technical solution of the present invention.
Use one: adopt the method for interpreter's achievement document content similarity matching to optimize interpreter's retrieval
Each interpreter has the document of much oneself translating, and these documents of translating have consisted of this interpreter's document library, and numerous interpreters' document library consists of huge " interpreter's achievement document library "; Will seek suitable interpreter when one piece of document to be translated translates, this document can be carried out similarity matching in " interpreter's achievement document library ", from the storehouse, match the high document of similarity, the interpreter that the document that these similarities are high is corresponding, being exactly suitable interpreter, is exactly the ordering of interpreter's appropriate degree according to sequencing of similarity.Because the interpreter once translated similarly document, translation gets up just can accomplish faster and better.
Use two: adopt classifying documents storehouse similarity matching to realize the document automation classification
Set up one according to the standardized documentation of set criteria for classification classification, wherein each classification has the sample document of some, with non-classified document still, pass through similarity matching, match all documents that similarity in the classifying documents storehouse surpasses predetermined value, the classification situation of these similar document is carried out tabulate statistics and bring computation model into being weighted calculating, calculate the classification situation score of the document, the classification that score is the highest is exactly the most probable classification of the document.If the classification score of score second is more or less the same with first score, can be used as subsidiary classification.
Use three: adopt ambit to divide the contribution fragmentation strategy of being combined with similarity retrieval
When carrying out large document translation task, large translation contribution is broken into a plurality of less translation fragment contributions, be to promote the division of labor to improve the common method of translation efficiency, but the strategy of how contribution " being smashed " just become key link.Here the method that adopts is that the content of contribution is smashed not according to simple paragraph, but judge the ambit of paragraph content according to keyword, according to ambit the content of contribution is carried out preliminary division, and then in the historical result document library, carry out similarity retrieval with the fragment contribution of dividing, draw the interpreter that these fragment contributions are fit to, carry out again the integration of fragment according to the interpreter: will be suitable for fragment contribution same or same class interpreter translation and merge or the part merging, the final like this result who obtains the contribution fragmentation is exactly desirable, is convenient to very much the arrangement task and is conducive to ensure translation quality.
Claims (7)
1. text similarity matching process based on vector space model comprises:
Extract the keyword of text, all keywords are carried out cluster, generate the keyword conceptional tree;
Calculate the similarity of text according to the keyword conceptional tree of keyword in the text to be translated that makes up, obtain the text that in translation list of references storehouse, mates by the size of similarity.
2. the text similarity matching process based on vector space model as claimed in claim 1 is characterized in that, the step of described generation keyword conceptional tree comprises:
Extract all keywords in document to be sorted and the reference library, obtain keyword set;
Keyword in the keyword set is carried out cluster, the keyword of same concept is polymerized to a concept class set, symphysis becomes described keyword conceptional tree according to described concept class set.
3. the text similarity matching process based on vector space model as claimed in claim 2 is characterized in that, if keyword k
iProbability p (the k that occurs
i) P1; And have, k occurring
iThe text in keyword k also appears
jConditional probability p(k
j| k
i) P2, then think keyword k
jAnd k
iExpress same concept, P1 and P2 are for setting the probability threshold values.
4. the text similarity matching process based on vector space model as claimed in claim 3 is characterized in that, the process concrete steps that generate described keyword conceptional tree comprise:
Extract all keywords in document to be sorted and the reference library, obtain keyword set C={k1, k2 ..., kn}, each keyword k the textual data of keyword k and the ratio of text sum occur and is designated as p(k at the probability that occurs among the calculating C);
Filter keyword according to setting threshold values, get p
Min<p(k)<p
MaxKeyword, it as set entry to be combined, is established qualified keyword number and is m, wherein p
MaxAnd p
MinBe the high lower bound threshold values that sets;
To the keyword that obtains after filtering by p(k) carry out descending sort, and with each keyword as a set, obtain so initial m set to be combined, be designated as { k
1, { k
2..., { k
m;
In this m keyword, calculate at keyword k
iKeyword k in the text that occurs
jThe probability that occurs is designated as p(k
j| k
i), amount to
Individual conditional probability, (1≤i, j≤m; I ≠ j); P(k
j| k
i)=p(k
jk
i)/p(k
i), p(k
jk
i) be k
jAnd k
iAppear at simultaneously the probability in the same piece of writing text;
Merge set to be combined, generating root node is the keyword conceptional tree of keyword set C.
5. the text similarity matching process based on vector space model as claimed in claim 4 is characterized in that, for two keyword set C1 to be combined and C2, the merging condition is: have k
iBelong to C1, k
jBelong to C2, and p(k
i) P1, p(k
j| k
i) P2, work as p(k
i) and p(k
j| k
i) during greater than described setting threshold values, keyword k
iAnd k
jExpress same concept, satisfy one of the merging condition of the set at its place; Appoint to a keyword k in the set after merging
i, its with set in the keyword over half p(k that all satisfies condition
j| k
i) P2; If above two conditions are satisfied in two set, then concept has very large similarity, belongs to annexable set, generates the set of last layer concept class after merging.
6. the text similarity matching process based on vector space model as claimed in claim 1, it is characterized in that, the process of searching the text of coupling in translation list of references storehouse comprises: extract the keyword of all documents in the translation list of references storehouse, form keyword set; According to the structure of described keyword conceptional tree, by improved Text similarity computing formula, calculate text to be sorted and reference library close in the similarity of each text, according to similarity descending return results text.
7. the text similarity matching process based on vector space model as claimed in claim 6 is characterized in that, the process concrete steps of searching the text of coupling in translation list of references storehouse comprise:
Definition H is the height of the conceptional tree that generates, definition depth(k) be the degree of depth of node k in tree, be from root node to limit number that this node experiences;
Definition com(k
i, k
j) be from node k
iAnd k
jNearest common father node, it is root node that any two nodes must have a common father node;
The long-pending computing formula of any two keywords: k
i* k
j=depth(com(k
i, k
j))/H;
If vectorial A={a
1, a
2..., a
n, B={b
1, b
2..., b
n, the definition vector calculation:
The calculating formula of similarity of text is:
D1 and d2 represent text vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012105931481A CN103049569A (en) | 2012-12-31 | 2012-12-31 | Text similarity matching method on basis of vector space model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012105931481A CN103049569A (en) | 2012-12-31 | 2012-12-31 | Text similarity matching method on basis of vector space model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103049569A true CN103049569A (en) | 2013-04-17 |
Family
ID=48062209
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012105931481A Pending CN103049569A (en) | 2012-12-31 | 2012-12-31 | Text similarity matching method on basis of vector space model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103049569A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678277A (en) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | Theme-vocabulary distribution establishing method and system based on document segmenting |
CN103678287A (en) * | 2013-11-30 | 2014-03-26 | 武汉传神信息技术有限公司 | Method for unifying keyword translation |
CN103761264A (en) * | 2013-12-31 | 2014-04-30 | 浙江大学 | Concept hierarchy establishing method based on product review document set |
CN104424279A (en) * | 2013-08-30 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Text relevance calculating method and device |
CN104572645A (en) * | 2013-10-11 | 2015-04-29 | 高德软件有限公司 | Method and device for POI (Point Of Interest) data association |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104866631A (en) * | 2015-06-18 | 2015-08-26 | 北京京东尚科信息技术有限公司 | Method and device for aggregating counseling problems |
CN105138521A (en) * | 2015-08-27 | 2015-12-09 | 武汉传神信息技术有限公司 | General translator recommendation method for risk project in translation industry |
CN105279147A (en) * | 2015-09-29 | 2016-01-27 | 武汉传神信息技术有限公司 | Translator document quick matching method |
CN106250412A (en) * | 2016-07-22 | 2016-12-21 | 浙江大学 | The knowledge mapping construction method merged based on many source entities |
CN106372122A (en) * | 2016-08-23 | 2017-02-01 | 温州大学瓯江学院 | Wiki semantic matching-based document classification method and system |
CN106503457A (en) * | 2016-10-26 | 2017-03-15 | 清华大学 | The integrated technical data introduction method of clinical data based on translational medicine analysis platform |
CN106776563A (en) * | 2016-12-21 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of is the method for treating manuscript of a translation part matching interpreter |
CN106802881A (en) * | 2016-12-25 | 2017-06-06 | 语联网(武汉)信息技术有限公司 | A kind of is to treat the method that manuscript of a translation part matches interpreter based on vocabulary is disabled |
CN106844304A (en) * | 2016-12-26 | 2017-06-13 | 语联网(武汉)信息技术有限公司 | It is a kind of to be categorized as treating the method that manuscript of a translation part matches interpreter based on the manuscript of a translation |
CN106844303A (en) * | 2016-12-23 | 2017-06-13 | 语联网(武汉)信息技术有限公司 | A kind of is to treat the method that manuscript of a translation part matches interpreter based on similarity mode algorithm |
CN107562854A (en) * | 2017-08-28 | 2018-01-09 | 云南大学 | A kind of modeling method of quantitative analysis Party building data |
CN108182182A (en) * | 2017-12-27 | 2018-06-19 | 传神语联网网络科技股份有限公司 | Document matching process, device and computer readable storage medium in translation database |
CN109284486A (en) * | 2018-08-14 | 2019-01-29 | 重庆邂智科技有限公司 | Text similarity measure, device, terminal and storage medium |
CN109636199A (en) * | 2018-12-14 | 2019-04-16 | 语联网(武汉)信息技术有限公司 | A kind of method and system to match interpreter to manuscript of a translation part |
CN110019785A (en) * | 2017-09-29 | 2019-07-16 | 北京国双科技有限公司 | A kind of file classification method and device |
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1828610A (en) * | 2006-04-13 | 2006-09-06 | 北大方正集团有限公司 | Improved file similarity measure method based on file structure |
CN101004761A (en) * | 2007-01-10 | 2007-07-25 | 复旦大学 | Hierarchy clustering method of successive dichotomy for document in large scale |
US20110213777A1 (en) * | 2010-02-01 | 2011-09-01 | Alibaba Group Holding Limited | Method and Apparatus of Text Classification |
-
2012
- 2012-12-31 CN CN2012105931481A patent/CN103049569A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1828610A (en) * | 2006-04-13 | 2006-09-06 | 北大方正集团有限公司 | Improved file similarity measure method based on file structure |
CN101004761A (en) * | 2007-01-10 | 2007-07-25 | 复旦大学 | Hierarchy clustering method of successive dichotomy for document in large scale |
US20110213777A1 (en) * | 2010-02-01 | 2011-09-01 | Alibaba Group Holding Limited | Method and Apparatus of Text Classification |
Non-Patent Citations (1)
Title |
---|
吕月娥: "中文科技期刊数据库文献分类与检索", 《临沂师范学院学报》 * |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424279A (en) * | 2013-08-30 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Text relevance calculating method and device |
CN104424279B (en) * | 2013-08-30 | 2018-11-20 | 腾讯科技(深圳)有限公司 | A kind of correlation calculations method and apparatus of text |
CN104572645A (en) * | 2013-10-11 | 2015-04-29 | 高德软件有限公司 | Method and device for POI (Point Of Interest) data association |
CN103678287B (en) * | 2013-11-30 | 2016-12-07 | 语联网(武汉)信息技术有限公司 | A kind of method that keyword is unified |
CN103678287A (en) * | 2013-11-30 | 2014-03-26 | 武汉传神信息技术有限公司 | Method for unifying keyword translation |
CN103678277A (en) * | 2013-12-04 | 2014-03-26 | 东软集团股份有限公司 | Theme-vocabulary distribution establishing method and system based on document segmenting |
CN103761264A (en) * | 2013-12-31 | 2014-04-30 | 浙江大学 | Concept hierarchy establishing method based on product review document set |
CN103761264B (en) * | 2013-12-31 | 2017-01-18 | 浙江大学 | Concept hierarchy establishing method based on product review document set |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104778158B (en) * | 2015-03-04 | 2018-07-17 | 新浪网技术(中国)有限公司 | A kind of document representation method and device |
CN104866631A (en) * | 2015-06-18 | 2015-08-26 | 北京京东尚科信息技术有限公司 | Method and device for aggregating counseling problems |
CN105138521A (en) * | 2015-08-27 | 2015-12-09 | 武汉传神信息技术有限公司 | General translator recommendation method for risk project in translation industry |
CN105138521B (en) * | 2015-08-27 | 2017-12-22 | 武汉传神信息技术有限公司 | A kind of translation industry risk project general recommendations interpreter's method |
CN105279147A (en) * | 2015-09-29 | 2016-01-27 | 武汉传神信息技术有限公司 | Translator document quick matching method |
CN105279147B (en) * | 2015-09-29 | 2018-02-23 | 语联网(武汉)信息技术有限公司 | A kind of interpreter's contribution fast matching method |
CN106250412B (en) * | 2016-07-22 | 2019-04-23 | 浙江大学 | Knowledge mapping construction method based on the fusion of multi-source entity |
CN106250412A (en) * | 2016-07-22 | 2016-12-21 | 浙江大学 | The knowledge mapping construction method merged based on many source entities |
CN106372122A (en) * | 2016-08-23 | 2017-02-01 | 温州大学瓯江学院 | Wiki semantic matching-based document classification method and system |
CN106503457B (en) * | 2016-10-26 | 2018-12-11 | 清华大学 | Clinical data based on translational medicine analysis platform integrates technical data introduction method |
CN106503457A (en) * | 2016-10-26 | 2017-03-15 | 清华大学 | The integrated technical data introduction method of clinical data based on translational medicine analysis platform |
CN106776563A (en) * | 2016-12-21 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of is the method for treating manuscript of a translation part matching interpreter |
CN106844303A (en) * | 2016-12-23 | 2017-06-13 | 语联网(武汉)信息技术有限公司 | A kind of is to treat the method that manuscript of a translation part matches interpreter based on similarity mode algorithm |
CN106802881A (en) * | 2016-12-25 | 2017-06-06 | 语联网(武汉)信息技术有限公司 | A kind of is to treat the method that manuscript of a translation part matches interpreter based on vocabulary is disabled |
CN106844304A (en) * | 2016-12-26 | 2017-06-13 | 语联网(武汉)信息技术有限公司 | It is a kind of to be categorized as treating the method that manuscript of a translation part matches interpreter based on the manuscript of a translation |
CN107562854B (en) * | 2017-08-28 | 2020-09-22 | 云南大学 | Modeling method for quantitatively analyzing party building data |
CN107562854A (en) * | 2017-08-28 | 2018-01-09 | 云南大学 | A kind of modeling method of quantitative analysis Party building data |
CN110019785A (en) * | 2017-09-29 | 2019-07-16 | 北京国双科技有限公司 | A kind of file classification method and device |
CN110019785B (en) * | 2017-09-29 | 2022-03-01 | 北京国双科技有限公司 | Text classification method and device |
CN108182182A (en) * | 2017-12-27 | 2018-06-19 | 传神语联网网络科技股份有限公司 | Document matching process, device and computer readable storage medium in translation database |
CN109284486A (en) * | 2018-08-14 | 2019-01-29 | 重庆邂智科技有限公司 | Text similarity measure, device, terminal and storage medium |
CN109284486B (en) * | 2018-08-14 | 2023-08-22 | 重庆邂智科技有限公司 | Text similarity measurement method, device, terminal and storage medium |
CN109636199A (en) * | 2018-12-14 | 2019-04-16 | 语联网(武汉)信息技术有限公司 | A kind of method and system to match interpreter to manuscript of a translation part |
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103049569A (en) | Text similarity matching method on basis of vector space model | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
US10437867B2 (en) | Scenario generating apparatus and computer program therefor | |
CN107122413A (en) | A kind of keyword extracting method and device based on graph model | |
CN109376352B (en) | Patent text modeling method based on word2vec and semantic similarity | |
CN102591988B (en) | Short text classification method based on semantic graphs | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
US10095685B2 (en) | Phrase pair collecting apparatus and computer program therefor | |
CN106156272A (en) | A kind of information retrieval method based on multi-source semantic analysis | |
WO2021051518A1 (en) | Text data classification method and apparatus based on neural network model, and storage medium | |
CN105653518A (en) | Specific group discovery and expansion method based on microblog data | |
CN103970729A (en) | Multi-subject extracting method based on semantic categories | |
CN102637192A (en) | Method for answering with natural language | |
CN107122349A (en) | A kind of feature word of text extracting method based on word2vec LDA models | |
CN102495892A (en) | Webpage information extraction method | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN101097570A (en) | Advertisement classification method capable of automatic recognizing classified advertisement type | |
CN102253982A (en) | Query suggestion method based on query semantics and click-through data | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN111221968B (en) | Author disambiguation method and device based on subject tree clustering | |
CN102054029A (en) | Figure information disambiguation treatment method based on social network and name context | |
CN104899188A (en) | Problem similarity calculation method based on subjects and focuses of problems | |
CN106484797A (en) | Accident summary abstracting method based on sparse study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130417 |