CN106598940A - Text similarity solution algorithm based on global optimization of keyword quality - Google Patents
Text similarity solution algorithm based on global optimization of keyword quality Download PDFInfo
- Publication number
- CN106598940A CN106598940A CN201610939853.0A CN201610939853A CN106598940A CN 106598940 A CN106598940 A CN 106598940A CN 201610939853 A CN201610939853 A CN 201610939853A CN 106598940 A CN106598940 A CN 106598940A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- similarity
- weight
- vocabulary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention provides a text similarity solution algorithm based on global optimization of keyword quality. The text similarity solution algorithm comprises the following steps: performing word segmentation and stop word removal processing on a text, comprehensively considering the weight, the density, the depth, the part of speech, the word position, the relevancy of core vocabularies and other factors of the keywords in a text library, deeming each keyword (the formula is as shown in the specification) as a multi-dimensional vector, performing dimension reduction processing on a keyword set in a multi-dimensional space by constraint conditions, extracting and calculating the similarity (the formula is as shown in the specification) of two keywords having the largest weights in two text keyword sets, and setting a threshold condition to extract characteristic vocabulary vectors in the two texts. Compared with the traditional word frequency-inverse document frequency method, the text similarity solution algorithm has higher accuracy and overcomes the defect that the defects of only one category can be extracted by the information gain method, the constraint conditions of the algorithm are precise enough to accurately calculate the contribution degrees of different vocabularies to the texts, the individual quality of the keywords can be guaranteed, the overall quality of the keywords can be globally optimized, and in addition, the similarity accuracy between the texts is higher.
Description
Technical field
The present invention relates to Semantic Web technology field, and in particular to the text similarity based on global optimization keyword quality
Derivation algorithm.
Background technology
Text similarity computing may apply to text classification, text cluster, information retrieval, question answering system, removing duplicate webpages
Etc. many fields.At present, in different field, many similarity calculating methods are suggested and are applied to practice.Such as vector space mould
Type, Boolean Model, implicit semantic index statistical model, the string matching model such as model, model based on semantic understanding etc..Can
To find, nowadays increasing experts and scholars study the calculating of text similarity, this is because effective meter of text similarity
Calculation can play raising recall precision, avoid article ticket from stealing, saves the effect such as memory space.In Text similarity computing field still
So there are problems that it is many need people to solve, especially to the research of Chinese text similarity, also do not develop into make us full at present
The degree of meaning.Natural language understanding is realized with computer, Chinese is more difficult to process than English.Chinese word unlike English
And have obvious separation mark between this, it expresses together a meaning with multiple continuous words, based on context linguistic context
Difference, is also easy to cause ambiguity, how to improve the effectiveness and accuracy of Text similarity computing, based on the demand, this
It is bright there is provided a kind of text similarity derivation algorithm based on global optimization keyword quality.
The content of the invention
The effectiveness of Text similarity computing and the deficiency of accuracy are directed to, the invention provides a kind of based on global excellent
Change the text similarity derivation algorithm of keyword quality.
In order to solve the above problems, the present invention is achieved by the following technical solutions:
Step 1:Using Chinese words segmentation to two text (W1, W2) carry out word segmentation processing;
Step 2:Stop words is carried out according to deactivation table to text vocabulary to process;
Step 3:The text key word collection gone after stop words operation is combined into C=(C1, C2..., Cn), each key word Ci
Contribution margin in text regards a multi-C vector as, i.e.,
Step 4:Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extracted most
The text key word set J of optimization1、J2;
Step 5:Calculate two keyword sets J1、J2Similarity between two maximum words of middle weight;
Step 6:Two keyword sets J1、J2Middle solution similarity two-by-two between vocabulary, sets similarity between a vocabulary
Threshold value, the similarity sim (W between two texts is calculated according to the vocabulary number for meeting condition1, W2)。
Present invention has the advantages that:
1st, the method is higher than the accuracy of the feature lexical set that traditional word frequency-anti-document frequency method is obtained.
2nd, the method overcomes the shortcoming that Information Gain Method is only suitable for the text feature for extracting a classification.
3rd, this algorithm has bigger value.
4th, the method has been precisely calculated contribution degree of the different vocabulary to text thought in feature vocabulary.
5th, the method condition before the method is compared is more harsh, and the result precision for obtaining is higher.
6th, key word Individual Quality is not only can ensure that, and the total quality of key word can be optimized from the overall situation.
7th, the similarity result between text is more accurate, more meets empirical value.
Description of the drawings
Text similarity derivation algorithm structure flow charts of the Fig. 1 based on global optimization keyword quality
Fig. 2 n-grams segmentation methods are illustrated
Fig. 3 Chinese text preprocessing process flow charts
Specific embodiment
In order to the effectiveness and accuracy that solve the problems, such as Text similarity computing are not enough, the present invention is carried out with reference to Fig. 1
Describe in detail, its specific implementation step is as follows:
Step 1:Using Chinese words segmentation to two text (W1, W2) word segmentation processing is carried out, its concrete participle technique process is such as
Under:
Step 1.1:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, the Chinese character for treating participle
The complete scanning of string one time, makes a look up matching in the dictionary of system, and the word that run into has in dictionary is just identified;If word
There are no relevant matches in allusion quotation, be just simply partitioned into individual character as word;Until Chinese character string is sky.
Step 1.2:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined
Minor structure, every sequential node of this structure SM is defined as successively1M2M3M4M5E, its structure chart is as shown in Figure 2.
Step 1.3:Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating are given
Process is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word
ni.That is the number collection of n paths word is combined into (n1, n2..., nn)。
Obtain min ()=min (n1, n2..., nn)
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.
In statistics corpus, the quantity of information X (C of each word are calculatedi), then the co-occurrence letter of the adjacent word of solution path
Breath amount X (Ci, Ci+1).Existing following formula:
X(Ci)=| x (Ci)1-x(Ci)2|
Above formula x (Ci)1For word C in text corpusiQuantity of information, x (Ci)2It is C containing wordiText message amount.
x(Ci)1=-P (Ci)1lnp(Ci)1
Above formula p (Ci)1For CiProbability in text corpus, n is C containing wordiText corpus number.
x(Ci)2=-p (Ci)2lnp(Ci)2
Above formula p (Ci)2It is C containing wordiTextual data probit, N is statistics corpus Chinese version sum.
X (C in the same manneri, Ci+1)=| x (Ci, Ci+1)1-x(Ci, Ci+1)2|
x(Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence information amount, x (Ci, Ci+1)2For adjacent word (Ci,
Ci+1) co-occurrence text message amount.
X (C in the same manneri, Ci+1) 1=-p (Ci, Ci+1)1lnp(Ci, Ci+1)1
Above formula p (Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence probabilities, m is the word (C in text libraryi,
Ci+1) co-occurrence amount of text.
x(Ci, Ci+1)2=-p (Ci, Ci+1)2lnp(Ci, Ci+1)2
p(Ci, Ci+1)2For adjacent word (C in text libraryi, Ci+1) co-occurrence textual data probability.
The weights that every adjacent path can to sum up be obtained are
w(Ci, Ci+1)=X (Ci)+X(Ci+1)-2X(Ci, Ci+1)
Step 1.4:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, it was specifically calculated
Journey is as follows:
There are n paths, it is different per paths length, it is assumed that path collection is combined into (L1, L2..., Ln)。
Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if
Its path collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 1.4,For S in remaining (n-m) pathjBar
The length in path.
One paths of maximum weight:
Step 2:Stop words is carried out according to deactivation table to text vocabulary to process, it is described in detail below:
Stop words refers to that in the text the frequency of occurrences is high, but for word of the Text Flag without too big effect.Go to stop
The process of word is exactly to be compared characteristic item with the word in vocabulary is disabled, by this feature entry deletion if matching.
Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3.
Step 3:The text key word collection gone after stop words operation is combined into C=(C1, C2..., Cn), each key word Ci
Contribution margin in text regards a multi-C vector, i.e. object vector as
Its concrete calculating process is as follows:
Object vector:
Above-mentioned fiFor key word CiWeighting function in text library:
Above formulaFor key word CiThe number of times for occurring in the text, LAlwaysFor the total length of text, NAlwaysFor text library Chinese version
Total number, Ij(Ci) it is key word CiThe quantity of information of jth text in storehouse,For key word CiAverage information in storehouse
Amount.
Ij(Ci)=lnIj(Ci)
Object vectorIn siFor key word CiDepth capacity of the correspondence Ontological concept in body network structure;
Object vectorIn miFor key word CiMaximal density of the correspondence Ontological concept in body network structure;
Object vectorIn xiFor key word CiPart of speech weight, the rule of thumb power of noun, verb, adjective, adverbial word
Weight values are followed successively by β1、β2、β3And β4, and β1> β2> β3> β4。
Object vectorIn wi1For key word CiThe position weight for occurring for the first time in the text, this can be according to system
Meter investigation draws a series of position weight value (α1, α2..., αn), the weight in theme is maximum, and first paragraph takes second place.
Object vectorIn n1For key word CiNumber of times in the paragraph of the 1st appearance of text.
Object vectorInFor key word CiWith weight limit vocabularyCo-occurrence in the text
Probability.
Step 4:Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extracted most
The text key word set J of optimization1、J2, its concrete calculating process is as follows:
Above-mentioned key vocabulariesIn being mapped to hyperspace, successively merge from big to small while meeting following formula constraints
Key word;
The distance value d of point-to-point transmission:D < γ
The similarity of point-to-point transmission:
γ, α are the threshold value of expert's setting, and this can draw suitable value by experiment test.
Key word and weight after merging is respectively that maximum in space vectorial corresponding key vocabularies of point, weight
Also it is the weight of the key vocabularies.
The key word after merging is sorted from big to small according to above-mentioned constraints, is extracted using following constraintss
High-quality key vocabularies characteristic set.
Above formulaTo extract the number of vocabulary, k is to set the threshold value for extracting vocabulary number, and k is individual high-quality before extracting
Amount key word, it is to avoid redundancy.
Finally give the high-quality key vocabularies characteristic set that weight arranges from big to small to be respectively
Step 5:Calculate two keyword sets J1、J2Similarity between two maximum words of middle weight, its concrete calculating process is such as
Under:
Here the description of key word is with the above-mentioned vectorial factorTo retouch
State, calculate the similarity between two maximum words of weight, it is possible to use following formula is realized:
Above formula A, B are respectively concept depth and density, the comprehensive weight allocation proportion of other elements in vector element, typically
A > B, A+B=1.
Above formula
It is apart from the factor.
Above formula in the same manner
Two maximum words of weight
Step 6:Two keyword sets J1、J2Middle solution similarity two-by-two between vocabulary, sets similarity between a vocabulary
Threshold value, the similarity sim (W between two texts is calculated according to the vocabulary number for meeting condition1, W2), its concrete calculating process is such as
Under:
Its matrix form is
In the same manner with step 5Can calculate similar between vocabulary two-by-two in two keyword sets
Degree, there is following similarity matrix:
Extraction meets following formula and obtains vocabulary, i.e.,:
The threshold value of similarity, as follows between one vocabulary of setting:
Above formula j ∈ (2,3 ..., k), the threshold value that C sets for expert, then number n ' Jia 1 to meet condition, the initial value of n '
Can be according to experiment iteration out for 0, C.
In sum, the similarity between two texts
X, y are respectively Weight, x>Y, x+y=1;N ' for meet threshold condition key word
Number, k is the number of high-quality key word after optimization.
Claims (3)
1. the text similarity derivation algorithm of global optimization keyword quality is based on, the present invention relates to Semantic Web technology field,
Text similarity derivation algorithm specifically related to based on global optimization keyword quality, is characterized in that, comprise the steps:
Step 1:Using Chinese words segmentation to two textsWord segmentation processing is carried out, its concrete participle technique process is such as
Under:
Step 1.1:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, the Chinese character string for treating participle is complete
Whole scanning one time, makes a look up matching in the dictionary of system, and the word that run into has in dictionary is just identified;If in dictionary
There are no relevant matches, be just simply partitioned into individual character as word;Until Chinese character string is sky
Step 1.2:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined
Structure, is successively defined as every sequential node of this structure, its structure chart is as shown in Figure 2
Step 1.3:Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating process are given
It is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e.,
The number collection of n paths words is combined into
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved,
In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula:
Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount
Above formulaForProbability in text corpus, n is containing wordText corpus number
Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum
In the same manner
It is the word in text corpusCo-occurrence information amount,For adjacent
WordThe text message amount of co-occurrence
In the same manner
Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text libraryThe amount of text of co-occurrence
For adjacent word in text libraryThe textual data probability of co-occurrence
The weights that every adjacent path can to sum up be obtained are
Step 1.4:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process is such as
Under:
There are n paths, it is different per paths length, it is assumed that path collection is combined into
Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road
Electrical path length collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesThe power on path side
Weight values, can calculate one by one according to step 1.4,For in remaining (n-m) path theThe length of paths
Degree
One paths of maximum weight:
Step 2:Stop words is carried out according to deactivation table to text vocabulary to process, it is described in detail below:
Stop words refers to that in the text the frequency of occurrences is high, but for word of the Text Flag without too big effect, removes stop words
Process be exactly to be compared characteristic item with the word in vocabulary is disabled, by the spy if matching
Levy entry deletion
Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3
Step 3:The text key word collection gone after stop words operation is combined into, each key word
Contribution margin in text regards a multi-C vector as, i.e.,, its
Concrete calculating process is as follows:
Object vector:
It is above-mentionedFor key wordWeighting function in text library:
Above formulaFor key wordThe number of times for occurring in the text,For the total length of text,For text library Chinese version
Total number,For key wordThe quantity of information of jth text in storehouse,For key wordIt is average in storehouse
Quantity of information
Object vectorInFor key wordDepth capacity of the correspondence Ontological concept in body network structure;
Object vectorInFor key wordMaximal density of the correspondence Ontological concept in body network structure;
Object vectorInFor key wordPart of speech weight, the rule of thumb weight of noun, verb, adjective, adverbial word
Value is followed successively by、、With, and
Object vectorInFor key wordThe position weight for occurring for the first time in the text, this can be according to system
Meter investigation draws a series of position weight value, weight in theme is maximum, and first paragraph takes second place
Object vectorInFor key wordNumber of times in the paragraph of the 1st appearance of text
Object vectorInFor key wordWith weight limit vocabularyCo-occurrence in the text
Probability
Step 4:Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extract optimization
Text key word set、, its concrete calculating process is as follows:
Above-mentioned key vocabulariesIn being mapped to hyperspace, successively merge from big to small while meeting the key of following formula constraints
Word;
The distance value d of point-to-point transmission:
The similarity of point-to-point transmission:
、For the threshold value of expert's setting, this can draw suitable value by experiment test
Key word and weight after merging is respectively that maximum in space vectorial corresponding key vocabularies of point, and weight is also
The weight of the key vocabularies
The key word after merging is sorted from big to small according to above-mentioned constraints, is extracted high-quality using following constraintss
The key vocabularies characteristic set of amount
Above formulaTo extract the number of vocabulary, k is to set the threshold value for extracting vocabulary number, and k high-quality is closed before extracting
Keyword, it is to avoid redundancy, finally gives the high-quality key vocabularies characteristic set that weight arranges from big to small and is respectively
、
Step 5:Calculate two keyword sets、Similarity between two maximum words of middle weight;
Step 6:Two keyword sets、Middle solution similarity two-by-two between vocabulary, sets the threshold of similarity between a vocabulary
Value, according to the vocabulary number for meeting condition the similarity between two texts is calculated。
2. according to the text similarity derivation algorithm based on global optimization keyword quality described in claim 1
It is characterized in that, the concrete calculating process in the above step 5 is as follows:
Step 5:Calculate two keyword sets、Similarity between two maximum words of middle weight, its concrete calculating process is as follows:
Here the description of key word is with the above-mentioned vectorial factor
Come what is described, the phase between two maximum words of weight is calculated
Like property, it is possible to use following formula is realized:
Above formula A, B are respectively concept depth and density, the comprehensive weight allocation proportion of other elements in vector element, typically,
Above formula
It is apart from the factor
Above formula in the same manner
Two maximum words of weight。
3. according to the text similarity derivation algorithm based on global optimization keyword quality described in claim 1
It is characterized in that, the concrete calculating process in the above step 6 is as follows:
Step 6:Two keyword sets、Middle solution similarity two-by-two between vocabulary, sets the threshold of similarity between a vocabulary
Value, according to the vocabulary number for meeting condition the similarity between two texts is calculated, its concrete calculating process
It is as follows:
Its matrix form is
In the same manner with step 5, similarity two-by-two between vocabulary in two keyword sets can be calculated, have
Following similarity matrix:
Extraction meets following formula and obtains vocabulary, i.e.,:
The threshold value of similarity, as follows between one vocabulary of setting:
Above formula, the threshold value that C sets for expert meets condition then numberPlus 1,Initial value be
0, C can be according to experiment iteration out
In sum, the similarity between two texts
X, y are respectively、Weight, x>Y, x+y=1;To meet the number of threshold condition key word,
K is the number of high-quality key word after optimization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610939853.0A CN106598940A (en) | 2016-11-01 | 2016-11-01 | Text similarity solution algorithm based on global optimization of keyword quality |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610939853.0A CN106598940A (en) | 2016-11-01 | 2016-11-01 | Text similarity solution algorithm based on global optimization of keyword quality |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106598940A true CN106598940A (en) | 2017-04-26 |
Family
ID=58589621
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610939853.0A Pending CN106598940A (en) | 2016-11-01 | 2016-11-01 | Text similarity solution algorithm based on global optimization of keyword quality |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106598940A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107357916A (en) * | 2017-07-19 | 2017-11-17 | 北京金堤科技有限公司 | Data processing method and system |
CN108228546A (en) * | 2018-01-19 | 2018-06-29 | 北京中关村科金技术有限公司 | A kind of text feature, device, equipment and readable storage medium storing program for executing |
CN108804512A (en) * | 2018-04-20 | 2018-11-13 | 平安科技(深圳)有限公司 | Generating means, method and the computer readable storage medium of textual classification model |
CN108829799A (en) * | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
CN109086262A (en) * | 2017-06-14 | 2018-12-25 | 财团法人资讯工业策进会 | Lexical analysis device, method and its computer storage medium |
CN109977196A (en) * | 2019-03-29 | 2019-07-05 | 云南电网有限责任公司电力科学研究院 | A kind of detection method and device of magnanimity document similarity |
CN110245118A (en) * | 2019-06-27 | 2019-09-17 | 重庆市筑智建信息技术有限公司 | BIM data information three-dimensional gridding retrieval filing method and filing system thereof |
CN110956039A (en) * | 2019-12-04 | 2020-04-03 | 中国太平洋保险(集团)股份有限公司 | Text similarity calculation method and device based on multi-dimensional vectorization coding |
CN111353301A (en) * | 2020-02-24 | 2020-06-30 | 成都网安科技发展有限公司 | Auxiliary secret fixing method and device |
CN111625468A (en) * | 2020-06-05 | 2020-09-04 | 中国银行股份有限公司 | Test case duplicate removal method and device |
CN111898380A (en) * | 2020-08-17 | 2020-11-06 | 上海熙满网络科技有限公司 | Text matching method and device, electronic equipment and storage medium |
CN112035621A (en) * | 2020-09-03 | 2020-12-04 | 江苏经贸职业技术学院 | Enterprise name similarity detection method based on statistics |
CN112256843A (en) * | 2020-12-22 | 2021-01-22 | 华东交通大学 | News keyword extraction method and system based on TF-IDF method optimization |
CN112348535A (en) * | 2020-11-04 | 2021-02-09 | 新华中经信用管理有限公司 | Traceability application method and system based on block chain technology |
CN113392637A (en) * | 2021-06-24 | 2021-09-14 | 青岛科技大学 | TF-IDF-based subject term extraction method, device, equipment and storage medium |
CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
CN113836942A (en) * | 2021-02-08 | 2021-12-24 | 宏龙科技(杭州)有限公司 | Text matching method based on hidden keywords |
CN116244496A (en) * | 2022-12-06 | 2023-06-09 | 山东紫菜云数字科技有限公司 | Resource recommendation method based on industrial chain |
CN117669513A (en) * | 2024-01-30 | 2024-03-08 | 江苏古卓科技有限公司 | Data management system and method based on artificial intelligence |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937471A (en) * | 2010-09-21 | 2011-01-05 | 上海大学 | Multidimensional space evaluation method of keyword extraction algorithm |
CN103617157A (en) * | 2013-12-10 | 2014-03-05 | 东北师范大学 | Text similarity calculation method based on semantics |
CN103970733A (en) * | 2014-04-10 | 2014-08-06 | 北京大学 | New Chinese word recognition method based on graph structure |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
-
2016
- 2016-11-01 CN CN201610939853.0A patent/CN106598940A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937471A (en) * | 2010-09-21 | 2011-01-05 | 上海大学 | Multidimensional space evaluation method of keyword extraction algorithm |
CN103617157A (en) * | 2013-12-10 | 2014-03-05 | 东北师范大学 | Text similarity calculation method based on semantics |
CN103970733A (en) * | 2014-04-10 | 2014-08-06 | 北京大学 | New Chinese word recognition method based on graph structure |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
Non-Patent Citations (2)
Title |
---|
BECK_ZHOU: ""中文分词语言模型和动态规划"", 《CSDN博客HTTPS://BLOG.CSDN.BET/ZHOUBL668/ARTICLE/DETAILS/68964》 * |
蒋健洪 等: ""词典与统计方法结合的中文分词模型研究及应用"", 《计算机工程与设计》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109086262A (en) * | 2017-06-14 | 2018-12-25 | 财团法人资讯工业策进会 | Lexical analysis device, method and its computer storage medium |
CN107357916A (en) * | 2017-07-19 | 2017-11-17 | 北京金堤科技有限公司 | Data processing method and system |
CN108228546A (en) * | 2018-01-19 | 2018-06-29 | 北京中关村科金技术有限公司 | A kind of text feature, device, equipment and readable storage medium storing program for executing |
CN108804512B (en) * | 2018-04-20 | 2020-11-24 | 平安科技(深圳)有限公司 | Text classification model generation device and method and computer readable storage medium |
CN108804512A (en) * | 2018-04-20 | 2018-11-13 | 平安科技(深圳)有限公司 | Generating means, method and the computer readable storage medium of textual classification model |
CN108829799A (en) * | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
CN109977196A (en) * | 2019-03-29 | 2019-07-05 | 云南电网有限责任公司电力科学研究院 | A kind of detection method and device of magnanimity document similarity |
CN110245118A (en) * | 2019-06-27 | 2019-09-17 | 重庆市筑智建信息技术有限公司 | BIM data information three-dimensional gridding retrieval filing method and filing system thereof |
CN110245118B (en) * | 2019-06-27 | 2021-05-14 | 重庆市筑智建信息技术有限公司 | BIM data information three-dimensional gridding retrieval filing method and filing system thereof |
CN110956039A (en) * | 2019-12-04 | 2020-04-03 | 中国太平洋保险(集团)股份有限公司 | Text similarity calculation method and device based on multi-dimensional vectorization coding |
CN111353301A (en) * | 2020-02-24 | 2020-06-30 | 成都网安科技发展有限公司 | Auxiliary secret fixing method and device |
CN111625468A (en) * | 2020-06-05 | 2020-09-04 | 中国银行股份有限公司 | Test case duplicate removal method and device |
CN111625468B (en) * | 2020-06-05 | 2024-04-16 | 中国银行股份有限公司 | Test case duplicate removal method and device |
CN111898380A (en) * | 2020-08-17 | 2020-11-06 | 上海熙满网络科技有限公司 | Text matching method and device, electronic equipment and storage medium |
CN112035621A (en) * | 2020-09-03 | 2020-12-04 | 江苏经贸职业技术学院 | Enterprise name similarity detection method based on statistics |
CN112348535A (en) * | 2020-11-04 | 2021-02-09 | 新华中经信用管理有限公司 | Traceability application method and system based on block chain technology |
CN112348535B (en) * | 2020-11-04 | 2023-09-12 | 新华中经信用管理有限公司 | Traceability application method and system based on blockchain technology |
CN112256843A (en) * | 2020-12-22 | 2021-01-22 | 华东交通大学 | News keyword extraction method and system based on TF-IDF method optimization |
CN113836942A (en) * | 2021-02-08 | 2021-12-24 | 宏龙科技(杭州)有限公司 | Text matching method based on hidden keywords |
CN113836942B (en) * | 2021-02-08 | 2022-09-20 | 宏龙科技(杭州)有限公司 | Text matching method based on hidden keywords |
CN113392637A (en) * | 2021-06-24 | 2021-09-14 | 青岛科技大学 | TF-IDF-based subject term extraction method, device, equipment and storage medium |
CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
CN113743090B (en) * | 2021-09-08 | 2024-04-12 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
CN116244496A (en) * | 2022-12-06 | 2023-06-09 | 山东紫菜云数字科技有限公司 | Resource recommendation method based on industrial chain |
CN116244496B (en) * | 2022-12-06 | 2023-12-01 | 山东紫菜云数字科技有限公司 | Resource recommendation method based on industrial chain |
CN117669513A (en) * | 2024-01-30 | 2024-03-08 | 江苏古卓科技有限公司 | Data management system and method based on artificial intelligence |
CN117669513B (en) * | 2024-01-30 | 2024-04-12 | 江苏古卓科技有限公司 | Data management system and method based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106598940A (en) | Text similarity solution algorithm based on global optimization of keyword quality | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN110413986B (en) | Text clustering multi-document automatic summarization method and system for improving word vector model | |
Dos Santos et al. | Deep convolutional neural networks for sentiment analysis of short texts | |
CN106610951A (en) | Improved text similarity solving algorithm based on semantic analysis | |
CN106970910B (en) | Keyword extraction method and device based on graph model | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN106598941A (en) | Algorithm for globally optimizing quality of text keywords | |
CN106611041A (en) | New text similarity solution method | |
Alwehaibi et al. | A study of the performance of embedding methods for Arabic short-text sentiment analysis using deep learning approaches | |
CN107423282A (en) | Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character | |
CN106528621A (en) | Improved density text clustering algorithm | |
CN106570112A (en) | Improved ant colony algorithm-based text clustering realization method | |
Suleiman et al. | Comparative study of word embeddings models and their usage in Arabic language applications | |
Lytvyn et al. | Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN109815400A (en) | Personage's interest extracting method based on long text | |
CN106610952A (en) | Mixed text feature word extraction method | |
CN107102985A (en) | Multi-threaded keyword extraction techniques in improved document | |
CN109344403A (en) | A kind of document representation method of enhancing semantic feature insertion | |
CN106610949A (en) | Text feature extraction method based on semantic analysis | |
CN106446147A (en) | Emotion analysis method based on structuring features | |
CN106570120A (en) | Process for realizing searching engine optimization through improved keyword optimization | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN106610954A (en) | Text feature word extraction method based on statistics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170426 |
|
WD01 | Invention patent application deemed withdrawn after publication |