CN106598940A - Text similarity solution algorithm based on global optimization of keyword quality - Google Patents

Text similarity solution algorithm based on global optimization of keyword quality Download PDF

Info

Publication number
CN106598940A
CN106598940A CN201610939853.0A CN201610939853A CN106598940A CN 106598940 A CN106598940 A CN 106598940A CN 201610939853 A CN201610939853 A CN 201610939853A CN 106598940 A CN106598940 A CN 106598940A
Authority
CN
China
Prior art keywords
word
text
similarity
weight
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610939853.0A
Other languages
Chinese (zh)
Inventor
金平艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yonglian Information Technology Co Ltd
Original Assignee
Sichuan Yonglian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yonglian Information Technology Co Ltd filed Critical Sichuan Yonglian Information Technology Co Ltd
Priority to CN201610939853.0A priority Critical patent/CN106598940A/en
Publication of CN106598940A publication Critical patent/CN106598940A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention provides a text similarity solution algorithm based on global optimization of keyword quality. The text similarity solution algorithm comprises the following steps: performing word segmentation and stop word removal processing on a text, comprehensively considering the weight, the density, the depth, the part of speech, the word position, the relevancy of core vocabularies and other factors of the keywords in a text library, deeming each keyword (the formula is as shown in the specification) as a multi-dimensional vector, performing dimension reduction processing on a keyword set in a multi-dimensional space by constraint conditions, extracting and calculating the similarity (the formula is as shown in the specification) of two keywords having the largest weights in two text keyword sets, and setting a threshold condition to extract characteristic vocabulary vectors in the two texts. Compared with the traditional word frequency-inverse document frequency method, the text similarity solution algorithm has higher accuracy and overcomes the defect that the defects of only one category can be extracted by the information gain method, the constraint conditions of the algorithm are precise enough to accurately calculate the contribution degrees of different vocabularies to the texts, the individual quality of the keywords can be guaranteed, the overall quality of the keywords can be globally optimized, and in addition, the similarity accuracy between the texts is higher.

Description

Text similarity derivation algorithm based on global optimization keyword quality
Technical field
The present invention relates to Semantic Web technology field, and in particular to the text similarity based on global optimization keyword quality Derivation algorithm.
Background technology
Text similarity computing may apply to text classification, text cluster, information retrieval, question answering system, removing duplicate webpages Etc. many fields.At present, in different field, many similarity calculating methods are suggested and are applied to practice.Such as vector space mould Type, Boolean Model, implicit semantic index statistical model, the string matching model such as model, model based on semantic understanding etc..Can To find, nowadays increasing experts and scholars study the calculating of text similarity, this is because effective meter of text similarity Calculation can play raising recall precision, avoid article ticket from stealing, saves the effect such as memory space.In Text similarity computing field still So there are problems that it is many need people to solve, especially to the research of Chinese text similarity, also do not develop into make us full at present The degree of meaning.Natural language understanding is realized with computer, Chinese is more difficult to process than English.Chinese word unlike English And have obvious separation mark between this, it expresses together a meaning with multiple continuous words, based on context linguistic context Difference, is also easy to cause ambiguity, how to improve the effectiveness and accuracy of Text similarity computing, based on the demand, this It is bright there is provided a kind of text similarity derivation algorithm based on global optimization keyword quality.
The content of the invention
The effectiveness of Text similarity computing and the deficiency of accuracy are directed to, the invention provides a kind of based on global excellent Change the text similarity derivation algorithm of keyword quality.
In order to solve the above problems, the present invention is achieved by the following technical solutions:
Step 1:Using Chinese words segmentation to two text (W1, W2) carry out word segmentation processing;
Step 2:Stop words is carried out according to deactivation table to text vocabulary to process;
Step 3:The text key word collection gone after stop words operation is combined into C=(C1, C2..., Cn), each key word Ci Contribution margin in text regards a multi-C vector as, i.e.,
Step 4:Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extracted most The text key word set J of optimization1、J2
Step 5:Calculate two keyword sets J1、J2Similarity between two maximum words of middle weight;
Step 6:Two keyword sets J1、J2Middle solution similarity two-by-two between vocabulary, sets similarity between a vocabulary Threshold value, the similarity sim (W between two texts is calculated according to the vocabulary number for meeting condition1, W2)。
Present invention has the advantages that:
1st, the method is higher than the accuracy of the feature lexical set that traditional word frequency-anti-document frequency method is obtained.
2nd, the method overcomes the shortcoming that Information Gain Method is only suitable for the text feature for extracting a classification.
3rd, this algorithm has bigger value.
4th, the method has been precisely calculated contribution degree of the different vocabulary to text thought in feature vocabulary.
5th, the method condition before the method is compared is more harsh, and the result precision for obtaining is higher.
6th, key word Individual Quality is not only can ensure that, and the total quality of key word can be optimized from the overall situation.
7th, the similarity result between text is more accurate, more meets empirical value.
Description of the drawings
Text similarity derivation algorithm structure flow charts of the Fig. 1 based on global optimization keyword quality
Fig. 2 n-grams segmentation methods are illustrated
Fig. 3 Chinese text preprocessing process flow charts
Specific embodiment
In order to the effectiveness and accuracy that solve the problems, such as Text similarity computing are not enough, the present invention is carried out with reference to Fig. 1 Describe in detail, its specific implementation step is as follows:
Step 1:Using Chinese words segmentation to two text (W1, W2) word segmentation processing is carried out, its concrete participle technique process is such as Under:
Step 1.1:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, the Chinese character for treating participle The complete scanning of string one time, makes a look up matching in the dictionary of system, and the word that run into has in dictionary is just identified;If word There are no relevant matches in allusion quotation, be just simply partitioned into individual character as word;Until Chinese character string is sky.
Step 1.2:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Minor structure, every sequential node of this structure SM is defined as successively1M2M3M4M5E, its structure chart is as shown in Figure 2.
Step 1.3:Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating are given Process is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word ni.That is the number collection of n paths word is combined into (n1, n2..., nn)。
Obtain min ()=min (n1, n2..., nn)
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.
In statistics corpus, the quantity of information X (C of each word are calculatedi), then the co-occurrence letter of the adjacent word of solution path
Breath amount X (Ci, Ci+1).Existing following formula:
X(Ci)=| x (Ci)1-x(Ci)2|
Above formula x (Ci)1For word C in text corpusiQuantity of information, x (Ci)2It is C containing wordiText message amount.
x(Ci)1=-P (Ci)1lnp(Ci)1
Above formula p (Ci)1For CiProbability in text corpus, n is C containing wordiText corpus number.
x(Ci)2=-p (Ci)2lnp(Ci)2
Above formula p (Ci)2It is C containing wordiTextual data probit, N is statistics corpus Chinese version sum.
X (C in the same manneri, Ci+1)=| x (Ci, Ci+1)1-x(Ci, Ci+1)2|
x(Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence information amount, x (Ci, Ci+1)2For adjacent word (Ci, Ci+1) co-occurrence text message amount.
X (C in the same manneri, Ci+1) 1=-p (Ci, Ci+1)1lnp(Ci, Ci+1)1
Above formula p (Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence probabilities, m is the word (C in text libraryi, Ci+1) co-occurrence amount of text.
x(Ci, Ci+1)2=-p (Ci, Ci+1)2lnp(Ci, Ci+1)2
p(Ci, Ci+1)2For adjacent word (C in text libraryi, Ci+1) co-occurrence textual data probability.
The weights that every adjacent path can to sum up be obtained are
w(Ci, Ci+1)=X (Ci)+X(Ci+1)-2X(Ci, Ci+1)
Step 1.4:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, it was specifically calculated Journey is as follows:
There are n paths, it is different per paths length, it is assumed that path collection is combined into (L1, L2..., Ln)。
Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if Its path collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 1.4,For S in remaining (n-m) pathjBar The length in path.
One paths of maximum weight:
Step 2:Stop words is carried out according to deactivation table to text vocabulary to process, it is described in detail below:
Stop words refers to that in the text the frequency of occurrences is high, but for word of the Text Flag without too big effect.Go to stop The process of word is exactly to be compared characteristic item with the word in vocabulary is disabled, by this feature entry deletion if matching.
Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3.
Step 3:The text key word collection gone after stop words operation is combined into C=(C1, C2..., Cn), each key word Ci Contribution margin in text regards a multi-C vector, i.e. object vector as
Its concrete calculating process is as follows:
Object vector:
Above-mentioned fiFor key word CiWeighting function in text library:
Above formulaFor key word CiThe number of times for occurring in the text, LAlwaysFor the total length of text, NAlwaysFor text library Chinese version Total number, Ij(Ci) it is key word CiThe quantity of information of jth text in storehouse,For key word CiAverage information in storehouse Amount.
Ij(Ci)=lnIj(Ci)
Object vectorIn siFor key word CiDepth capacity of the correspondence Ontological concept in body network structure;
Object vectorIn miFor key word CiMaximal density of the correspondence Ontological concept in body network structure;
Object vectorIn xiFor key word CiPart of speech weight, the rule of thumb power of noun, verb, adjective, adverbial word Weight values are followed successively by β1、β2、β3And β4, and β1> β2> β3> β4
Object vectorIn wi1For key word CiThe position weight for occurring for the first time in the text, this can be according to system Meter investigation draws a series of position weight value (α1, α2..., αn), the weight in theme is maximum, and first paragraph takes second place.
Object vectorIn n1For key word CiNumber of times in the paragraph of the 1st appearance of text.
Object vectorInFor key word CiWith weight limit vocabularyCo-occurrence in the text Probability.
Step 4:Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extracted most The text key word set J of optimization1、J2, its concrete calculating process is as follows:
Above-mentioned key vocabulariesIn being mapped to hyperspace, successively merge from big to small while meeting following formula constraints Key word;
The distance value d of point-to-point transmission:D < γ
The similarity of point-to-point transmission:
γ, α are the threshold value of expert's setting, and this can draw suitable value by experiment test.
Key word and weight after merging is respectively that maximum in space vectorial corresponding key vocabularies of point, weight Also it is the weight of the key vocabularies.
The key word after merging is sorted from big to small according to above-mentioned constraints, is extracted using following constraintss High-quality key vocabularies characteristic set.
Above formulaTo extract the number of vocabulary, k is to set the threshold value for extracting vocabulary number, and k is individual high-quality before extracting Amount key word, it is to avoid redundancy.
Finally give the high-quality key vocabularies characteristic set that weight arranges from big to small to be respectively
Step 5:Calculate two keyword sets J1、J2Similarity between two maximum words of middle weight, its concrete calculating process is such as Under:
Here the description of key word is with the above-mentioned vectorial factorTo retouch State, calculate the similarity between two maximum words of weight, it is possible to use following formula is realized:
Above formula A, B are respectively concept depth and density, the comprehensive weight allocation proportion of other elements in vector element, typically A > B, A+B=1.
Above formula
It is apart from the factor.
Above formula in the same manner
Two maximum words of weight
Step 6:Two keyword sets J1、J2Middle solution similarity two-by-two between vocabulary, sets similarity between a vocabulary Threshold value, the similarity sim (W between two texts is calculated according to the vocabulary number for meeting condition1, W2), its concrete calculating process is such as Under:
Its matrix form is
In the same manner with step 5Can calculate similar between vocabulary two-by-two in two keyword sets Degree, there is following similarity matrix:
Extraction meets following formula and obtains vocabulary, i.e.,:
The threshold value of similarity, as follows between one vocabulary of setting:
Above formula j ∈ (2,3 ..., k), the threshold value that C sets for expert, then number n ' Jia 1 to meet condition, the initial value of n ' Can be according to experiment iteration out for 0, C.
In sum, the similarity between two texts
X, y are respectively Weight, x>Y, x+y=1;N ' for meet threshold condition key word Number, k is the number of high-quality key word after optimization.

Claims (3)

1. the text similarity derivation algorithm of global optimization keyword quality is based on, the present invention relates to Semantic Web technology field, Text similarity derivation algorithm specifically related to based on global optimization keyword quality, is characterized in that, comprise the steps:
Step 1:Using Chinese words segmentation to two textsWord segmentation processing is carried out, its concrete participle technique process is such as Under:
Step 1.1:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, the Chinese character string for treating participle is complete Whole scanning one time, makes a look up matching in the dictionary of system, and the word that run into has in dictionary is just identified;If in dictionary There are no relevant matches, be just simply partitioned into individual character as word;Until Chinese character string is sky
Step 1.2:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined Structure, is successively defined as every sequential node of this structure, its structure chart is as shown in Figure 2
Step 1.3:Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating process are given It is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e., The number collection of n paths words is combined into
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved,
In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula:
Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount
Above formulaForProbability in text corpus, n is containing wordText corpus number
Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum
In the same manner
It is the word in text corpusCo-occurrence information amount,For adjacent WordThe text message amount of co-occurrence
In the same manner
Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text libraryThe amount of text of co-occurrence
For adjacent word in text libraryThe textual data probability of co-occurrence
The weights that every adjacent path can to sum up be obtained are
Step 1.4:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process is such as Under:
There are n paths, it is different per paths length, it is assumed that path collection is combined into
Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road Electrical path length collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesThe power on path side Weight values, can calculate one by one according to step 1.4,For in remaining (n-m) path theThe length of paths Degree
One paths of maximum weight:
Step 2:Stop words is carried out according to deactivation table to text vocabulary to process, it is described in detail below:
Stop words refers to that in the text the frequency of occurrences is high, but for word of the Text Flag without too big effect, removes stop words Process be exactly to be compared characteristic item with the word in vocabulary is disabled, by the spy if matching
Levy entry deletion
Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 3
Step 3:The text key word collection gone after stop words operation is combined into, each key word Contribution margin in text regards a multi-C vector as, i.e.,, its Concrete calculating process is as follows:
Object vector:
It is above-mentionedFor key wordWeighting function in text library:
Above formulaFor key wordThe number of times for occurring in the text,For the total length of text,For text library Chinese version Total number,For key wordThe quantity of information of jth text in storehouse,For key wordIt is average in storehouse Quantity of information
Object vectorInFor key wordDepth capacity of the correspondence Ontological concept in body network structure;
Object vectorInFor key wordMaximal density of the correspondence Ontological concept in body network structure;
Object vectorInFor key wordPart of speech weight, the rule of thumb weight of noun, verb, adjective, adverbial word Value is followed successively byWith, and
Object vectorInFor key wordThe position weight for occurring for the first time in the text, this can be according to system Meter investigation draws a series of position weight value, weight in theme is maximum, and first paragraph takes second place
Object vectorInFor key wordNumber of times in the paragraph of the 1st appearance of text
Object vectorInFor key wordWith weight limit vocabularyCo-occurrence in the text Probability
Step 4:Using constraints, keyword feature set dimension-reduction treatment is carried out in hyperspace, finally extract optimization Text key word set, its concrete calculating process is as follows:
Above-mentioned key vocabulariesIn being mapped to hyperspace, successively merge from big to small while meeting the key of following formula constraints Word;
The distance value d of point-to-point transmission:
The similarity of point-to-point transmission:
For the threshold value of expert's setting, this can draw suitable value by experiment test
Key word and weight after merging is respectively that maximum in space vectorial corresponding key vocabularies of point, and weight is also The weight of the key vocabularies
The key word after merging is sorted from big to small according to above-mentioned constraints, is extracted high-quality using following constraintss The key vocabularies characteristic set of amount
Above formulaTo extract the number of vocabulary, k is to set the threshold value for extracting vocabulary number, and k high-quality is closed before extracting Keyword, it is to avoid redundancy, finally gives the high-quality key vocabularies characteristic set that weight arranges from big to small and is respectively
Step 5:Calculate two keyword setsSimilarity between two maximum words of middle weight;
Step 6:Two keyword setsMiddle solution similarity two-by-two between vocabulary, sets the threshold of similarity between a vocabulary Value, according to the vocabulary number for meeting condition the similarity between two texts is calculated
2. according to the text similarity derivation algorithm based on global optimization keyword quality described in claim 1
It is characterized in that, the concrete calculating process in the above step 5 is as follows:
Step 5:Calculate two keyword setsSimilarity between two maximum words of middle weight, its concrete calculating process is as follows:
Here the description of key word is with the above-mentioned vectorial factor
Come what is described, the phase between two maximum words of weight is calculated Like property, it is possible to use following formula is realized:
Above formula A, B are respectively concept depth and density, the comprehensive weight allocation proportion of other elements in vector element, typically,
Above formula
It is apart from the factor
Above formula in the same manner
Two maximum words of weight
3. according to the text similarity derivation algorithm based on global optimization keyword quality described in claim 1
It is characterized in that, the concrete calculating process in the above step 6 is as follows:
Step 6:Two keyword setsMiddle solution similarity two-by-two between vocabulary, sets the threshold of similarity between a vocabulary Value, according to the vocabulary number for meeting condition the similarity between two texts is calculated, its concrete calculating process It is as follows:
Its matrix form is
In the same manner with step 5, similarity two-by-two between vocabulary in two keyword sets can be calculated, have Following similarity matrix:
Extraction meets following formula and obtains vocabulary, i.e.,:
The threshold value of similarity, as follows between one vocabulary of setting:
Above formula, the threshold value that C sets for expert meets condition then numberPlus 1,Initial value be 0, C can be according to experiment iteration out
In sum, the similarity between two texts
X, y are respectivelyWeight, x>Y, x+y=1;To meet the number of threshold condition key word, K is the number of high-quality key word after optimization.
CN201610939853.0A 2016-11-01 2016-11-01 Text similarity solution algorithm based on global optimization of keyword quality Pending CN106598940A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610939853.0A CN106598940A (en) 2016-11-01 2016-11-01 Text similarity solution algorithm based on global optimization of keyword quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610939853.0A CN106598940A (en) 2016-11-01 2016-11-01 Text similarity solution algorithm based on global optimization of keyword quality

Publications (1)

Publication Number Publication Date
CN106598940A true CN106598940A (en) 2017-04-26

Family

ID=58589621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610939853.0A Pending CN106598940A (en) 2016-11-01 2016-11-01 Text similarity solution algorithm based on global optimization of keyword quality

Country Status (1)

Country Link
CN (1) CN106598940A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357916A (en) * 2017-07-19 2017-11-17 北京金堤科技有限公司 Data processing method and system
CN108228546A (en) * 2018-01-19 2018-06-29 北京中关村科金技术有限公司 A kind of text feature, device, equipment and readable storage medium storing program for executing
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
CN109086262A (en) * 2017-06-14 2018-12-25 财团法人资讯工业策进会 Lexical analysis device, method and its computer storage medium
CN109977196A (en) * 2019-03-29 2019-07-05 云南电网有限责任公司电力科学研究院 A kind of detection method and device of magnanimity document similarity
CN110245118A (en) * 2019-06-27 2019-09-17 重庆市筑智建信息技术有限公司 BIM data information three-dimensional gridding retrieval filing method and filing system thereof
CN110956039A (en) * 2019-12-04 2020-04-03 中国太平洋保险(集团)股份有限公司 Text similarity calculation method and device based on multi-dimensional vectorization coding
CN111353301A (en) * 2020-02-24 2020-06-30 成都网安科技发展有限公司 Auxiliary secret fixing method and device
CN111625468A (en) * 2020-06-05 2020-09-04 中国银行股份有限公司 Test case duplicate removal method and device
CN111898380A (en) * 2020-08-17 2020-11-06 上海熙满网络科技有限公司 Text matching method and device, electronic equipment and storage medium
CN112035621A (en) * 2020-09-03 2020-12-04 江苏经贸职业技术学院 Enterprise name similarity detection method based on statistics
CN112256843A (en) * 2020-12-22 2021-01-22 华东交通大学 News keyword extraction method and system based on TF-IDF method optimization
CN112348535A (en) * 2020-11-04 2021-02-09 新华中经信用管理有限公司 Traceability application method and system based on block chain technology
CN113392637A (en) * 2021-06-24 2021-09-14 青岛科技大学 TF-IDF-based subject term extraction method, device, equipment and storage medium
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113836942A (en) * 2021-02-08 2021-12-24 宏龙科技(杭州)有限公司 Text matching method based on hidden keywords
CN116244496A (en) * 2022-12-06 2023-06-09 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain
CN117669513A (en) * 2024-01-30 2024-03-08 江苏古卓科技有限公司 Data management system and method based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937471A (en) * 2010-09-21 2011-01-05 上海大学 Multidimensional space evaluation method of keyword extraction algorithm
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937471A (en) * 2010-09-21 2011-01-05 上海大学 Multidimensional space evaluation method of keyword extraction algorithm
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BECK_ZHOU: ""中文分词语言模型和动态规划"", 《CSDN博客HTTPS://BLOG.CSDN.BET/ZHOUBL668/ARTICLE/DETAILS/68964》 *
蒋健洪 等: ""词典与统计方法结合的中文分词模型研究及应用"", 《计算机工程与设计》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086262A (en) * 2017-06-14 2018-12-25 财团法人资讯工业策进会 Lexical analysis device, method and its computer storage medium
CN107357916A (en) * 2017-07-19 2017-11-17 北京金堤科技有限公司 Data processing method and system
CN108228546A (en) * 2018-01-19 2018-06-29 北京中关村科金技术有限公司 A kind of text feature, device, equipment and readable storage medium storing program for executing
CN108804512B (en) * 2018-04-20 2020-11-24 平安科技(深圳)有限公司 Text classification model generation device and method and computer readable storage medium
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
CN109977196A (en) * 2019-03-29 2019-07-05 云南电网有限责任公司电力科学研究院 A kind of detection method and device of magnanimity document similarity
CN110245118A (en) * 2019-06-27 2019-09-17 重庆市筑智建信息技术有限公司 BIM data information three-dimensional gridding retrieval filing method and filing system thereof
CN110245118B (en) * 2019-06-27 2021-05-14 重庆市筑智建信息技术有限公司 BIM data information three-dimensional gridding retrieval filing method and filing system thereof
CN110956039A (en) * 2019-12-04 2020-04-03 中国太平洋保险(集团)股份有限公司 Text similarity calculation method and device based on multi-dimensional vectorization coding
CN111353301A (en) * 2020-02-24 2020-06-30 成都网安科技发展有限公司 Auxiliary secret fixing method and device
CN111625468A (en) * 2020-06-05 2020-09-04 中国银行股份有限公司 Test case duplicate removal method and device
CN111625468B (en) * 2020-06-05 2024-04-16 中国银行股份有限公司 Test case duplicate removal method and device
CN111898380A (en) * 2020-08-17 2020-11-06 上海熙满网络科技有限公司 Text matching method and device, electronic equipment and storage medium
CN112035621A (en) * 2020-09-03 2020-12-04 江苏经贸职业技术学院 Enterprise name similarity detection method based on statistics
CN112348535A (en) * 2020-11-04 2021-02-09 新华中经信用管理有限公司 Traceability application method and system based on block chain technology
CN112348535B (en) * 2020-11-04 2023-09-12 新华中经信用管理有限公司 Traceability application method and system based on blockchain technology
CN112256843A (en) * 2020-12-22 2021-01-22 华东交通大学 News keyword extraction method and system based on TF-IDF method optimization
CN113836942A (en) * 2021-02-08 2021-12-24 宏龙科技(杭州)有限公司 Text matching method based on hidden keywords
CN113836942B (en) * 2021-02-08 2022-09-20 宏龙科技(杭州)有限公司 Text matching method based on hidden keywords
CN113392637A (en) * 2021-06-24 2021-09-14 青岛科技大学 TF-IDF-based subject term extraction method, device, equipment and storage medium
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113743090B (en) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 Keyword extraction method and device
CN116244496A (en) * 2022-12-06 2023-06-09 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain
CN116244496B (en) * 2022-12-06 2023-12-01 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain
CN117669513A (en) * 2024-01-30 2024-03-08 江苏古卓科技有限公司 Data management system and method based on artificial intelligence
CN117669513B (en) * 2024-01-30 2024-04-12 江苏古卓科技有限公司 Data management system and method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
Dos Santos et al. Deep convolutional neural networks for sentiment analysis of short texts
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
CN106970910B (en) Keyword extraction method and device based on graph model
CN106776562A (en) A kind of keyword extracting method and extraction system
CN106598941A (en) Algorithm for globally optimizing quality of text keywords
CN106611041A (en) New text similarity solution method
Alwehaibi et al. A study of the performance of embedding methods for Arabic short-text sentiment analysis using deep learning approaches
CN107423282A (en) Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN106528621A (en) Improved density text clustering algorithm
CN106570112A (en) Improved ant colony algorithm-based text clustering realization method
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
Lytvyn et al. Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN109815400A (en) Personage&#39;s interest extracting method based on long text
CN106610952A (en) Mixed text feature word extraction method
CN107102985A (en) Multi-threaded keyword extraction techniques in improved document
CN109344403A (en) A kind of document representation method of enhancing semantic feature insertion
CN106610949A (en) Text feature extraction method based on semantic analysis
CN106446147A (en) Emotion analysis method based on structuring features
CN106570120A (en) Process for realizing searching engine optimization through improved keyword optimization
CN110705247A (en) Based on x2-C text similarity calculation method
CN106610954A (en) Text feature word extraction method based on statistics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170426

WD01 Invention patent application deemed withdrawn after publication