CN106610951A - Improved text similarity solving algorithm based on semantic analysis - Google Patents

Improved text similarity solving algorithm based on semantic analysis Download PDF

Info

Publication number
CN106610951A
CN106610951A CN201610864853.9A CN201610864853A CN106610951A CN 106610951 A CN106610951 A CN 106610951A CN 201610864853 A CN201610864853 A CN 201610864853A CN 106610951 A CN106610951 A CN 106610951A
Authority
CN
China
Prior art keywords
text
word
information
similarity
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610864853.9A
Other languages
Chinese (zh)
Inventor
金平艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yonglian Information Technology Co Ltd
Original Assignee
Sichuan Yonglian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yonglian Information Technology Co Ltd filed Critical Sichuan Yonglian Information Technology Co Ltd
Priority to CN201610864853.9A priority Critical patent/CN106610951A/en
Publication of CN106610951A publication Critical patent/CN106610951A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an improved text similarity solving algorithm based on semantic analysis. The algorithm comprises the steps of performing word segmentation and stop word removing processing on two texts; computing weights of the words in the texts based on an improved information theory method; acquiring weights of positions and properties of the words according to the positions and the properties of the words; constructing a target function shown in the description of the extracted text words according to the abovementioned three factors; and at last respectively reducing dimensions of the two feature words according to a semantic similarity, thus acquiring two feature word vectors, and then solving the text similarity sim (W1, W2) between the texts (W1, W2) according to a Pearson correlation coefficient. Compared with traditional text similarity computing method, the algorithm provided by the invention has higher accuracy, wider application range and higher application value, can accurately compute contribution degrees of the different words to a text thought and solve the problems of polysemy and synonym, is more accordant with an empirical value, and meanwhile provides a good theoretical basis for subsequent text clustering.

Description

The improved text similarity derivation algorithm based on semantic analysis
Technical field
The present invention relates to Semantic Web technology field, and in particular to a kind of improved text similarity based on semantic analysis Derivation algorithm.
Background technology
At present, conventional calculating text similarity method mainly has two kinds:A kind of is the method for being based on mathematical statisticss, in addition A kind of is based on the method for semantic analysis.Based on the method for mathematical statisticss calculated according to morphology and word frequency, and semantic point Analysis is calculated using the inherent semantic relation of text internal vocabulary.Vector space model (Vector Space Model Abbreviation VSM) it is the classical way for calculating text similarity, the method does not account for the language between the semantic information of vocabulary and vocabulary Justice contact, therefore the similar situation between text can not be really reacted, in addition VSM does not account for vocabulary semanteme in the text Status and to the contribution done by text centric thought expression, so text similarity is calculated with vector space model being It is defective.In order to improve the accuracy of Text similarity computing and solve phenomenons such as " polysemy " and " adopted many words ", this Invention provides the improved text similarity derivation algorithm based on semantic analysis.
The content of the invention
Be directed in text in feature vocabulary different vocabulary to the difference problem of the significance level of text, " polysemy " with The accuracy problem of " adopted many words " problem and raising Text similarity computing, the invention provides improved based on semantic point The text similarity derivation algorithm of analysis.
In order to solve the above problems, the present invention is achieved by the following technical solutions:
Step 1:Initialization corpus of text library module, to text (W to be compared1, W2) carry out pretreatment.
Step 2:Based on method of information theory, vocabulary weighted value W in the text is calculatedI
Step 3:According to lexical position information and part of speech, vocabulary weighted value in the text is calculated
Step 4:Consider above-mentioned three factor, construction extracts text (W1, W2) in eigenvalue object function Text (W is extracted respectively1, W2) in eigenvalue.
Step 5:Using Similarity of Words sim (c1i, c1i+1) dimensionality reduction is carried out to feature lexical set obtained above Process
Step 6:Text (W to be compared is solved according to Pearson correlation coefficients1, W2) between text similarity sim (W1, W2)。
Present invention has the advantages that:
1st, the method has higher accuracy than the result that traditional Text similarity computing method is obtained, and more meets people The result that work is extracted.
2nd, the method all has the more preferable suitability in fields such as information retrieval, machine translation, automatically request-answering systems.
3rd, this algorithm has bigger value.
4th, the method has been precisely calculated contribution degree of the different vocabulary to text thought in feature vocabulary.
5th, different vocabulary in feature vocabulary are calculated there is higher degree of accuracy to the contribution degree of text thought.
6th, good theoretical basiss are provided for follow-up text cluster.
7th, the method has processed the problem of " polysemy " and " adopted many words "
8th, the method focuses on the angle of semantic analysis to calculate the similarity between two texts, more meets the experience of people Value.
Description of the drawings
The structure flow chart of the improved text similarity derivation algorithms based on semantic analysis of Fig. 1
Fig. 2 Chinese text preprocessing process flow charts
Fig. 3 n-gram segmentation methods figures
Specific embodiment
Difference problem, " polysemy " of the different vocabulary to the significance level of text in feature vocabulary in order to solve text With " adopted many words " problem and the accuracy problem of raising Text similarity computing, the present invention is carried out with reference to Fig. 1-Fig. 3 Describe in detail, its specific implementation step is as follows:
Step 1:Initialization corpus of text library module, to text (W to be compared1, W2) carry out pretreatment, its specific descriptions Process is as follows:
Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 2.
Here segmenting method is based on theory of information Chinese Automatic Word Segmentation algorithm using a kind of, its concrete participle and goes at stop words Reason step is as follows:
Step 1.1:Using deactivation table respectively to text (W1, W2) carry out stop words and process.
Step 1.2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, its specifically describe such as Under:
The complete scanning of the Chinese character string of participle one time is treated, matching is made a look up in the dictionary of system, in running into dictionary Some words are just identified;If there are no relevant matches in dictionary, individual character is just simply partitioned into as word;Until Chinese character string For sky.
Step 1.3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Minor structure, every sequential node of this structure SM is defined as successively1M2M3M4M5E, its structure chart is as shown in Figure 3.
Step 1.4:Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating are given Process is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word ni.That is the number collection of n paths word is combined into (n1, n2..., nn)。
Obtain min ()=min (n1, n2..., nn)
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.
In statistics corpus, the quantity of information X (C of each word are calculatedi), then the co-occurrence letter of the adjacent word of solution path
Breath amount X (Ci, Ci+1).Existing following formula:
X(Ci)=| x (Ci)1-x(Ci)2|
Above formula x (Ci)1For word C in text corpusiQuantity of information, x (Ci)2It is C containing wordiText message amount.
x(Ci)1=-P (Ci)1lnp(Ci)1
Above formula p (Ci)1For CiProbability in text corpus, n is C containing wordiText corpus number.
x(Ci)2=-p (Ci)2lnp(Ci)2
Above formula p (Ci)2It is C containing wordiTextual data probit, N is statistics corpus Chinese version sum.
X (C in the same manneri, Ci+1)=| x (Ci, Ci+1)1-x(Ci, Ci+1)2|
x(Ci, Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence information amount, x (Ci, Ci+1)2For adjacent word (Ci, Ci+1) co-occurrence text message amount.
X (C in the same manneri, Ci+1)1=-p (Ci, Ci+1)1lnp(Ci, Ci+1)1
Above formula p (CI,Ci+1)1It is the word (C in text corpusi, Ci+1) co-occurrence probabilities, m is the word (C in text libraryi, Ci+1) co-occurrence amount of text.
x(Ci, Ci+1)2=-P (ChCi+1)2lnp(Ci, Ci+1)2
p(Ci, Ci+1)2For adjacent word (C in text libraryi, Ci+1) co-occurrence textual data probability.
The weights that every adjacent path can to sum up be obtained are
w(Ci, Ci+1)=X (Ci)+X(Ci+1)-2X(Ci, Ci+1)
Step 1.5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, it was specifically calculated Journey is as follows:
There are n paths, it is different per paths length, it is assumed that path collection is combined into (L1, L2..., Ln)。
Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if Its path collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 1.4,For S in remaining (n-m) pathj The length of paths.
One paths of maximum weight:
Step 2:Based on method of information theory, vocabulary weighted value W in the text is calculatedI, its concrete calculating process is as follows:
Had based on the computing formula of theory of information word frequency:
Above formulaThe quantity of information being had in a document with regard to word frequency by vocabulary, p (c1,2) it is respectively word c1、c2 Probit in text.
Had based on the computing formula of theory of information document frequency:
The quantity of information being had in document library with regard to document frequency by vocabulary,To contain c respectively1、 c2Number of files, N be document library in document total number.
In sum, there are the function that term weight is calculated based on theory of information, after normalization, such as following formula:
Step 3:According to lexical position information and part of speech, vocabulary weighted value in the text is calculatedIts tool Body calculating process is as follows:
Shown according to data from investigation, Feature Words are got in the forward position of text, can more represent the central idea of text, Feature Words The number of times for occurring in the text is more, more with the representativeness of text implication.Weight of the vocabulary in text is obtained by step 2 Value, takes front n feature vocabulary.Position weight division is carried out to these vocabulary.
In the text each Feature Words at least occurs once, text feature word c(1,2) iThe position vector of composition is as follows:
From from the point of view of part of speech, noun typically takes on the role of subject and object, and verb typically takes on the role of predicate, shape Hold word and adverbial word typically takes on the role of attribute.The difference of part of speech, causes their expression contents to text or sentence Ability it is different.Can obtain through association area expert investigation, the part of speech such as noun, verb, adjective, adverbial word power in the text Weight coefficient ai
Then consider each Feature Words position is with the weighting function of part of speech:
Above formula k is characterized word ciThere is paragraph number in the text, qhIt is containing Feature Words ciH sections to text thought Contribution margin, aiFor contribution margin of the part of speech to text thought, ai、qhValue can be drawn by corresponding text field expert through investigation. nhIt is characterized word ciIn the number of times that h sections occur.
Step 4:Consider above-mentioned three factor, construction extracts text (W1, W2) in Feature Words object function Text (W is extracted respectively1, W2) in Feature Words, its concrete calculating process is as follows:
Extract text (W1, W2) in the object function of feature vocabulary be:
Step 5:Using Similarity of Words sim (c1i, c1i+1) dimensionality reduction is carried out to feature lexical set obtained above Process, need to first calculate the similarity sim (g between concept1, g2), its concrete calculating process is as follows:
Utilize《Hownet》Data base, it is assumed that feature vocabulary (c_1i, c_ (1i+1)) corresponding concept set is respectivelyThis concept is compared two-by-two, similarity is found Two maximum concepts, as feature vocabulary (c1i, c1i+1) between similarity sim (c1i, c1i+1)。
Step 5.1) the similarity sim (g between concept is calculated based on information-theoretical method1, g2)
Phase is mainly calculated by weighing quantity of information that concept included based on the similarity based method that calculates of information content Like degree.Concept is the succession to its ancestor node, is refining again for ancestor node, so can be included by ancestor node Quantity of information is weighing the shared information of two concepts.
Solve information magnitude I (pr) of its common parent in tree-like hierarchy structure
According to Fig. 2, two Ontological concept (g are drawn1, g2) common parent per layer of probability for occurring in tree-like hierarchy structure Value p (pr)
P (pr)=(p1(pr), p2(pr) ..., pk(pr))
Above formula k is two Ontological concept (g1, g2) number of plies of the common parent in tree-like hierarchy structure.
E [p (pr)] is two Ontological concept (g1, g2) mathematical expectation of probability of the common parent in tree-like hierarchy structure.
Two Ontological concept (g are solved respectively1, g2) information magnitude I (g in tree-like hierarchy structure1)、I(g2), its is concrete Solution procedure is as follows:
Solve the information magnitude I (g in tree-like hierarchy structure of two Ontological concepts1)、I(g2)
In the same manner, according to Fig. 2, two Ontological concept (g are drawn1, g2) per layer of the probit p (g in tree-like hierarchy structure1)、p (g2)
p(g1)=(p1(g1), p2(g1) ..., pi(g1))
P(g2)=(p1(g2), p2(g2) ..., pj(g2))
Above formula i is Ontological concept g1The number of plies in tree-like hierarchy structure, in the same manner, j is Ontological concept g2In tree-like hierarchy knot The number of plies in structure.
Above formula E [p (g1)]、E[p(g2)] it is respectively two Ontological concept (g1, g2) probability in tree-like hierarchy structure is equal Value.
Thus the information magnitude I (g in tree-like hierarchy structure of two Ontological concepts can be obtained1)、I(g2)
Based on quantity of information, it can be deduced that the semantic similarity sim (g between two Ontological concepts1, g2), its concrete calculating process is such as Under:
Two Ontological concept (g1, g2) the quantity of information that includes of common parent can only to represent that two concepts are included identical Information.Two Ontological concept (g can rule of thumb be obtained1, g2) between semantic similarity sim (g1, g2)。
Step 5.2) concept similarity matrix can be drawn according to step 5.1, it is as follows:
I.e.
Dimension-reduction treatment is carried out to feature lexical set, there is following formula
sim(c1i, c1i+1)≥α
When feature vocabulary two-by-two between similarity meet the threshold alpha that sets, then merge into a word, i.e. similarity it is maximum two One of vocabulary, its weighted value need to be redistributed, as:
In the same manner, you can obtain the dimensionality reduction vector of feature lexical set in text 2.
Step:6:Text (W to be compared is solved according to Pearson correlation coefficients1, W2) between text similarity sim (W1, W2)。
According to the feature vocabulary weighted value that step 4 is calculated, m positions key word before association area selection of specialists, m here< 20, both there is text (W respectively1, W2) corresponding feature term vector.
Text W1The average weight function of corresponding Feature Words is
In the same manner, text W2The average weight function of character pair word is
According to Pearson correlation coefficients, you can obtain text (W1, W2) between text similarity sim (W1, W2), there is following formula:
The improved text similarity derivation algorithm based on semantic analysis, its false code calculating process:
Input:Text (W to be compared1, W2)。
Output:Text (W1, W2) between similarity sim (W1, W2)。

Claims (4)

1. the improved text similarity derivation algorithm based on semantic analysis, the present invention relates to Semantic Web technology field, specifically It is related to a kind of improved text similarity derivation algorithm based on semantic analysis, it is characterized in that, comprises the steps:
Step 1:Initialization corpus of text library module, to text to be comparedCarry out pretreatment, it was specifically processed Journey is as follows:
Comprehensive participle and deletion stop words technology, Chinese text preprocessing process flow chart such as Fig. 2
Here segmenting method is based on theory of information Chinese Automatic Word Segmentation algorithm using a kind of, its concrete participle and goes stop words to process step It is rapid as follows:
Step 1.1:Using deactivation table respectively to textCarry out stop words to process
Step 1.2:According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, it is described in detail below:
The complete scanning of the Chinese character string of participle one time is treated, matching is made a look up in the dictionary of system, run into what is had in dictionary Word is just identified;If there are no relevant matches in dictionary, individual character is just simply partitioned into as word;Until Chinese character string is sky
Step 1.3:According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined Structure, is successively defined as every sequential node of this structure, its structure chart is as shown in Figure 3
Step 1.4:Based on method of information theory, to above-mentioned network structure each edge certain weights, its concrete calculating process are given It is as follows:
According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e., The number collection of n paths words is combined into
In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved
In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula:
Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount
Above formulaForProbability in text corpus, n is containing wordText corpus number
Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum
In the same manner
It is the word in text corpusCo-occurrence information amount,For adjacent word The text message amount of co-occurrence
In the same manner
Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text library The amount of text of co-occurrence
For adjacent word in text libraryThe textual data probability of co-occurrence
The weights that every adjacent path can to sum up be obtained are
Step 1.5:A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process is such as Under:
There are n paths, it is different per paths length, it is assumed that path collection is combined into
Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road Electrical path length collection is combined into
It is per paths weight then:
Above formulaRespectively the 1,2nd arrivesPath side Weighted value, can be calculated one by one according to step 1.4,For in remaining (n-m) path thePaths Length
One paths of maximum weight:
Step 2:Based on method of information theory, vocabulary weighted value in the text is calculated, its concrete calculating process is as follows:
Had based on the computing formula of theory of information word frequency:
Above formulaThe quantity of information being had in a document with regard to word frequency by vocabulary,Respectively wordIn text In probit
Had based on the computing formula of theory of information document frequency:
The quantity of information being had in document library with regard to document frequency by vocabulary,To contain respectively Number of files, N be document library in document total number
In sum, there are the function that term weight is calculated based on theory of information, after normalization, such as following formula:
Step 3:According to lexical position information and part of speech, vocabulary weighted value in the text is calculated
Step 4:Consider above-mentioned three factor, construction extracts textIn eigenvalue object function, point Indescribably take textIn eigenvalue
Step 5:Using Similarity of WordsDimension-reduction treatment is carried out to feature lexical set obtained above
Step 6:Text to be compared is solved according to Pearson correlation coefficientsBetween text similarity, Its concrete calculating process is as follows:
According to the feature vocabulary weighted value that step 4 is calculated, m positions key word before association area selection of specialists, m here<20, both There is text respectivelyCorresponding feature term vector
TextThe average weight function of corresponding Feature Words is
In the same manner, textThe average weight function of character pair word is
According to Pearson correlation coefficients, you can obtain textBetween text similarity, there is following formula:
2., according to the improved text similarity derivation algorithm based on semantic analysis described in claim 1, it is characterized in that, with Concrete calculating process in the upper step 3 is as follows:
Step 3:According to lexical position information and part of speech, vocabulary weighted value in the text is calculated, its concrete meter Calculation process is as follows:
Shown according to data from investigation, Feature Words are got in the forward position of text, can more represent the central idea of text, and Feature Words are in text The number of times occurred in this is more, more with the representativeness of text implication, by step 2 weighted value of the vocabulary in text is obtained, and takes These vocabulary are carried out position weight division by front n feature vocabulary
In the text each Feature Words at least occurs once, text feature wordThe position vector of composition is as follows:
From from the point of view of part of speech, noun typically takes on the role of subject and object, and verb typically takes on the role of predicate, adjective With the role that adverbial word typically takes on attribute, the difference of part of speech, them are caused to text or the ability of the expression content of sentence It is different, can obtain through association area expert investigation, the part of speech such as noun, verb, adjective, adverbial word weight system in the text Number
Then consider each Feature Words position is with the weighting function of part of speech:
Above formula k is characterized wordThere is paragraph number in the text,It is containing Feature WordsH sections to text thought Contribution margin,For contribution margin of the part of speech to text thought,Value can draw by corresponding text field expert through investigation,It is characterized wordIn the number of times that h sections occur.
3., according to the improved text similarity derivation algorithm based on semantic analysis described in claim 1, it is characterized in that, with Concrete calculating process in the upper step 4 is as follows:
Step 4:Consider above-mentioned three factor, construction extracts textIn Feature Words object function, point Indescribably take textIn Feature Words, its concrete calculating process is as follows:
Extract textThe object function of middle feature vocabulary is:
4., according to the improved text similarity derivation algorithm based on semantic analysis described in claim 1, it is characterized in that, with Concrete calculating process in the upper step 5 is as follows:
Step 5:Using Similarity of WordsFeature lexical set obtained above is carried out at dimensionality reduction Reason, need to first calculate the similarity between concept, its concrete calculating process is as follows:
Utilize《Hownet》Data base, it is assumed that feature vocabularyCorresponding concept set is respectively, this concept is compared two-by-two, two maximum concepts of similarity are found, i.e., It is characterized vocabularyBetween similarity
Step 5.1)Similarity between concept is calculated based on information-theoretical method
Similarity is mainly calculated by weighing quantity of information that concept included based on the similarity based method that calculates of information content, Concept is the succession to its ancestor node, is refining again for ancestor node, so the information that can be included by ancestor node Measure to weigh the shared information of two concepts
Solve information magnitude of its common parent in tree-like hierarchy structure
According to Fig. 2, two Ontological concepts are drawnThe probit of common parent per layer of appearance in tree-like hierarchy structure
Above formula k is two Ontological conceptsThe number of plies of the common parent in tree-like hierarchy structure
For two Ontological conceptsMathematical expectation of probability of the common parent in tree-like hierarchy structure
Two Ontological concepts are solved respectivelyInformation magnitude in tree-like hierarchy structure, its concrete solution Process is as follows:
Solve the information magnitude in tree-like hierarchy structure of two Ontological concepts
In the same manner, according to Fig. 2, two Ontological concepts are drawnPer layer of the probit in tree-like hierarchy structure
Above formula i is Ontological conceptThe number of plies in tree-like hierarchy structure, in the same manner, j is Ontological conceptIn tree-like hierarchy structure In the number of plies
Above formulaRespectively two Ontological conceptsMathematical expectation of probability in tree-like hierarchy structure
Thus the information magnitude in tree-like hierarchy structure of two Ontological concepts can be obtained
Based on quantity of information, it can be deduced that the semantic similarity between two Ontological concepts, its concrete calculating process is such as Under:
Two Ontological conceptsThe quantity of information that includes of common parent can only represent the identical letter that two concepts are included Breath, can rule of thumb obtain two Ontological conceptsBetween semantic similarity
Step 5.2)Concept similarity matrix can be drawn according to step 5.1, it is as follows:
I.e.
Dimension-reduction treatment is carried out to feature lexical set, there is following formula
When feature vocabulary two-by-two between similarity meet the threshold value that sets, then two maximum vocabulary of a word, i.e. similarity are merged into One of, its weighted value need to be redistributed, as:
In the same manner, you can obtain the dimensionality reduction vector of feature lexical set in text 2.
CN201610864853.9A 2016-09-29 2016-09-29 Improved text similarity solving algorithm based on semantic analysis Pending CN106610951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610864853.9A CN106610951A (en) 2016-09-29 2016-09-29 Improved text similarity solving algorithm based on semantic analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610864853.9A CN106610951A (en) 2016-09-29 2016-09-29 Improved text similarity solving algorithm based on semantic analysis

Publications (1)

Publication Number Publication Date
CN106610951A true CN106610951A (en) 2017-05-03

Family

ID=58615303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610864853.9A Pending CN106610951A (en) 2016-09-29 2016-09-29 Improved text similarity solving algorithm based on semantic analysis

Country Status (1)

Country Link
CN (1) CN106610951A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291808A (en) * 2017-05-16 2017-10-24 南京邮电大学 It is a kind of that big data sorting technique is manufactured based on semantic cloud
CN107943965A (en) * 2017-11-27 2018-04-20 福建中金在线信息科技有限公司 Similar article search method and device
CN108153730A (en) * 2017-12-25 2018-06-12 北京奇艺世纪科技有限公司 A kind of polysemant term vector training method and device
CN108563636A (en) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 Extract method, apparatus, equipment and the storage medium of text key word
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment
WO2019056692A1 (en) * 2017-09-25 2019-03-28 平安科技(深圳)有限公司 News sentence clustering method based on semantic similarity, device, and storage medium
CN109697452A (en) * 2017-10-23 2019-04-30 北京京东尚科信息技术有限公司 Processing method, processing unit and the processing system of data object
CN110222192A (en) * 2019-05-20 2019-09-10 国网电子商务有限公司 Corpus method for building up and device
CN110232185A (en) * 2019-01-07 2019-09-13 华南理工大学 Towards financial industry software test knowledge based map semantic similarity calculation method
CN110309263A (en) * 2019-06-06 2019-10-08 中国人民解放军军事科学院军事科学信息研究中心 A kind of semantic-based working attributes content of text judgement method for confliction detection and device
CN110705248A (en) * 2019-10-09 2020-01-17 厦门今立方科技有限公司 Text similarity calculation method, terminal device and storage medium
CN110874392A (en) * 2019-11-20 2020-03-10 中山大学 Text network information fusion embedding method based on deep bidirectional attention mechanism
CN112231439A (en) * 2020-09-27 2021-01-15 中国人民解放军军事科学院军事科学信息研究中心 Text semantic analysis and characteristic value extraction method
CN112348535A (en) * 2020-11-04 2021-02-09 新华中经信用管理有限公司 Traceability application method and system based on block chain technology
CN112580352A (en) * 2021-03-01 2021-03-30 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN115858765A (en) * 2023-01-08 2023-03-28 山东谷联网络技术有限公司 Automatic grading intelligent examination platform based on data contrast analysis
CN115905506A (en) * 2023-02-21 2023-04-04 江西省科技事务中心 Basic theory file pushing method and system, computer and readable storage medium
CN117725146A (en) * 2023-12-22 2024-03-19 中信出版集团股份有限公司 Network information processing system and method based on artificial intelligence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BECK_ZHOU: ""中文分词语言模型和动态规划"", 《CSDN博客HTTPS://BLOG.CSDN.BET/ZHOUBL668/ARTICLE/DETAILS/68964》 *
刘景方 等: ""一种改进的本体概念语义相似度算法研究"", 《武汉理工大学学报》 *
蒋健洪 等: ""词典与统计方法结合的中文分词模型研究及应用"", 《计算机工程与设计》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291808A (en) * 2017-05-16 2017-10-24 南京邮电大学 It is a kind of that big data sorting technique is manufactured based on semantic cloud
WO2019056692A1 (en) * 2017-09-25 2019-03-28 平安科技(深圳)有限公司 News sentence clustering method based on semantic similarity, device, and storage medium
CN109697452A (en) * 2017-10-23 2019-04-30 北京京东尚科信息技术有限公司 Processing method, processing unit and the processing system of data object
CN107943965A (en) * 2017-11-27 2018-04-20 福建中金在线信息科技有限公司 Similar article search method and device
CN108153730A (en) * 2017-12-25 2018-06-12 北京奇艺世纪科技有限公司 A kind of polysemant term vector training method and device
CN108563636A (en) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 Extract method, apparatus, equipment and the storage medium of text key word
CN109165291B (en) * 2018-06-29 2021-07-09 厦门快商通信息技术有限公司 Text matching method and electronic equipment
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment
CN110232185A (en) * 2019-01-07 2019-09-13 华南理工大学 Towards financial industry software test knowledge based map semantic similarity calculation method
CN110232185B (en) * 2019-01-07 2023-09-19 华南理工大学 Knowledge graph semantic similarity-based computing method for financial industry software testing
CN110222192A (en) * 2019-05-20 2019-09-10 国网电子商务有限公司 Corpus method for building up and device
CN110309263A (en) * 2019-06-06 2019-10-08 中国人民解放军军事科学院军事科学信息研究中心 A kind of semantic-based working attributes content of text judgement method for confliction detection and device
CN110705248A (en) * 2019-10-09 2020-01-17 厦门今立方科技有限公司 Text similarity calculation method, terminal device and storage medium
CN110874392A (en) * 2019-11-20 2020-03-10 中山大学 Text network information fusion embedding method based on deep bidirectional attention mechanism
CN110874392B (en) * 2019-11-20 2023-10-24 中山大学 Text network information fusion embedding method based on depth bidirectional attention mechanism
CN112231439A (en) * 2020-09-27 2021-01-15 中国人民解放军军事科学院军事科学信息研究中心 Text semantic analysis and characteristic value extraction method
CN112348535A (en) * 2020-11-04 2021-02-09 新华中经信用管理有限公司 Traceability application method and system based on block chain technology
CN112348535B (en) * 2020-11-04 2023-09-12 新华中经信用管理有限公司 Traceability application method and system based on blockchain technology
CN112580352A (en) * 2021-03-01 2021-03-30 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN112580352B (en) * 2021-03-01 2021-06-04 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN115858765A (en) * 2023-01-08 2023-03-28 山东谷联网络技术有限公司 Automatic grading intelligent examination platform based on data contrast analysis
CN115905506A (en) * 2023-02-21 2023-04-04 江西省科技事务中心 Basic theory file pushing method and system, computer and readable storage medium
CN117725146A (en) * 2023-12-22 2024-03-19 中信出版集团股份有限公司 Network information processing system and method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
CN106611041A (en) New text similarity solution method
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
CN106570112A (en) Improved ant colony algorithm-based text clustering realization method
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
Rahimi et al. An overview on extractive text summarization
CN106528621A (en) Improved density text clustering algorithm
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN106598941A (en) Algorithm for globally optimizing quality of text keywords
CN106610952A (en) Mixed text feature word extraction method
CN106610953A (en) Method for solving text similarity based on Gini index
CN110705247A (en) Based on x2-C text similarity calculation method
CN106610954A (en) Text feature word extraction method based on statistics
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN106610949A (en) Text feature extraction method based on semantic analysis
Gupta Hybrid algorithm for multilingual summarization of Hindi and Punjabi documents
Al-Azzawy et al. Arabic words clustering by using K-means algorithm
CN111428031A (en) Graph model filtering method fusing shallow semantic information
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
CN107038155A (en) The extracting method of text feature is realized based on improved small-world network model
CN107092595A (en) New keyword extraction techniques
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
CN107102986A (en) Multi-threaded keyword extraction techniques in document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170503