CN111814456A - Verb-based Chinese text similarity calculation method - Google Patents

Verb-based Chinese text similarity calculation method Download PDF

Info

Publication number
CN111814456A
CN111814456A CN202010450674.7A CN202010450674A CN111814456A CN 111814456 A CN111814456 A CN 111814456A CN 202010450674 A CN202010450674 A CN 202010450674A CN 111814456 A CN111814456 A CN 111814456A
Authority
CN
China
Prior art keywords
text
feature
semantic
similarity
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010450674.7A
Other languages
Chinese (zh)
Inventor
陈凯玲
顾闻
史松峰
韩东
徐雪莲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Shanghai Electric Power Co Ltd
Original Assignee
State Grid Shanghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Shanghai Electric Power Co Ltd filed Critical State Grid Shanghai Electric Power Co Ltd
Priority to CN202010450674.7A priority Critical patent/CN111814456A/en
Publication of CN111814456A publication Critical patent/CN111814456A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Abstract

The invention relates to a verb-based Chinese text similarity calculation method, which comprises the following steps: s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text; s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text; s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1(ii) a S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2(ii) a S5: computing first and second text in combination with grammatical and semantic similarityCompared with the prior art, the method has the advantages of improving the calculation accuracy and the calculation speed and the like.

Description

Verb-based Chinese text similarity calculation method
Technical Field
The invention relates to the technical field of semantic analysis, in particular to a verb-based Chinese text similarity calculation method.
Background
In information processing, the calculation of text similarity is widely applied to the fields of information retrieval, machine translation, automatic question-answering systems, text mining and the like, is a very basic and key problem, and has been a hotspot and difficulty of research of people for a long time.
In recent years, some methods propose intelligent review of signed contracts by using text similarity calculation, automatic early warning of potential legal risks in contract texts is realized, application of the Chinese text similarity calculation method is further expanded and applied, and new requirements are provided for the Chinese text similarity calculation.
The existing text similarity calculation method comprises a character string-based method, an ontology-based method, a corpus-based method and the like, wherein the character string-based method only literally considers matching or co-occurrence of character strings and does not consider semantic information contained in a text, the ontology-based method is limited by the ontology scale constructed by human, similarity cannot be calculated for words not in the ontology, the corpus-based method trains word vectors through a neural network to express sentences into vector forms, and grammatical and semantic information in the text can be captured to a certain extent.
However, none of these methods combines the rules and experiences of the Chinese linguistics and natural language processing, and the methods do not effectively combine, and cannot efficiently and accurately calculate the similarity of the Chinese text. The contract review is related to the important benefits of both signing parties, for example, in the power grid engineering construction, the formulation of contract terms is a very important link, if the problems of uncertain responsibility and the like exist in the terms, the risks of causing disputes, causing loss and the like exist, and accurate and precise approval is required. Therefore, the conventional method for calculating the similarity of the Chinese text cannot be suitable for intelligent evaluation of contracts, and a new method for calculating the similarity of the Chinese text needs to be designed to efficiently and accurately calculate the similarity of the Chinese text.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a verb-based Chinese text similarity calculation method for improving the calculation accuracy and calculation speed.
The purpose of the invention can be realized by the following technical scheme:
a Chinese text similarity calculation method based on verbs comprises the following steps:
s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text;
s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text;
s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1
S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2
S5: and calculating the similarity f between the texts of the first text and the second text by combining the grammar similarity and the semantic similarity.
Further, the pretreatment specifically comprises: and segmenting the first text and the second text, and removing stop words.
In the process of word segmentation, words, symbols, punctuations and the like which have little meaning to the text content but high frequency of occurrence can be found. Words such as "this, too, right, did" appear in essentially any chinese article, but there seems to be little meaning to apply these words to the article, their status in the article becomes optional, and their removal does not affect the specific meaning to be expressed by the article and its readability. Therefore, the Words are taken out as stop Words in the preprocessing process, the stop word library of the Sichuan university machine intelligent laboratory is adopted, and the meaningless Words are filtered out by constructing a removal word List.
Further, the step S3 specifically includes:
s31: respectively taking verb sequences of the first text and the second text as a first text characteristic character string and a second text characteristic character string;
s32: acquiring the number of common substrings from the first text characteristic character string to the second text characteristic character string, and recording the number as the number of the first common substrings;
s33: acquiring the number of common substrings from the second text characteristic character string to the first text characteristic character string, and recording the number as the number of the second common substrings;
s34: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s35: calculating the grammar similarity f of the first text and the second text by using the number of the actual common substrings1
Further, the grammar similarity f1The calculation formula of (2) is as follows:
Figure BDA0002507568940000021
wherein c is the number of actual common substrings, a is the number of verbs in the verb sequence of the first text, and b is the number of verbs in the verb sequence of the second text.
Further, the step S4 specifically includes:
s41: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;
s42: respectively extracting all feature items in the first text and the second text to obtain a first text feature item set and a second text feature item set;
s43: respectively counting the occurrence times of each feature item in the first text feature item set and the second text feature item set;
s44: acquiring feature item vectors corresponding to feature items in the first text feature item set and the second text feature item set by using a feature item vector table;
s45: calculating a feature vector corresponding to the first text and a feature vector corresponding to the second text according to the feature item vector, and respectively carrying out standardization processing to obtain a first text feature vector and a second text feature vector;
s46: calculating semantic similarity f of the first text and the second text according to the first text feature vector and the second text feature vector2
Furthermore, the feature vector corresponding to the first text
Figure BDA0002507568940000031
The calculation formula of (A) is as follows:
Figure BDA0002507568940000032
wherein f isi,kThe number of times of occurrence of the kth characteristic item in the first text characteristic item set is shown, n is the number of all characteristic items in the first text,
Figure BDA0002507568940000033
corresponding feature item vectors of the kth feature item in the first text feature item set in a semantic topic space P;
the feature vector corresponding to the second text
Figure BDA0002507568940000034
The calculation formula of (A) is as follows:
Figure BDA0002507568940000035
wherein f isj,kThe number of times of occurrence of the kth characteristic item in the second text characteristic item set is m is the number of all characteristic items in the second text,
Figure BDA0002507568940000036
and the k-th feature item in the second text feature item set corresponds to a feature item vector in the semantic topic space P.
Further, the semantic similarity f2The calculation formula of (A) is as follows:
Figure BDA0002507568940000037
Figure BDA0002507568940000041
Figure BDA0002507568940000042
wherein the content of the first and second substances,
Figure BDA0002507568940000043
is a first text feature vector and is a second text feature vector,
Figure BDA0002507568940000044
is a second text feature vector, wi,jIs the included angle between the first text feature vector and the second text feature vector.
Further, the step S41 specifically includes:
s411: determining a semantic topic set V for use in a semantic vector space modelT={τ12,…,τdDetermining a semantic topic space P;
s412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set VN
S413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);
s414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;
s415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.
Further preferably, the feature items are words in the text.
Further, the calculation formula of the similarity between texts is as follows:
f=α*f1+β*f2
where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.
Compared with the prior art, the invention has the following advantages:
1) the invention expands the range of stop words by introducing the concept of 'verb central words', forms a sequence of stop words by removing the stop words in the text as a text characteristic string, and calculates the grammar similarity f between Chinese texts by combining a string matching algorithm1The algorithm is simple, and the calculation speed is improved;
2) the invention extracts the feature items of two texts according to the IFIDF method, performs weight calculation, extracts the feature vectors of the texts by using semantic subjects as the dimensions of a vector space, and calculates semantic similarity f2The problem that the replacement of the similar meaning words and the synonym heteromorphism words is omitted simply by taking the words as the characteristic items of the text is solved effectively, and the accuracy of the calculation result is improved effectively;
3) the invention combines the grammar similarity f between texts1And semantic similarity f2And obtaining the similarity f between texts as a final text similarity result, and simultaneously considering the grammar and the semantics, thereby improving the accuracy of text similarity calculation.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a diagram illustrating a syntax similarity calculation process;
FIG. 3 is a schematic diagram of a semantic similarity calculation process;
FIG. 4 is a diagram illustrating the number of common substrings from text A to text B in the embodiment;
FIG. 5 is a diagram illustrating the number of common substrings from text B to text A in the embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
As shown in FIG. 1, the invention provides a Chinese text similarity calculation method based on verbs, which comprises the following steps:
s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text;
s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text;
s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1
S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2
S5: and calculating the similarity f between the texts of the first text and the second text by combining the grammar similarity and the semantic similarity.
Wherein the pretreatment specifically comprises: and segmenting the first text and the second text, and removing stop words.
In the embodiment, an open-source Chinese word segmentation component, namely ancient word segmentation, is used during word segmentation, and word segmentation is performed on two texts by using third-party library word segmentation software. When the words are segmented, some nonsense words and optional words of the text are firstly put into a removal word list constructed in advance, so that the subsequent loading of a dog searching word bank is facilitated to calculate word frequency and weight.
We can find some words, symbols, punctuation, etc. that have little meaning to the text content but occur frequently. Words such as "this, too, right, did" appear in essentially any chinese article, but there seems to be little meaning to apply these words to the article, their status in the article becomes optional, and their removal does not affect the specific meaning to be expressed by the article and its readability. Therefore, the Words are taken out as stop Words in the preprocessing process, in the embodiment, the stop word stock of the intelligent laboratory of the university of Sichuan machine is adopted, and the meaningless Words are filtered out by constructing a removal word List (Remove Words List).
The method comprises three major parts in total, and firstly, the grammar similarity f is carried out on two texts by extracting verbs1The second is to carry out semantic similarity f by extracting feature items and utilizing a TF-IDF weighting method2Finally, the grammar similarity f is calculated1And semantic similarity f2And combining to obtain the similarity f between texts. The following is described in detail in three sections.
Firstly, grammar similarity f is carried out on two texts by extracting verbs1Is calculated by
Lu Xie Xiang has a syntactic model with verb as the center built in the representative book, China culture and literature treatise. In analyzing a sentence, the sentence center is a verb representing an action, and nouns representing the beginning, the end and the related aspects of the action are all supplementary to the verb, so the general term may be called "supplementary word". Therefore, in addition to the center of the verb, there are various "complementary words" such as "start word", "stop word", "accepted word", "close complementary word", "join complementary word", and "complement by means of" in the sentence. That is, the meaning of sentence expression is reflected on the central verb in the sentence, so the sequence of the central verbs of all sentences in the paragraph reflects the central meaning of the paragraph. Similarly, a sequence of central verbs of all sentences in the text can summarize the central meaning of the full text. In this way, the verb sequence not only reflects the actions that occur in the text, but also describes the order in which the actions occur, so the verb sequence can be used as a feature string of an article. The similarity of the feature strings between two texts reflects the similarity between the texts.
The method comprises the following specific steps:
s31: respectively taking verb sequences of the first text and the second text as a first text characteristic character string and a second text characteristic character string;
s32: acquiring the number of common substrings from the first text characteristic character string to the second text characteristic character string, and recording the number as the number of the first common substrings;
s33: acquiring the number of common substrings from the second text characteristic character string to the first text characteristic character string, and recording the number as the number of the second common substrings;
s34: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s35: calculating the grammar similarity f of the first text and the second text by using the number of the actual common substrings1
As shown in fig. 2, assuming that two texts are respectively a text a and a text B, after obtaining verb sequences of the two texts respectively, the verb sequences can be regarded as one character string to obtain a text a characteristic character string and a text B characteristic character string, and the similarity of the two verb sequences can be obtained by calculating the number of common substrings of the two characteristic character strings, assuming that the verb sequences of the text a are V1, V2, V3, V2 and V4, and the verb sequences of the text B are V1, V3, V2 and V4. The number of common substrings from the text a characteristic character string to the text B characteristic character string is shown in fig. 4, and the number of common substrings from the text B characteristic character string to the text a characteristic character string is shown in fig. 5. As can be seen from fig. 4 and 5, the number of common substrings from the text a characteristic character string to the text B characteristic character string is 3, the number of common substrings from the text B characteristic character string to the text a characteristic character string is 4, and the number of the larger common substrings of the two is taken as the number of the actual common substrings, so that the number of the actual common substrings is 4.
Finally, the similarity f of the grammar is passed1The calculation formula of (2) is as follows:
Figure BDA0002507568940000071
wherein c is the number of actual common substrings, a is the number of verbs in the verb sequence of the first text, and b is the number of verbs in the verb sequence of the second text.
(II) extracting characteristic items and performing semantic similarity f by using a TF-IDF weighting method2Is calculated by
The measured semantic similarity may refer to a vector model in the information retrieval. The basic idea of the vector space model is to represent texts by vectors, and words, words or phrases can be selected as feature items.
According to the method for calculating the TF-IDF similarity of the VSM, words are used as feature items of texts, and the problem of replacing similar words and synonymy heteromorphism words is ignored, so that the accuracy of a calculation result is reduced. This problem can be solved efficiently by using a semantic dictionary. The commonly used semantic dictionary mainly comprises synonym forest and knowledge network as the measure of word similarity according to the information of related word concepts provided by the semantic dictionary. The method comprises the following steps of taking a semantic theme as a dimension of a vector space to extract feature vectors, adopting a method based on corpus statistics, firstly selecting features of a group of words, then comparing each word with the features of the group of words to obtain a related feature vector, and calculating the similarity by calculating the cosine of an included angle of the vector, wherein the method comprises the following specific steps:
s41: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;
wherein S41 specifically includes:
s411: determining a semantic topic set V for use in a semantic vector space modelT={τ12,…,τdDetermining a semantic topic space P;
s412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set VN
S413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);
s414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;
s415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.
S42: respectively extracting all feature items in the first text and the second text to obtain a first text feature item set and a second text feature item set;
s43: respectively counting the occurrence times of each feature item in the first text feature item set and the second text feature item set;
s44: acquiring feature item vectors corresponding to feature items in the first text feature item set and the second text feature item set by using a feature item vector table;
s45: calculating a feature vector corresponding to the first text and a feature vector corresponding to the second text according to the feature item vector, and respectively carrying out standardization processing to obtain a first text feature vector and a second text feature vector;
feature vector corresponding to first text
Figure BDA0002507568940000081
The calculation formula of (A) is as follows:
Figure BDA0002507568940000082
wherein f isi,kThe number of times of occurrence of the kth characteristic item in the first text characteristic item set is shown, n is the number of all characteristic items in the first text,
Figure BDA0002507568940000083
corresponding feature item vectors of the kth feature item in the first text feature item set in a semantic topic space P;
feature vector corresponding to the second text
Figure BDA0002507568940000084
The calculation formula of (A) is as follows:
Figure BDA0002507568940000085
wherein f isj,kThe number of times of occurrence of the kth characteristic item in the second text characteristic item set is m is the number of all characteristic items in the second text,
Figure BDA0002507568940000086
and the k-th feature item in the second text feature item set corresponds to a feature item vector in the semantic topic space P.
S46: calculating semantic similarity f of the first text and the second text according to the first text feature vector and the second text feature vector2
Semantic similarity f2The calculation formula of (A) is as follows:
Figure BDA0002507568940000087
Figure BDA0002507568940000088
Figure BDA0002507568940000091
wherein the content of the first and second substances,
Figure BDA0002507568940000092
is a first text feature vector and is a second text feature vector,
Figure BDA0002507568940000093
is a second text feature vector, wi,jIs the included angle between the first text feature vector and the second text feature vector.
(III) similarity f of grammar1And semantic similarity f2Combining to obtain the similarity f between texts
Obtaining semantic similarity f of two texts2Similarity to grammar f1Then, the total similarity, i.e. the similarity f between texts, needs to be calculated, and the calculation formula is:
f=α*f1+β*f2
where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A verb-based Chinese text similarity calculation method is characterized by comprising the following steps:
s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text;
s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text;
s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1
S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2
S5: and calculating the similarity f between the texts of the first text and the second text by combining the grammar similarity and the semantic similarity.
2. The verb-based Chinese text similarity calculation method according to claim 1, wherein the preprocessing specifically comprises:
and segmenting the first text and the second text, and removing stop words.
3. The method for calculating similarity of Chinese text based on verbs as claimed in claim 1, wherein said step S3 further includes:
s31: respectively taking verb sequences of the first text and the second text as a first text characteristic character string and a second text characteristic character string;
s32: acquiring the number of common substrings from the first text characteristic character string to the second text characteristic character string, and recording the number as the number of the first common substrings;
s33: acquiring the number of common substrings from the second text characteristic character string to the first text characteristic character string, and recording the number as the number of the second common substrings;
s34: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s35: calculating the grammar similarity f of the first text and the second text by using the number of the actual common substrings1
4. The verb-based Chinese text similarity calculation method according to claim 3, wherein the grammar similarity f1The calculation formula of (2) is as follows:
Figure FDA0002507568930000011
wherein c is the number of actual common substrings, a is the number of verbs in the verb sequence of the first text, and b is the number of verbs in the verb sequence of the second text.
5. The verb-based Chinese text similarity calculation method according to claim 4, wherein said step S4 specifically comprises:
s41: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;
s42: respectively extracting all feature items in the first text and the second text to obtain a first text feature item set and a second text feature item set;
s43: respectively counting the occurrence times of each feature item in the first text feature item set and the second text feature item set;
s44: acquiring feature item vectors corresponding to feature items in the first text feature item set and the second text feature item set by using a feature item vector table;
s45: calculating a feature vector corresponding to the first text and a feature vector corresponding to the second text according to the feature item vector, and respectively carrying out standardization processing to obtain a first text feature vector and a second text feature vector;
s46: calculating semantic similarity f of the first text and the second text according to the first text feature vector and the second text feature vector2
6. The method of claim 5, wherein the first text corresponds to a feature vector
Figure FDA0002507568930000021
The calculation formula of (A) is as follows:
Figure FDA0002507568930000022
wherein f isi,kThe number of times of occurrence of the kth characteristic item in the first text characteristic item set is shown, n is the number of all characteristic items in the first text,
Figure FDA0002507568930000023
corresponding feature item vectors of the kth feature item in the first text feature item set in a semantic topic space P;
the feature vector corresponding to the second text
Figure FDA0002507568930000024
The calculation formula of (A) is as follows:
Figure FDA0002507568930000025
wherein f isj,kThe number of times of occurrence of the kth characteristic item in the second text characteristic item set is m is the number of all characteristic items in the second text,
Figure FDA0002507568930000026
the corresponding feature of the kth feature item in the second text feature item set in the semantic topic space PThe term vector is characterized.
7. The verb-based Chinese text similarity calculation method according to claim 6, wherein the semantic similarity f2The calculation formula of (A) is as follows:
Figure FDA0002507568930000027
Figure FDA0002507568930000031
Figure FDA0002507568930000032
wherein the content of the first and second substances,
Figure FDA0002507568930000033
is a first text feature vector and is a second text feature vector,
Figure FDA0002507568930000034
is a second text feature vector, wi,jIs the included angle between the first text feature vector and the second text feature vector.
8. The verb-based Chinese text similarity calculation method according to claim 5, wherein said step S41 specifically comprises:
s411: determining a semantic topic set V for use in a semantic vector space modelT={τ12,…,τdDetermining a semantic topic space P;
s412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set VN
S413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);
s414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;
s415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.
9. The method of claim 8, wherein the feature items are words in the text.
10. The verb-based chinese text similarity calculation method according to claim 7, wherein said inter-text similarity calculation formula is:
f=α*f1+β*f2
wherein, alpha is a grammar weighting coefficient, and beta is a semantic weighting coefficient.
CN202010450674.7A 2020-05-25 2020-05-25 Verb-based Chinese text similarity calculation method Pending CN111814456A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010450674.7A CN111814456A (en) 2020-05-25 2020-05-25 Verb-based Chinese text similarity calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010450674.7A CN111814456A (en) 2020-05-25 2020-05-25 Verb-based Chinese text similarity calculation method

Publications (1)

Publication Number Publication Date
CN111814456A true CN111814456A (en) 2020-10-23

Family

ID=72848023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010450674.7A Pending CN111814456A (en) 2020-05-25 2020-05-25 Verb-based Chinese text similarity calculation method

Country Status (1)

Country Link
CN (1) CN111814456A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883165A (en) * 2021-03-16 2021-06-01 山东亿云信息技术有限公司 Intelligent full-text retrieval method and system based on semantic understanding

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012043294A (en) * 2010-08-20 2012-03-01 Kddi Corp Binomial relationship categorization program, method, and device for categorizing semantically similar word pair by binomial relationship
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012043294A (en) * 2010-08-20 2012-03-01 Kddi Corp Binomial relationship categorization program, method, and device for categorizing semantically similar word pair by binomial relationship
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘小军;赵栋;姚卫东;: "一种用于中文文本查重的双因子相似度算法", 计算机仿真, no. 12, pages 2 - 3 *
黄菊;: "一种基于语义向量空间模型的作业查重算法", 电子科学技术, no. 06, pages 2 - 3 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883165A (en) * 2021-03-16 2021-06-01 山东亿云信息技术有限公司 Intelligent full-text retrieval method and system based on semantic understanding

Similar Documents

Publication Publication Date Title
Suleiman et al. Deep learning based technique for plagiarism detection in Arabic texts
Oudah et al. NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic
Ulčar et al. High quality ELMo embeddings for seven less-resourced languages
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
Ren et al. Detecting the scope of negation and speculation in biomedical texts by using recursive neural network
Al-Harbi et al. Lexical disambiguation in natural language questions (nlqs)
Wadud et al. Text coherence analysis based on misspelling oblivious word embeddings and deep neural network
Zhang et al. Chinese-English mixed text normalization
Sornlertlamvanich et al. Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
Aejas et al. Named entity recognition for cultural heritage preservation
CN111814456A (en) Verb-based Chinese text similarity calculation method
pal Singh et al. Naive Bayes classifier for word sense disambiguation of Punjabi language
Khoufi et al. Chunking Arabic texts using conditional random fields
Tongtep et al. Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction
Abdolahi et al. A new method for sentence vector normalization using word2vec
Rebala et al. Natural language processing
Jamwal Named entity recognition for Dogri using ML
Prasad et al. Lexicon based extraction and opinion classification of associations in text from Hindi weblogs
Golubev et al. Use of augmentation and distant supervision for sentiment analysis in Russian
Bafna et al. BaSa: A Technique to Identify Context based Common Tokens for Hindi Verses and Proses
Yuan et al. Semantic based chinese sentence sentiment analysis
Bharti et al. Sarcasm as a contradiction between a tweet and its temporal facts: a pattern-based approach
Liu et al. Domain phrase identification using atomic word formation in Chinese text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination