CN111814456A

CN111814456A - Verb-based Chinese text similarity calculation method

Info

Publication number: CN111814456A
Application number: CN202010450674.7A
Authority: CN
Inventors: 陈凯玲; 顾闻; 史松峰; 韩东; 徐雪莲
Original assignee: State Grid Shanghai Electric Power Co Ltd
Current assignee: State Grid Shanghai Electric Power Co Ltd
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2020-10-23

Abstract

The invention relates to a verb-based Chinese text similarity calculation method, which comprises the following steps: s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text; s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text; s3: calculating grammar similarity f of the first text and the second text based on the verb sequence₁(ii) a S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text₂(ii) a S5: computing first and second text in combination with grammatical and semantic similarityCompared with the prior art, the method has the advantages of improving the calculation accuracy and the calculation speed and the like.

Description

Verb-based Chinese text similarity calculation method

Technical Field

The invention relates to the technical field of semantic analysis, in particular to a verb-based Chinese text similarity calculation method.

Background

In information processing, the calculation of text similarity is widely applied to the fields of information retrieval, machine translation, automatic question-answering systems, text mining and the like, is a very basic and key problem, and has been a hotspot and difficulty of research of people for a long time.

In recent years, some methods propose intelligent review of signed contracts by using text similarity calculation, automatic early warning of potential legal risks in contract texts is realized, application of the Chinese text similarity calculation method is further expanded and applied, and new requirements are provided for the Chinese text similarity calculation.

The existing text similarity calculation method comprises a character string-based method, an ontology-based method, a corpus-based method and the like, wherein the character string-based method only literally considers matching or co-occurrence of character strings and does not consider semantic information contained in a text, the ontology-based method is limited by the ontology scale constructed by human, similarity cannot be calculated for words not in the ontology, the corpus-based method trains word vectors through a neural network to express sentences into vector forms, and grammatical and semantic information in the text can be captured to a certain extent.

However, none of these methods combines the rules and experiences of the Chinese linguistics and natural language processing, and the methods do not effectively combine, and cannot efficiently and accurately calculate the similarity of the Chinese text. The contract review is related to the important benefits of both signing parties, for example, in the power grid engineering construction, the formulation of contract terms is a very important link, if the problems of uncertain responsibility and the like exist in the terms, the risks of causing disputes, causing loss and the like exist, and accurate and precise approval is required. Therefore, the conventional method for calculating the similarity of the Chinese text cannot be suitable for intelligent evaluation of contracts, and a new method for calculating the similarity of the Chinese text needs to be designed to efficiently and accurately calculate the similarity of the Chinese text.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a verb-based Chinese text similarity calculation method for improving the calculation accuracy and calculation speed.

The purpose of the invention can be realized by the following technical scheme:

a Chinese text similarity calculation method based on verbs comprises the following steps:

s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text;

s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text;

s3: calculating grammar similarity f of the first text and the second text based on the verb sequence₁；

S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text₂；

S5: and calculating the similarity f between the texts of the first text and the second text by combining the grammar similarity and the semantic similarity.

Further, the pretreatment specifically comprises: and segmenting the first text and the second text, and removing stop words.

In the process of word segmentation, words, symbols, punctuations and the like which have little meaning to the text content but high frequency of occurrence can be found. Words such as "this, too, right, did" appear in essentially any chinese article, but there seems to be little meaning to apply these words to the article, their status in the article becomes optional, and their removal does not affect the specific meaning to be expressed by the article and its readability. Therefore, the Words are taken out as stop Words in the preprocessing process, the stop word library of the Sichuan university machine intelligent laboratory is adopted, and the meaningless Words are filtered out by constructing a removal word List.

Further, the step S3 specifically includes:

s31: respectively taking verb sequences of the first text and the second text as a first text characteristic character string and a second text characteristic character string;

s32: acquiring the number of common substrings from the first text characteristic character string to the second text characteristic character string, and recording the number as the number of the first common substrings;

s33: acquiring the number of common substrings from the second text characteristic character string to the first text characteristic character string, and recording the number as the number of the second common substrings;

s34: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;

s35: calculating the grammar similarity f of the first text and the second text by using the number of the actual common substrings₁。

Further, the grammar similarity f₁The calculation formula of (2) is as follows:

wherein c is the number of actual common substrings, a is the number of verbs in the verb sequence of the first text, and b is the number of verbs in the verb sequence of the second text.

Further, the step S4 specifically includes:

s41: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;

s42: respectively extracting all feature items in the first text and the second text to obtain a first text feature item set and a second text feature item set;

s43: respectively counting the occurrence times of each feature item in the first text feature item set and the second text feature item set;

s44: acquiring feature item vectors corresponding to feature items in the first text feature item set and the second text feature item set by using a feature item vector table;

s45: calculating a feature vector corresponding to the first text and a feature vector corresponding to the second text according to the feature item vector, and respectively carrying out standardization processing to obtain a first text feature vector and a second text feature vector;

s46: calculating semantic similarity f of the first text and the second text according to the first text feature vector and the second text feature vector₂。

Furthermore, the feature vector corresponding to the first text

The calculation formula of (A) is as follows:

wherein f is_i,kThe number of times of occurrence of the kth characteristic item in the first text characteristic item set is shown, n is the number of all characteristic items in the first text,

corresponding feature item vectors of the kth feature item in the first text feature item set in a semantic topic space P;

the feature vector corresponding to the second text

The calculation formula of (A) is as follows:

wherein f is_j,kThe number of times of occurrence of the kth characteristic item in the second text characteristic item set is m is the number of all characteristic items in the second text,

and the k-th feature item in the second text feature item set corresponds to a feature item vector in the semantic topic space P.

Further, the semantic similarity f₂The calculation formula of (A) is as follows:

wherein the content of the first and second substances,

is a first text feature vector and is a second text feature vector,

is a second text feature vector, w_i,jIs the included angle between the first text feature vector and the second text feature vector.

Further, the step S41 specifically includes:

s411: determining a semantic topic set V for use in a semantic vector space model_T＝{τ₁,τ₂,…,τ_dDetermining a semantic topic space P;

s412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set V_N；

S413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);

s414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;

s415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.

Further preferably, the feature items are words in the text.

Further, the calculation formula of the similarity between texts is as follows:

f＝α*f₁+β*f₂

where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.

Compared with the prior art, the invention has the following advantages:

1) the invention expands the range of stop words by introducing the concept of 'verb central words', forms a sequence of stop words by removing the stop words in the text as a text characteristic string, and calculates the grammar similarity f between Chinese texts by combining a string matching algorithm₁The algorithm is simple, and the calculation speed is improved;

2) the invention extracts the feature items of two texts according to the IFIDF method, performs weight calculation, extracts the feature vectors of the texts by using semantic subjects as the dimensions of a vector space, and calculates semantic similarity f₂The problem that the replacement of the similar meaning words and the synonym heteromorphism words is omitted simply by taking the words as the characteristic items of the text is solved effectively, and the accuracy of the calculation result is improved effectively;

3) the invention combines the grammar similarity f between texts₁And semantic similarity f₂And obtaining the similarity f between texts as a final text similarity result, and simultaneously considering the grammar and the semantics, thereby improving the accuracy of text similarity calculation.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a diagram illustrating a syntax similarity calculation process;

FIG. 3 is a schematic diagram of a semantic similarity calculation process;

FIG. 4 is a diagram illustrating the number of common substrings from text A to text B in the embodiment;

FIG. 5 is a diagram illustrating the number of common substrings from text B to text A in the embodiment.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Examples

As shown in FIG. 1, the invention provides a Chinese text similarity calculation method based on verbs, which comprises the following steps:

Wherein the pretreatment specifically comprises: and segmenting the first text and the second text, and removing stop words.

In the embodiment, an open-source Chinese word segmentation component, namely ancient word segmentation, is used during word segmentation, and word segmentation is performed on two texts by using third-party library word segmentation software. When the words are segmented, some nonsense words and optional words of the text are firstly put into a removal word list constructed in advance, so that the subsequent loading of a dog searching word bank is facilitated to calculate word frequency and weight.

We can find some words, symbols, punctuation, etc. that have little meaning to the text content but occur frequently. Words such as "this, too, right, did" appear in essentially any chinese article, but there seems to be little meaning to apply these words to the article, their status in the article becomes optional, and their removal does not affect the specific meaning to be expressed by the article and its readability. Therefore, the Words are taken out as stop Words in the preprocessing process, in the embodiment, the stop word stock of the intelligent laboratory of the university of Sichuan machine is adopted, and the meaningless Words are filtered out by constructing a removal word List (Remove Words List).

The method comprises three major parts in total, and firstly, the grammar similarity f is carried out on two texts by extracting verbs₁The second is to carry out semantic similarity f by extracting feature items and utilizing a TF-IDF weighting method₂Finally, the grammar similarity f is calculated₁And semantic similarity f₂And combining to obtain the similarity f between texts. The following is described in detail in three sections.

Firstly, grammar similarity f is carried out on two texts by extracting verbs₁Is calculated by

Lu Xie Xiang has a syntactic model with verb as the center built in the representative book, China culture and literature treatise. In analyzing a sentence, the sentence center is a verb representing an action, and nouns representing the beginning, the end and the related aspects of the action are all supplementary to the verb, so the general term may be called "supplementary word". Therefore, in addition to the center of the verb, there are various "complementary words" such as "start word", "stop word", "accepted word", "close complementary word", "join complementary word", and "complement by means of" in the sentence. That is, the meaning of sentence expression is reflected on the central verb in the sentence, so the sequence of the central verbs of all sentences in the paragraph reflects the central meaning of the paragraph. Similarly, a sequence of central verbs of all sentences in the text can summarize the central meaning of the full text. In this way, the verb sequence not only reflects the actions that occur in the text, but also describes the order in which the actions occur, so the verb sequence can be used as a feature string of an article. The similarity of the feature strings between two texts reflects the similarity between the texts.

The method comprises the following specific steps:

As shown in fig. 2, assuming that two texts are respectively a text a and a text B, after obtaining verb sequences of the two texts respectively, the verb sequences can be regarded as one character string to obtain a text a characteristic character string and a text B characteristic character string, and the similarity of the two verb sequences can be obtained by calculating the number of common substrings of the two characteristic character strings, assuming that the verb sequences of the text a are V1, V2, V3, V2 and V4, and the verb sequences of the text B are V1, V3, V2 and V4. The number of common substrings from the text a characteristic character string to the text B characteristic character string is shown in fig. 4, and the number of common substrings from the text B characteristic character string to the text a characteristic character string is shown in fig. 5. As can be seen from fig. 4 and 5, the number of common substrings from the text a characteristic character string to the text B characteristic character string is 3, the number of common substrings from the text B characteristic character string to the text a characteristic character string is 4, and the number of the larger common substrings of the two is taken as the number of the actual common substrings, so that the number of the actual common substrings is 4.

Finally, the similarity f of the grammar is passed₁The calculation formula of (2) is as follows:

(II) extracting characteristic items and performing semantic similarity f by using a TF-IDF weighting method₂Is calculated by

The measured semantic similarity may refer to a vector model in the information retrieval. The basic idea of the vector space model is to represent texts by vectors, and words, words or phrases can be selected as feature items.

According to the method for calculating the TF-IDF similarity of the VSM, words are used as feature items of texts, and the problem of replacing similar words and synonymy heteromorphism words is ignored, so that the accuracy of a calculation result is reduced. This problem can be solved efficiently by using a semantic dictionary. The commonly used semantic dictionary mainly comprises synonym forest and knowledge network as the measure of word similarity according to the information of related word concepts provided by the semantic dictionary. The method comprises the following steps of taking a semantic theme as a dimension of a vector space to extract feature vectors, adopting a method based on corpus statistics, firstly selecting features of a group of words, then comparing each word with the features of the group of words to obtain a related feature vector, and calculating the similarity by calculating the cosine of an included angle of the vector, wherein the method comprises the following specific steps:

wherein S41 specifically includes:

feature vector corresponding to first text

The calculation formula of (A) is as follows:

feature vector corresponding to the second text

The calculation formula of (A) is as follows:

Semantic similarity f₂The calculation formula of (A) is as follows:

wherein the content of the first and second substances,

is a first text feature vector and is a second text feature vector,

(III) similarity f of grammar₁And semantic similarity f₂Combining to obtain the similarity f between texts

Obtaining semantic similarity f of two texts₂Similarity to grammar f₁Then, the total similarity, i.e. the similarity f between texts, needs to be calculated, and the calculation formula is:

f＝α*f₁+β*f₂

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A verb-based Chinese text similarity calculation method is characterized by comprising the following steps:

2. The verb-based Chinese text similarity calculation method according to claim 1, wherein the preprocessing specifically comprises:

and segmenting the first text and the second text, and removing stop words.

3. The method for calculating similarity of Chinese text based on verbs as claimed in claim 1, wherein said step S3 further includes:

4. The verb-based Chinese text similarity calculation method according to claim 3, wherein the grammar similarity f₁The calculation formula of (2) is as follows:

5. The verb-based Chinese text similarity calculation method according to claim 4, wherein said step S4 specifically comprises:

6. The method of claim 5, wherein the first text corresponds to a feature vector

The calculation formula of (A) is as follows:

the feature vector corresponding to the second text

The calculation formula of (A) is as follows:

the corresponding feature of the kth feature item in the second text feature item set in the semantic topic space PThe term vector is characterized.

7. The verb-based Chinese text similarity calculation method according to claim 6, wherein the semantic similarity f₂The calculation formula of (A) is as follows:

wherein the content of the first and second substances,

is a first text feature vector and is a second text feature vector,

8. The verb-based Chinese text similarity calculation method according to claim 5, wherein said step S41 specifically comprises:

9. The method of claim 8, wherein the feature items are words in the text.

10. The verb-based chinese text similarity calculation method according to claim 7, wherein said inter-text similarity calculation formula is:

f＝α*f₁+β*f₂

wherein, alpha is a grammar weighting coefficient, and beta is a semantic weighting coefficient.