CN111814456A - Verb-based Chinese text similarity calculation method - Google Patents
Verb-based Chinese text similarity calculation method Download PDFInfo
- Publication number
- CN111814456A CN111814456A CN202010450674.7A CN202010450674A CN111814456A CN 111814456 A CN111814456 A CN 111814456A CN 202010450674 A CN202010450674 A CN 202010450674A CN 111814456 A CN111814456 A CN 111814456A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- semantic
- similarity
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services; Handling legal documents
Abstract
The invention relates to a verb-based Chinese text similarity calculation method, which comprises the following steps: s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text; s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text; s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1(ii) a S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2(ii) a S5: computing first and second text in combination with grammatical and semantic similarityCompared with the prior art, the method has the advantages of improving the calculation accuracy and the calculation speed and the like.
Description
Technical Field
The invention relates to the technical field of semantic analysis, in particular to a verb-based Chinese text similarity calculation method.
Background
In information processing, the calculation of text similarity is widely applied to the fields of information retrieval, machine translation, automatic question-answering systems, text mining and the like, is a very basic and key problem, and has been a hotspot and difficulty of research of people for a long time.
In recent years, some methods propose intelligent review of signed contracts by using text similarity calculation, automatic early warning of potential legal risks in contract texts is realized, application of the Chinese text similarity calculation method is further expanded and applied, and new requirements are provided for the Chinese text similarity calculation.
The existing text similarity calculation method comprises a character string-based method, an ontology-based method, a corpus-based method and the like, wherein the character string-based method only literally considers matching or co-occurrence of character strings and does not consider semantic information contained in a text, the ontology-based method is limited by the ontology scale constructed by human, similarity cannot be calculated for words not in the ontology, the corpus-based method trains word vectors through a neural network to express sentences into vector forms, and grammatical and semantic information in the text can be captured to a certain extent.
However, none of these methods combines the rules and experiences of the Chinese linguistics and natural language processing, and the methods do not effectively combine, and cannot efficiently and accurately calculate the similarity of the Chinese text. The contract review is related to the important benefits of both signing parties, for example, in the power grid engineering construction, the formulation of contract terms is a very important link, if the problems of uncertain responsibility and the like exist in the terms, the risks of causing disputes, causing loss and the like exist, and accurate and precise approval is required. Therefore, the conventional method for calculating the similarity of the Chinese text cannot be suitable for intelligent evaluation of contracts, and a new method for calculating the similarity of the Chinese text needs to be designed to efficiently and accurately calculate the similarity of the Chinese text.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a verb-based Chinese text similarity calculation method for improving the calculation accuracy and calculation speed.
The purpose of the invention can be realized by the following technical scheme:
a Chinese text similarity calculation method based on verbs comprises the following steps:
s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text;
s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text;
s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1;
S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2;
S5: and calculating the similarity f between the texts of the first text and the second text by combining the grammar similarity and the semantic similarity.
Further, the pretreatment specifically comprises: and segmenting the first text and the second text, and removing stop words.
In the process of word segmentation, words, symbols, punctuations and the like which have little meaning to the text content but high frequency of occurrence can be found. Words such as "this, too, right, did" appear in essentially any chinese article, but there seems to be little meaning to apply these words to the article, their status in the article becomes optional, and their removal does not affect the specific meaning to be expressed by the article and its readability. Therefore, the Words are taken out as stop Words in the preprocessing process, the stop word library of the Sichuan university machine intelligent laboratory is adopted, and the meaningless Words are filtered out by constructing a removal word List.
Further, the step S3 specifically includes:
s31: respectively taking verb sequences of the first text and the second text as a first text characteristic character string and a second text characteristic character string;
s32: acquiring the number of common substrings from the first text characteristic character string to the second text characteristic character string, and recording the number as the number of the first common substrings;
s33: acquiring the number of common substrings from the second text characteristic character string to the first text characteristic character string, and recording the number as the number of the second common substrings;
s34: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s35: calculating the grammar similarity f of the first text and the second text by using the number of the actual common substrings1。
Further, the grammar similarity f1The calculation formula of (2) is as follows:
wherein c is the number of actual common substrings, a is the number of verbs in the verb sequence of the first text, and b is the number of verbs in the verb sequence of the second text.
Further, the step S4 specifically includes:
s41: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;
s42: respectively extracting all feature items in the first text and the second text to obtain a first text feature item set and a second text feature item set;
s43: respectively counting the occurrence times of each feature item in the first text feature item set and the second text feature item set;
s44: acquiring feature item vectors corresponding to feature items in the first text feature item set and the second text feature item set by using a feature item vector table;
s45: calculating a feature vector corresponding to the first text and a feature vector corresponding to the second text according to the feature item vector, and respectively carrying out standardization processing to obtain a first text feature vector and a second text feature vector;
s46: calculating semantic similarity f of the first text and the second text according to the first text feature vector and the second text feature vector2。
Furthermore, the feature vector corresponding to the first textThe calculation formula of (A) is as follows:
wherein f isi,kThe number of times of occurrence of the kth characteristic item in the first text characteristic item set is shown, n is the number of all characteristic items in the first text,corresponding feature item vectors of the kth feature item in the first text feature item set in a semantic topic space P;
wherein f isj,kThe number of times of occurrence of the kth characteristic item in the second text characteristic item set is m is the number of all characteristic items in the second text,and the k-th feature item in the second text feature item set corresponds to a feature item vector in the semantic topic space P.
Further, the semantic similarity f2The calculation formula of (A) is as follows:
wherein the content of the first and second substances,is a first text feature vector and is a second text feature vector,is a second text feature vector, wi,jIs the included angle between the first text feature vector and the second text feature vector.
Further, the step S41 specifically includes:
s411: determining a semantic topic set V for use in a semantic vector space modelT={τ1,τ2,…,τdDetermining a semantic topic space P;
s412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set VN;
S413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);
s414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;
s415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.
Further preferably, the feature items are words in the text.
Further, the calculation formula of the similarity between texts is as follows:
f=α*f1+β*f2
where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.
Compared with the prior art, the invention has the following advantages:
1) the invention expands the range of stop words by introducing the concept of 'verb central words', forms a sequence of stop words by removing the stop words in the text as a text characteristic string, and calculates the grammar similarity f between Chinese texts by combining a string matching algorithm1The algorithm is simple, and the calculation speed is improved;
2) the invention extracts the feature items of two texts according to the IFIDF method, performs weight calculation, extracts the feature vectors of the texts by using semantic subjects as the dimensions of a vector space, and calculates semantic similarity f2The problem that the replacement of the similar meaning words and the synonym heteromorphism words is omitted simply by taking the words as the characteristic items of the text is solved effectively, and the accuracy of the calculation result is improved effectively;
3) the invention combines the grammar similarity f between texts1And semantic similarity f2And obtaining the similarity f between texts as a final text similarity result, and simultaneously considering the grammar and the semantics, thereby improving the accuracy of text similarity calculation.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a diagram illustrating a syntax similarity calculation process;
FIG. 3 is a schematic diagram of a semantic similarity calculation process;
FIG. 4 is a diagram illustrating the number of common substrings from text A to text B in the embodiment;
FIG. 5 is a diagram illustrating the number of common substrings from text B to text A in the embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
As shown in FIG. 1, the invention provides a Chinese text similarity calculation method based on verbs, which comprises the following steps:
s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text;
s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text;
s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1;
S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2;
S5: and calculating the similarity f between the texts of the first text and the second text by combining the grammar similarity and the semantic similarity.
Wherein the pretreatment specifically comprises: and segmenting the first text and the second text, and removing stop words.
In the embodiment, an open-source Chinese word segmentation component, namely ancient word segmentation, is used during word segmentation, and word segmentation is performed on two texts by using third-party library word segmentation software. When the words are segmented, some nonsense words and optional words of the text are firstly put into a removal word list constructed in advance, so that the subsequent loading of a dog searching word bank is facilitated to calculate word frequency and weight.
We can find some words, symbols, punctuation, etc. that have little meaning to the text content but occur frequently. Words such as "this, too, right, did" appear in essentially any chinese article, but there seems to be little meaning to apply these words to the article, their status in the article becomes optional, and their removal does not affect the specific meaning to be expressed by the article and its readability. Therefore, the Words are taken out as stop Words in the preprocessing process, in the embodiment, the stop word stock of the intelligent laboratory of the university of Sichuan machine is adopted, and the meaningless Words are filtered out by constructing a removal word List (Remove Words List).
The method comprises three major parts in total, and firstly, the grammar similarity f is carried out on two texts by extracting verbs1The second is to carry out semantic similarity f by extracting feature items and utilizing a TF-IDF weighting method2Finally, the grammar similarity f is calculated1And semantic similarity f2And combining to obtain the similarity f between texts. The following is described in detail in three sections.
Firstly, grammar similarity f is carried out on two texts by extracting verbs1Is calculated by
Lu Xie Xiang has a syntactic model with verb as the center built in the representative book, China culture and literature treatise. In analyzing a sentence, the sentence center is a verb representing an action, and nouns representing the beginning, the end and the related aspects of the action are all supplementary to the verb, so the general term may be called "supplementary word". Therefore, in addition to the center of the verb, there are various "complementary words" such as "start word", "stop word", "accepted word", "close complementary word", "join complementary word", and "complement by means of" in the sentence. That is, the meaning of sentence expression is reflected on the central verb in the sentence, so the sequence of the central verbs of all sentences in the paragraph reflects the central meaning of the paragraph. Similarly, a sequence of central verbs of all sentences in the text can summarize the central meaning of the full text. In this way, the verb sequence not only reflects the actions that occur in the text, but also describes the order in which the actions occur, so the verb sequence can be used as a feature string of an article. The similarity of the feature strings between two texts reflects the similarity between the texts.
The method comprises the following specific steps:
s31: respectively taking verb sequences of the first text and the second text as a first text characteristic character string and a second text characteristic character string;
s32: acquiring the number of common substrings from the first text characteristic character string to the second text characteristic character string, and recording the number as the number of the first common substrings;
s33: acquiring the number of common substrings from the second text characteristic character string to the first text characteristic character string, and recording the number as the number of the second common substrings;
s34: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s35: calculating the grammar similarity f of the first text and the second text by using the number of the actual common substrings1。
As shown in fig. 2, assuming that two texts are respectively a text a and a text B, after obtaining verb sequences of the two texts respectively, the verb sequences can be regarded as one character string to obtain a text a characteristic character string and a text B characteristic character string, and the similarity of the two verb sequences can be obtained by calculating the number of common substrings of the two characteristic character strings, assuming that the verb sequences of the text a are V1, V2, V3, V2 and V4, and the verb sequences of the text B are V1, V3, V2 and V4. The number of common substrings from the text a characteristic character string to the text B characteristic character string is shown in fig. 4, and the number of common substrings from the text B characteristic character string to the text a characteristic character string is shown in fig. 5. As can be seen from fig. 4 and 5, the number of common substrings from the text a characteristic character string to the text B characteristic character string is 3, the number of common substrings from the text B characteristic character string to the text a characteristic character string is 4, and the number of the larger common substrings of the two is taken as the number of the actual common substrings, so that the number of the actual common substrings is 4.
Finally, the similarity f of the grammar is passed1The calculation formula of (2) is as follows:
wherein c is the number of actual common substrings, a is the number of verbs in the verb sequence of the first text, and b is the number of verbs in the verb sequence of the second text.
(II) extracting characteristic items and performing semantic similarity f by using a TF-IDF weighting method2Is calculated by
The measured semantic similarity may refer to a vector model in the information retrieval. The basic idea of the vector space model is to represent texts by vectors, and words, words or phrases can be selected as feature items.
According to the method for calculating the TF-IDF similarity of the VSM, words are used as feature items of texts, and the problem of replacing similar words and synonymy heteromorphism words is ignored, so that the accuracy of a calculation result is reduced. This problem can be solved efficiently by using a semantic dictionary. The commonly used semantic dictionary mainly comprises synonym forest and knowledge network as the measure of word similarity according to the information of related word concepts provided by the semantic dictionary. The method comprises the following steps of taking a semantic theme as a dimension of a vector space to extract feature vectors, adopting a method based on corpus statistics, firstly selecting features of a group of words, then comparing each word with the features of the group of words to obtain a related feature vector, and calculating the similarity by calculating the cosine of an included angle of the vector, wherein the method comprises the following specific steps:
s41: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;
wherein S41 specifically includes:
s411: determining a semantic topic set V for use in a semantic vector space modelT={τ1,τ2,…,τdDetermining a semantic topic space P;
s412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set VN;
S413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);
s414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;
s415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.
S42: respectively extracting all feature items in the first text and the second text to obtain a first text feature item set and a second text feature item set;
s43: respectively counting the occurrence times of each feature item in the first text feature item set and the second text feature item set;
s44: acquiring feature item vectors corresponding to feature items in the first text feature item set and the second text feature item set by using a feature item vector table;
s45: calculating a feature vector corresponding to the first text and a feature vector corresponding to the second text according to the feature item vector, and respectively carrying out standardization processing to obtain a first text feature vector and a second text feature vector;
wherein f isi,kThe number of times of occurrence of the kth characteristic item in the first text characteristic item set is shown, n is the number of all characteristic items in the first text,corresponding feature item vectors of the kth feature item in the first text feature item set in a semantic topic space P;
wherein f isj,kThe number of times of occurrence of the kth characteristic item in the second text characteristic item set is m is the number of all characteristic items in the second text,and the k-th feature item in the second text feature item set corresponds to a feature item vector in the semantic topic space P.
S46: calculating semantic similarity f of the first text and the second text according to the first text feature vector and the second text feature vector2。
Semantic similarity f2The calculation formula of (A) is as follows:
wherein the content of the first and second substances,is a first text feature vector and is a second text feature vector,is a second text feature vector, wi,jIs the included angle between the first text feature vector and the second text feature vector.
(III) similarity f of grammar1And semantic similarity f2Combining to obtain the similarity f between texts
Obtaining semantic similarity f of two texts2Similarity to grammar f1Then, the total similarity, i.e. the similarity f between texts, needs to be calculated, and the calculation formula is:
f=α*f1+β*f2
where α is a grammar weighting coefficient, and its value is preferably 0.4, β is a semantic weighting coefficient, and its value is preferably 0.6, and the value is determined according to the weight of the grammar structure and the semantic structure in text similarity measurement.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A verb-based Chinese text similarity calculation method is characterized by comprising the following steps:
s1: acquiring a first text and a second text which need to be subjected to similarity calculation, and preprocessing the first text and the second text;
s2: respectively extracting verb sequences of the preprocessed first text and the preprocessed second text;
s3: calculating grammar similarity f of the first text and the second text based on the verb sequence1;
S4: calculating semantic similarity f of the first text and the second text based on the preprocessed first text and the preprocessed second text2;
S5: and calculating the similarity f between the texts of the first text and the second text by combining the grammar similarity and the semantic similarity.
2. The verb-based Chinese text similarity calculation method according to claim 1, wherein the preprocessing specifically comprises:
and segmenting the first text and the second text, and removing stop words.
3. The method for calculating similarity of Chinese text based on verbs as claimed in claim 1, wherein said step S3 further includes:
s31: respectively taking verb sequences of the first text and the second text as a first text characteristic character string and a second text characteristic character string;
s32: acquiring the number of common substrings from the first text characteristic character string to the second text characteristic character string, and recording the number as the number of the first common substrings;
s33: acquiring the number of common substrings from the second text characteristic character string to the first text characteristic character string, and recording the number as the number of the second common substrings;
s34: selecting the maximum public substring number from the first public substring number and the second public substring number as the actual public substring number;
s35: calculating the grammar similarity f of the first text and the second text by using the number of the actual common substrings1。
4. The verb-based Chinese text similarity calculation method according to claim 3, wherein the grammar similarity f1The calculation formula of (2) is as follows:
wherein c is the number of actual common substrings, a is the number of verbs in the verb sequence of the first text, and b is the number of verbs in the verb sequence of the second text.
5. The verb-based Chinese text similarity calculation method according to claim 4, wherein said step S4 specifically comprises:
s41: constructing a feature item vector table in a semantic topic space P based on a semantic vector space model;
s42: respectively extracting all feature items in the first text and the second text to obtain a first text feature item set and a second text feature item set;
s43: respectively counting the occurrence times of each feature item in the first text feature item set and the second text feature item set;
s44: acquiring feature item vectors corresponding to feature items in the first text feature item set and the second text feature item set by using a feature item vector table;
s45: calculating a feature vector corresponding to the first text and a feature vector corresponding to the second text according to the feature item vector, and respectively carrying out standardization processing to obtain a first text feature vector and a second text feature vector;
s46: calculating semantic similarity f of the first text and the second text according to the first text feature vector and the second text feature vector2。
6. The method of claim 5, wherein the first text corresponds to a feature vectorThe calculation formula of (A) is as follows:
wherein f isi,kThe number of times of occurrence of the kth characteristic item in the first text characteristic item set is shown, n is the number of all characteristic items in the first text,corresponding feature item vectors of the kth feature item in the first text feature item set in a semantic topic space P;
wherein f isj,kThe number of times of occurrence of the kth characteristic item in the second text characteristic item set is m is the number of all characteristic items in the second text,the corresponding feature of the kth feature item in the second text feature item set in the semantic topic space PThe term vector is characterized.
7. The verb-based Chinese text similarity calculation method according to claim 6, wherein the semantic similarity f2The calculation formula of (A) is as follows:
8. The verb-based Chinese text similarity calculation method according to claim 5, wherein said step S41 specifically comprises:
s411: determining a semantic topic set V for use in a semantic vector space modelT={τ1,τ2,…,τdDetermining a semantic topic space P;
s412: determining text characteristic items of non-semantic subjects in a semantic vector space model, and recording the text characteristic items as a set VN;
S413: expressing semantic subjects and feature items as a set V, taking elements of the set as nodes, taking semantic relations among the elements as edges, and organizing a semantic relation graph G (V, E);
s414: determining vectors corresponding to all semantic topics according to the semantic association graph G ═ V, E >;
s415: and calculating the vector representation of each feature item, and constructing a feature item vector table in the semantic topic space P.
9. The method of claim 8, wherein the feature items are words in the text.
10. The verb-based chinese text similarity calculation method according to claim 7, wherein said inter-text similarity calculation formula is:
f=α*f1+β*f2
wherein, alpha is a grammar weighting coefficient, and beta is a semantic weighting coefficient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010450674.7A CN111814456A (en) | 2020-05-25 | 2020-05-25 | Verb-based Chinese text similarity calculation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010450674.7A CN111814456A (en) | 2020-05-25 | 2020-05-25 | Verb-based Chinese text similarity calculation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111814456A true CN111814456A (en) | 2020-10-23 |
Family
ID=72848023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010450674.7A Pending CN111814456A (en) | 2020-05-25 | 2020-05-25 | Verb-based Chinese text similarity calculation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111814456A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112883165A (en) * | 2021-03-16 | 2021-06-01 | 山东亿云信息技术有限公司 | Intelligent full-text retrieval method and system based on semantic understanding |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012043294A (en) * | 2010-08-20 | 2012-03-01 | Kddi Corp | Binomial relationship categorization program, method, and device for categorizing semantically similar word pair by binomial relationship |
CN108549634A (en) * | 2018-04-09 | 2018-09-18 | 北京信息科技大学 | A kind of Chinese patent text similarity calculating method |
-
2020
- 2020-05-25 CN CN202010450674.7A patent/CN111814456A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012043294A (en) * | 2010-08-20 | 2012-03-01 | Kddi Corp | Binomial relationship categorization program, method, and device for categorizing semantically similar word pair by binomial relationship |
CN108549634A (en) * | 2018-04-09 | 2018-09-18 | 北京信息科技大学 | A kind of Chinese patent text similarity calculating method |
Non-Patent Citations (2)
Title |
---|
刘小军;赵栋;姚卫东;: "一种用于中文文本查重的双因子相似度算法", 计算机仿真, no. 12, pages 2 - 3 * |
黄菊;: "一种基于语义向量空间模型的作业查重算法", 电子科学技术, no. 06, pages 2 - 3 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112883165A (en) * | 2021-03-16 | 2021-06-01 | 山东亿云信息技术有限公司 | Intelligent full-text retrieval method and system based on semantic understanding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Suleiman et al. | Deep learning based technique for plagiarism detection in Arabic texts | |
Oudah et al. | NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic | |
Ulčar et al. | High quality ELMo embeddings for seven less-resourced languages | |
CN108073571B (en) | Multi-language text quality evaluation method and system and intelligent text processing system | |
Gao et al. | Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF | |
CN113761890B (en) | Multi-level semantic information retrieval method based on BERT context awareness | |
Ren et al. | Detecting the scope of negation and speculation in biomedical texts by using recursive neural network | |
Al-Harbi et al. | Lexical disambiguation in natural language questions (nlqs) | |
Wadud et al. | Text coherence analysis based on misspelling oblivious word embeddings and deep neural network | |
Zhang et al. | Chinese-English mixed text normalization | |
Sornlertlamvanich et al. | Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC | |
Aejas et al. | Named entity recognition for cultural heritage preservation | |
CN111814456A (en) | Verb-based Chinese text similarity calculation method | |
pal Singh et al. | Naive Bayes classifier for word sense disambiguation of Punjabi language | |
Khoufi et al. | Chunking Arabic texts using conditional random fields | |
Tongtep et al. | Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction | |
Abdolahi et al. | A new method for sentence vector normalization using word2vec | |
Rebala et al. | Natural language processing | |
Jamwal | Named entity recognition for Dogri using ML | |
Prasad et al. | Lexicon based extraction and opinion classification of associations in text from Hindi weblogs | |
Golubev et al. | Use of augmentation and distant supervision for sentiment analysis in Russian | |
Bafna et al. | BaSa: A Technique to Identify Context based Common Tokens for Hindi Verses and Proses | |
Yuan et al. | Semantic based chinese sentence sentiment analysis | |
Bharti et al. | Sarcasm as a contradiction between a tweet and its temporal facts: a pattern-based approach | |
Liu et al. | Domain phrase identification using atomic word formation in Chinese text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |