CN112380830B - Matching method, system and computer readable storage medium for related sentences in different documents - Google Patents

Matching method, system and computer readable storage medium for related sentences in different documents Download PDF

Info

Publication number
CN112380830B
CN112380830B CN202010559644.XA CN202010559644A CN112380830B CN 112380830 B CN112380830 B CN 112380830B CN 202010559644 A CN202010559644 A CN 202010559644A CN 112380830 B CN112380830 B CN 112380830B
Authority
CN
China
Prior art keywords
sentences
matching
sentence
score
shallow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010559644.XA
Other languages
Chinese (zh)
Other versions
CN112380830A (en
Inventor
王忠萌
陈运文
王文广
贺梦洁
胡盟
纪达麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Co ltd
Original Assignee
Daguan Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Co ltd filed Critical Daguan Data Co ltd
Priority to CN202010559644.XA priority Critical patent/CN112380830B/en
Publication of CN112380830A publication Critical patent/CN112380830A/en
Application granted granted Critical
Publication of CN112380830B publication Critical patent/CN112380830B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a matching method of related sentences in different documents, which is used for matching a reference sentence in a reference document with a candidate sentence in a comparison document, and comprises the following steps: on three layers of shallow semantics, statistical information and deep semantics, calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores represent the matching degree; fitting the shallow score, the statistical score, and the deep score based on a linear regression model to obtain a final score representing the degree of matching of the candidate sentence relative to the reference sentence. The invention improves the accuracy of document matching.

Description

Matching method, system and computer readable storage medium for related sentences in different documents
Technical Field
The invention belongs to the field of computer natural language processing, and particularly relates to a matching method, a matching system and a computer readable storage medium of related sentences in different documents.
Background
With the development of the information age in recent years, the amount of text to be processed by a computer has increased. In the face of massive texts, the machine is enabled to automatically process the texts into the current hot trend. The requirements for matching the document contents are gradually expanded, people can automatically match through a machine, and the distinction and the connection of different documents can be conveniently found, so that public opinion comparison, auxiliary decision making and the like are facilitated, and the method plays a great role in the fields of economy, law and the like.
Common methods such as TF-IDF algorithm can calculate the similarity of two documents by calculating the TF-IDF value of each word in the document and then combining the similarity calculation method (generally adopting cosine similarity). The premise of using TF-IDF is that the term importance of an article is not related to where the term appears in the article. The core idea of the algorithm is as follows: in an article, the importance of a term is positively correlated with the number of times that term appears in the article, while it is negatively correlated with the number of articles in the entire corpus in which the term appears.
Meanwhile, a deep learning method is popular, a deep neural network is widely used for sentence modeling, the deep learning model can represent sentences as vector matrixes in semantic space, semantic relations between two sentences are described more accurately by means of distances between the vectors, the convolutional neural network is good at extracting abstract features in the sentences, and the cyclic neural network is good at keeping and utilizing long-distance information. Such as a representative DSSM algorithm. DSSM is a deep learning semantic matching model that uses user's click data to train semantic level matching in a search scenario. DSSM replaces relevance with click-through rate, and the click-through data contains a large number of user questions and corresponding click-through documents that link the user's questions to matching documents. Google proposes a BERT pre-training model, uses a transducer structure for bi-directional encoding, and uses massive data for pre-training Masked LM and Next Sentence Prediction. Further, the fine adjustment can be used for downstream tasks. For example, when the text similarity task is performed, the structure of the output layer is adjusted, and the linear layer is used for performing model fine adjustment, so that a final result is obtained.
Currently, document matching tasks face several difficulties, first, sentence matching itself is problematic. Different descriptions of the same thing can affect that it is difficult for a computer to judge two texts to be similar, resulting in reduced recall; diversified semantic structures, such as "socioeconomic", can be used as descriptive subjects or adjectives for modification, such as "socioeconomic law" and "socioeconomic culture". Secondly, the text matching system faces the problem of cross-domain text, and in different text domains, the judging methods are not completely consistent, and whether the text matching system is a descriptive subject needs to be specifically judged. Thereby affecting quick and accurate migration. Finally, the matching score of an isolated sentence is inconsistent with the matching result of the whole document range, and the readability of the result is inconsistent. These problems are all challenges for current text-like systems.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a matching method of related sentences in different documents, and part of embodiments of the invention can improve the document matching precision.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A matching method of related sentences in different documents for matching a reference sentence in a reference document with a candidate sentence in a comparison document, the matching method comprising: on three layers of shallow semantics, statistical information and deep semantics, calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores represent the matching degree; fitting the shallow score, the statistical score, and the deep score based on a linear regression model to obtain a final score representing the degree of matching of the candidate sentence relative to the reference sentence.
Preferably, the obtaining of the shallow semantic includes obtaining three parallel indexes, where the parallel indexes are respectively: characters, word segmentation, trunk components.
Preferably, the obtaining of the trunk component comprises: finding nouns in sentences and adjectives with the nouns in a centering structure; starting from the adjective nearest to the noun, overlapping a plurality of adjectives in a direction far from the noun in sequence, and combining the adjectives with the noun to obtain a plurality of trunk components.
Preferably, the calculation method of the shallow fraction includes: respectively obtaining recall rates of characters, word segmentation and trunk components in the reference sentences in the candidate sentences; the first vector is formed with the recall as a shallow score.
Preferably, the method for calculating the statistical score includes: calculating the TF-IDF score of the trunk component based on a plurality of documents including a reference document and a specific field in which the comparison document is located; taking a plurality of trunk components with larger TF-IDF fraction; taking whether the main component appears in the reference sentence and the candidate sentence simultaneously as a first variable, and whether the reference sentence and the candidate sentence are actually matched as a second variable, carrying out chi-square test on the first variable and the second variable, and taking the main component passing the chi-square test as a key component; and constructing a second vector as a statistical score according to whether the key components appear in the reference sentence and the candidate sentences at the same time.
Preferably, the method for calculating the depth score includes: obtaining a Bert classification network by using a Bert pre-trained model of a wide corpus; and calculating the similarity of the candidate sentences in the deep semantic meaning relative to the reference sentences based on the classification network to form a third vector, and taking the third vector as a deep score.
Preferably, the linear regression model comprises weights and preset bias values, and different weights which are opposite to the fields are trained for a plurality of documents in different specific fields.
Preferably, the matching method comprises: if the candidate sentence and the reference sentence are at the same position of the paragraph, the final score is up-regulated by a predetermined score.
A matching system for matching a reference sentence in a reference document with a candidate sentence in a comparison document for related sentences in different documents, the matching system comprising: the calculation module is used for calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores are used for representing the matching degree of the candidate sentences on three layers of shallow semantics, statistical information and deep semantics; and the fitting module is used for fitting the shallow score, the statistical score and the deep score based on a linear regression model to obtain a final score used for representing the matching degree of the candidate sentence relative to the reference sentence.
A computer readable storage medium storing computer instructions which when executed by a processor implement any of the matching methods.
Compared with the prior art, the invention has the beneficial effects that: the accuracy of document matching is improved, and the method is applicable to and targeted for text matching in different fields.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an overall framework of a document matching method according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart of a sentence matching method in an embodiment.
FIG. 3 is a schematic diagram of the structure of centering relationship words and core words in the embodiment.
Fig. 4 is a flowchart of a method for calculating a deep semantic level similarity in an embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.
As shown in fig. 1-4, the input to the present system is two documents, such as a policy report, a product specification, a legal document. Especially parallel corpora, such as policy reports of different years, can be corresponding to a plurality of related sentences in a comparison document or can be used as added content without correspondence in view of the correlation of most of contents between such document pairs. In case there are multiple correspondences in the comparison document, a policy may be adopted that tends to choose the sentence whose content is closest to itself, or the sentence whose contextual content is closest to itself. After two documents are respectively designated as a reference document and a comparison document, the system outputs the matching condition of sentences of each reference document, such as corresponding sentences matched to the comparison document or no corresponding sentences, and in the process, the system can output the matching information of the middle process, such as chapter matching and paragraph matching information.
The following sentence matching results can be obtained by taking the government work report in 2018 as a reference document, taking the government work report in 2019 as a comparison document and taking the whole sentence in Chinese as a basic unit, as shown in Table 1.
TABLE 1
In the specific implementation, not limited to the whole sentence, clauses divided by punctuation marks such as commas, semicolons and the like can be used as basic units of the document for matching, and the results are similar.
The implementation of the invention can be divided according to the modules, and the following steps are sequentially carried out:
1. When the chapter structure exists in the documents in the specific field, chapter division is firstly carried out, and sentence screening is carried out from the chapter level. Then, the text in the chapter can be subjected to paragraph segmentation by taking the line feed as a segmenter. Then, according to the description object condition of the paragraph, paragraph matching is performed, so that sentences in the paragraph have corresponding candidate sets, namely all sentences in the matched paragraph.
2. Sentence matching, shallow semantic extraction, statistical information extraction and deep semantic extraction are sequentially carried out, so that similarity characteristics between two sentences can be extracted. Firstly, preprocessing a text, primarily dividing a main-predicate structure, eliminating obvious irrelevant information, and reserving description of candidate shallow sub-information; secondly, carrying out statistical information description on the segmented text in a specific field; thirdly, vectorizing the text, and calculating depth information by using a Bert and other depth neural networks; and finally, performing primary screening by using threshold judgment, performing linear regression processing, and judging different weights according to text information in different fields so as to obtain a final result. Meanwhile, the text similarity judgment in different fields can be performed by using the specific field perception module.
3. And (3) after the two steps are finished, each sentence of the reference document corresponds to the candidate sentences of the other document, and the number of the candidate sentences is zero or more. Then, according to scoring condition, whether the candidate sentences are matched or not, and the like, post-processing of the results is carried out, so that the most probable matched sentences are returned.
The sentence matching module may be divided into the following sub-modules: preprocessing, shallow information extraction, statistical information extraction, deep information extraction and similar result judgment.
1. Sentence preliminary screening
The module has two functions, namely, the candidate space of sentences is reduced, and obviously unmatched sentences are removed. Secondly, the addition of manual experience rules can be supported, so that the sentence is assisted in defining a matching range.
First, documents in certain specific fields have complete hierarchies, such as government reports, legal documents, agreements, etc., so that respective rules need to be formulated, and these reports are segmented and corresponded. Some documents in a particular domain may not have a chapter hierarchy, and this step may be omitted. Continuing the above example, government work reports may be divided into three sections, which correspond in sequence.
Next, paragraph matching. Firstly, the word segmentation of sentences in the segments is carried out, and then noun sets corresponding to each segment, namely all possible sentence description objects, can be obtained. For each paragraph p of the reference document, all sentences inside are noted as a set S p. From the corresponding sections or full text paragraphs, the top K p paragraphs with the highest recall rate are found, and all sentences in the paragraphs are marked as a set S q. Thus, the candidate matching sentence subsets of all sentences in S p are S q.
2. Sentence matching
The module is a key module of the whole framework, and is used for providing matching processing of sentences and carrying out correlation processing in the sentences, so that the matching difficulty of the sentences under documents in specific fields and the model migration problem under documents in different specific fields are relieved. The following are several sub-steps:
(1) Sentence preprocessing.
The function of the part is to clean irrelevant noise of sentences and reconstruct sentence structures through simple manual experience. And s1, s2 are the reference sentences to be matched and the comparison sentences, and a new sentence pair is obtained according to the set preprocessing function preprocess as follows. Each processing step is performed on the processed sentence later.
s1′=preprocess(s1)
s2′=preprocess(s2)
In the specific implementation, simple and complex conversion, conversion of full-angle half-angle punctuation and conversion operation of case and case can be carried out. Meanwhile, the text needs to be segmented, and the broad stop words and the components which are needed to be removed in the specific field (such as the date components which are not concerned in some fields) are screened and removed. And obtaining a new sentence after pretreatment.
Continuing the government work report example, two representative original sentences 1 and 2 are taken from the reference report and the comparison report, respectively, as follows.
Sentence 1: the total domestic production value is increased from 54 trillion yuan to 82.7 trillion yuan, and the annual growth rate is 7.1 percent.
Sentence 2: the total value of domestic production is increased by 6.6%, and the total amount breaks through 90 trillion yuan.
In this example, where the setting is insensitive to the specific number, sentence 1 and sentence 2 may be preprocessed as:
Sentence 1: the total domestic production value is increased from 0 yuan to 0 yuan, and the annual growth is 0 percent.
Sentence 2: the total value of domestic production is increased by 0 percent, and the total value breaks through 0 yuan.
When clause matching is to be performed, subject supplementation may be performed where subject is omitted, as follows.
Half sentence (1) of sentence 1: the total domestic production value is increased from 0 to 0 yuan
Half sentence (2) of sentence 1: the annual increase of the total value of domestic production is 0 percent
Half sentence (1) of sentence 2: the total value of domestic production is increased by 0 percent
Half sentence (2) of sentence 2: the total amount of the domestic production total value breaks through 0 yuan.
(2) Shallow semantic acquisition.
The function of the part is to preliminarily extract shallow semantic information of the appointed text, and shallow secondary characteristic information is obtained by pre-extracting main components possibly needing to be judged and replacing synonym components.
First, a preliminary extraction of the trunk component of the sentence is performed. Nouns and adjectives thereof with centering structures are found, resulting in several possible trunk components.
In the implementation, if one core word has a plurality of centering relation words, the core words are sequentially overlapped from the latest centering relation word, so that a plurality of trunk component words can be obtained. The core word w1 has two centering relation words a1 and a2, and w1, a 2w 1 and a 1a 2w 1 can be obtained, and three sentence trunk components of the domestic production total value, the production total value and the total value can be extracted according to the domestic production total value of the example.
Again, in practice, the vocabulary of the rule base may be utilized to replace synonyms, such as "GDP" and "national production total", and may be replaced so that the machine can handle synonyms very conveniently.
Finally, the shallow level matching score of the reference sentence and the candidate sentence is calculated by the following indexes. The granularity of sentences can be sequentially from small to large: the character aspect is the recall rate of the character level, the f value and the rouge score; the word segmentation aspect is the recall rate of the word segmentation, the f value, the recall rate of the n-gram and the f value; the aspect of the trunk component is the recall rate and f value of the trunk component. These indices are parallel indices, by which the score vector is constructed
Continuing the above example, taking the half sentence (1) of the clause 1 as a reference half sentence, and examining the half sentence (1) and the half sentence (2) of the clause 2 as candidate matching items respectively. The score vectors are [0.67,0.57,0.70,0.50,0.40,0.25,0.78,1,1] and [0.63,0.57,0.64,0.62,0.55,0.38,0.3,1,1], respectively.
(3) Statistical information acquisition
The method has the function of obtaining the weight influence of different components on the similarity of the judging sentences in the specific field through text statistics of the specific field, and further providing support for automatic evaluation of the text similarity in the specific field through co-occurrence of key components.
Firstly, according to the sentence trunk component obtained in the last step, TF-IDF calculation is carried out on the whole documents in the field. Note N w is the number of occurrences of term w in a certain text, N is the total number of terms in the text. Where Y is the total number of documents of the corpus and Y w is the number of documents containing the term w. The word frequency can be obtained as follows:
The inverse word frequency is:
Finally, the final TF-IDF score is obtained
TF-IDFw=TFw*IDFw
Secondly, for the first K w components with larger TF-IDF weight, in each data of the document dataset in the specific field, if the component is co-occurrence in two sentences, the component is marked as 1, and if the component is not co-occurrence, the component is marked as 0.
K is the data set size, A is the co-occurrence value, E is the true matching value, and then the chi-square calculates the result
Taking the threshold p w, the components that pass the chi-square test are noted as the set W key, representing the components that are important for semantic similarity calculation.
Finally, counting whether key components in the specific fields co-occur in sentence pairs or not to obtain a result of statistical information, and marking the result as a score
Continuing the above example, the key components are filtered to be 5 words such as 'domestic total production value', the two scores of the candidate matches are respectively 1,0, and [1, 0]. In this example, only one key component is matched, and the rest key components are all present, so that the first term of the score vector is 1, and the rest term is 0.
(4) Deep semantic acquisition
The function of the part is to obtain depth information of the text in the non-specific field through the depth neural network so as to assist the next module to calculate the similarity of the text in the specific field.
And adding an output layer network by using a Bert pre-trained model of a wide corpus to obtain a Bert classification network, and training and calculating the text similarity.
The C is the return matrix of the pre-training model, the W is the trainable parameter, the P is the probability of the classification result, and the probability is
P=softmax(CWT)
Obtaining the score according to the result with the probability of 1
Continuing the example above, the similarity scores for the two candidates are [0.85], [0.90], respectively.
(5) Specific text perception
The module is used for integrating the results of the score calculation parts, judging the similarity in a specific field and migrating the model across scenes.
First, a portion of the score is initially screened. In practice, certain scores may be specified to be thresholded depending on the domain situation. If the two sentences have no common characters, the sentences can be directly judged to be dissimilar; and if the return scores of the two text depth networks are extremely low, the sentences can be directly judged to be dissimilar, so that the sentences can be filtered out.
Next, a linear regression model is constructed to make the determination. Record the final score asWeight is w, bias value is b, and the final result is
The model internal weight can be trained by utilizing a data set of a specific field, and then the document of the field can be judged. When the scene is to be switched, the scene switching can be completed only by retraining the linear model.
Continuing the above example, according to the trained model, the matching scores of the two candidate matching items can be obtained to be 0.91,0.94 respectively.
3. Post-processing of results
The function of the module is to coordinate the isolated matching result of each sentence, and to screen the candidate set from the accuracy of full-text matching, thereby improving the readability and making the matching result more friendly.
First, the position information of each sentence is obtained and divided into a first segment and a last segment in the segment. N start before the paragraph is the first paragraph, N end after the paragraph is the end of the paragraph, and the rest is in the paragraph. For the reference sentence s1, if the position coincides with the candidate sentence s2 paragraph, the score increases by Y pos. In the specific implementation, the demarcation mode is not limited, and the manual knowledge can be used for demarcating the position information of the sentences. In this way, the position correspondence result can be improved.
And secondly, obtaining repeated occurrence information of the comparison sentence. According to the general structure of the article, a certain comparison document sentence is compared in a plurality of reference documents, so that the sentence is reasonably distributed and is divided into the most correct corresponding reference sentences. For a certain aligned document sentence, if it is matched, it is matched again, and its score should be reduced by Y repeat. In the specific implementation, the repeated matching can be corrected by using manual knowledge. For example, in contracts, where repeated treaties are fewer, the maximum value may be taken to handle instead of reducing the repeated matching principle.
Finally, the final result was obtained as Y final=Ymatch+Ypos-Yrepeat. And sequencing sentences of each reference document according to the scores for the respective candidate sentence subsets to obtain a final matching result.
Continuing the above example, both candidate matches belong to the same position within the segment and do not appear, so the final score is 0.91,0.94, which is a better match of clause (1) of sentence 1 and sentence (2) of sentence 2, respectively.
While the foregoing embodiments have been described in detail and with reference to the present invention, it will be apparent to one skilled in the art that modifications and improvements can be made based on the disclosure without departing from the spirit and scope of the invention.

Claims (8)

1. A method for matching related sentences in different documents for matching a reference sentence in a reference document with candidate sentences in a comparison document, the method comprising:
on three layers of shallow semantics, statistical information and deep semantics, calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores represent the matching degree;
Fitting the shallow score, the statistical score and the deep score based on a linear regression model to obtain a final score representing the matching degree of the candidate sentence relative to the reference sentence;
the acquisition of the shallow semantics comprises the acquisition of three parallel indexes, wherein the parallel indexes are respectively: character, word segmentation and trunk components;
the calculation method of the shallow fraction comprises the following steps:
respectively obtaining recall rates of characters, word segmentation and trunk components in the reference sentences in the candidate sentences;
The first vector is formed with the recall as a shallow score.
2. The method for matching related sentences in different documents according to claim 1, wherein said obtaining of said backbone component comprises:
finding nouns in sentences and adjectives with the nouns in a centering structure;
starting from the adjective nearest to the noun, overlapping a plurality of adjectives in a direction far from the noun in sequence, and combining the adjectives with the noun to obtain a plurality of trunk components.
3. The method for matching related sentences in different documents according to claim 2, wherein said statistical score calculating method comprises:
Calculating the TF-IDF score of the trunk component based on a plurality of documents including a reference document and a specific field in which the comparison document is located;
Taking a plurality of trunk components with larger TF-IDF fraction;
Taking whether the main component appears in the reference sentence and the candidate sentence simultaneously as a first variable, and whether the reference sentence and the candidate sentence are actually matched as a second variable, carrying out chi-square test on the first variable and the second variable, and taking the main component passing the chi-square test as a key component;
And constructing a second vector as a statistical score according to whether the key components appear in the reference sentence and the candidate sentences at the same time.
4. A method of matching related sentences in different documents according to claim 3, wherein said method of calculating deep score comprises:
Obtaining a Bert classification network by using a Bert pre-trained model of a wide corpus;
and calculating the similarity of the candidate sentences in the deep semantic meaning relative to the reference sentences based on the classification network to form a third vector, and taking the third vector as a deep score.
5. The method of claim 4, wherein the linear regression model includes weights and preset bias values, and wherein different domain-specific weights are trained for different domain-specific documents.
6. The method of matching related sentences in different documents according to claim 1, wherein said matching method comprises:
If the candidate sentence and the reference sentence are at the same position of the paragraph, the final score is up-regulated by a predetermined score.
7. A matching system for matching a reference sentence in a reference document with a candidate sentence in a comparison document for related sentences in different documents, the matching system comprising:
The calculation module is used for calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores are used for representing the matching degree of the candidate sentences on three layers of shallow semantics, statistical information and deep semantics; the acquisition of the shallow semantics comprises the acquisition of three parallel indexes, wherein the parallel indexes are respectively: character, word segmentation and trunk components;
a fitting module that fits the shallow score, the statistical score, and the deep score based on a linear regression model to obtain a final score representing a degree of matching of the candidate sentence relative to a reference sentence; the calculation method of the shallow fraction comprises the following steps:
respectively obtaining recall rates of characters, word segmentation and trunk components in the reference sentences in the candidate sentences;
The first vector is formed with the recall as a shallow score.
8. A computer readable storage medium, characterized in that the readable storage medium stores computer instructions, which when executed by a processor, implement the matching method of any one of claims 1-6.
CN202010559644.XA 2020-06-18 2020-06-18 Matching method, system and computer readable storage medium for related sentences in different documents Active CN112380830B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010559644.XA CN112380830B (en) 2020-06-18 2020-06-18 Matching method, system and computer readable storage medium for related sentences in different documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010559644.XA CN112380830B (en) 2020-06-18 2020-06-18 Matching method, system and computer readable storage medium for related sentences in different documents

Publications (2)

Publication Number Publication Date
CN112380830A CN112380830A (en) 2021-02-19
CN112380830B true CN112380830B (en) 2024-05-17

Family

ID=74586338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010559644.XA Active CN112380830B (en) 2020-06-18 2020-06-18 Matching method, system and computer readable storage medium for related sentences in different documents

Country Status (1)

Country Link
CN (1) CN112380830B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029618A1 (en) * 2000-09-30 2002-04-11 Intel Corporation (A Corporation Of Delaware) A method and apparatus for determining text passage similarity
RU2014112241A (en) * 2014-03-31 2015-12-20 Общество с ограниченной ответственностью "Аби ИнфоПоиск" BUILDING A CASE OF COMPARATIVE DOCUMENTS BASED ON A UNIVERSAL SIMILARITY
CN105868187A (en) * 2016-03-25 2016-08-17 北京语言大学 A multi-translation version parallel corpus establishing method
CN106502987A (en) * 2016-11-02 2017-03-15 深圳市空谷幽兰人工智能科技有限公司 The method and apparatus that a kind of sentence template based on seed sentence is recalled
JP2017188039A (en) * 2016-04-08 2017-10-12 Kddi株式会社 Program, device and method for estimating score of text by calculating multiple similarity degrees
CN107992472A (en) * 2017-11-23 2018-05-04 浪潮金融信息技术有限公司 Sentence similarity computational methods and device, computer-readable storage medium and terminal
CN108287824A (en) * 2018-03-07 2018-07-17 北京云知声信息技术有限公司 Semantic similarity calculation method and device
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 One kind being based on the problem of various features similarity calculating method
CN110851599A (en) * 2019-11-01 2020-02-28 中山大学 Automatic scoring method and teaching and assisting system for Chinese composition
KR102085217B1 (en) * 2019-10-14 2020-03-04 (주)디앤아이파비스 Method, apparatus and system for determining similarity of patent documents

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7295965B2 (en) * 2001-06-29 2007-11-13 Honeywell International Inc. Method and apparatus for determining a measure of similarity between natural language sentences

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029618A1 (en) * 2000-09-30 2002-04-11 Intel Corporation (A Corporation Of Delaware) A method and apparatus for determining text passage similarity
RU2014112241A (en) * 2014-03-31 2015-12-20 Общество с ограниченной ответственностью "Аби ИнфоПоиск" BUILDING A CASE OF COMPARATIVE DOCUMENTS BASED ON A UNIVERSAL SIMILARITY
CN105868187A (en) * 2016-03-25 2016-08-17 北京语言大学 A multi-translation version parallel corpus establishing method
JP2017188039A (en) * 2016-04-08 2017-10-12 Kddi株式会社 Program, device and method for estimating score of text by calculating multiple similarity degrees
CN106502987A (en) * 2016-11-02 2017-03-15 深圳市空谷幽兰人工智能科技有限公司 The method and apparatus that a kind of sentence template based on seed sentence is recalled
CN107992472A (en) * 2017-11-23 2018-05-04 浪潮金融信息技术有限公司 Sentence similarity computational methods and device, computer-readable storage medium and terminal
CN108287824A (en) * 2018-03-07 2018-07-17 北京云知声信息技术有限公司 Semantic similarity calculation method and device
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 One kind being based on the problem of various features similarity calculating method
KR102085217B1 (en) * 2019-10-14 2020-03-04 (주)디앤아이파비스 Method, apparatus and system for determining similarity of patent documents
CN110851599A (en) * 2019-11-01 2020-02-28 中山大学 Automatic scoring method and teaching and assisting system for Chinese composition

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
A Short Texts Matching Method Using Shallow Features and Deep Features;Longbiao Kang ET AL;Natural Language Processing and Chinese Computing;150-159 *
Deep and shallow features learning for short texts matching;Ziliang Wang ET AL.;2017 International Conference on Progress in Informatics and Computing (PIC);51-55 *
Ziliang Wang ET AL..Deep and shallow features learning for short texts matching.2017 International Conference on Progress in Informatics and Computing (PIC).2017,51-55. *
基于分层的中文句子相似度的研究;陈学智;万方学位论文;27-28 *
基于句法结构与修饰词的句子相似度计算;邓涵;朱新华;李奇;彭琦;;计算机工程(09);246-250+255 *
基于混合策略的中文短文本相似度计算;宋冬云;郑瑾;张祖平;;计算机工程与应用(12);121-125+210 *
基于语句结构及语义相似度计算主观题评分算法的研究;贾电如;李阳明;;信息化纵横(05);8-10 *
多层次特征融合的短文本匹配方法;康龙彪;万方学位论文;1-67 *
汉语句子相似度计算在FAQ中的应用;裴婧;包宏;;计算机工程(17);52-54 *

Also Published As

Publication number Publication date
CN112380830A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
Wang et al. K-adapter: Infusing knowledge into pre-trained models with adapters
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN106484664B (en) Similarity calculating method between a kind of short text
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
US8751218B2 (en) Indexing content at semantic level
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
EP3203383A1 (en) Text generation system
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN103473280A (en) Method and device for mining comparable network language materials
CN107895000A (en) A kind of cross-cutting semantic information retrieval method based on convolutional neural networks
CN111241824B (en) Method for identifying Chinese metaphor information
CN112883182A (en) Question-answer matching method and device based on machine reading
Nugraha et al. Typographic-based data augmentation to improve a question retrieval in short dialogue system
CN115525763A (en) Emotion analysis method based on improved SO-PMI algorithm and fusion word vector
CN112883199A (en) Collaborative disambiguation method based on deep semantic neighbor and multi-entity association
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
Breja et al. Analyzing linguistic features for answer re-ranking of why-questions
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
CN112100382B (en) Clustering method and device, computer readable storage medium and processor
CN112446217B (en) Emotion analysis method and device and electronic equipment
CN112182332A (en) Emotion classification method and system based on crawler collection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 201203 rooms 301, 303 and 304, block B, 112 liangxiu Road, Pudong New Area, Shanghai

Applicant after: Daguan Data Co.,Ltd.

Address before: 201203 rooms 301, 303 and 304, block B, 112 liangxiu Road, Pudong New Area, Shanghai

Applicant before: DATAGRAND INFORMATION TECHNOLOGY (SHANGHAI) Co.,Ltd.

Country or region before: China

GR01 Patent grant
GR01 Patent grant