CN112380830B

CN112380830B - Matching method, system and computer readable storage medium for related sentences in different documents

Info

Publication number: CN112380830B
Application number: CN202010559644.XA
Authority: CN
Inventors: 王忠萌; 陈运文; 王文广; 贺梦洁; 胡盟; 纪达麒
Original assignee: Daguan Data Co ltd
Current assignee: Daguan Data Co ltd
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2024-05-17
Anticipated expiration: 2040-06-18
Also published as: CN112380830A

Abstract

The invention discloses a matching method of related sentences in different documents, which is used for matching a reference sentence in a reference document with a candidate sentence in a comparison document, and comprises the following steps: on three layers of shallow semantics, statistical information and deep semantics, calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores represent the matching degree; fitting the shallow score, the statistical score, and the deep score based on a linear regression model to obtain a final score representing the degree of matching of the candidate sentence relative to the reference sentence. The invention improves the accuracy of document matching.

Description

Matching method, system and computer readable storage medium for related sentences in different documents

Technical Field

The invention belongs to the field of computer natural language processing, and particularly relates to a matching method, a matching system and a computer readable storage medium of related sentences in different documents.

Background

With the development of the information age in recent years, the amount of text to be processed by a computer has increased. In the face of massive texts, the machine is enabled to automatically process the texts into the current hot trend. The requirements for matching the document contents are gradually expanded, people can automatically match through a machine, and the distinction and the connection of different documents can be conveniently found, so that public opinion comparison, auxiliary decision making and the like are facilitated, and the method plays a great role in the fields of economy, law and the like.

Common methods such as TF-IDF algorithm can calculate the similarity of two documents by calculating the TF-IDF value of each word in the document and then combining the similarity calculation method (generally adopting cosine similarity). The premise of using TF-IDF is that the term importance of an article is not related to where the term appears in the article. The core idea of the algorithm is as follows: in an article, the importance of a term is positively correlated with the number of times that term appears in the article, while it is negatively correlated with the number of articles in the entire corpus in which the term appears.

Meanwhile, a deep learning method is popular, a deep neural network is widely used for sentence modeling, the deep learning model can represent sentences as vector matrixes in semantic space, semantic relations between two sentences are described more accurately by means of distances between the vectors, the convolutional neural network is good at extracting abstract features in the sentences, and the cyclic neural network is good at keeping and utilizing long-distance information. Such as a representative DSSM algorithm. DSSM is a deep learning semantic matching model that uses user's click data to train semantic level matching in a search scenario. DSSM replaces relevance with click-through rate, and the click-through data contains a large number of user questions and corresponding click-through documents that link the user's questions to matching documents. Google proposes a BERT pre-training model, uses a transducer structure for bi-directional encoding, and uses massive data for pre-training Masked LM and Next Sentence Prediction. Further, the fine adjustment can be used for downstream tasks. For example, when the text similarity task is performed, the structure of the output layer is adjusted, and the linear layer is used for performing model fine adjustment, so that a final result is obtained.

Currently, document matching tasks face several difficulties, first, sentence matching itself is problematic. Different descriptions of the same thing can affect that it is difficult for a computer to judge two texts to be similar, resulting in reduced recall; diversified semantic structures, such as "socioeconomic", can be used as descriptive subjects or adjectives for modification, such as "socioeconomic law" and "socioeconomic culture". Secondly, the text matching system faces the problem of cross-domain text, and in different text domains, the judging methods are not completely consistent, and whether the text matching system is a descriptive subject needs to be specifically judged. Thereby affecting quick and accurate migration. Finally, the matching score of an isolated sentence is inconsistent with the matching result of the whole document range, and the readability of the result is inconsistent. These problems are all challenges for current text-like systems.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a matching method of related sentences in different documents, and part of embodiments of the invention can improve the document matching precision.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A matching method of related sentences in different documents for matching a reference sentence in a reference document with a candidate sentence in a comparison document, the matching method comprising: on three layers of shallow semantics, statistical information and deep semantics, calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores represent the matching degree; fitting the shallow score, the statistical score, and the deep score based on a linear regression model to obtain a final score representing the degree of matching of the candidate sentence relative to the reference sentence.

Preferably, the obtaining of the shallow semantic includes obtaining three parallel indexes, where the parallel indexes are respectively: characters, word segmentation, trunk components.

Preferably, the obtaining of the trunk component comprises: finding nouns in sentences and adjectives with the nouns in a centering structure; starting from the adjective nearest to the noun, overlapping a plurality of adjectives in a direction far from the noun in sequence, and combining the adjectives with the noun to obtain a plurality of trunk components.

Preferably, the calculation method of the shallow fraction includes: respectively obtaining recall rates of characters, word segmentation and trunk components in the reference sentences in the candidate sentences; the first vector is formed with the recall as a shallow score.

Preferably, the method for calculating the statistical score includes: calculating the TF-IDF score of the trunk component based on a plurality of documents including a reference document and a specific field in which the comparison document is located; taking a plurality of trunk components with larger TF-IDF fraction; taking whether the main component appears in the reference sentence and the candidate sentence simultaneously as a first variable, and whether the reference sentence and the candidate sentence are actually matched as a second variable, carrying out chi-square test on the first variable and the second variable, and taking the main component passing the chi-square test as a key component; and constructing a second vector as a statistical score according to whether the key components appear in the reference sentence and the candidate sentences at the same time.

Preferably, the method for calculating the depth score includes: obtaining a Bert classification network by using a Bert pre-trained model of a wide corpus; and calculating the similarity of the candidate sentences in the deep semantic meaning relative to the reference sentences based on the classification network to form a third vector, and taking the third vector as a deep score.

Preferably, the linear regression model comprises weights and preset bias values, and different weights which are opposite to the fields are trained for a plurality of documents in different specific fields.

Preferably, the matching method comprises: if the candidate sentence and the reference sentence are at the same position of the paragraph, the final score is up-regulated by a predetermined score.

A matching system for matching a reference sentence in a reference document with a candidate sentence in a comparison document for related sentences in different documents, the matching system comprising: the calculation module is used for calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores are used for representing the matching degree of the candidate sentences on three layers of shallow semantics, statistical information and deep semantics; and the fitting module is used for fitting the shallow score, the statistical score and the deep score based on a linear regression model to obtain a final score used for representing the matching degree of the candidate sentence relative to the reference sentence.

A computer readable storage medium storing computer instructions which when executed by a processor implement any of the matching methods.

Compared with the prior art, the invention has the beneficial effects that: the accuracy of document matching is improved, and the method is applicable to and targeted for text matching in different fields.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an overall framework of a document matching method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of a sentence matching method in an embodiment.

FIG. 3 is a schematic diagram of the structure of centering relationship words and core words in the embodiment.

Fig. 4 is a flowchart of a method for calculating a deep semantic level similarity in an embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.

As shown in fig. 1-4, the input to the present system is two documents, such as a policy report, a product specification, a legal document. Especially parallel corpora, such as policy reports of different years, can be corresponding to a plurality of related sentences in a comparison document or can be used as added content without correspondence in view of the correlation of most of contents between such document pairs. In case there are multiple correspondences in the comparison document, a policy may be adopted that tends to choose the sentence whose content is closest to itself, or the sentence whose contextual content is closest to itself. After two documents are respectively designated as a reference document and a comparison document, the system outputs the matching condition of sentences of each reference document, such as corresponding sentences matched to the comparison document or no corresponding sentences, and in the process, the system can output the matching information of the middle process, such as chapter matching and paragraph matching information.

The following sentence matching results can be obtained by taking the government work report in 2018 as a reference document, taking the government work report in 2019 as a comparison document and taking the whole sentence in Chinese as a basic unit, as shown in Table 1.

TABLE 1

In the specific implementation, not limited to the whole sentence, clauses divided by punctuation marks such as commas, semicolons and the like can be used as basic units of the document for matching, and the results are similar.

The implementation of the invention can be divided according to the modules, and the following steps are sequentially carried out:

1. When the chapter structure exists in the documents in the specific field, chapter division is firstly carried out, and sentence screening is carried out from the chapter level. Then, the text in the chapter can be subjected to paragraph segmentation by taking the line feed as a segmenter. Then, according to the description object condition of the paragraph, paragraph matching is performed, so that sentences in the paragraph have corresponding candidate sets, namely all sentences in the matched paragraph.

2. Sentence matching, shallow semantic extraction, statistical information extraction and deep semantic extraction are sequentially carried out, so that similarity characteristics between two sentences can be extracted. Firstly, preprocessing a text, primarily dividing a main-predicate structure, eliminating obvious irrelevant information, and reserving description of candidate shallow sub-information; secondly, carrying out statistical information description on the segmented text in a specific field; thirdly, vectorizing the text, and calculating depth information by using a Bert and other depth neural networks; and finally, performing primary screening by using threshold judgment, performing linear regression processing, and judging different weights according to text information in different fields so as to obtain a final result. Meanwhile, the text similarity judgment in different fields can be performed by using the specific field perception module.

3. And (3) after the two steps are finished, each sentence of the reference document corresponds to the candidate sentences of the other document, and the number of the candidate sentences is zero or more. Then, according to scoring condition, whether the candidate sentences are matched or not, and the like, post-processing of the results is carried out, so that the most probable matched sentences are returned.

The sentence matching module may be divided into the following sub-modules: preprocessing, shallow information extraction, statistical information extraction, deep information extraction and similar result judgment.

1. Sentence preliminary screening

The module has two functions, namely, the candidate space of sentences is reduced, and obviously unmatched sentences are removed. Secondly, the addition of manual experience rules can be supported, so that the sentence is assisted in defining a matching range.

First, documents in certain specific fields have complete hierarchies, such as government reports, legal documents, agreements, etc., so that respective rules need to be formulated, and these reports are segmented and corresponded. Some documents in a particular domain may not have a chapter hierarchy, and this step may be omitted. Continuing the above example, government work reports may be divided into three sections, which correspond in sequence.

Next, paragraph matching. Firstly, the word segmentation of sentences in the segments is carried out, and then noun sets corresponding to each segment, namely all possible sentence description objects, can be obtained. For each paragraph p of the reference document, all sentences inside are noted as a set S _p. From the corresponding sections or full text paragraphs, the top K _p paragraphs with the highest recall rate are found, and all sentences in the paragraphs are marked as a set S _q. Thus, the candidate matching sentence subsets of all sentences in S _p are S _q.

2. Sentence matching

The module is a key module of the whole framework, and is used for providing matching processing of sentences and carrying out correlation processing in the sentences, so that the matching difficulty of the sentences under documents in specific fields and the model migration problem under documents in different specific fields are relieved. The following are several sub-steps:

(1) Sentence preprocessing.

The function of the part is to clean irrelevant noise of sentences and reconstruct sentence structures through simple manual experience. And s1, s2 are the reference sentences to be matched and the comparison sentences, and a new sentence pair is obtained according to the set preprocessing function preprocess as follows. Each processing step is performed on the processed sentence later.

s1′＝preprocess(s1)

s2′＝preprocess(s2)

In the specific implementation, simple and complex conversion, conversion of full-angle half-angle punctuation and conversion operation of case and case can be carried out. Meanwhile, the text needs to be segmented, and the broad stop words and the components which are needed to be removed in the specific field (such as the date components which are not concerned in some fields) are screened and removed. And obtaining a new sentence after pretreatment.

Continuing the government work report example, two representative original sentences 1 and 2 are taken from the reference report and the comparison report, respectively, as follows.

Sentence 1: the total domestic production value is increased from 54 trillion yuan to 82.7 trillion yuan, and the annual growth rate is 7.1 percent.

Sentence 2: the total value of domestic production is increased by 6.6%, and the total amount breaks through 90 trillion yuan.

In this example, where the setting is insensitive to the specific number, sentence 1 and sentence 2 may be preprocessed as:

Sentence 1: the total domestic production value is increased from 0 yuan to 0 yuan, and the annual growth is 0 percent.

Sentence 2: the total value of domestic production is increased by 0 percent, and the total value breaks through 0 yuan.

When clause matching is to be performed, subject supplementation may be performed where subject is omitted, as follows.

Half sentence (1) of sentence 1: the total domestic production value is increased from 0 to 0 yuan

Half sentence (2) of sentence 1: the annual increase of the total value of domestic production is 0 percent

Half sentence (1) of sentence 2: the total value of domestic production is increased by 0 percent

Half sentence (2) of sentence 2: the total amount of the domestic production total value breaks through 0 yuan.

(2) Shallow semantic acquisition.

The function of the part is to preliminarily extract shallow semantic information of the appointed text, and shallow secondary characteristic information is obtained by pre-extracting main components possibly needing to be judged and replacing synonym components.

First, a preliminary extraction of the trunk component of the sentence is performed. Nouns and adjectives thereof with centering structures are found, resulting in several possible trunk components.

In the implementation, if one core word has a plurality of centering relation words, the core words are sequentially overlapped from the latest centering relation word, so that a plurality of trunk component words can be obtained. The core word w1 has two centering relation words a1 and a2, and w1, a 2w 1 and a 1a 2w 1 can be obtained, and three sentence trunk components of the domestic production total value, the production total value and the total value can be extracted according to the domestic production total value of the example.

Again, in practice, the vocabulary of the rule base may be utilized to replace synonyms, such as "GDP" and "national production total", and may be replaced so that the machine can handle synonyms very conveniently.

Finally, the shallow level matching score of the reference sentence and the candidate sentence is calculated by the following indexes. The granularity of sentences can be sequentially from small to large: the character aspect is the recall rate of the character level, the f value and the rouge score; the word segmentation aspect is the recall rate of the word segmentation, the f value, the recall rate of the n-gram and the f value; the aspect of the trunk component is the recall rate and f value of the trunk component. These indices are parallel indices, by which the score vector is constructed

Continuing the above example, taking the half sentence (1) of the clause 1 as a reference half sentence, and examining the half sentence (1) and the half sentence (2) of the clause 2 as candidate matching items respectively. The score vectors are [0.67,0.57,0.70,0.50,0.40,0.25,0.78,1,1] and [0.63,0.57,0.64,0.62,0.55,0.38,0.3,1,1], respectively.

(3) Statistical information acquisition

The method has the function of obtaining the weight influence of different components on the similarity of the judging sentences in the specific field through text statistics of the specific field, and further providing support for automatic evaluation of the text similarity in the specific field through co-occurrence of key components.

Firstly, according to the sentence trunk component obtained in the last step, TF-IDF calculation is carried out on the whole documents in the field. Note N _w is the number of occurrences of term w in a certain text, N is the total number of terms in the text. Where Y is the total number of documents of the corpus and Y _w is the number of documents containing the term w. The word frequency can be obtained as follows:

The inverse word frequency is:

Finally, the final TF-IDF score is obtained

TF-IDF_w＝TF_w*IDF_w

Secondly, for the first K _w components with larger TF-IDF weight, in each data of the document dataset in the specific field, if the component is co-occurrence in two sentences, the component is marked as 1, and if the component is not co-occurrence, the component is marked as 0.

K is the data set size, A is the co-occurrence value, E is the true matching value, and then the chi-square calculates the result

Taking the threshold p _w, the components that pass the chi-square test are noted as the set W _key, representing the components that are important for semantic similarity calculation.

Finally, counting whether key components in the specific fields co-occur in sentence pairs or not to obtain a result of statistical information, and marking the result as a score

Continuing the above example, the key components are filtered to be 5 words such as 'domestic total production value', the two scores of the candidate matches are respectively 1,0, and [1, 0]. In this example, only one key component is matched, and the rest key components are all present, so that the first term of the score vector is 1, and the rest term is 0.

(4) Deep semantic acquisition

The function of the part is to obtain depth information of the text in the non-specific field through the depth neural network so as to assist the next module to calculate the similarity of the text in the specific field.

And adding an output layer network by using a Bert pre-trained model of a wide corpus to obtain a Bert classification network, and training and calculating the text similarity.

The C is the return matrix of the pre-training model, the W is the trainable parameter, the P is the probability of the classification result, and the probability is

P＝softmax(CW^T)

Obtaining the score according to the result with the probability of 1

Continuing the example above, the similarity scores for the two candidates are [0.85], [0.90], respectively.

(5) Specific text perception

The module is used for integrating the results of the score calculation parts, judging the similarity in a specific field and migrating the model across scenes.

First, a portion of the score is initially screened. In practice, certain scores may be specified to be thresholded depending on the domain situation. If the two sentences have no common characters, the sentences can be directly judged to be dissimilar; and if the return scores of the two text depth networks are extremely low, the sentences can be directly judged to be dissimilar, so that the sentences can be filtered out.

Next, a linear regression model is constructed to make the determination. Record the final score asWeight is w, bias value is b, and the final result is

The model internal weight can be trained by utilizing a data set of a specific field, and then the document of the field can be judged. When the scene is to be switched, the scene switching can be completed only by retraining the linear model.

Continuing the above example, according to the trained model, the matching scores of the two candidate matching items can be obtained to be 0.91,0.94 respectively.

3. Post-processing of results

The function of the module is to coordinate the isolated matching result of each sentence, and to screen the candidate set from the accuracy of full-text matching, thereby improving the readability and making the matching result more friendly.

First, the position information of each sentence is obtained and divided into a first segment and a last segment in the segment. N _start before the paragraph is the first paragraph, N _end after the paragraph is the end of the paragraph, and the rest is in the paragraph. For the reference sentence s1, if the position coincides with the candidate sentence s2 paragraph, the score increases by Y _pos. In the specific implementation, the demarcation mode is not limited, and the manual knowledge can be used for demarcating the position information of the sentences. In this way, the position correspondence result can be improved.

And secondly, obtaining repeated occurrence information of the comparison sentence. According to the general structure of the article, a certain comparison document sentence is compared in a plurality of reference documents, so that the sentence is reasonably distributed and is divided into the most correct corresponding reference sentences. For a certain aligned document sentence, if it is matched, it is matched again, and its score should be reduced by Y _repeat. In the specific implementation, the repeated matching can be corrected by using manual knowledge. For example, in contracts, where repeated treaties are fewer, the maximum value may be taken to handle instead of reducing the repeated matching principle.

Finally, the final result was obtained as Y _final＝Y_match+Y_pos-Y_repeat. And sequencing sentences of each reference document according to the scores for the respective candidate sentence subsets to obtain a final matching result.

Continuing the above example, both candidate matches belong to the same position within the segment and do not appear, so the final score is 0.91,0.94, which is a better match of clause (1) of sentence 1 and sentence (2) of sentence 2, respectively.

While the foregoing embodiments have been described in detail and with reference to the present invention, it will be apparent to one skilled in the art that modifications and improvements can be made based on the disclosure without departing from the spirit and scope of the invention.

Claims

1. A method for matching related sentences in different documents for matching a reference sentence in a reference document with candidate sentences in a comparison document, the method comprising:

on three layers of shallow semantics, statistical information and deep semantics, calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores represent the matching degree;

Fitting the shallow score, the statistical score and the deep score based on a linear regression model to obtain a final score representing the matching degree of the candidate sentence relative to the reference sentence;

the acquisition of the shallow semantics comprises the acquisition of three parallel indexes, wherein the parallel indexes are respectively: character, word segmentation and trunk components;

the calculation method of the shallow fraction comprises the following steps:

respectively obtaining recall rates of characters, word segmentation and trunk components in the reference sentences in the candidate sentences;

The first vector is formed with the recall as a shallow score.

2. The method for matching related sentences in different documents according to claim 1, wherein said obtaining of said backbone component comprises:

finding nouns in sentences and adjectives with the nouns in a centering structure;

starting from the adjective nearest to the noun, overlapping a plurality of adjectives in a direction far from the noun in sequence, and combining the adjectives with the noun to obtain a plurality of trunk components.

3. The method for matching related sentences in different documents according to claim 2, wherein said statistical score calculating method comprises:

Calculating the TF-IDF score of the trunk component based on a plurality of documents including a reference document and a specific field in which the comparison document is located;

Taking a plurality of trunk components with larger TF-IDF fraction;

Taking whether the main component appears in the reference sentence and the candidate sentence simultaneously as a first variable, and whether the reference sentence and the candidate sentence are actually matched as a second variable, carrying out chi-square test on the first variable and the second variable, and taking the main component passing the chi-square test as a key component;

And constructing a second vector as a statistical score according to whether the key components appear in the reference sentence and the candidate sentences at the same time.

4. A method of matching related sentences in different documents according to claim 3, wherein said method of calculating deep score comprises:

Obtaining a Bert classification network by using a Bert pre-trained model of a wide corpus;

and calculating the similarity of the candidate sentences in the deep semantic meaning relative to the reference sentences based on the classification network to form a third vector, and taking the third vector as a deep score.

5. The method of claim 4, wherein the linear regression model includes weights and preset bias values, and wherein different domain-specific weights are trained for different domain-specific documents.

6. The method of matching related sentences in different documents according to claim 1, wherein said matching method comprises:

If the candidate sentence and the reference sentence are at the same position of the paragraph, the final score is up-regulated by a predetermined score.

7. A matching system for matching a reference sentence in a reference document with a candidate sentence in a comparison document for related sentences in different documents, the matching system comprising:

The calculation module is used for calculating shallow scores, statistical scores and deep scores of the candidate sentences relative to the reference sentences, wherein the shallow scores, the statistical scores and the deep scores are used for representing the matching degree of the candidate sentences on three layers of shallow semantics, statistical information and deep semantics; the acquisition of the shallow semantics comprises the acquisition of three parallel indexes, wherein the parallel indexes are respectively: character, word segmentation and trunk components;

a fitting module that fits the shallow score, the statistical score, and the deep score based on a linear regression model to obtain a final score representing a degree of matching of the candidate sentence relative to a reference sentence; the calculation method of the shallow fraction comprises the following steps:

The first vector is formed with the recall as a shallow score.

8. A computer readable storage medium, characterized in that the readable storage medium stores computer instructions, which when executed by a processor, implement the matching method of any one of claims 1-6.