CN117688399A - Text similarity calculation method and device and electronic equipment - Google Patents

Text similarity calculation method and device and electronic equipment Download PDF

Info

Publication number
CN117688399A
CN117688399A CN202311778507.5A CN202311778507A CN117688399A CN 117688399 A CN117688399 A CN 117688399A CN 202311778507 A CN202311778507 A CN 202311778507A CN 117688399 A CN117688399 A CN 117688399A
Authority
CN
China
Prior art keywords
text
similarity
segment
sentence
aligned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311778507.5A
Other languages
Chinese (zh)
Inventor
李林钦
冯小琴
李维
丁辉
吴玉虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mobvoi Information Technology Co ltd
Original Assignee
Shanghai Mobvoi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mobvoi Information Technology Co ltd filed Critical Shanghai Mobvoi Information Technology Co ltd
Priority to CN202311778507.5A priority Critical patent/CN117688399A/en
Publication of CN117688399A publication Critical patent/CN117688399A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a text similarity calculation method, a text similarity calculation device and electronic equipment. Obtaining a first text and a second text, carrying out segmentation processing on the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text, determining a similarity matrix according to first similarity of the first segment and the second segment, obtaining a first aligned paragraph pair according to the similarity matrix, and obtaining similarity of the first text and the second text according to the first aligned paragraph pair. Therefore, the similarity can be judged more comprehensively and reliably, and fine-grained evaluation analysis can be performed.

Description

Text similarity calculation method and device and electronic equipment
Technical Field
The present invention relates to the field of text similarity, and in particular, to a text similarity calculation method, a text similarity calculation device, and an electronic device.
Background
With the rapid development of the internet and digital technology, the copying and dissemination of works becomes easier and more convenient, which also leads to an increase in plagiarism. In order to protect the intellectual property of the original work, maintain the rights and interests of the creator, accurate evaluation and identification of the similarity is required.
The similarity evaluation method commonly used at present is to calculate through some similarity algorithms. These methods generally calculate for complete text paragraphs, belong to a coarse-grained evaluation method, and the result of calculating similarity is not comprehensive and reliable enough and lacks fine-grained evaluation analysis.
Disclosure of Invention
In view of this, the embodiment of the invention provides a text similarity calculation method, a text similarity calculation device and an electronic device, which can calculate the similarity more comprehensively and reliably and provide fine-grained evaluation analysis.
In a first aspect, an embodiment of the present invention provides a text similarity calculation method, where the method includes:
acquiring a first text and a second text;
segmenting the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text;
determining a similarity matrix according to the first similarity of the first segment and the second segment;
acquiring a first aligned paragraph pair according to the similarity matrix;
and obtaining the similarity of the first text and the second text according to the first aligned paragraph pair.
In some embodiments, the segmenting the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text includes:
splitting the first text and the second text according to at least one of punctuation, word count, word segmentation and prosody results to obtain a first sentence corresponding to the first text and a second sentence corresponding to the second text;
and merging the first sentence and the second sentence respectively to obtain a first segment corresponding to the first text and a second segment corresponding to the second text.
In some embodiments, the merging the first sentence and the second sentence, respectively, includes:
acquiring a target sentence from the first sentence or the second sentence;
acquiring a first confusion degree corresponding to the target sentence and a combined sentence of a sentence above the target sentence;
acquiring a second confusion degree corresponding to the combined sentence of the target sentence and the next sentence of the target sentence;
and selecting one of the previous sentence and the next sentence to be combined with the target sentence according to the first confusion degree and the second confusion degree.
In some embodiments, the determining the similarity matrix according to the first similarity of the first segment and the second segment is specifically:
and determining the first similarity of each first segment and each second segment through a preset calculation mode so as to obtain a similarity matrix.
In some embodiments, the similarity matrix is a matrix of n×m, N being the number of first segments and M being the number of second segments;
the step of obtaining the first aligned paragraph pair according to the similarity matrix specifically includes the following steps in an iterative manner:
determining a second segment corresponding to the maximum value element of the ith row in the similarity matrix and the ith first segment as a first aligned segment pair;
and deleting the row and the column corresponding to the maximum value element to adjust the similarity matrix.
In some embodiments, the obtaining the similarity of the first text and the second text according to the first aligned paragraph pair includes:
merging the plurality of first aligned paragraph pairs according to word count rules or sentence numbers to obtain a second aligned paragraph pair, wherein the second aligned paragraph pair comprises a third segment and a fourth segment, the third segment is obtained by merging first segments in the plurality of first aligned paragraph pairs, and the fourth segment is obtained by merging second segments in the plurality of first aligned paragraph pairs;
calculating second similarity of a third segment and a fourth segment in the second aligned segment pair through bilingual evaluation and replacement similarity indexes;
and determining the similarity of the first text and the second text according to the second similarity and a preset reference index.
In some embodiments, the predetermined reference index includes a confidence level.
In some embodiments, the determining the similarity between the first text and the second text according to the second similarity and a predetermined reference index is specifically:
and calculating a mean value or a median value of the second similarity values in the confidence coefficient range as the similarity between the first text and the second text.
In a second aspect, an embodiment of the present invention provides a text similarity calculation apparatus, including:
an acquisition unit configured to acquire a first text and a second text;
the segmentation unit is used for carrying out segmentation processing on the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text;
a similarity matrix calculation unit, configured to determine a similarity matrix according to first similarities of the first segment and the second segment;
an alignment unit, configured to obtain a first aligned paragraph pair according to the similarity matrix;
and the text similarity calculation unit is used for acquiring the similarity of the first text and the second text according to the first aligned paragraph pair.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory for storing one or more computer program instructions, and a processor, wherein the one or more computer program instructions are executed by the processor to implement the method of any of the first aspects.
According to the technical scheme, the first text and the second text are segmented to obtain a first segment corresponding to the first text and a second segment corresponding to the second text, a similarity matrix is determined according to the first similarity of the first segment and the second segment, a first aligned paragraph pair is obtained according to the similarity matrix, and the similarity of the first text and the second text is obtained according to the first aligned paragraph pair. Therefore, the similarity can be judged more comprehensively and reliably, and fine-grained evaluation analysis can be performed.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a similarity calculation method according to an embodiment of the present invention;
FIG. 2 is a flow chart of segmenting a first text and a second text according to an embodiment of the invention;
FIG. 3 is a flow chart of a merge statement of an embodiment of the invention;
FIG. 4 is a schematic diagram of an embodiment of the present invention incorporating a first statement;
FIG. 5 is a flow chart of obtaining similarity of the first text and the second text according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a text similarity calculation device according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The present application is described below based on examples, but the present application is not limited to only these examples. In the following detailed description of the present application, certain specific details are set forth in detail. The present application will be fully understood by those skilled in the art without a description of these details. Well-known methods, procedures, flows, components and circuits have not been described in detail so as not to obscure the nature of the present application.
Moreover, those of ordinary skill in the art will appreciate that the drawings are provided herein for illustrative purposes and that the drawings are not necessarily drawn to scale.
Unless the context clearly requires otherwise, the words "comprise," "comprising," and the like throughout the application are to be construed as including but not being exclusive or exhaustive; that is, it is the meaning of "including but not limited to".
In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.
The current text similarity calculation method has a plurality of methods, and common methods include cosine similarity, jaccard (Jaccard) similarity, word bag model and Bert-based model. Cosine similarity is the measure of similarity between two text vectors by calculating the angle between them. Jaccard similarity is a set-based similarity measurement method, which is commonly used for comparing word overlaps of texts, and calculates similarity by calculating the ratio of common terms of two texts to the total number of terms. The bag of words model represents text as a collection of terms, ignoring the order of the vocabulary and the impact of context. Based on the bag of words model, a vector space model may be used to calculate the similarity between the texts. The bert-based model encodes text using a pre-trained bert model, and then measures the similarity of two texts by calculating their cosine similarity.
However, the above-mentioned similarity calculation scheme generally performs calculation for complete text paragraphs, that is, calculates the similarity between one complete paragraph and another complete paragraph, belongs to a coarse-grained evaluation method, and the calculation result is not comprehensive and reliable enough.
Fig. 1 is a flowchart of a similarity calculation method according to an embodiment of the present invention, where the similarity calculation method shown in fig. 1 may be applied to various data processing devices, for example, a mobile phone, a desktop computer, a notebook computer, a tablet computer, a server, or other terminal devices with data processing functions, and the functions of information retrieval, data mining, and plagiarism discrimination may be implemented by using the similarity calculation method. Specifically, the similarity calculation method includes the steps of:
step S100, acquiring a first text and a second text.
In this embodiment, the first text and the second text are two texts that need to be subjected to similarity calculation, and the first text or the second text may be a complete text, for example, a complete paper, a complete article, or the like. Alternatively, the first text or the second text may be a part of a complete text, for example, a piece of text or a plurality of pieces of text in a complete text.
The text obtaining manner may be various, and may be specifically determined according to an actual application scenario, which is not limited in the embodiment of the present invention, for example, user input, file reading, and database query.
In a specific scenario, the similarity calculation method of the embodiment of the invention can be used for realizing plagiarism discrimination, and at this time, one of the first text and the second text is a text to be detected, and the other is a comparison text. The text to be detected can be input by a user, and the comparison text can be input by the user or can be acquired in a preset database by the data processing equipment.
Step 200, performing segmentation processing on the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text.
The first segment and the second segment refer to paragraphs. That is, the first text is subjected to segmentation processing to obtain a plurality of first segments corresponding to the first text, and the second text is subjected to segmentation processing to obtain a plurality of second segments corresponding to the second text.
The first text and the second text are split into a first segment and a second segment, so that the similarity of each segment is calculated in the subsequent process, and finally, each data is processed to obtain the final text similarity. Therefore, the text must be processed in a reasonable way when segmenting the text so that the final result is accurate.
FIG. 2 is a flow chart of acquiring a first segment and a second segment according to an embodiment of the present invention. As shown in fig. 2, the segmentation processing of the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text includes the following steps:
and S210, splitting the first text and the second text according to at least one of punctuation, word count, word segmentation and prosody results to obtain a first sentence corresponding to the first text and a second sentence corresponding to the second text.
The first sentence is to split the first text into a plurality of small clauses, and the second sentence is to split the second text into a plurality of small clauses, wherein the splitting is to divide the first text or the second text in a specific mode.
In this embodiment, the first text is split according to one or more of punctuation, word count, word segmentation, and prosody result to obtain a first sentence corresponding to the first text, and the second text is split according to one or more of punctuation, word count, word segmentation, and prosody result to obtain a second sentence corresponding to the second text. The first sentence or the second sentence may be one sentence or may be multiple sentences. The sentence may be divided according to punctuation marks, for example, a text between two predetermined punctuation marks is a sentence, where the predetermined punctuation marks include a period, a comma, a semicolon, a question mark, a mark, and the like.
For example, when dividing text by the number of words, the number of words of the divided sentence is a predetermined value or within a predetermined range. For example, the sentence "today's sunny weather, sunny, i walk around a park near me home, and see many beautiful flowers and plants and trees. The splitting result according to the word number is that the weather is clear today, the sunshine is bright,/I go to parks nearby I'm home to walk,/see a lot of beautiful flowers and plants and trees. "
For another example, when the text is split by punctuation, the split sentence has only one predetermined punctuation mark, and the predetermined punctuation mark includes a period, a comma, a semicolon, a question mark, a mark, and the like. For example, the sentence "this is a very long sentence. Needs to be split into multiple paragraphs. This is a very long sentence "split according to punctuation. /need to be split into multiple paragraphs. "
For another example, splitting may be performed by multiple metrics. For example, splitting is performed by word count and punctuation. The embodiments of the present invention are not limited in this regard. For example, a text of length 100 words is split according to the number of words and punctuation. Assuming that the preset word number is 20 words, firstly dividing the text into five sections with 20 words per section, and then dividing each section into one sentence according to a period, a comma, a semicolon, a question mark and a sigh. Or splitting through punctuation and then splitting through sentences, which is not limited by the embodiment of the invention.
The splitting is performed according to the word segmentation, and since the sentence is composed of a plurality of words, the splitting is performed according to the word segmentation, namely the word segmentation is used for splitting the sentence into word sequences. For example, the sentence "weather today is good, suitable for play outside. "according to the word segmentation split result is" today/weather/good/,/fit/go out/play/. "
The prosody is split according to the prosody, and the poetry and the lyrics have a certain prosody, so the prosody can be used for splitting. When the splitting is performed, the splitting can be performed in one mode, or can be performed in a plurality of modes. For example, if poetry or lyrics need to be split, sentences may be split into word sequences according to word splitting, and then split according to prosody. For example, the sentence "spring arrives, flowers bloom, birds call, and butterflies fly. The "split result is" spring/arrival/,/flower/opening/,/bird/call/,/butterfly/fly/. "
Step S220, merging the first sentence and the second sentence to obtain a first segment corresponding to the first text and a second segment corresponding to the second text.
The first segmentation is a segmentation obtained by splitting a first sentence of a first text and then merging the first sentence, and the second segmentation is a segmentation obtained by splitting a second sentence of a second text and then merging the second sentence, and the obtained first segmentation and second segmentation have coherent semantics, so that a complete meaning can be expressed.
FIG. 3 is a flow chart of a merge statement of an embodiment of the invention. As shown in fig. 3, the merging of the first sentence and the second sentence, respectively, includes the following four steps:
step S221, obtaining a target sentence in the first sentence or the second sentence.
The target sentence may be any one of the first sentence or the second sentence.
In a specific implementation manner, one of the first sentence or the second sentence may be selected according to a predetermined order, and when the order is selected, the first selected target sentence is the second sentence of the first sentence or the second sentence because the previous sentence or the next sentence of the target sentence needs to be combined.
Step S222, obtaining a first confusion degree corresponding to the target sentence and a combined sentence of a sentence previous to the target sentence.
The specific mode is that the target sentence is combined with the previous sentence to generate a new sentence, and then the confusion degree is judged through a model. The confusion degree is an index for evaluating the language model and is used for measuring the quality of the language, and the smaller the confusion degree is, the less confusion of the sentence by the model is explained. The basic idea of confusion is to measure the probabilistic predictive power of a model on test data: an ideal model should be able to give a high probability to the test set, resulting in low confusion. The definition of confusion is based on the inverse of the probability and takes an index. Specifically, for a test set D, its confusion PP (D) calculation formula is as follows:
where sum refers to the sum of the logarithms of the probabilities of all words in the test set, N is the total number of words in the test set, w i Is each word in the test set, and p (w i ) Is w given by the model i Is a probability of (2). As can be seen from the formula, if the model gives a high probability for each word in the test set, the overall confusion is low; conversely, a high degree of confusion is obtained. In the embodiment of the invention, the lowest confusion degree is selected as the final model.
Step S223, obtaining a second confusion degree corresponding to the combined statement of the target statement and the next statement of the target statement.
The specific manner is similar to that in step S222, the target sentence is first combined with the next sentence to generate a new sentence, and then the confusion degree is judged by the model.
Step S224, selecting one of the previous sentence and the next sentence to merge with the target sentence according to the first confusion degree and the second confusion degree.
Specifically, the magnitude of the confusion calculated in S222 and S223 is determined, and a mode in which the confusion is smaller, that is, the model is more not confused with the sentence is selected to combine, so as to form a combined sentence. At this point, the resulting merged sentence is more understandable by the model, i.e., the semantics are more coherent.
After the execution of step S224 is completed, the process returns to step S221, the next sentence is selected as the target sentence to be merged, and when the next sentence is selected as the target sentence, the steps S222 to S224 are executed to merge the target sentence. And similarly, selecting the next sentence of the last combined sentence every time, and realizing one round of combination of the first sentence or the second sentence after combining the Q-1 sentence, wherein Q is the number of the first sentence or the second sentence. Meanwhile, the merging result of the previous round can be merged in the next round according to the steps until the merging round number reaches a preset value.
Taking fig. 4 as an example for illustration, it is assumed that the first sentences are merged, the number q=6 of the first sentences, and the six first sentences are T11 to T16, respectively. As shown in the figure:
at the first selection, the second sentence is selected as the target sentence, that is, the target sentence is T12, the degree of confusion of T12 with the previous sentence T11, and the degree of confusion of T12 with the next sentence T13 are calculated. The sentences with lower confusion degree are selected for merging, and the merged sentence obtained by merging the two sentences is assumed to be T21, wherein the confusion degree of the T12 and the next sentence T13 is lower.
At the time of the second selection, the next sentence of the last merge sentence is selected as the target sentence. After the first merging, only one merging sentence T21 is provided, the next sentence T13 of the merging sentence T21 is determined as a target sentence, and the confusion degree of the T13 and the previous sentence T21 and the confusion degree of the T13 and the next sentence T14 are calculated. The sentences with lower confusion degree are selected for merging, and the merged sentence obtained by merging the two sentences is assumed to be T31, wherein the confusion degree of the T13 and the next sentence T14 is lower.
On the third selection, the next sentence of the last merge sentence is selected as the target sentence. After the second merging, there are two merging sentences T21 and T31, determining the next sentence T15 of the last merging sentence T31 as a target sentence, and calculating the confusion degree of the T15 and the previous sentence T31 and the confusion degree of the T15 and the next sentence T16. The sentences with lower confusion degree are selected for merging, and the merged sentence obtained by merging the two sentences is assumed to be T41, wherein the confusion degree of T15 and the previous sentence T31 is lower. At this time, the merge is completed for the Q-1 statement (i.e., T15), and the merge execution of this round is completed.
Meanwhile, the next round of merging can be performed on the sentences (T11, T21, T41, T16) obtained by the previous round of merging according to actual requirements, and the specific merging mode can refer to steps S221-S224, and the embodiments of the present invention are not described herein again.
Finally, after the merging step is completed, a first segment obtained by merging the first sentence and a second segment obtained by merging the second sentence can be obtained.
And step S300, determining a similarity matrix according to the first similarity of the first segment and the second segment.
I.e. calculating the similarity between each first segment and each second segment and combining the calculated results into a similarity matrix. The similarity matrix is used to characterize the similarity between the first segment and the second segment.
The specific obtaining mode of the similarity matrix is to determine the first similarity of each first segment and each second segment through a preset calculating mode. Specifically, text vectors of the first segment and the second segment are obtained first, and semantic similarity calculation is performed by means of cosine similarity and Jaccard similarity. When the lengths of the vectors are different, different texts are mapped to the same dimension, and then vector operation is carried out.
Assuming that the text vector corresponding to a certain first segment is a first vector a and the text vector corresponding to a certain second segment is a second vector B, the cosine similarity F of the first segment and the second segment is calculated in the following manner:
F=(A·B)/(|A|×|B|)
where |A| represents the modulo length of vector A, |B| represents the modulo length of vector B, and A.B represents the number product of A and B.
Specifically, when there are N first segments and M second segments, the similarity between each first segment and each second segment is calculated, and the similarity matrix G of n×m is formed.
As shown in the following figures.
The meaning represented by the elements in the similarity matrix is: element a of the ith row and jth column ij Representing the similarity of the i-th first segment and the j-th second segment. Thus, the similarity between any one of the first segments and any one of the second segments can be obtained.
Step 400, obtaining a first aligned paragraph pair according to the similarity matrix.
The step of obtaining the first aligned paragraph pair according to the similarity matrix specifically includes the following steps in an iterative manner:
step S410, determining the second segment corresponding to the maximum value element of the ith row in the similarity matrix and the ith first segment as a first aligned segment pair.
And step S420, deleting the row and the column corresponding to the maximum value element to adjust the similarity matrix.
The similarity matrix G is assumed to be a 4×4 matrix, specifically as follows:
it is assumed that the second segments corresponding to the respective first segments are acquired in the order of the first segments.
For the first segment, the maximum value element of the first row of the similarity matrix is 0.65, the corresponding second segment is the third second segment, the first segment and the third second segment are determined to be an aligned segment pair, then the first row and the third column of the matrix are deleted, and a new 3×3 matrix is obtained as shown in the following figure:
and performing the operation in an iterative mode until the number of rows or columns in the similarity matrix is 0, so as to obtain a first aligned paragraph pair.
It should be noted that, when one of the first segments is selected to find pairing, it may be arbitrarily selected.
For example, a second one of the first segments, i.e. a second row in the matrix, may also be selected. At this time, the second segment corresponding to the maximum value element of the second row of the similarity matrix, that is, the fourth second segment and the second first segment are determined as an aligned segment pair, and then the second row and the fourth column of the matrix are deleted, so as to obtain a new 3×3 matrix, as shown in the following figure:
a third or fourth first segment may also be selected to find a pairing in the second segment, as embodiments of the invention are not limited in this respect.
Step S500, obtaining the similarity of the first text and the second text according to the first aligned paragraph pair.
The similarity between the first text and the second text is the similarity between the two texts, specifically, the similarity is obtained by calculating the similarity obtained by each first aligned paragraph pair.
Fig. 5 is a flowchart of acquiring similarity between the first text and the second text according to an embodiment of the present invention. As shown in fig. 5, the obtaining the similarity between the first text and the second text according to the first aligned paragraph pair includes the following steps:
step S510, merging the plurality of first aligned paragraph pairs according to a word count rule or a sentence number to obtain a second aligned paragraph pair, where the second aligned paragraph pair includes a third segment and a fourth segment, the third segment is obtained by merging the first segments in the plurality of first aligned paragraph pairs, and the fourth segment is obtained by merging the second segments in the plurality of first aligned paragraph pairs.
The combination is performed according to the word number rule or the sentence number, that is, the principle of ensuring that the lengths of all paragraphs are close to the same, so as to ensure that the lengths of all paragraphs with calculated similarity are balanced as much as possible, and prevent other deviations caused by different lengths. By the method, the first aligned paragraph pairs are combined into the second aligned paragraph pairs with similar lengths. The third segment is obtained by combining the first segment in the first aligned segment pair according to the segment mode of the second aligned segment pair, and the fourth segment is obtained by combining the second segment in the first aligned segment pair according to the segment mode of the second aligned segment pair.
And step S520, calculating the second similarity of the third segment and the fourth segment in the second aligned segment pair through bilingual evaluation and replacement of the similarity index.
The bilingual evaluation alternative method is an algorithm for evaluating the translation quality of natural language, and is used for evaluating the index of the difference between sentences generated by a model and actual sentences. Bilingual evaluation of the replacement score ranges from 0 to 1, representing the similarity of the machine translation result to the reference translation result. In general, the higher the bilingual evaluation replacement score, the closer the machine translation result is to the reference translation result. And obtaining the second similarity of a plurality of aligned paragraph pairs through a bilingual evaluation alternative method.
Step S530, determining the similarity between the first text and the second text according to the second similarity and a predetermined reference index.
Specifically, the predetermined reference index includes a confidence coefficient, and the second similarity and the predetermined reference index determine the similarity of the first text and the second text, specifically, calculate a mean value or a median value of second similarity values within a confidence coefficient range, as the similarity of the first text and the second text. The second similarity value in the confidence coefficient range is a point in a confidence coefficient interval selected according to the preset confidence coefficient, and the point is used as data to be processed. And then, processing all the data to be processed in a mode of averaging or median value, wherein the obtained result is used as the similarity of the first text and the second text. For example, when the obtained data to be processed has five groups of 0.25, 0.30, 0.35, 0.50 and 0.60, the similarity between the first text and the second text obtained from the average value is 0.40, and the similarity between the first text and the second text obtained from the median value is 0.35. The similarity is characterized using an average or median value under different application scenarios.
For points outside the interval, i.e. outliers of the second similarity of aligned segment pairs, since they are outside the confidence interval, it is necessary to determine whether an anomaly detection situation exists when an outlier occurs, where the outlier is a data point in the dataset that differs greatly from other data points. In the second similarity score list, outliers indicate that some paragraphs of text are similar to the other paragraphs for anomalies. These outliers may be due to special circumstances, such as certain paragraphs indeed having plagiarisms, or certain paragraphs having scores that are abnormally high or abnormally low for special reasons.
In the document similarity evaluation, there is a case where fine granularity is abnormal. The fine-grained anomalies refer to whether there is a similar anomaly for some smaller text units. Such fine-grained evaluation may help detect some more subtle plagiarism actions, such as misspelling, word-by-word copying, etc. Outliers may serve as reference values for fine-grained plagiarism anomalies, as they may indicate the presence of anomalous plagiarism behavior.
According to the embodiment of the invention, the first text and the second text are segmented to obtain the first segment corresponding to the first text and the second segment corresponding to the second text, the similarity matrix is determined according to the first similarity of the first segment and the second segment, the first aligned paragraph pair is obtained according to the similarity matrix, and the similarity of the first text and the second text is obtained according to the first aligned paragraph pair. Therefore, the similarity can be judged more comprehensively and reliably, and fine-grained evaluation analysis can be performed.
Fig. 6 is a schematic diagram of a text similarity calculation device according to an embodiment of the present invention. As shown in fig. 6, the text similarity calculation device of the present invention includes an acquisition unit 61, a segmentation unit 62, a similarity matrix calculation unit 63, an alignment unit 64, and a text similarity calculation unit 66. The obtaining unit 61 is configured to obtain a first text and a second text, the segmenting unit 62 is configured to segment the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text, the similarity matrix calculating unit 63 is configured to determine a similarity matrix according to a first similarity of the first segment and the second segment, the aligning unit 64 is configured to obtain a first aligned segment pair according to the similarity matrix, and the text similarity calculating unit 65 is configured to obtain a similarity of the first text and the second text according to the first aligned segment pair.
In some embodiments, the segmentation unit comprises:
the splitting subunit is used for splitting the first text and the second text according to at least one of punctuation, word count, word segmentation and prosody results so as to obtain a first sentence corresponding to the first text and a second sentence corresponding to the second text;
and the merging subunit is used for merging the first sentence and the second sentence respectively to obtain a first segment corresponding to the first text and a second segment corresponding to the second text.
In some embodiments, the merging subunit comprises:
the sentence acquisition module is used for acquiring a target sentence in the first sentence or the second sentence;
the first confusion degree acquisition module is used for acquiring a first confusion degree corresponding to the combination statement of the target statement and the previous statement of the target statement;
a second confusion degree obtaining module, configured to obtain a second confusion degree corresponding to a combined sentence of the target sentence and a sentence next to the target sentence;
and the sentence merging module is used for selecting one of the previous sentence and the next sentence to be merged with the target sentence according to the first confusion degree and the second confusion degree.
In some embodiments, the similarity matrix calculation unit specifically includes:
and determining the first similarity of each first segment and each second segment through a preset calculation mode so as to obtain a similarity matrix.
In some embodiments, the similarity matrix is a matrix of n×m, N being the number of first segments and M being the number of second segments;
wherein the alignment unit specifically performs the following steps in an iterative manner:
determining a second segment corresponding to the maximum value element of the ith row in the similarity matrix and the ith first segment as a first aligned segment pair;
and deleting the row and the column corresponding to the maximum value element to adjust the similarity matrix.
In some embodiments, the text similarity calculation unit includes:
a paragraph pair merging subunit, configured to merge a plurality of first aligned paragraph pairs according to a word count rule or a sentence number to obtain a second aligned paragraph pair, where the second aligned paragraph pair includes a third segment and a fourth segment, the third segment is obtained by merging first segments in the plurality of first aligned paragraph pairs, and the fourth segment is obtained by merging second segments in the plurality of first aligned paragraph pairs;
a second similarity calculation subunit, configured to calculate a second similarity of the third segment and the fourth segment in the second aligned segment pair by using bilingual evaluation to replace a similarity index;
and the similarity determining subunit is used for determining the similarity of the first text and the second text according to the second similarity and a preset reference index.
In some embodiments, the predetermined reference index includes a confidence level.
In some embodiments, the similarity determining subunit specifically includes:
and calculating a mean value or a median value of the second similarity values in the confidence coefficient range as the similarity between the first text and the second text.
According to the embodiment of the invention, the first text and the second text are segmented to obtain the first segment corresponding to the first text and the second segment corresponding to the second text, the similarity matrix is determined according to the first similarity of the first segment and the second segment, the first aligned paragraph pair is obtained according to the similarity matrix, and the similarity of the first text and the second text is obtained according to the first aligned paragraph pair. Therefore, the similarity can be judged more comprehensively and reliably, and fine-grained evaluation analysis can be performed.
Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the invention. As shown in fig. 7, the electronic device shown in fig. 7 is a general address query device, which includes a general computer hardware structure including at least a processor 71 and a memory 72. The processor 71 and the memory 72 are connected by a bus 73. The memory 72 is adapted to store instructions or programs executable by the processor 71. The processor 71 may be a separate microprocessor or a collection of one or more microprocessors. Thus, the processor 71 performs the process flow of the embodiment of the present invention described above to realize the processing of data and the control of other devices by executing the instructions stored in the memory 72. Bus 73 connects the above components together, as well as to display controller 74 and display devices and input/output (I/O) devices 75. Input/output (I/O) devices 75 may be a mouse, keyboard, modem, network interface, touch input device, somatosensory input device, printer, and other devices known in the art. Typically, an input/output device 75 is connected to the system through an input/output (I/O) controller 76.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each of the flows in the flowchart may be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the present invention is directed to a non-volatile storage medium storing a computer readable program for causing a computer to perform some or all of the method embodiments described above.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by specifying relevant hardware by a program, where the program is stored in a storage medium, and includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (10)

1. A text similarity calculation method, the method comprising:
acquiring a first text and a second text;
segmenting the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text;
determining a similarity matrix according to the first similarity of the first segment and the second segment;
acquiring a first aligned paragraph pair according to the similarity matrix;
and obtaining the similarity of the first text and the second text according to the first aligned paragraph pair.
2. The method of claim 1, wherein segmenting the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text comprises:
splitting the first text and the second text according to at least one of punctuation, word count, word segmentation and prosody results to obtain a first sentence corresponding to the first text and a second sentence corresponding to the second text;
and merging the first sentence and the second sentence respectively to obtain a first segment corresponding to the first text and a second segment corresponding to the second text.
3. The method of claim 2, wherein the merging the first sentence and the second sentence, respectively, comprises:
acquiring a target sentence from the first sentence or the second sentence;
acquiring a first confusion degree corresponding to the target sentence and a combined sentence of a sentence above the target sentence;
acquiring a second confusion degree corresponding to the combined sentence of the target sentence and the next sentence of the target sentence;
and selecting one of the previous sentence and the next sentence to be combined with the target sentence according to the first confusion degree and the second confusion degree.
4. The method according to claim 1, wherein said determining a similarity matrix from the first similarity of the first segment and the second segment is specifically:
and determining the first similarity of each first segment and each second segment through a preset calculation mode so as to obtain a similarity matrix.
5. The method of claim 1, wherein the similarity matrix is a matrix of N x M, N being the number of first segments and M being the number of second segments;
the step of obtaining the first aligned paragraph pair according to the similarity matrix specifically includes the following steps in an iterative manner:
determining a second segment corresponding to the maximum value element of the ith row in the similarity matrix and the ith first segment as a first aligned segment pair;
and deleting the row and the column corresponding to the maximum value element to adjust the similarity matrix.
6. The method of claim 1, wherein the obtaining the similarity of the first text and the second text from the first aligned paragraph pair comprises:
merging the plurality of first aligned paragraph pairs according to word count rules or sentence numbers to obtain a second aligned paragraph pair, wherein the second aligned paragraph pair comprises a third segment and a fourth segment, the third segment is obtained by merging first segments in the plurality of first aligned paragraph pairs, and the fourth segment is obtained by merging second segments in the plurality of first aligned paragraph pairs;
calculating second similarity of a third segment and a fourth segment in the second aligned segment pair through bilingual evaluation and replacement similarity indexes;
and determining the similarity of the first text and the second text according to the second similarity and a preset reference index.
7. The method of claim 6, wherein the predetermined reference indicator comprises a confidence level.
8. The method according to claim 7, wherein the determining the similarity of the first text and the second text according to the second similarity and a predetermined reference index is specifically:
and calculating a mean value or a median value of the second similarity values in the confidence coefficient range as the similarity between the first text and the second text.
9. A text similarity calculation device, the device comprising:
an acquisition unit configured to acquire a first text and a second text;
the segmentation unit is used for carrying out segmentation processing on the first text and the second text to obtain a first segment corresponding to the first text and a second segment corresponding to the second text;
a similarity matrix calculation unit, configured to determine a similarity matrix according to first similarities of the first segment and the second segment;
an alignment unit, configured to obtain a first aligned paragraph pair according to the similarity matrix;
and the text similarity calculation unit is used for acquiring the similarity of the first text and the second text according to the first aligned paragraph pair.
10. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-8.
CN202311778507.5A 2023-12-21 2023-12-21 Text similarity calculation method and device and electronic equipment Pending CN117688399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311778507.5A CN117688399A (en) 2023-12-21 2023-12-21 Text similarity calculation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311778507.5A CN117688399A (en) 2023-12-21 2023-12-21 Text similarity calculation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN117688399A true CN117688399A (en) 2024-03-12

Family

ID=90138798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311778507.5A Pending CN117688399A (en) 2023-12-21 2023-12-21 Text similarity calculation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN117688399A (en)

Similar Documents

Publication Publication Date Title
KR102564144B1 (en) Method, apparatus, device and medium for determining text relevance
CN111898366B (en) Document subject word aggregation method and device, computer equipment and readable storage medium
CN110019732B (en) Intelligent question answering method and related device
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN108108426B (en) Understanding method and device for natural language question and electronic equipment
US20120323968A1 (en) Learning Discriminative Projections for Text Similarity Measures
CN111753167B (en) Search processing method, device, computer equipment and medium
JP6053131B2 (en) Information processing apparatus, information processing method, and program
US11087745B2 (en) Speech recognition results re-ranking device, speech recognition results re-ranking method, and program
US20230075290A1 (en) Method for linking a cve with at least one synthetic cpe
WO2023000725A1 (en) Named entity identification method and apparatus for electric power measurement, and computer device
CN116848490A (en) Document analysis using model intersection
CN109284497B (en) Method and apparatus for identifying medical entities in medical text in natural language
JP2021163477A (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program for image processing
CN117077679B (en) Named entity recognition method and device
CN117763126A (en) Knowledge retrieval method, device, storage medium and apparatus
CN111950265A (en) Domain lexicon construction method and device
Tezcan et al. UGENT-LT3 SCATE system for machine translation quality estimation
Ataman et al. Transforming large-scale participation data through topic modelling in urban design processes
CN113988085B (en) Text semantic similarity matching method and device, electronic equipment and storage medium
CN117688399A (en) Text similarity calculation method and device and electronic equipment
Tu et al. A domain-independent text segmentation method for educational course content
CN112215006B (en) Organization named entity normalization method and system
CN111339287B (en) Abstract generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination