CN113673216B

CN113673216B - Text infringement detection method and device and electronic equipment

Info

Publication number: CN113673216B
Application number: CN202111222905.XA
Authority: CN
Inventors: 黄凯明; 李泽昌; 徐军; 张伟; 张晓博; 杨磊
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-02-01
Anticipated expiration: 2041-10-20
Also published as: CN113673216A

Abstract

A text piracy detection method, a text piracy detection device and electronic equipment are provided, wherein the method comprises the following steps: extracting key sentences from the target text, and carrying out vectorization processing on the key sentences to obtain key sentence vectors corresponding to the key sentences; calculating the vector similarity between the key sentence vector and the original sentence vector, and determining candidate sentences similar to the key sentences based on the vector similarity; the original sentence vector comprises a sentence vector corresponding to an original sentence obtained by vectorizing the original sentence in the original text; and calculating the text similarity between the target text and the candidate text to which the candidate sentence belongs based on the vector similarity between the key sentence vector and the original sentence vector corresponding to the candidate sentence, determining whether the target text is the infringing text of the candidate text or not based on the text similarity, and issuing the infringing detail of the candidate text to the block chain for evidence taking the candidate sentence as the target text when the target text is the infringing text of the candidate text.

Description

Text infringement detection method and device and electronic equipment

Technical Field

One or more embodiments of the present disclosure relate to the technical field of computer applications, and in particular, to a text infringement detection method and apparatus, and an electronic device.

Background

With the popularization of the internet, the speed of information dissemination is gradually increased, and originators of network news, network novels, self media and the like continuously create more and more updated information contents, such as: press releases, novels, science articles, etc., which are typically disseminated over the internet in textual form. But at the same time, the plagiarism of the original text and the problem of text infringement caused by the plagiarism are endless. Under the circumstances, in order to guarantee the rights of the original author, how to perform text infringement detection and improve the accuracy of the text infringement detection become problems to be solved urgently.

Disclosure of Invention

The present specification proposes a text infringement detection method, including:

extracting key sentences from a target text to be detected, and carrying out vectorization processing on the key sentences to obtain key sentence vectors corresponding to the key sentences;

calculating the vector similarity between the key sentence vector and the original sentence vector, and determining candidate sentences similar to the key sentences based on the vector similarity; the original sentence vector comprises a sentence vector corresponding to an original sentence obtained by vectorizing an original sentence in a preset original text;

based on the vector similarity between the key sentence vector and the original sentence vector corresponding to the candidate sentence, further calculating the text similarity between the target text and the candidate text to which the candidate sentence belongs, and based on the text similarity, determining whether the target text is the infringement text of the candidate text, so that when the target text is the infringement text of the candidate text, the candidate sentence is used as the infringement detail of the target text for the candidate text, and is issued to a block chain for evidence storage.

This specification also proposes a text infringement detection device, the device comprising:

the extraction module is used for extracting key sentences from a target text to be detected and vectorizing the key sentences to obtain key sentence vectors corresponding to the key sentences;

the first calculation module is used for calculating the vector similarity between the key sentence vector and the original sentence vector and determining candidate sentences similar to the key sentences based on the vector similarity; the original sentence vector comprises a sentence vector corresponding to an original sentence obtained by vectorizing an original sentence in a preset original text;

and the second calculation module is used for further calculating the text similarity between the target text and the candidate text to which the candidate sentence belongs on the basis of the vector similarity between the key sentence vector and the original sentence vector corresponding to the candidate sentence, and determining whether the target text is the infringement text of the candidate text or not on the basis of the text similarity, so that when the target text is the infringement text of the candidate text, the candidate sentence is used as the infringement detail of the target text for the candidate text, and the infringement detail is issued to a block chain for storage.

This specification also proposes an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the steps of the above method by executing the executable instructions.

The present specification also contemplates a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the above-described method.

In the above technical solution, since the vector similarity between the key sentence vector corresponding to each key sentence in the target text for performing the text infringement detection and the original sentence vector corresponding to each original sentence in each original text can be calculated first, then the text similarity between the target text and each original text can be further calculated based on the calculated vector similarity, and finally whether the target text is the infringement text of each original text can be determined based on the calculated text similarity, the text infringement detection based on the sentence granularity is realized, and the text infringement detection is realized through the vector similarity between the sentence vectors, so that the problem that the text infringement modes such as word order adjustment, sentence pattern change, synonym replacement, and the like are difficult to detect can be effectively solved.

Drawings

FIG. 1 is a flow chart of a text piracy detection method shown in an exemplary embodiment of the present description;

FIG. 2 is a diagram of a sentence similarity detection model shown in an exemplary embodiment of the present description;

fig. 3 is a schematic structural diagram of an electronic device shown in an exemplary embodiment of the present specification;

fig. 4 is a block diagram of a text piracy detection apparatus according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

In practical applications, the text infringement mode is various, and includes full-text plagiarism, paragraph interception, word order adjustment, sentence pattern modification, synonym replacement and the like.

In the related art, when text infringement detection is performed, a detection mechanism of hash matching or character matching is generally adopted.

Specifically, when the text piracy detection is performed by using a hash matching detection mechanism, hash calculation may be performed on the text to be detected and the original text, respectively, to obtain hash values (for example, SHA256 value or MD5 value, etc.) of the text to be detected and the original text, and then, whether the hash values of the text to be detected and the original text are the same or not is compared, and if so, the text to be detected is considered to constitute the piracy text of the original text. Therefore, this detection mechanism can only detect infringement text that is identical to the content of the original text.

When the detection mechanism of locality sensitive hash matching is adopted to detect the text piracy, the locality sensitive hash calculation can be firstly carried out on the text to be detected and the original text to obtain locality sensitive hash values (such as SimHash value or MinHash value) of the text to be detected and the original text, then whether the hash values of the text to be detected and the original text are the same or only slightly different is compared, and if yes, the text to be detected is considered to form the piracy text of the original text. Under the condition that the text content is changed at a small part, the local sensitive hash value of the text cannot be changed or only a small part of the local sensitive hash value of the text is changed, so that the detection mechanism can detect the infringement text obtained by changing the original text at a small part, but cannot detect the infringement text in other modes.

When a character matching detection mechanism is adopted to detect the infringement of the text, the characters in the text to be detected and the characters in the original text can be compared, and when the continuous same characters in the characters and the original text reach a certain number, the text to be detected is determined to form the infringement text of the original text. Therefore, by adopting the text infringement modes such as word order adjustment, sentence pattern change, synonym replacement and the like, the detection mode can be bypassed, so that the text which actually forms infringement cannot be detected.

In order to solve the above-mentioned problem and improve the accuracy of the text infringement detection, the present specification proposes a technical solution of determining whether or not a target text constitutes an infringement text of an original text based on a vector similarity between a sentence vector corresponding to a sentence in the text (referred to as the target text) to which the text infringement detection is to be performed and a sentence vector corresponding to a sentence in the original text.

In a specific implementation, for a text (called a target text) to be subjected to text infringement detection, several sentences (called key sentences) may be extracted from the target text, and each extracted key sentence is subjected to vectorization processing to obtain a key sentence vector corresponding to the key sentence.

Similar to the target text, several sentences (called original sentences) may be extracted from the preset original text in advance, and each extracted original sentence is vectorized to obtain a sentence vector (called original sentence vector) corresponding to the original sentence.

When the text infringement detection is executed, under the condition that the key sentence vectors are obtained, the vector similarity between each key sentence vector and each original sentence vector can be calculated, and a plurality of original sentences similar to the key sentences are determined based on all the calculated vector similarities, and the original sentences can be used as candidate sentences which are possibly infringed.

In the case that the candidate sentences are determined, the original texts to which the candidate sentences belong can be further used as the candidate texts which can be infringed.

For the candidate text, based on the vector similarity between each key sentence vector and the original sentence vector corresponding to each candidate sentence, the text similarity between the target text and each candidate text is further calculated, and based on the calculated text similarity, whether the target text is an infringing text of the candidate text is determined. If the target text is determined to be the infringing text of the candidate text, the candidate sentences belonging to the candidate text can be issued to the blockchain for verification as the infringing details of the target text for the candidate text.

In this specification, after a sentence vector corresponding to a sentence in a text is calculated through a sentence similarity detection model, whether the text to be detected constitutes an infringing text of the original text may be determined through a vector similarity between the sentence vector corresponding to the sentence in the text to be detected and the sentence vector corresponding to the sentence in the original text.

Referring to fig. 1, fig. 1 is a flowchart illustrating a text piracy detection method according to an exemplary embodiment of the present disclosure.

The text infringement detection method can be applied to a server, a server cluster or a computer and other equipment for providing text infringement detection service, and the equipment can be butted with a database (called an original text library) for storing original texts, namely the original texts stored in the original text library can be obtained.

In practical applications, both the text to be subjected to infringement detection and the original text may be any type of text, such as: the text can be articles taking the chapter as the unit, such as news manuscripts, novels and popular science articles, and can also be text segments taking the segment as the unit; this is not limited by the present description.

Further, the text piracy detection method may include the following steps:

step 102, extracting key sentences from the target text to be detected, and carrying out vectorization processing on the key sentences to obtain key sentence vectors corresponding to the key sentences.

For a text (called a target text) to be subjected to text infringement detection, a plurality of sentences (called key sentences) can be extracted from the target text, and vectorization processing is performed on each extracted key sentence to obtain a key sentence vector corresponding to the key sentence.

104, calculating the vector similarity between the key sentence vector and the original sentence vector, and determining candidate sentences similar to the key sentences based on the vector similarity; the original sentence vector comprises a sentence vector corresponding to an original sentence obtained by vectorizing an original sentence in a preset original text.

Similar to the target text, several sentences (called original sentences) may be extracted from the preset original text in advance, and each extracted original sentence is vectorized to obtain a sentence vector (called original sentence vector) corresponding to the original sentence. The original texts may include original texts obtained from the original text library, and may also include texts preset by technicians; this is not limited by the present description.

Step 106, based on the vector similarity between the key sentence vector and the original sentence vector corresponding to the candidate sentence, further calculating a text similarity between the target text and the candidate text to which the candidate sentence belongs, and based on the text similarity, determining whether the target text is an infringing text of the candidate text, so that when the target text is the infringing text of the candidate text, the candidate sentence is used as the infringing detail of the target text for the candidate text, and the infringing detail is issued to a block chain for storage.

The text infringement detection method shown in fig. 1 will be described in detail below in terms of constructing a sentence similarity detection model, obtaining an original sentence vector, obtaining a key sentence vector, calculating a vector similarity between the key sentence vector and the original sentence vector, determining candidate sentences, calculating a text similarity between a target text and candidate texts, and determining whether the target text is an infringement text of the candidate texts.

(1) Constructing sentence similarity detection model

In one embodiment, in order to obtain a sentence vector that can represent the similarity between sentences and improve the accuracy of the sentence vectorization process, a sentence similarity detection model may be constructed in advance, and a sentence vector corresponding to a sentence in the text may be calculated based on the sentence similarity detection model.

Referring to fig. 2, fig. 2 is a schematic diagram of a sentence similarity detection model according to an exemplary embodiment of the present disclosure.

The model architecture of the sentence similarity detection model may include: a Language model (Language model) pair as a feature extraction layer, the Language model pair including two Language models, and model parameters are shared between the two Language models, that is, the two Language models have the same model structure and the same model parameters; and, a classification layer. Wherein, both language models in the language model pair can be used for extracting the characteristics of the input sentence to obtain the characteristic vector (called sentence vector) corresponding to the sentence; the classification layer may be configured to determine whether sentence vectors respectively output by two language models in the language model pair are similar.

The language model may be a BERT model, an XLNet model, or a GPT (generic Pre-Training) model, or may be another language model; this is not limited by the present description.

The sentence similarity detection model can be trained based on a plurality of sentence pair (including two sentences) samples labeled with similar labels.

Specifically, a plurality of sentence pair samples may be obtained first, and each sentence pair sample is labeled with a label (i.e., a similar label) indicating whether two sentences in the sentence pair are similar, then, one sentence in each sentence pair sample may be input into one language model in the language model pair, and the other sentence in the sentence pair sample may be input into the other language model in the language model pair, two sentence vectors corresponding to the two sentences in the sentence pair sample are output by the two language models in the language model pair, and whether the two sentence vectors are similar is determined by the classification layer, so that the model parameters of the sentence similarity detection model may be adjusted according to a deviation between the output of the classification layer and the similar label labeled for the sentence pair sample until the sentence similarity detection model training is completed.

In order to improve the computation rate of the sentence similarity detection model, in an embodiment shown, after the training of the sentence similarity detection model is completed, for two language models in the language model pair, a part of encoders in the two language models may be respectively removed to reduce the number of encoders in the two language models, that is, for a sentence similarity detection model for computing a sentence vector corresponding to a sentence in a text, the number of encoders in the language model included in the sentence similarity detection model is smaller than a standard number (i.e., the number of encoders in the language models of all encoders is retained).

In practical applications, a part of the encoders removed from the two language models respectively can be kept consistent, so as to ensure that the functions of the two language models after the part of the encoders is removed are still completely the same.

In the above case, since the number of encoders in the language model is small, the calculation amount of the language model is reduced, and therefore, the calculation rate of the language model can be increased, and the calculation rate of the sentence similarity detection model including the language model can be increased.

In order to reduce the feature dimension of the sentence vector output by the sentence similarity detection model, and thus reduce the calculation amount of the vector similarity, in an illustrated embodiment, the sentence similarity detection model may further include a pooling layer. The pooling layer may be configured to pool sentence vectors output by two of the language model pairs, input the pooled sentence vectors into the classification layer, and determine whether the pooled sentence vectors are similar.

In order to further reduce the feature dimension of the sentence vector output by the sentence similarity detection model, in an embodiment shown, the sentence similarity detection model may further include a multi-layer Perceptron (MLP) in addition to the pooling layer. The multi-layer perceptron can be used for performing dimensionality reduction processing on the two sentence vectors after the pooling, which are output by the pooling layer, inputting the two sentence vectors after the dimensionality reduction into the classification layer, and determining whether the two sentence vectors after the dimensionality reduction are similar or not by the classification layer.

(2) Obtaining original sentence vector

In one embodiment, for each original text, in order to obtain original sentence vectors respectively corresponding to a plurality of original sentences in the original text, a plurality of original sentences may be extracted from the original text.

Specifically, the original text may be first sentence-divided to obtain a plurality of sentences corresponding to the original text, and subsequently, all sentences obtained by sentence-dividing may be extracted to serve as the plurality of original sentences; or, for a part of the original text that is more critical and has a higher plagiarism probability, a sentence obtained by performing sentence splitting processing on the part may be extracted as the plurality of original sentences, where the part of the original text that is more critical and has a higher plagiarism probability may be preset by a technician according to an actual situation.

In practical applications, before sentence segmentation processing is performed on each original text, preprocessing may be performed on the original text, for example: and performing language identification, word segmentation, word removal and regularization and the like to perform standardized arrangement on characters in the original text, so that the efficiency of subsequent processing is improved.

After the plurality of original sentences are extracted, the extracted original sentences can be input into the sentence similarity detection model, and the sentence similarity detection model is used for calculating the original sentences, so that original sentence vectors corresponding to the original sentences and output by any language model in the language model pair can be obtained.

For each original sentence, when the original sentence is input into the sentence similarity detection model, the original sentence can be specifically input into any language model in the language model pair, and the original sentence is subjected to feature extraction by the language model; or, the original sentence may be simultaneously input into two language models in the language model pair, the two language models perform feature extraction on the original sentence respectively, and since the model parameters are shared between the two language models, the original sentence vectors corresponding to the original sentence output by the two language models respectively are the same.

For example, assume that there are 3 original texts, i.e., original text 1, original text 2, and original text 3.

2 original sentences are extracted from the original text 1, and are respectively an original sentence 11 and an original sentence 12, the original sentence 11 can be input into the sentence similarity detection model to obtain an original sentence vector 11 which is output by the language model and corresponds to the original sentence 11, and the original sentence 12 can be input into the sentence similarity detection model to obtain an original sentence vector 12 which is output by the language model and corresponds to the original sentence 12.

Extracting 3 original sentences from the original text 2, wherein the three original sentences are an original sentence 21, an original sentence 22 and an original sentence 23, respectively, the original sentence 21 can be input into the sentence similarity detection model to obtain an original sentence vector 21 corresponding to the original sentence 21 output by the language model, the original sentence 22 can be input into the sentence similarity detection model to obtain an original sentence vector 22 corresponding to the original sentence 22 output by the language model, and the original sentence 23 can be input into the sentence similarity detection model to obtain an original sentence vector 23 corresponding to the original sentence 23 output by the language model.

2 original sentences are extracted from the original text 3, which are the original sentence 31 and the original sentence 32, respectively, then the original sentence 31 can be input into the sentence similarity detection model to obtain an original sentence vector 31 corresponding to the original sentence 31 output by the language model, and the original sentence 32 can be input into the sentence similarity detection model to obtain an original sentence vector 32 corresponding to the original sentence 32 output by the language model.

In practical application, the original sentence vector can be obtained in advance, so that the obtained original sentence vector can be directly utilized when text infringement detection is carried out, and the original sentence vector does not need to be calculated by the sentence similarity detection model.

(2) Obtaining key sentence vectors

Similarly to the above-mentioned process of obtaining the original sentence vector, in order to obtain the key sentence vectors respectively corresponding to the key sentences in the target text, a plurality of key sentences may be first extracted from the target text.

Specifically, the target text may be first subjected to sentence splitting to obtain a plurality of sentences (referred to as target sentences) corresponding to the target text, and then, in order to reduce the number of key sentences extracted from the target text and thus reduce the amount of similarity calculation, each target sentence may be subjected to score splitting to obtain a sentence score corresponding to the target sentence, and after the sentence scores respectively corresponding to the plurality of target sentences are obtained, N sentences having the highest sentence scores may be extracted from the plurality of target sentences as the plurality of key sentences; or, a sentence with a sentence score greater than a preset threshold (referred to as a first threshold) may be extracted from the target sentences as the key sentences, where the value of N may be preset by a technician according to actual needs, and the first threshold may also be preset by the technician according to actual needs.

In practical applications, before the sentence segmentation processing is performed on the target text, the target text may be preprocessed, for example: and performing language identification, word segmentation, word removal and regularization and the like to perform standardized arrangement on the characters in the target text, so that the efficiency of subsequent processing is improved.

In order to reduce the amount of computation for extracting key sentences from the target text, sentences whose length exceeds a certain value may be extracted from the target text as the key sentences. Wherein the length of the sentence may be the number of characters contained in the sentence.

Further, in one embodiment shown, for each target sentence, when the target sentence is subjected to the scoring processing, the target sentence may be specifically subjected to the scoring processing based on the TextRank algorithm; alternatively, the target sentence may be scored based on the position of the target sentence in the target text, wherein the numerical value of the score of the target sentence is inversely proportional to the distance between the target sentence and the head or tail of the target text, i.e., the closer to the head or tail of the target text, the higher the score is, and the farther from the head or tail of the target text, the lower the score is.

Specifically, when the target sentence is scored based on the TextRank algorithm, since the TextRank algorithm is a graph-based ranking algorithm for a text and can be used to extract keywords and key sentences from the text, the keyword score corresponding to the target sentence calculated based on the TextRank algorithm can be used as the sentence score corresponding to the target sentence.

Alternatively, when the target sentence is scored based on the position of the target sentence in the target text, the distance between the target sentence and the head and the tail of the target text may be determined, and then the sentence score corresponding to the target sentence may be calculated based on the shorter distance of the two determined distances. Wherein the sentence score is inversely proportional to the distance. For example, assuming that the target text contains 100 characters in total, the target sentence itself contains 10 characters, the number of characters before the target sentence is 20, and the number of characters after the target sentence is 70 (10 +20+70= 100), the distance between the target sentence and the head of the target text may be regarded as 20/100=0.2, and the distance between the target sentence and the tail of the target text may be regarded as 70/100=0.7 (0.7 > 0.2), in which case, a sentence score corresponding to the sentence may be calculated based on the distance between the target sentence and the head of the target text.

After the key sentences are extracted, the extracted key sentences can be input into the sentence similarity detection model, and the sentence similarity detection model calculates the key sentences, so that a key sentence vector corresponding to the key sentence and output by any language model in the language model pair can be obtained.

For example, if 2 key sentences, namely, a key sentence 1 and a key sentence 2, are extracted from the target text, the key sentence 1 may be input into the sentence similarity detection model for calculation to obtain a key sentence vector 1 corresponding to the key sentence 1 output by any one of the language models in the language model pair, and the key sentence 2 may be input into the sentence similarity detection model for calculation to obtain a key sentence vector 2 corresponding to the key sentence 2 output by any one of the language models in the language model pair.

(3) Calculating vector similarity between key sentence vector and original sentence vector

In the case where the plurality of key sentence vectors are acquired and the plurality of original sentence vectors are acquired in advance, the vector similarity between each key sentence vector and each original sentence vector may be calculated.

Specifically, for each key sentence vector and each original sentence vector, the key sentence vector and the original sentence vector may be regarded as one sentence pair, and the vector similarity between the key sentence vector and the original sentence vector may be calculated as the vector similarity corresponding to the sentence pair.

In one embodiment shown, the above vector similarity may be characterized by cosine similarity between vectors, or euclidean distance between vectors.

Continuing with the example of the 3 original texts and the target text, the vector similarity between the key sentence vector 1, the key sentence vector 2, and the original sentence vector 11, the original sentence vector 12, the original sentence vector 21, the original sentence vector 22, the original sentence vector 23, the original sentence vector 31, and the original sentence vector 32 can be calculated, as shown in table 1:

sentence pair	Sentence vector	Vector similarity
			Sentence pair 111	(Key sentence vector 1, original sentence vector 11)	Vector similarity 111
Sentence pair 112	(Key sentence vector 1, original sentence vector 12)	Vector similarity 112
			Sentence pair 211	(Key sentence vector 2, original sentence vector 11)	Vector similarity 211
Sentence pair 212	(Key sentence vector 2, original sentence vector 12)	Vector similarity 212
			Sentence pair 121	(Key sentence vector 1, original sentence vector 21)	Vector similarity 121
Sentence pair 122	(Key sentence vector 1, original sentence vector 22)	Vector similarity 122
			Sentence pair 123	(Key sentence vector 1, original sentence vector 23)	Vector similarity 123
Sentence pair 221	(Key sentence vector 2, original sentence vector 21)	Vector similarity 221
			Sentence pair 222	(Key sentence vector 2, original sentence vector 22)	Vector similarity 222
Sentence pair 223	(Key sentence vector 2, original sentence vector 23)	Vector similarity 223
			Sentence pair 131	(Key sentence vector 1, original sentence vector 31)	Vector similarity 131
Sentence pair 132	(Key sentence vector 1, original sentence vector 32)	Vector similarity 132
			Sentence pair 231	(Key sentence vector 2, original sentence vector 31)	Vector similarity 231
Sentence pair 232	(Key sentence vector 2, original sentence vector 32)	Vector similarity 232

TABLE 1

(4) Determining candidate sentences

In the case of calculating the vector similarity between each key sentence vector and each original sentence vector, several candidate sentences similar to the above-mentioned several key sentences may be determined based on all the calculated vector similarities.

In one embodiment shown, N original sentence vectors with the highest vector similarity to the plurality of key sentences may be determined, and the original sentence corresponding to the determined original sentence vector may be determined as a plurality of candidate sentences similar to the plurality of key sentences; alternatively, an original sentence vector having a vector similarity with the key sentences greater than a preset threshold (referred to as a second threshold) may be determined, and the original sentence corresponding to the determined original sentence vector may be determined as candidate sentences similar to the key sentences. The value of N may be preset by a technician according to actual requirements, and the second threshold may also be preset by the technician according to actual requirements.

Continuing to take the 3 original texts and the target text as an example, assuming that the preset numerical value of the N is 4; among the 14 vector similarities obtained by calculation, the 4 vector similarities with the largest value are the vector similarity 111, the vector similarity 212, the vector similarity 122, and the vector similarity 222, respectively, so that the original sentence 11 corresponding to the original sentence vector 11, the original sentence 12 corresponding to the original sentence vector 12, and the original sentence 22 corresponding to the original sentence vector 22 (both the vector similarity 122 and the vector similarity 222 correspond to the original sentence vector 22) can be determined as candidate sentences (hereinafter referred to as candidate sentences 11, candidate sentences 12, and candidate sentences 22).

(5) Calculating text similarity between target text and candidate text

In a case where the candidate sentence is determined, for candidate texts to which the candidate sentences belong, text similarity between the target text and the candidate texts may be further calculated based on vector similarity between each key sentence vector and an original sentence vector corresponding to the candidate sentence.

In one illustrated embodiment, for each sentence pair, the vector similarity between the key sentence vector in the sentence pair and the original sentence vector in the sentence pair may be mapped to a similarity score corresponding to the sentence pair. Further, for candidate sentences belonging to the same candidate text, a sum of similarity scores corresponding to pairs of sentences respectively containing the candidate sentences may be calculated, and the calculated sum value may be determined as the text similarity between the target text and the candidate text.

Continuing to take the 3 original texts and the target text as an example, the vector similarity 111 may be mapped as a similarity score 111, the vector similarity 122 may be mapped as a similarity score 122, and the vector similarity 222 may be mapped as a similarity score 222; since the candidate sentences 11 and 12 belong to the original text 1 (hereinafter referred to as candidate text 1) and the candidate sentence 22 belongs to the original text 2 (hereinafter referred to as candidate text 2), the sum of the similarity score 111 and the similarity score 122 may be determined as the text similarity 1 between the target text and the candidate text 1, and the sum of the similarity score 122 and the similarity score 222 may be determined as the text similarity 2 between the target text and the candidate text 2.

Further, in an embodiment shown, for each sentence pair, when mapping the vector similarity between the key sentence vector in the sentence pair and the original sentence vector in the sentence pair to the similarity score corresponding to the sentence pair, specifically: based on the length of the key sentence corresponding to the key sentence vector, scoring the target sentence to obtain a first sub-score; based on the position of the original sentence corresponding to the original sentence vector in the original text to which the original sentence belongs, scoring the original sentence to obtain a second sub-score; and mapping the vector similarity between the key sentence vector and the original sentence vector into a third sub-score. In this case, a product of the first sub-score, the second sub-score and the third sub-score may be calculated, and the calculated product may be determined as a similarity score corresponding to the sentence pair.

In practical applications, in a first aspect, the following formula can be adopted to implement the scoring processing on the sentence based on the length of the sentence:

wherein x represents a sentence; x <50 indicates that the sentence is less than 50 characters in length; x ≧ 50 indicates that the length of the sentence is greater than or equal to 50 characters; the sensor _ length represents a first sub-score corresponding to the sentence.

In a second aspect, the following formula may be used to implement the scoring of a sentence based on its position in the text to which it belongs:

wherein x represents a sentence; l/3< x <2L/3 indicates that the sentence is between positions 1/3-2/3 in the text to which it belongs, for example: in the text to which the sentence belongs, the ratio of all characters before the sentence in the text is greater than 1/3, and the ratio of all characters after the sentence in the text is also greater than 1/3; x ≧ L/3 or x ≧ 2L/3 indicates that the position of the sentence in the text to which it belongs is either within 1/3 or outside 2/3, for example: in the text to which the sentence belongs, the proportion of all characters before the sentence in the text is less than or equal to 1/3, or the proportion of all characters after the sentence in the text is less than or equal to 1/3; the sensor _ location represents a second sub-score corresponding to the sentence.

In a second aspect, the mapping of vector similarity to a third sub-score may be implemented using the following formula:

wherein i represents the ith key sentence; j represents the jth original sentence, similarity (i, j) represents the vector similarity between the key sentence vector corresponding to the ith key sentence and the original sentence vector corresponding to the jth original sentence; the button _ filter represents a third sub-score corresponding to the vector similarity.

In the above case, the similarity score corresponding to the sentence pair composed of the ith key sentence and the jth original sentence = is scored

。

(6) Determining whether target text is infringing text of candidate text

After the text similarity between the target text and each candidate text is obtained through calculation, whether the target text is an infringing text of the candidate text or not can be determined based on the calculated text similarity.

In an illustrated embodiment, N candidate texts with the highest text similarity to the target text may be determined, and the target text is determined to be an infringing text of the determined candidate text; alternatively, a candidate text having a text similarity greater than a preset threshold (referred to as a third threshold) with the target text may be determined, and the target text may be determined as an infringing text of the determined candidate text. The numerical value of N may be preset by a technician according to actual requirements, and the third threshold may also be preset by the technician according to actual requirements.

Continuing to take the 3 original texts and the target text as an example, assuming that the preset value of the N is 1; and if the text similarity 1 is greater than the text similarity 2 in the 2 text similarities obtained through calculation, determining that the target text is an infringing text of the candidate text 1.

In practical application, the operations executed on the target text and the original text and the generated data in the whole text infringement detection process can be issued to the block chain for evidence storage. Because the data which is stored on the block chain can not be tampered randomly, the authenticity and the reliability of the text infringement detection can be ensured, and the protection effect of the original works can be achieved. For example, in the case that it is determined that the target text is an infringing text of a certain candidate text, a candidate sentence belonging to the candidate text may be issued to the blockchain for deposit as an infringing detail of the target text for the candidate text.

Continuing with the example of the 3 original texts and the target text, in the case that the target text is determined to be an infringing text of the candidate text 1, the candidate sentences 11 and 22 may be issued to the blockchain for storage as infringing details of the target text with respect to the candidate text 1.

In one illustrated embodiment, the above-described infringement details further include one or more of the following: the above key sentence; vector similarity between the key sentence vector and the original sentence vector corresponding to the candidate sentence; the candidate text; the text similarity between the target text and the candidate text.

Corresponding to the embodiments of the text infringement detection method, the present specification also provides embodiments of a text infringement detection apparatus.

The embodiment of the text infringement detection device can be applied to electronic equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 3, the present disclosure is a hardware structure diagram of an electronic device in which a text piracy detection apparatus is located, where, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, the electronic device in which the apparatus is located in the embodiment may also include other hardware according to an actual function of the text piracy detection, which is not described again.

Referring to fig. 4, fig. 4 is a block diagram of a text piracy detection apparatus according to an exemplary embodiment of the present disclosure. The text piracy detection apparatus 40 may be applied to an electronic device as shown in fig. 3; the text piracy detection apparatus 40 may include:

the extraction module 401 extracts a key sentence from a target text to be detected, and performs vectorization processing on the key sentence to obtain a key sentence vector corresponding to the key sentence;

a first calculating module 402, configured to calculate a vector similarity between the key sentence vector and the original sentence vector, and determine a candidate sentence similar to the key sentence based on the vector similarity; the original sentence vector comprises a sentence vector corresponding to an original sentence obtained by vectorizing an original sentence in a preset original text;

a second calculating module 403, configured to further calculate a text similarity between the target text and a candidate text to which the candidate sentence belongs based on a vector similarity between the key sentence vector and an original sentence vector corresponding to the candidate sentence, and determine whether the target text is an infringing text of the candidate text based on the text similarity, so that when the target text is the infringing text of the candidate text, the candidate sentence is issued to a blockchain for evidence as the infringing detail of the target text for the candidate text.

The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method of text piracy detection, the method comprising:

extracting key sentences from a target text to be detected, and inputting the key sentences into a sentence similarity detection model; the sentence similarity detection model comprises a language model pair serving as a feature extraction layer and a classification layer; the classification layer is used for determining whether sentence vectors output by the language models in the language model pair are similar; sharing model parameters between language models in the language model pair; the sentence similarity detection model is obtained by training a sample based on sentences labeled with similar labels;

obtaining a key sentence vector corresponding to the key sentence and output by any language model in the language model pair;

2. The method of claim 1, wherein extracting key sentences from target text to be detected comprises:

performing sentence division processing on a target text to be detected to obtain a target sentence corresponding to the target text;

scoring the target sentence to obtain a sentence score corresponding to the target sentence;

extracting a preset first number of target sentences with highest sentence score from the target sentences to serve as key sentences; or extracting the target sentence with the sentence score larger than a preset first threshold value from the target sentence to serve as a key sentence.

3. The method of claim 2, the scoring the target sentence portion comprising:

scoring the target sentence based on a TextRank algorithm; and/or the presence of a gas in the gas,

based on the position of the target sentence in the target text, performing scoring processing on the target sentence; wherein a numerical size of the score for the target sentence is inversely proportional to a distance between the target sentence and a head or a tail of the target text.

4. The method of claim 1, wherein the number of encoders in the language model is less than a standard number.

5. The method of claim 1, the sentence similarity detection model further comprising a pooling layer; and the pooling layer is used for pooling sentence vectors output by the language model and inputting the pooled sentence vectors into the classification layer.

6. The method of claim 5, the sentence similarity detection model further comprising a multi-layer perceptron; and the multilayer perceptron is used for carrying out dimensionality reduction on the pooled sentence vectors and inputting the sentence vectors subjected to dimensionality reduction into the classification layer.

7. The method of claim 1, further comprising:

extracting an original sentence from a preset original text;

and inputting the original sentence into the sentence similarity detection model, and acquiring a sentence vector which is output by any language model in the language model pair and corresponds to the original sentence as the original sentence vector.

8. The method of claim 1, the vector similarity characterized by cosine similarity between vectors, or Euclidean distance between vectors.

9. The method of claim 1, the determining candidate sentences that are similar to the key sentence based on the vector similarity comprising:

determining a preset second number of original sentence vectors with the highest vector similarity between the original sentence vectors and the key sentence vectors, and determining original sentences corresponding to the determined original sentence vectors as candidate sentences similar to the key sentences; alternatively, the first and second electrodes may be,

and determining original sentence vectors with the vector similarity between the original sentence vectors and the key sentence vectors being larger than a preset second threshold, and determining original sentences corresponding to the determined original sentence vectors as candidate sentences similar to the key sentences.

10. The method of claim 1, the further calculating a text similarity between the target text and a candidate text to which the candidate sentence belongs based on a vector similarity between the key sentence vector and an original sentence vector corresponding to the candidate sentence, comprising:

mapping vector similarity between the key sentence vector and the original sentence vector corresponding to the candidate sentence into a similarity score;

and calculating the sum of the similarity scores corresponding to the candidate sentences belonging to the same candidate text, and determining the calculated sum as the text similarity between the target text and the candidate text.

11. The method of claim 10, the mapping vector similarities between the key sentence vector and an original sentence vector corresponding to the candidate sentence to a similarity score, comprising:

based on the length of a key sentence corresponding to the key sentence vector, performing scoring processing on the key sentence to obtain a first sub-score;

based on the position of the candidate sentence in the candidate text to which the candidate sentence belongs, scoring the candidate sentence to obtain a second sub-score;

mapping vector similarity between the key sentence vector and the original sentence vector corresponding to the candidate sentence into a third sub-score;

and calculating the product of the first sub-score, the second sub-score and the third sub-score, and determining the calculated product as a similarity score.

12. The method of claim 1, the determining whether the target text is an infringing text of the candidate text based on the text similarity, comprising:

determining a preset third number of candidate texts with the highest text similarity between the target texts and the target texts, and determining that the target texts are infringing texts of the determined candidate texts; alternatively, the first and second electrodes may be,

and determining a candidate text with the text similarity between the target text and the target text larger than a preset third threshold, and determining the target text as an infringing text of the determined candidate text.

13. The method of claim 1, the infringement details further comprising one or more of the following: the key sentence; vector similarity between the key sentence vector and an original sentence vector corresponding to the candidate sentence; the candidate text; a text similarity between the target text and the candidate text.

14. A text piracy detection apparatus, the apparatus comprising:

the extraction module is used for extracting key sentences from the target text to be detected and inputting the key sentences into a sentence similarity detection model; the sentence similarity detection model comprises a language model pair serving as a feature extraction layer and a classification layer; the classification layer is used for determining whether sentence vectors output by the language models in the language model pair are similar; sharing model parameters between language models in the language model pair; the sentence similarity detection model is obtained by training a sample based on sentences labeled with similar labels; acquiring a key sentence vector corresponding to the key sentence output by any language model in the language model pair;

15. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the method of any one of claims 1 to 13 by executing the executable instructions.

16. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 13.