CN114818672A

CN114818672A - Text duplicate removal method and device, electronic equipment and readable storage medium

Info

Publication number: CN114818672A
Application number: CN202210356716.XA
Authority: CN
Inventors: 陈伟; 王盛辉; 周海波; 成功; 赵领杰; 王磊
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-07-29

Abstract

The invention discloses a text duplicate removal method and device, electronic equipment and a readable storage medium. Wherein, the method comprises the following steps: acquiring a plurality of result texts corresponding to query input; matching a plurality of result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts; and screening out repeated texts in the plurality of result texts according to the matching result of the synonym knowledge base. The invention solves the technical problems of poor accuracy and timeliness due to the real-time reasoning of the on-line model in the related technology.

Description

Text duplicate removal method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of databases, in particular to a text duplicate removal method and device, an electronic device and a readable storage medium.

Background

With the development of the internet, users can easily obtain a large amount of information from the internet, but the cost of filtering invalid information is increased. The search recommendation technology is a technology for intelligently pushing effective information based on search keywords or user desensitization information input by a user. The life service platform aggregates massive life service information and serves users, so that the search recommendation technology has an important role on the platform. Because the information homogeneity on the life service platform is serious, the text information matched by the search recommendation technology needs to be subjected to duplication elimination processing, so that a user has good use experience. Under the scene, the existing common technical schemes include the following types:

1. and (3) judging the weight of the hash signature: the method comprises the steps of carrying out word segmentation on matched texts to obtain word groups of the words, then calculating the Hash value of each word, calculating a weighted digital string according to the Hash value of each word to be respectively used as Hash signatures of the matched texts, and finally determining whether the two texts are repeated or not by calculating the distance of the Hash signatures, so that the duplication elimination processing of text information is realized.

2. Text matching model deduplication: the method uses a text matching model to learn repeated text data which is manually marked, so that the model has certain weight judging capability, and then carries out reasoning judgment on a text result which is recommended by searching to determine whether two texts are repeated or not, thereby realizing the weight removing processing of text information.

In carrying out the present invention, the applicant has found that at least the following technical problems exist in the related art.

1. The existing scheme is only based on text information reasoning, and has certain limitation on the application of extra context and other information which cannot be well applied.

2. The existing scheme has poor performance due to the fact that the online model is used for real-time reasoning, different performance requirements such as related search, guess search and sug search are difficult to meet, and universality is low.

3. After the existing scheme is on line, the model is updated only when iteration is developed, and the timeliness is poor. In the life service platform scene, various commodity store families are different day by day, and the defect is particularly obvious.

It can be seen that no effective solution to the above problems has been proposed in the related art.

Disclosure of Invention

The embodiment of the invention provides a text duplicate removal method and device, electronic equipment and a readable storage medium, which at least solve the technical problems of poor accuracy and timeliness due to online model real-time reasoning in the related technology.

According to an aspect of the embodiments of the present invention, there is provided a text deduplication method, including: acquiring a plurality of result texts corresponding to query input; matching the result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts; and screening out repeated texts in the plurality of result texts according to the matching result of the synonym knowledge base.

Further, before the obtaining a plurality of result texts corresponding to the query input, the method further includes: performing semantic repeated prediction according to text features, context features and expansion features respectively corresponding to the first text data and the second text data through the text duplication removal model to obtain prediction results of the first text data and the second text data; and if the prediction result is that the text semantics are the same, adding the first text data and the second text data into the synonym knowledge base.

Further, the text deduplication model includes a text processing sub-module and a compressed interaction layer, wherein semantic repeat prediction is performed according to text features, context features, and extension features respectively corresponding to the first text data and the second text data through the text deduplication model, and the semantic repeat prediction includes: determining, by the text processing sub-module, a first vector representation according to a first text feature of the first text data and a second text feature of the second text data; determining, by the compression interaction layer, a second vector representation from the context feature and the extension feature; determining a third vector representation according to the text feature, the context feature and the expansion feature respectively corresponding to the first text data and the second text data; determining the prediction result from the first vector representation, the second vector representation, and the third vector representation.

Further, the text deduplication model comprises a classification layer and a feature enhancement layer, wherein determining the prediction result according to the first vector representation, the second vector representation, and the third vector representation comprises: vector-summing the first, second, and third vector representations to obtain a fourth vector representation; performing feature enhancement on the fourth vector representation by the feature enhancement layer to obtain a fifth vector representation; representing, by the classification layer, the fifth vector to determine a prediction result of the first text and the second text.

Further, if the prediction result is that the text semantics are the same, adding the first text data and the second text data into the synonym knowledge base, including: determining the synonym knowledge base corresponding to the text semantics according to the text semantics corresponding to the first text data and the second text data, wherein the semantic distance between text pairs in the synonym knowledge base is smaller than a preset semantic distance threshold; adding the first text data and the second text data into the synonym knowledge base.

According to another aspect of the embodiments of the present invention, there is also provided a text deduplication apparatus, including: the acquisition module is used for acquiring a plurality of result texts corresponding to the query input; the matching module is used for matching the result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts; and the duplication removing module is used for screening out repeated texts in the result texts according to the matching result of the synonym knowledge base.

Further, still include: the classification module is used for performing semantic repeated prediction according to text features, context features and expansion features respectively corresponding to the first text data and the second text data through the text duplication removal model before the plurality of result texts corresponding to the query input are obtained, so that prediction results of the first text data and the second text data are obtained; and the storage module is used for adding the first text data and the second text data into the synonym knowledge base if the prediction result is that the text semantics are the same.

Further, the text deduplication model comprises a text processing sub-module and a compression interaction layer, wherein the classification module comprises: the first determining submodule is used for determining a first vector representation according to a first text feature of the first text data and a second text feature of the second text data through the text processing submodule; a second determining submodule, configured to determine, by the compression interaction layer, a second vector representation according to the context feature and the extension feature; a third determining submodule, configured to determine a third vector representation according to text features, context features, and expansion features respectively corresponding to the first text data and the second text data; a fourth determination submodule for determining the prediction result from the first vector representation, the second vector representation and the third vector representation.

Further, the fourth determination submodule includes: a processing unit configured to perform vector summation on the first vector representation, the second vector representation, and the third vector representation to obtain a fourth vector representation; a feature enhancement unit, configured to perform feature enhancement on the fourth vector representation through the feature enhancement layer to obtain a fifth vector representation; a determining unit, configured to represent, by the classification layer, the fifth vector to determine a prediction result of the first text and the second text.

Further, the storage module includes: determining the synonym knowledge base corresponding to the text semantics according to the text semantics corresponding to the first text data and the second text data, wherein the semantic distance between text pairs in the synonym knowledge base is smaller than a preset semantic distance threshold; and adding the first text data and the second text data into the synonym knowledge base.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor, a memory, and a program or instructions stored on the memory and executable on the processor, wherein the program or instructions, when executed by the processor, implement the steps of the text deduplication method as described above.

According to another aspect of the embodiments of the present invention, there is also provided a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the text deduplication method as described above.

In the embodiment of the invention, a plurality of result texts corresponding to query input are obtained; matching a plurality of result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts; and screening out repeated texts in the plurality of result texts according to the matching result of the synonym knowledge base. The synonym knowledge base constructed based on the prediction structure of the pre-trained text deduplication model achieves the purpose of rapidly deduplicating the text on line, so that the technical effects of improving the timeliness of the recommended search terms and the accuracy of the recommended results are achieved, and the technical problems of poor accuracy and timeliness due to real-time reasoning of the on-line model in the related technology are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an alternative text deduplication method according to an embodiment of the present invention;

FIG. 2 is a diagram of an alternative text deduplication model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of yet another alternative text deduplication model in accordance with an embodiment of the present invention;

fig. 4 is a schematic diagram of an alternative text deduplication apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to an embodiment of the present invention, there is provided a text deduplication method, as shown in fig. 1, the method including:

s102, acquiring a plurality of result texts corresponding to query input;

the result text in this embodiment may be a recommended text, a search result, and the like corresponding to the user query input in the network platform. The query input in this embodiment may be a query term and/or a selected query condition input by the user through the search page, or a query term and/or a query condition generated by the platform according to the behavior of the user. In addition, the query input can also be the selection operation of the user on the platform for the related recommended words or the notification information, and the like.

In this embodiment, a query input of a user for an online platform is received, and then a plurality of result texts corresponding to the query input are obtained, for example, a query input for a current query or current browsing of the user may be obtained through a query entry of the platform. Further, the platform performs search query according to the obtained query input, and recalls a plurality of result texts matched with the query input. For example, when the user clicks "roast," each recalled referral may be considered a result text.

S104, matching a plurality of result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts;

in this embodiment, synonym knowledge bases are pre-constructed, each synonym knowledge base corresponds to one text semantic, in one synonym knowledge base, the semantics of the texts are similar, and the similarity between the semantics is greater than a preset threshold. For example, when the text a and the text B are repeated and the text B and the text C are repeated, it can be reasonably determined that the text a and the text C are repeated. Therefore, the multiple result texts can be matched quickly through the pre-constructed synonym knowledge base.

The synonym knowledge base in the embodiment is constructed in advance by the prediction result of the text duplication removal model, the text duplication removal model carries out classification prediction on the texts in the text data source in a pairwise grouping mode in sequence, and finally the synonym knowledge base is constructed. In addition, after the synonym knowledge base is constructed, and after a new text is obtained subsequently, the new text is added to the synonym knowledge base with similar or same semantics as a text increment.

And S106, screening out repeated texts in the plurality of result texts according to the matching result of the synonym knowledge base.

In this embodiment, by matching the result texts in the synonym knowledge base, the repeated texts with the same or similar semantics can be screened out, the repeated texts are screened out from the result texts, and then the screened-out result texts are displayed to the user corresponding to the query input.

It should be noted that, in this embodiment, a plurality of result texts corresponding to query inputs are obtained; matching a plurality of result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts; and screening out repeated texts in the plurality of result texts according to the matching result of the synonym knowledge base. The synonym knowledge base constructed based on the prediction structure of the pre-trained text deduplication model achieves the purpose of rapidly deduplicating the text on line, so that the technical effects of improving the timeliness of the recommended search terms and the accuracy of the recommended results are achieved, and the technical problems of poor accuracy and timeliness due to real-time reasoning of the on-line model in the related technology are solved.

Optionally, in this embodiment, before obtaining the plurality of result texts corresponding to the query input, the method further includes, but is not limited to: performing semantic repeated prediction according to text features, context features and expansion features respectively corresponding to the first text data and the second text data through a text duplicate removal model to obtain prediction results of the first text data and the second text data; and if the text semantics are the same as the predicted result, adding the first text data and the second text data into the synonym knowledge base.

In the implementation of this embodiment, the text deduplication model needs to be trained first.

In some embodiments, a set of training samples is constructed from text data of text recalled by a query input, each training sample in the set of training samples including at least the following information: text information, context information, and extension information, wherein the text information includes, but is not limited to, query input, result text; the context information is obtained by constructing according to the corresponding relation between the text recalled by the query input and the query input by the online platform; the extended information includes, but is not limited to, user information, user interaction information, user preferences, and the like.

In this embodiment, training samples are constructed for a group based on two training text data, and each training sample includes a first text, a second text, a first context, a second context, first expansion information, second expansion information, and semantic similarity between the first text and the second text. In some embodiments, each training sample is represented as a seven-tuple < second text, first context, second context, first extended information, second extended information, semantic similarity >.

In addition, as an optional implementation mode, since the number of positive samples is often much smaller than that of negative samples in the text re-determination task, the initial threshold value jamming control is performed by an edit distance and string similarity comparison (jaro winkler) algorithm, similar samples are screened, and most of negative samples are filtered.

Next, a text deduplication model is trained based on the constructed training sample set. And training the text deduplication model by taking the text information, the context information and the extension information respectively corresponding to the first training text data and the second training text data as model input and taking semantic similarity of the first training text data and the second training text data as a model target until the model converges or the model iterates to a preset number of times.

And then, inputting the text features, the context features and the extension features corresponding to the first text data and the second text data into a text deduplication model which is trained in advance so as to obtain a result of whether the semantic similarity scores or the semantics of the first text data and the second text data are similar. And under the condition that the semantics of the first text data are similar to those of the second text data, adding the first text data and the second text data into a synonym knowledge base with the same semantics as synonyms or near synonyms.

If the semantics of the first text data are not similar to those of the second text data, the texts corresponding to the first text data and the second text data are reserved.

Through the embodiment, the semantic similarity prediction is performed on the result text corresponding to the query input in advance through the text duplication removal model, so that the result text of the query input is subjected to rapid semantic judgment, and the timeliness of text duplication removal is improved.

Optionally, in this embodiment, the text deduplication model includes a text processing sub-module and a compressed interaction layer, where semantic repeat prediction is performed according to text features, context features, and extension features respectively corresponding to the first text data and the second text data through the text deduplication model, and the semantic repeat prediction includes but is not limited to: determining, by a text processing sub-module, a first vector representation according to a first text feature of the first text data and a second text feature of the second text data; determining a second vector representation from the context feature and the extension feature by compressing the interaction layer; determining a third vector representation according to the text feature, the context feature and the expansion feature respectively corresponding to the first text data and the second text data; a prediction result is determined from the first vector representation, the second vector representation, and the third vector representation.

Specifically, in the present embodiment, as shown in the structural diagram of the text deduplication model shown in fig. 2, the text deduplication model 20 includes a text processing sub-module 210, a compression interaction layer 220, a feature enhancement layer 230, and a classification layer 240. Inputting the first text feature of the first text data and the second text feature of the second text data into the text processing submodule 210 to obtain a first vector representation; the context feature and the expansion feature corresponding to the first text data and the second text data are input to the compression interaction layer 220 to obtain the second vector representation. Then, carrying out vector splicing on the text features, the context features and the extension features respectively corresponding to the first text data and the second text data to obtain a third vector representation; finally, the prediction results corresponding to the first text data and the second text data are determined through the feature enhancement layer 230 and the classification layer 240.

Optionally, in this embodiment, the text deduplication model includes a classification layer and a feature enhancement layer, wherein the prediction result is determined according to the first vector representation, the second vector representation, and the third vector representation, including but not limited to: vector-summing the first vector representation, the second vector representation, and the third vector representation to obtain a fourth vector representation; performing feature enhancement on the fourth vector representation by a feature enhancement layer to obtain a fifth vector representation; and representing the fifth vector through a classification layer to determine a prediction result of the first text and the second text.

In a specific application scenario, as shown in fig. 3, a text deduplication model has an overall model structure of xDeepFm, and context features and other extension features are processed by a feature processing layer and then input to a compression interaction layer CIN for high-order feature combination processing of display. Replacing a deep Neural network DNN (deep Neural networks) in xDelpemm with a BERT (language Representation model), and inputting the first text and the second text into the BERT model to obtain a first vector Representation.

In addition, the text deduplication model comprises an embedded vector Layer and a Compressed Interaction Network (CIN), the context feature and the extension feature are firstly input into the embedded vector Layer in the Compressed Interaction Layer, and then the output of the embedded vector Layer is input into the CIN to obtain the second vector representation.

On one hand, a first vector representation is obtained through text features corresponding to the first text and the second text respectively, a second vector representation is obtained through the context features and the expansion features, and the first text, the second text, the context features and the expansion features are input to a Linear layer to obtain a third vector representation.

On the other hand, the first vector representation, the second vector representation and the third vector representation are subjected to vector summation in an Add layer of the text deduplication model, and then data enhancement is carried out through a feature enhancement (Mix Up) layer to obtain a fourth vector representation.

The fourth vector representation is then classified by an Output Unit (Output Unit) to obtain a similarity or similarity score of the text semantics of the first text and the second text. Optionally, the classification layer 240 may be formed by a multi-layer perceptron mlp (multi layer perceptron), which is not limited in this embodiment.

In the embodiment, the semantic similarity of the first text data to the second text data is determined according to the text feature, the context feature and the extension feature respectively corresponding to the first text data and the second text data through the text deduplication model, so that the accuracy of the prediction result is improved.

Optionally, in this embodiment, if the prediction result is that the text semantics are the same, adding the first text data and the second text data into the synonym knowledge base, including but not limited to: determining a synonym knowledge base corresponding to the text semantics according to the text semantics corresponding to the first text data and the second text data, wherein the semantic distance between text pairs in the synonym knowledge base is smaller than a preset semantic distance threshold; and adding the first text data and the second text data into a synonym knowledge base.

Specifically, in this embodiment, the result of the text deduplication model is whether the first text data is duplicated between the text pairs of the second text data. For example, when the text a and the text B are repeated and the text B and the text C are repeated, it can be reasonably determined that the text a and the text C are repeated. Since the construction of the text pairs is filtered by the distance threshold and does not include all text pair combinations, more repeated texts can be expanded according to the repeated transitivity. But each deduplication will traverse the link, requiring a significant time cost. In order to improve the weight judging performance and the updating efficiency of the knowledge base, a union set searching algorithm is adopted in the embodiment, the synonym knowledge base in a key value pair form is maintained, and path compression optimization is performed during off-line updating.

It should be noted that if the text duplication elimination model performs parameter updating or adjustment, the synonym knowledge base is updated in full; if the text duplication removal model is not updated or adjusted, the synonym knowledge base is updated in a synonym/near-synonym incremental updating mode at regular time, so that the knowledge base is ensured to have better coverage and the emerging popular text is covered in time.

According to the embodiment, a plurality of result texts corresponding to query input are obtained; matching a plurality of result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts; and screening out repeated texts in the plurality of result texts according to the matching result of the synonym knowledge base. The synonym knowledge base constructed based on the prediction structure of the pre-trained text deduplication model achieves the purpose of rapidly deduplicating the text on line, so that the technical effects of improving the timeliness of the recommended search terms and the accuracy of the recommended results are achieved, and the technical problems of poor accuracy and timeliness due to real-time reasoning of the on-line model in the related technology are solved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

According to an embodiment of the present invention, there is also provided a text deduplication device for implementing the text deduplication method, as shown in fig. 4, the device includes:

1) an obtaining module 40, configured to obtain a plurality of result texts corresponding to query inputs;

2) a matching module 42, configured to match the multiple result texts in a pre-constructed synonym knowledge base, where the synonym knowledge base is generated according to a prediction result of a pre-trained text deduplication model, and the text deduplication model is used for performing semantic deduplication prediction according to text features, context features, and expansion features of the result texts;

3) and the duplicate removal module 44 is configured to screen out duplicate texts in the multiple result texts according to the matching result of the synonym knowledge base.

Optionally, in this embodiment, the method further includes:

1) the classification module is used for performing semantic repeated prediction according to text features, context features and expansion features respectively corresponding to the first text data and the second text data through the text duplication removal model before the plurality of result texts corresponding to the query input are obtained, so that prediction results of the first text data and the second text data are obtained;

2) and the storage module is used for adding the first text data and the second text data into the synonym knowledge base if the prediction result is that the text semantics are the same.

Optionally, in this embodiment, the text deduplication model includes a text processing sub-module and a compression interaction layer, where the classification module includes:

1) the first determining submodule is used for determining a first vector representation according to a first text feature of the first text data and a second text feature of the second text data through the text processing submodule;

2) a second determining submodule, configured to determine, by the compression interaction layer, a second vector representation according to the context feature and the extension feature;

3) the third determining submodule is used for determining a third vector representation according to the text feature, the context feature and the expansion feature which respectively correspond to the first text data and the second text data;

4) a fourth determination submodule for determining the prediction result from the first vector representation, the second vector representation and the third vector representation.

Optionally, in this embodiment, the fourth determining sub-module includes:

1) a processing unit configured to perform vector summation on the first vector representation, the second vector representation, and the third vector representation to obtain a fourth vector representation;

2) a feature enhancement unit, configured to perform feature enhancement on the fourth vector representation through the feature enhancement layer to obtain a fifth vector representation;

3) a determining unit, configured to represent, by the classification layer, the fifth vector to determine a prediction result of the first text and the second text.

Optionally, in this embodiment, the storage module 44 includes:

1) determining the synonym knowledge base corresponding to the text semantics according to the text semantics corresponding to the first text data and the second text data, wherein the semantic distance between text pairs in the synonym knowledge base is smaller than a preset semantic distance threshold;

2) and adding the first text data and the second text data into the synonym knowledge base.

Example 3

There is also provided, according to an embodiment of the present invention, an electronic device, including a processor, a memory, and a program or instructions stored on the memory and executable on the processor, where the program or instructions, when executed by the processor, implement the steps of the text deduplication method as described above.

Optionally, in this embodiment, the memory is configured to store program code for performing the following steps:

s1, acquiring a plurality of result texts corresponding to the query input;

s2, matching the plurality of result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts;

s3, screening out repeated texts in the multiple result texts according to the matching result of the synonym knowledge base.

Optionally, for a specific example in this embodiment, reference may be made to the example described in embodiment 1 above, and this embodiment is not described herein again.

Example 4

Embodiments of the present invention also provide a readable storage medium on which a program or instructions are stored, which when executed by a processor implement the steps of the text deduplication method as described above.

Optionally, in this embodiment, the readable storage medium is configured to store program code for performing the following steps:

s1, acquiring a plurality of result texts corresponding to the query input;

s2, matching the result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts;

Optionally, the readable storage medium is further configured to store program codes for executing the steps included in the method in embodiment 1, which is not described in detail in this embodiment.

Optionally, in this embodiment, the readable storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Optionally, the specific example in this embodiment may refer to the example described in embodiment 1 above, and this embodiment is not described again here.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for text deduplication, comprising:

acquiring a plurality of result texts corresponding to query input;

matching the result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts;

and screening out repeated texts in the plurality of result texts according to the matching result of the synonym knowledge base.

2. The method of claim 1, prior to obtaining a plurality of result texts corresponding to the query input, further comprising:

performing semantic repeated prediction according to text features, context features and expansion features respectively corresponding to the first text data and the second text data through the text duplication removal model to obtain prediction results of the first text data and the second text data;

and if the prediction result is that the text semantics are the same, adding the first text data and the second text data into the synonym knowledge base.

3. The method of claim 2, wherein the text deduplication model comprises a text processing sub-module and a compression interaction layer, wherein,

performing semantic repeat prediction according to the text features, the context features and the extension features respectively corresponding to the first text data and the second text data through the text deduplication model, wherein the semantic repeat prediction comprises the following steps:

determining, by the text processing sub-module, a first vector representation according to a first text feature of the first text data and a second text feature of the second text data;

determining, by the compression interaction layer, a second vector representation from the context feature and the extension feature;

determining a third vector representation according to the text feature, the context feature and the expansion feature respectively corresponding to the first text data and the second text data;

determining the prediction result from the first vector representation, the second vector representation, and the third vector representation.

4. The method of claim 3, wherein the text de-emphasis model comprises a classification layer and a feature enhancement layer, and wherein determining the prediction result according to the first vector representation, the second vector representation, and the third vector representation comprises:

vector-summing the first, second, and third vector representations to obtain a fourth vector representation;

performing feature enhancement on the fourth vector representation by the feature enhancement layer to obtain a fifth vector representation;

representing, by the classification layer, the fifth vector to determine a prediction result of the first text and the second text.

5. The method according to claim 2, wherein if the prediction result is that the text semantics are the same, adding the first text data and the second text data to the synonym knowledge base comprises:

determining the synonym knowledge base corresponding to the text semantics according to the text semantics corresponding to the first text data and the second text data, wherein the semantic distance between text pairs in the synonym knowledge base is smaller than a preset semantic distance threshold;

and adding the first text data and the second text data into the synonym knowledge base.

6. A text deduplication apparatus, comprising:

the acquisition module is used for acquiring a plurality of result texts corresponding to the query input;

the matching module is used for matching the result texts in a pre-constructed synonym knowledge base, wherein the synonym knowledge base is generated according to a prediction result of a pre-trained text duplication elimination model, and the text duplication elimination model is used for performing semantic duplication prediction according to text features, context features and expansion features of the result texts;

and the duplication removing module is used for screening out repeated texts in the result texts according to the matching result of the synonym knowledge base.

7. The apparatus of claim 6, further comprising:

the classification module is used for performing semantic repeated prediction according to text features, context features and expansion features respectively corresponding to the first text data and the second text data through the text duplication removal model before the plurality of result texts corresponding to the query input are obtained, so that prediction results of the first text data and the second text data are obtained;

and the storage module is used for adding the first text data and the second text data into the synonym knowledge base if the prediction result is that the text semantics are the same.

8. The apparatus of claim 7, wherein the text de-duplication model comprises a text processing sub-module and a compression interaction layer, and wherein the classification module comprises:

the first determining submodule is used for determining a first vector representation according to a first text feature of the first text data and a second text feature of the second text data through the text processing submodule;

a second determining submodule, configured to determine, by the compression interaction layer, a second vector representation according to the context feature and the extension feature;

the third determining submodule is used for determining a third vector representation according to the text feature, the context feature and the expansion feature which respectively correspond to the first text data and the second text data;

a fourth determination submodule for determining the prediction result from the first vector representation, the second vector representation and the third vector representation.

9. The apparatus of claim 8, wherein the fourth determination submodule comprises:

a processing unit configured to perform vector summation on the first vector representation, the second vector representation, and the third vector representation to obtain a fourth vector representation;

a feature enhancement unit, configured to perform feature enhancement on the fourth vector representation through the feature enhancement layer to obtain a fifth vector representation;

a determining unit, configured to represent, by the classification layer, the fifth vector to determine a prediction result of the first text and the second text.

10. The apparatus of claim 7, wherein the storage module comprises:

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the text deduplication method of claims 1-5.

12. A readable storage medium, on which a program or instructions are stored, which when executed by a processor, implement the steps of the text deduplication method of claims 1-5.