CN117764052A

CN117764052A - Method, device, equipment and medium for checking text similarity degree

Info

Publication number: CN117764052A
Application number: CN202311841001.4A
Authority: CN
Inventors: 孙武; 昝云飞; 徐红; 高翔; 纪达麒; 陈运文
Original assignee: Daguan Technology Beijing Co ltd
Current assignee: Daguan Technology Beijing Co ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-03-26

Abstract

The invention discloses a method, a device, equipment and a medium for checking text similarity. A method of verifying text similarity comprising: vectorizing texts in the sample set to be tested to obtain a text vector feature set; according to a first similarity algorithm, a similarity threshold value and a text vector feature set, carrying out similar sample rejection on each sample subset to be tested in the sample set to be tested to obtain each primary screening sample subset to be tested; performing word segmentation processing on the sample set to be tested to obtain a text word segmentation sample set, and performing similar sample rejection on each primary screening sample subset to be tested according to the text word segmentation sample set, a second similarity algorithm and a similarity threshold value to obtain each target cleaning sample subset. The technical scheme of the embodiment of the invention can effectively improve the cleaning effect of the text data set.

Description

Method, device, equipment and medium for checking text similarity degree

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a device, and a medium for checking text similarity.

Background

In recent years, as more data sets are classified by text, the quality requirements for the collected data are increasing.

The data needed to be collected at present come from actual scenes and internal data files, the data mostly have repeatability and very high similarity, the data do not bring very high promotion to a deep learning model, and the data can cause the phenomenon of fitting in the text classification training of the deep learning, so that the similar phenomenon is avoided, the repeated degree of a data set is needed to be checked, the data set is cleaned through the similarity, however, the cleaning effect of the current data set is poor, and the improvement still remains.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for checking text similarity, which are used for solving the problem of poor cleaning effect of the existing text data set.

According to an aspect of the present invention, there is provided a method of checking a degree of similarity of texts, including:

vectorizing texts in the sample set to be tested to obtain a text vector feature set;

according to a first similarity algorithm, a similarity threshold value and a text vector feature set, carrying out similar sample rejection on each sample subset to be tested in the sample set to be tested to obtain each primary screening sample subset to be tested;

Performing word segmentation processing on the sample set to be tested to obtain a text word segmentation sample set, and performing similar sample rejection on each primary screening sample subset to be tested according to the text word segmentation sample set, a second similarity algorithm and a similarity threshold value to obtain each target cleaning sample subset.

According to another aspect of the present invention, there is provided an apparatus for checking a degree of similarity of texts, comprising:

the vectorization processing module is used for vectorizing texts in the sample set to be tested to obtain a text vector feature set;

the first similar sample eliminating module is used for eliminating similar samples of all sample subsets to be tested in the sample set to be tested according to a first similarity algorithm, a similarity threshold value and a text vector feature set to obtain all primary screening sample subsets to be tested;

the second similar sample eliminating module is used for carrying out word segmentation processing on the sample set to be tested to obtain a text word segmentation sample set, and carrying out similar sample elimination on each primary screening sample subset to be tested according to the text word segmentation sample set, a second similarity algorithm and a similarity threshold value to obtain each target cleaning sample subset.

According to another aspect of the present invention, there is provided an electronic apparatus including:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of verifying text similarity according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a method of verifying text similarity according to any embodiment of the present invention.

According to the technical scheme, the text in the sample set to be tested is vectorized to obtain the text vector feature set, so that similar sample rejection is carried out on each sample subset to be tested in the sample set to be tested according to a first similarity algorithm, a similarity threshold value and the text vector feature set to obtain each primary screening sample subset to be tested, word segmentation is carried out on the sample set to be tested to obtain a text word segmentation sample set, and similar sample rejection is carried out on each primary screening sample subset to be tested according to the text word segmentation sample set, a second similarity algorithm and the similarity threshold value to obtain each target cleaning sample subset. According to the method and the device, the recognition and elimination of the similar samples are carried out on the sample set to be tested from the two characteristic dimensions of the text vector and the text word segmentation, so that the similar samples in the sample set to be tested are effectively cleaned, the problem that the cleaning effect of the existing text data set is poor is solved, and the cleaning effect of the text data set can be effectively improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for checking text similarity according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for checking text similarity according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for checking text similarity according to a third embodiment of the present invention;

fig. 4 shows a schematic diagram of the structure of an electronic device that may be used to implement an embodiment of the invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a method for checking text similarity according to an embodiment of the present invention, where the method may be performed by a device for checking text similarity, and the device for checking text similarity may be implemented in hardware and/or software, and the device for checking text similarity may be configured in an electronic device.

As shown in fig. 1, the method includes:

and 110, vectorizing texts in the sample set to be tested to obtain a text vector feature set.

The sample set to be tested may be a sample set that requires data cleaning to remove similar text. By way of example, the sample set to be inspected may include, but is not limited to, a sample set in the power communication industry, news industry, recommendation context, or financial industry. The text vector feature set may be a feature set obtained by vectorizing text in the sample set to be tested.

In the embodiment of the invention, the text in the sample set to be tested can be vectorized based on the vectorization processing tool to obtain the text vector feature set.

Alternatively, the vectorization processing tool may include, but is not limited to, a roformer model.

And 120, according to the first similarity algorithm, the similarity threshold value and the text vector feature set, performing similar sample rejection on each sample subset to be tested in the sample set to be tested to obtain each primary screening sample subset to be tested.

The first similarity algorithm may be an algorithm for calculating vectorized text similarity. The first similarity algorithm may include, but is not limited to, a cosine similarity algorithm. The similarity threshold may be a preset upper limit value of the similarity between texts. The subset of samples to be tested may be a subset divided from the set of samples to be tested. The primary screening sample subsets to be tested can be sample sets after primary similar sample rejection of each sample subset to be tested in the sample sets to be tested.

Alternatively, the current sample to be inspected (not the last sample to be inspected) and the samples following the current sample to be inspected in the sample set to be inspected may be divided into a subset of the sample set to be inspected. For example, assume that the sample set to be tested is { sample 1, sample 2, sample 3, sample 4}, and each sample subset to be tested is { sample 1, sample 2, sample 3, sample 4}, { sample 2, sample 3, sample 4}, and { sample 3, sample 4}, respectively.

In the embodiment of the invention, the text vector characteristics matched with each sample subset to be tested of the sample set to be tested in the text vector characteristic set can be determined, so that the similarity between the sample subsets to be tested and the sample with the similarity threshold value for the first time is calculated according to the text vector characteristics matched with each sample subset to be tested and a first similarity algorithm, and the calculated similarity between the samples of each sample subset to be tested is subjected to similar sample rejection to obtain each sample subset to be tested for the first time.

And 130, performing word segmentation processing on the sample set to be tested to obtain a text word segmentation sample set, and performing similar sample rejection on each primary screening sample subset to be tested according to the text word segmentation sample set, a second similarity algorithm and a similarity threshold value to obtain each target cleaning sample subset.

The text word segmentation sample set may be a sample set obtained by performing word segmentation on a sample set to be tested. The second similarity algorithm may be an algorithm that calculates the similarity between text tokens. The target cleaning sample subset may be a sample subset obtained after similar sample rejection is performed on each of the sample subsets to be tested.

In the embodiment of the invention, after obtaining each primary screening sample subset to be inspected, word segmentation can be further carried out on the sample set to be inspected to obtain a text word segmentation sample set, text word segmentation matched with each primary screening sample subset to be inspected is determined from the text word segmentation sample set, so that the similarity between each sample subset to be inspected and the sample with the similarity threshold value again is calculated according to a second similarity algorithm and the text word segmentation matched with each primary screening sample subset to be inspected, and samples with the similarity between the samples of each primary screening sample subset to be inspected being greater than the similarity threshold value are subjected to similar sample rejection, so that each target cleaning sample subset is obtained.

Alternatively, the similarity between samples can be uniformly understood as the similarity between the reference test sample in the sample subset to be tested and other samples in the same subset.

Example two

Fig. 2 is a flowchart of a method for checking the similarity degree of texts according to a second embodiment of the present invention, which is embodied based on the above embodiment, and shows a specific alternative implementation manner of obtaining a sample set to be checked before vectorizing texts in the sample set to be checked. As shown in fig. 2, the method includes:

step 210, acquiring a target data set.

Wherein the target data set may be a text data set that is determined based on model training needs.

In embodiments of the present invention, the target data set may be obtained based on model training requirements.

And 220, screening a sample set to be inspected from the target data set according to the label type of the target data set.

In the embodiment of the invention, the label type of the sample which is required to be subjected to data cleaning currently can be determined from the label types of the target data set, and the sample is screened from the target data set according to the determined label type to obtain the sample set to be inspected.

And 230, vectorizing the texts in the sample set to be tested to obtain a text vector feature set.

And 240, according to the first similarity algorithm, the similarity threshold value and the text vector feature set, performing similar sample rejection on each sample subset to be tested in the sample set to be tested to obtain each primary screening sample subset to be tested.

In an optional embodiment of the present invention, according to a first similarity algorithm, a similarity threshold, and a text vector feature set, performing similar sample rejection on each sample subset to be tested in the sample set to be tested, to obtain each preliminary screening sample subset to be tested, may include: according to a first similarity algorithm and a text vector feature set, respectively calculating sample comparison first similarity of each sample subset to be tested; and comparing the first similarity and a similarity threshold according to the samples of the sample subsets to be tested, and removing similar samples from the sample subsets to be tested in the sample set to be tested to obtain the sample subsets to be tested.

The first similarity of the samples may be determined based on a first similarity algorithm and a text vector feature set, and the similarity between samples of the sample subset to be tested may be determined based on the first similarity algorithm.

In the embodiment of the invention, after determining the text vector characteristics of the text vector characteristic set matched with each sample subset to be tested of the sample set to be tested, the similarity of the text vector characteristics matched with each sample subset to be tested can be calculated by using a first similarity algorithm to obtain the first similarity of the samples of each sample subset to be tested, so that the first similarity of the samples of each sample subset to be tested is compared with a similarity threshold value, and then the similar samples, of which the similarity of each sample subset to be tested with the corresponding reference test sample is larger than the similarity threshold value, are removed, so that each primary sample subset to be tested is obtained.

In an alternative embodiment of the present invention, calculating the first similarity of the samples of each sample subset to be tested according to the first similarity algorithm and the text vector feature set, respectively, may include: determining a text vector feature subset matched with each sample subset to be tested according to each sample subset to be tested and the text vector feature set; and according to a first similarity algorithm, calculating the similarity between the target text vector features and the text vector features to be compared in the text vector feature subsets matched with the sample subsets to be tested, and obtaining the sample comparison first similarity of the sample subsets to be tested.

The text vector feature subset may be a feature set formed by text vector features matched with the sample subset to be inspected. Illustratively, assume that the text vector feature set is { sample 1 text vector, sample 2 text vector, sample 3 text vector, sample 4 text vector }, and the text vector feature set that matches the sample subset { sample 2, sample 3, sample 4} to be tested is { sample 2 text vector, sample 3 text vector, sample 4 text vector }. The target text vector feature may be a text vector feature of a benchmark test sample in the subset of samples to be tested. The text vector feature to be compared may be a text vector feature for which an inter-sample similarity calculation is performed with the target text vector feature. The test sample corresponding to the target text vector feature and the test sample corresponding to the text vector feature to be compared belong to the same subset of the test samples to be compared.

According to the embodiment of the invention, the text vector characteristics corresponding to each sample in each sample set to be tested can be determined according to the corresponding relation between the sample set to be tested and the text vector characteristic set, so that the text vector characteristic subset matched with the corresponding sample set to be tested is determined according to the text vector characteristics matched with the sample in each sample set to be tested. After the text vector feature subsets matched with the sample subsets to be tested are obtained, calculating the similarity between target text vector features in the text vector feature subsets matched with the sample subsets to be tested and the text vector features to be compared through a first similarity algorithm, and obtaining the sample comparison first similarity of the sample subsets to be tested.

Step 250, word segmentation processing is carried out on the sample set to be tested to obtain a text word segmentation sample set, and similar sample rejection is carried out on each primary screening sample subset to be tested according to the text word segmentation sample set, a second similarity algorithm and a similarity threshold value to obtain each target cleaning sample subset.

In an optional embodiment of the present invention, performing similar sample rejection on each of the first screened sample subsets to be tested according to the text word segmentation sample set, the second similarity algorithm, and the similarity threshold, to obtain each of the target cleaning sample subsets may include: according to the text word segmentation sample set and a second similarity algorithm, respectively calculating sample comparison second similarity of each primary screening sample subset to be tested; and comparing the second similarity and a similarity threshold according to the samples of the sample subsets to be tested, and removing the similar samples of the sample subsets to be tested to obtain target cleaning sample subsets.

The sample comparison second similarity may be a sample-to-sample similarity of a preliminary screening sample subset to be tested, which is determined based on the text word segmentation sample set and the second similarity algorithm.

In the embodiment of the invention, after text segmentation matched with each primary screening sample subset is determined, the similarity between text segmentation corresponding to each primary screening sample subset can be calculated by using a second similarity algorithm, namely, the sample comparison second similarity of each primary screening sample subset is obtained, the sample comparison second similarity of each primary screening sample subset is compared with a similarity threshold value, and similar samples, of which the similarity with the corresponding reference test sample is larger than the similarity threshold value, in each primary screening sample subset are removed, so that each target cleaning sample subset is obtained.

In an alternative embodiment of the present invention, according to the text word segmentation sample set and the second similarity algorithm, respectively calculating the sample comparison second similarity of each of the sample subsets to be tested through preliminary screening may include: determining text word segmentation sample subsets matched with all the primary screening sample subsets to be tested according to the text word segmentation sample sets and all the primary screening sample subsets to be tested; and according to a second similarity algorithm, calculating the similarity between the target text word segmentation characteristics and the text word segmentation characteristics to be compared in the text word segmentation sample subsets matched with the sample subsets to be inspected, and obtaining the sample comparison second similarity of the sample subsets to be inspected.

The text word segmentation sample subset can be a feature set formed by text word segmentation features matched with the sample subset to be tested through preliminary screening. The target text segmentation feature may be a text segmentation feature of a benchmark test sample in the subset of preliminary samples to be tested. The text word segmentation feature to be compared can be a text word segmentation feature which performs sample-to-sample similarity calculation with the target text word segmentation feature. The test sample corresponding to the target text word segmentation feature and the test sample corresponding to the text word segmentation feature to be compared belong to the same sample subset to be tested.

According to the embodiment of the invention, the text word segmentation characteristics corresponding to each sample in the sample subset to be screened can be determined according to the corresponding relation between the sample set to be screened and the text word segmentation sample set, so that the text word segmentation sample subset matched with the corresponding sample subset to be screened is determined according to the text word segmentation characteristics matched with the sample in the sample subset to be screened. After the text word segmentation sample subsets matched with the sample subsets to be tested are obtained, calculating the similarity between target text word segmentation characteristics in the text word segmentation sample subsets matched with the sample subsets to be tested and the text word segmentation characteristics to be compared through a second similarity algorithm, and obtaining sample comparison second similarity of the sample subsets to be tested.

In an alternative embodiment of the present invention, after obtaining each target cleaning sample subset, it may further include: when the sample number of the sample set to be tested is lower than the weight-removal lower limit sample number, acquiring a target supplementary sample set according to the label type of the sample set to be tested; and balancing the target data set according to each sample subset to be tested and the target supplementary sample set.

The deduplication lower limit sample amount may be a lower limit value of a preset sample amount. The target supplemental sample set may be a sample set having the same tag type as the sample set to be inspected.

In the embodiment of the invention, if the number of the samples of the sample set to be tested is detected to be lower than the weight-removal lower limit sample number, the sample number of the sample set to be tested is not rich, so that the target supplementary sample set can be obtained according to the label type of the sample set to be tested, and the target supplementary sample set is added to each sample set to be tested to balance the target data set, thereby preventing the situation that the data is fitted after the target data set is cleaned.

By way of example, assume 5000 documents for a news industry, for a total of 4 categories: 2000 documents of the real-time news class, 500 documents of the military class, 500 documents of the financial class and 2000 documents of the entertainment class. When the bert model is used for classification, the f1 index is only 75%, and data analysis shows that under the condition, real-time news, entertainment 2000 documents exceed military, financial 1500 documents belong to serious sample unbalance, and the model is fitted. And the new 1500 military and financial documents are not easy to acquire in a short time, and when 300 military and financial documents are newly added, the f1 index is improved to 80% by classification training evaluation. Through the scheme, text similarity test is carried out on entertainment documents and real-time news documents, and the number of the documents is reduced to 1000 after data cleaning is completed. In the classification training evaluation by using the bert model, the f1 index is improved to 85%, and the model effect is greatly improved. And after verifying that the data is cleaned according to the mode of the scheme for less than five thousands of sample sets to be tested, the data quality can be improved based on the targeted increase data of the target supplementary sample sets, and the training effect of the model is improved, and the data is cleaned for more than one hundred thousand sample sets to be tested, so that the data quantity can be reduced, the data quality can be improved, and the training efficiency of the model can be improved.

In a specific example, according to the label type marked by the target data set, a sample set to be tested is obtained, and the text part is subjected to vectorization processing through a rotomer model (a variant model of the bert, which is improved to a certain extent on the basis of the bert aiming at the text similarity) and is stored in a file. Taking the text vector characteristics of the first sample, calculating cosine similarity according to the text vector characteristics of all the samples of the second and later categories and the text vector characteristics of the first sample, rejecting the highly similar samples from the original file if the similarity threshold is larger than 0.8, and cleaning the first sample subset to be inspected for the first time to obtain a first sample subset to be inspected for the first time, so that dissimilar samples (the first sample subset to be inspected at this time) are saved in a file new0.0, and the rejected samples are saved in another file new 0.1. And taking the text vector characteristics of the second sample, calculating cosine similarity with the text vector characteristics of the second sample according to the text vector characteristics of all the samples of the third and later categories, recognizing that the samples are highly similar to the text if the similarity threshold is greater than 0.8, removing the highly similar samples from the original file, and realizing the first cleaning of the second sample subset to be tested to obtain a second sample subset to be tested, so that dissimilar samples (the second sample subset to be tested at the moment) are stored in a file new1.0, and the removed samples are stored in another file new 1.1. And the like, the first cleaning of all the sample subsets to be tested and the storage of data are completed.

J ieba word segmentation is carried out on all texts in the sample set to be tested. The text word segmentation characteristics of a first sample in the first sample subset to be inspected and the text word segmentation characteristics of all samples of the second category are subjected to editing distance similarity calculation, if the similarity threshold value is larger than 0.8, the text word segmentation characteristics are considered to be highly similar to the text, the highly similar sample is removed from new0.0, and the removed sample is stored in new0.1; and then, carrying out editing distance similarity calculation on text word segmentation characteristics of the first sample in the second preliminary screening sample subset and text word segmentation characteristics of all samples of the second category and later, considering the text word segmentation characteristics to be highly similar if the similarity threshold is larger than 0.8, removing the highly similar samples from new1.0, and storing the removed samples into new1.1. And the like, cleaning of the whole primary screening sample subset to be tested and data storage are completed. The sample subset to be inspected after the cleaning is directly used for subsequent data analysis or text classification model training, and the effect is improved by a certain degree compared with that before the cleaning.

According to the technical scheme, the target data set is obtained, so that the sample set to be tested is screened out from the target data set according to the label type of the target data set, the text in the sample set to be tested is vectorized to obtain the text vector feature set, the similar samples of all sample subsets to be tested in the sample set to be tested are removed according to the first similarity algorithm, the similarity threshold and the text vector feature set to obtain all primary screened sample subsets to be tested, word segmentation is further carried out on the sample set to be tested to obtain the text word segmentation sample set, and the similar samples of all primary screened sample subsets to be tested are removed according to the text word segmentation sample set, the second similarity algorithm and the similarity threshold to obtain all target cleaning sample subsets. According to the method and the device, the recognition and elimination of the similar samples are carried out on the sample set to be tested from the two characteristic dimensions of the text vector and the text word segmentation, so that the similar samples in the sample set to be tested are effectively cleaned, the problem that the cleaning effect of the existing text data set is poor is solved, and the cleaning effect of the text data set can be effectively improved.

Example III

Fig. 3 is a schematic structural diagram of a device for checking text similarity according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes:

the vectorization processing module 310 is configured to perform vectorization processing on the text in the sample set to be tested to obtain a text vector feature set;

the first similar sample rejection module 320 is configured to reject similar samples from each subset of samples to be tested in the set of samples to be tested according to a first similarity algorithm, a similarity threshold, and a text vector feature set, so as to obtain each subset of samples to be tested through preliminary screening;

the second similar sample eliminating module 330 is configured to perform word segmentation on the sample set to be tested to obtain a text word segmentation sample set, and perform similar sample elimination on each of the initially screened sample subsets to be tested according to the text word segmentation sample set, the second similarity algorithm and the similarity threshold value to obtain each of the target cleaning sample subsets.

Optionally, the device for checking the similarity degree of the text further comprises a sample set acquisition module to be checked, which is used for acquiring a target data set; and screening the sample set to be inspected from the target data set according to the label type of the target data set.

Optionally, a first similarity sample rejection module 320 is configured to calculate, according to the first similarity algorithm and the text vector feature set, a first similarity of sample comparison of each of the sample subsets to be tested; and comparing the first similarity with the similarity threshold according to the samples of the sample subsets to be tested, and removing the similar samples of the sample subsets to be tested in the sample set to be tested to obtain the sample subsets to be tested through preliminary screening.

Optionally, a second similar sample rejection module 330 is configured to calculate, according to the text word segmentation sample set and the second similarity algorithm, a sample comparison second similarity of each of the first screened sample subsets to be tested; and comparing the second similarity and the similarity threshold according to the samples of the primary screening sample subsets to be tested, and removing the similar samples of the primary screening sample subsets to be tested to obtain target cleaning sample subsets.

Optionally, the first similar sample rejection module 320 is configured to determine, according to each of the sample subsets to be tested and the text vector feature set, a text vector feature subset that is matched with each of the sample subsets to be tested; and calculating the similarity between the target text vector feature and the text vector feature to be compared in the text vector feature subsets matched with the sample subsets to be tested according to the first similarity algorithm to obtain the sample comparison first similarity of the sample subsets to be tested.

Optionally, a second similar sample rejection module 330 is configured to determine, according to the text word segmentation sample set and each of the preliminary screening sample subsets to be tested, a text word segmentation sample subset that is matched with each of the preliminary screening sample subsets to be tested; and calculating the similarity between the target text word segmentation characteristics and the text word segmentation characteristics to be compared in the text word segmentation sample subsets matched with the sample subsets to be inspected through the primary screening according to the second similarity algorithm, and obtaining the sample comparison second similarity of the sample subsets to be inspected through the primary screening.

Optionally, the device for testing the similarity degree of the text comprises a sample set balancing module, which is used for acquiring a target supplementary sample set according to the label type of the sample set to be tested when the sample number of the sample set to be tested is lower than the de-duplication lower limit sample number; and balancing the target data set according to each sample subset to be tested and the target supplementary sample set.

The device for checking the text similarity provided by the embodiment of the invention can execute the method for checking the text similarity provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 4 shows a schematic diagram of the structure of an electronic device that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, for example, a method of checking the degree of similarity of texts.

In some embodiments, the method of verifying text similarity may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the method of verifying text similarity described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the method of verifying the degree of text similarity in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for verifying text similarity comprising:

according to a first similarity algorithm, a similarity threshold value and the text vector feature set, carrying out similar sample rejection on each sample subset to be tested in the sample set to be tested to obtain each primary screening sample subset to be tested;

performing word segmentation on the sample set to be tested to obtain a text word segmentation sample set, and removing similar samples from all the primary screening sample subsets to be tested according to the text word segmentation sample set, a second similarity algorithm and the similarity threshold value to obtain all target cleaning sample subsets.

2. The method of claim 1, further comprising, prior to vectorizing text in the set of samples to be tested:

acquiring a target data set;

and screening the sample set to be inspected from the target data set according to the label type of the target data set.

3. The method according to claim 1, wherein the performing, according to a first similarity algorithm, a similarity threshold, and the text vector feature set, similar sample culling on each sample subset to be tested in the sample set to be tested to obtain each preliminary screened sample subset to be tested includes:

according to the first similarity algorithm and the text vector feature set, respectively calculating sample comparison first similarity of each sample subset to be tested;

and comparing the first similarity with the similarity threshold according to the samples of the sample subsets to be tested, and removing the similar samples of the sample subsets to be tested in the sample set to be tested to obtain the sample subsets to be tested through preliminary screening.

4. The method according to claim 1, wherein the performing similar sample culling on each of the preliminary screening sample subsets to be tested according to the text segmentation word sample set, the second similarity algorithm and the similarity threshold value to obtain each target cleaning sample subset includes:

According to the text word segmentation sample set and the second similarity algorithm, respectively calculating sample comparison second similarity of each primary screening sample subset to be tested;

and comparing the second similarity and the similarity threshold according to the samples of the primary screening sample subsets to be tested, and removing the similar samples of the primary screening sample subsets to be tested to obtain target cleaning sample subsets.

5. A method according to claim 3, wherein the calculating the sample comparison first similarity for each of the sample subsets to be tested according to the first similarity algorithm and the text vector feature set, respectively, comprises:

determining a text vector feature subset matched with each sample subset to be tested according to each sample subset to be tested and the text vector feature set;

and calculating the similarity between the target text vector feature and the text vector feature to be compared in the text vector feature subsets matched with the sample subsets to be tested according to the first similarity algorithm to obtain the sample comparison first similarity of the sample subsets to be tested.

6. The method of claim 4, wherein the calculating the sample comparison second similarity for each of the subset of preliminary samples to be tested according to the text segmentation sample set and the second similarity algorithm, respectively, comprises:

Determining text word segmentation sample subsets matched with the primary screening sample subsets to be tested according to the text word segmentation sample sets and the primary screening sample subsets to be tested;

and calculating the similarity between the target text word segmentation characteristics and the text word segmentation characteristics to be compared in the text word segmentation sample subsets matched with the sample subsets to be inspected through the primary screening according to the second similarity algorithm, and obtaining the sample comparison second similarity of the sample subsets to be inspected through the primary screening.

7. The method of claim 2, further comprising, after said obtaining each target cleaning sample subset:

when the sample number of the sample set to be tested is lower than the de-duplication lower limit sample number, acquiring a target supplementary sample set according to the label type of the sample set to be tested;

and balancing the target data set according to each sample subset to be tested and the target supplementary sample set.

8. An apparatus for verifying text similarity, comprising:

the first similar sample eliminating module is used for eliminating similar samples of all sample subsets to be tested in the sample set to be tested according to a first similarity algorithm, a similarity threshold value and the text vector feature set to obtain all sample subsets to be tested through primary screening;

And the second similar sample removing module is used for performing word segmentation on the sample set to be tested to obtain a text word segmentation sample set, and performing similar sample removing on each of the initially screened sample subsets to be tested according to the text word segmentation sample set, a second similarity algorithm and the similarity threshold value to obtain each target cleaning sample subset.

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of verifying text similarity of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of checking for text similarity of any one of claims 1-7.