CN111539196A - Text duplicate checking method and device, text management system and electronic equipment - Google Patents

Text duplicate checking method and device, text management system and electronic equipment Download PDF

Info

Publication number
CN111539196A
CN111539196A CN202010297125.0A CN202010297125A CN111539196A CN 111539196 A CN111539196 A CN 111539196A CN 202010297125 A CN202010297125 A CN 202010297125A CN 111539196 A CN111539196 A CN 111539196A
Authority
CN
China
Prior art keywords
word
text
determining
checked
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010297125.0A
Other languages
Chinese (zh)
Inventor
孟庆典
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Beijing BOE Optoelectronics Technology Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Beijing BOE Optoelectronics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Beijing BOE Optoelectronics Technology Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202010297125.0A priority Critical patent/CN111539196A/en
Publication of CN111539196A publication Critical patent/CN111539196A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The embodiment of the application provides a text duplicate checking method, a text duplicate checking device, a text management system and electronic equipment, wherein the text duplicate checking method comprises the following steps: acquiring a plurality of parts of duplicate texts to be checked; performing word division on each duplicate text to be checked according to a word database, and determining the word frequency of each word in the duplicate text to be checked; determining word vectors and vector weights of the word vectors in each text to be checked according to the words and the word frequencies corresponding to the words; and determining the repetition rate between any two texts to be checked according to the vector weight. According to the text duplication checking method, text words are divided through the word database, words appearing in the word database and word frequency of the words in the text are determined, vector weights of the words are calculated according to the words and word vectors obtained according to the corresponding word frequency, and then the duplication rate of any two texts is determined through comparison of the vector weights, so that the duplication rate comparison of any two texts can be performed efficiently and accurately.

Description

Text duplicate checking method and device, text management system and electronic equipment
Technical Field
The application relates to the technical field of computers, in particular to a text duplicate checking method and device, a text management system and electronic equipment.
Background
Bidding is a relatively frequent activity in the management process of enterprises. In order to select the partner which is more in line with the interests of each party, the bidding data provided by the bidder needs to be read and examined, and the optimal partner is weighed. However, one tenderer often corresponds to a plurality of bidders, the number of received bidding documents is large, and the content of each bidding document is also large, so that it is difficult to find a suitable bidder in a short time. Reading and screening the bidding documents is a tedious work, a large amount of manpower and material resources are needed, and because the bidding documents are directed to the same project theme, the situations that the contents are similar usually exist, and if the similarity degree of each bidding document is compared manually and business partner selection is carried out according to the similarity degree, the work is time-consuming and labor-consuming. Therefore, people usually adopt some text duplication checking technology.
In the prior art, a segmentation technology and a semantic analysis technology are generally adopted, and a longest character string matching principle is used for calculating the character repetition rate in a document. However, the semantic analysis technology needs hardware equipment with higher performance to support semantic recognition of the bidding document, and often has the problem of overhigh occupation of computing resources, and the technology is relatively closed and has a narrow application and popularization range. On the basis of the longest string matching principle, many keywords are often omitted, and the accuracy of document duplicate checking is not high enough.
Disclosure of Invention
The application provides a text duplicate checking method, a text duplicate checking device, a text management system and electronic equipment aiming at the defects of the prior art, and aims to solve the technical problem that the text duplicate checking efficiency or accuracy is not high in the prior art.
In a first aspect, an embodiment of the present application provides a text duplicate checking method, including:
acquiring a plurality of parts of duplicate texts to be checked;
performing word division on each duplicate text to be checked according to a word database, and determining the word frequency of each word in the duplicate text to be checked;
determining word vectors and vector weights of the word vectors in each text to be checked according to the words and the word frequencies corresponding to the words;
and determining the repetition rate between any two texts to be checked according to the vector weight.
In some implementations of the first aspect, performing word division on each duplicate text to be checked according to a word database, and determining a word frequency of each word in the duplicate text to be checked includes:
dividing the text to be checked into a plurality of words according to a preset semantic strategy;
if the words exist in the word database, determining the words as first-class words;
if the word does not exist in the word database, determining the word as a second class word;
and determining the word frequency of each first-class word in the text to be checked.
With reference to the first aspect and the foregoing implementation manners, in some implementation manners of the first aspect, determining a word frequency of each word in the text to be repeated includes:
determining the total number of words of the text to be checked and the repetition times of each word;
and determining the word frequency of each word in the text to be checked according to the total number of the words and the repetition times of each word.
With reference to the first aspect and the foregoing implementation manners, in some implementation manners of the first aspect, determining a word vector and a vector weight of the word vector in each text to be checked according to a word and a word frequency corresponding to the word includes:
constructing a text vector space according to all words of the text to be checked;
determining a word vector of each word according to each word of the text to be checked and the corresponding word frequency of each word;
determining the weight of the word vector in the text vector space according to the text vector space and the word vector of each word.
With reference to the first aspect and the foregoing implementation manners, in some implementation manners of the first aspect, determining similarity between any two texts to be checked according to vector weights includes:
determining all the same words in the two texts to be checked;
determining the coincidence rate of the same words in the two documents to be checked according to the vector weights of the same words in the two documents to be checked;
and determining the repetition rate of any two texts to be checked according to the coincidence rate of all the same words.
In a second aspect, the present application provides a text duplication checking apparatus, including:
the acquisition module is used for acquiring a plurality of parts of duplicate texts to be checked;
the word segmentation module is used for carrying out word segmentation on each duplicate text to be searched according to the word database and determining the word frequency of each word in the duplicate text to be searched;
the vector module is used for determining the vector weight of the word vector and the vector weight of the word vector in each text to be checked according to the words and the word frequency corresponding to the words;
and the duplication checking module is used for determining the repetition rate of any two texts to be duplicated according to the vector weight.
In a third aspect, the present application provides a text management method, including:
acquiring identity authentication information and determining an identity authentication result;
processing a text corresponding to the identity authentication information according to the identity authentication result;
according to the method for text duplication checking as described in the first aspect of the present application, the text is duplicated.
In a fourth aspect, the present application provides a text management system, comprising:
the account management device is used for acquiring the authentication information and determining the authentication result;
the file transmission device processes the text corresponding to the identity authentication information according to the identity authentication result;
the text duplication checking device is used for checking the duplication of the text according to the text duplication checking method described in the first aspect of the application.
In a fifth aspect, the present application provides an electronic device, comprising:
a processor;
a memory electrically connected to the processor;
at least one program stored in the memory and configured to be executed by the processor, the at least one program configured to: a method of performing text duplication checking as described in the first aspect of the present application.
In a sixth aspect, the present application provides a computer readable storage medium for storing computer instructions which, when executed on a computer, implement a method for text duplication checking as described above in the first aspect of the present application.
The technical scheme provided by the embodiment of the application has the following beneficial technical effects:
according to the text duplication checking method, text words are divided through the word database, words appearing in the word database and word frequency of the words in the text are determined, vector weights of the words are calculated according to the words and word vectors obtained according to the corresponding word frequencies, and similarity between two texts is determined through comparison of the vector weights.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a text management method according to an embodiment of the present application;
fig. 2 is a schematic structural framework diagram of a text management system according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a text duplicate checking method according to an embodiment of the present application;
fig. 4 is a schematic flow chart of a method for performing word division on each duplicate text to be checked according to a word database and determining a word frequency of each word in the duplicate text to be checked according to an embodiment of the present application;
fig. 5 is a schematic flowchart of a method for determining word vectors and vector weights of the word vectors in each text to be checked according to words and word frequencies corresponding to the words according to an embodiment of the present application;
fig. 6 is a schematic flowchart of a method for determining a repetition rate of every two texts to be checked according to a vector weight according to an embodiment of the present application;
fig. 7 is a schematic structural framework diagram of a text duplicate checking device according to an embodiment of the present application;
fig. 8 is a schematic structural framework diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar parts or parts having the same or similar functions throughout. In addition, if a detailed description of the known art is not necessary for illustrating the features of the present application, it is omitted. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It is to be understood that the term "and/or" as used herein is intended to include all or any and all combinations of one or more of the associated listed items.
In the experience activity of an enterprise, the enterprise often needs to cooperate with other enterprises to do a certain work, so that bidding management is one of necessary activities of the enterprise, and in order to quickly count the repetition rate or similarity of numerous bidding documents, the text in the bidding documents needs to be analyzed, however, by using the duplication checking method provided in the prior art, either high-performance hardware is required to provide a duplication checking effect with sufficient accuracy, and duplication checking efficiency is high, or duplication checking accuracy is insufficient, and the duplication checking method is difficult to be used as an application reference.
In order to solve the problems, the application provides a text duplicate checking method, a text duplicate checking device, a document management system and electronic equipment.
First, the embodiment of the present application provides a text management method from a general perspective, the text management method can be used for managing bidding documents or texts, a flow diagram of the text management method is shown in fig. 1, and the method at least includes the following steps:
s100: and acquiring the identity authentication information and determining the identity authentication result.
S200: and processing the text corresponding to the identity authentication information according to the identity authentication result.
S300: according to the text duplicate checking method described in the application, the text duplicate checking is carried out.
The method for text duplication checking described in the present application will be described in detail later.
Based on the same inventive concept, embodiments of the present application correspondingly provide a text management system, which is used to implement the text management method described above, and specifically, as shown in fig. 2, the text management system at least includes the following devices: the device comprises an account management device, a file transmission device and a text duplication checking device. The account management device is used for acquiring the identity authentication information and determining the identity authentication result. And the file transmission device is used for processing the text corresponding to the identity authentication information according to the identity authentication result. The text duplicate checking device is used for checking the duplicate of the text according to the text duplicate checking method described in the application.
Specifically, the account management device is mainly responsible for implementing account management functions of the text management system provided by the embodiment of the present application, where the account management functions include account registration, account management, account login, and the like. For example, the account registration means that bidder entry information is collected, and a bidder account is generated after the examination is passed. The account management means that information of the successfully registered bidder accounts is recorded into a database for management. The account login means that whether the current login account is a registered account is verified, and a bidder operation interface can be accessed after the verification is successful. Of course, the account management device can also distinguish the bidders and the tenderers corresponding to different system authorities.
Accordingly, the file transmission device is responsible for realizing the file or text transmission functions of the system, including text uploading and text downloading functions. The text upload means permits the bidder to upload text to a designated directory, and the text download means permits the tenderer or bidder to download the uploaded document. That is, a bidder who has successfully logged in the text management system through the account management device can upload his or her own bidding document or download the bidding document and other files corresponding to the right of the bidder.
The text duplication checking device is responsible for realizing the duplication checking function of the text and searching related texts in the text management system, for example, searching texts with the range uploaded by all bidders in a certain bidding activity, and the duplication checking result is given in the form of the similarity of any two texts. The detailed duplication checking method can be known as described later.
In addition, the text management system provided by the embodiment of the application further comprises a text reading device, the text reading device is a document reader utilizing a local text editing tool, such as a Word program, and the operation of reading and editing the text is processed by starting a silent Word process, and the system can process the files in the formats of doc and docx. The text reading device can decode the text by means of the existing Word program of the device, thereby reducing the program load.
Through the overall design of the text management system, the design of each entity and the attribute thereof is completed, the attribute of the enterprise bidding database is determined, and the related work of constructing the bidding database can be guided and completed. The text management system provided by the embodiment of the application mainly comprises three main function devices, the modular design structure is presented, the whole text management system is clear in construction thought, clear and concise in structure, management of tendering and bidding activities can be conveniently carried out, and text analysis work of the tendering and bidding activities is guaranteed to be efficiently carried out.
The following describes in detail the technical scheme of text duplication checking according to the present application and how to solve the problem of efficient text duplication checking according to specific embodiments.
An embodiment of an aspect of the present application provides a method for text duplicate checking, as shown in fig. 3, the method for text duplicate checking at least includes the following steps:
s310: and acquiring a plurality of parts of duplicate texts to be checked.
S320: and performing word division on each duplicate text to be checked according to a word database, and determining the word frequency of each word in the duplicate text to be checked.
S330: and determining the vector weight of the word vector and the word vector in each text to be checked according to the words and the word frequency corresponding to the words.
S340: and determining the repetition rate between any two texts to be checked according to the vector weight.
According to the text duplication checking method, text words are divided through the word database, words appearing in the word database and word frequency of the words in the text are determined, vector weights of the words are calculated according to the words and word vectors obtained according to the corresponding word frequencies, and then the duplication rate between the two texts is determined through comparison of the vector weights.
Practically, in one implementation manner of the foregoing embodiment of the present application, S310: according to a word database, performing word division on each duplicate text to be checked, and determining the word frequency of each word in the duplicate text to be checked, as shown in fig. 4, the method specifically includes the following steps:
s311: and dividing the text to be checked into a plurality of words according to a preset semantic strategy.
S312: if the words exist in the word database, determining the words as first-class words; and if the word does not exist in the word database, determining the word as a second class word.
S313: and determining the word frequency of each first-class word in the text to be checked.
Common word segmentation technologies include word-based full-text indexes or word-based full-text indexes, wherein the word-based full-text indexes greatly improve the accuracy of word segmentation and extract keywords as index items, so that the accuracy of duplicate searching is greatly improved. However, due to the particularity of the chinese language, there may be a case where a single word is a word, or there may be a case where a plurality of words are composed such as three words or more than three words are a word. Therefore, the preset semantic strategy is to scan and segment the text according to the determined semantic division rule. Some preset semantic strategies use a single character as a word, some preset semantic strategies use two characters as a word, and so on.
Because the bidding document is a text file generated aiming at a certain project main body, the text file has a large number of normalized characters, and each text has a large number of repeated words in the aspect of professional sentences of the purchasing scheme or the engineering construction scheme. Therefore, the text duplicate checking method provided by the application adopts the word database to correct the words divided by the preset semantic strategy, improves the word dividing efficiency through the method, and further improves the text duplicate checking efficiency and the accuracy. The term database is an industry dictionary, and comprises enough keywords counted by enterprises in long-term bidding work, and can be continuously updated and iterated, and can also be updated and iterated in a manual editing mode.
When the S310 is executed, after the duplicate text to be checked is subjected to word division according to a certain preset semantic policy, the duplicate text to be checked exists in the form of numerous scattered words, a large number of the same words exist in the words, and some words are not required in the bidding activity, for example, for the content of "me", two words with independent meanings of "me" and "se" may be divided through the preset semantic policy, and the two words of "me" and "se" do not exist in the word database, so that the two words are divided into the second category of words. Obviously, the system operating the text duplicate checking method changes the preset semantic policy, takes 'me si' as a single word, and if the word of 'me si' exists in the word database, the 'me si' is the first kind of word, and counts the occurrence frequency of the word in the whole text, namely the word frequency.
Different text words and word frequencies corresponding to the words can be obtained through word division of different preset semantic strategies. In order to improve the word division efficiency, the semantic strategy types of word division in the word database to be used can be consulted and counted, and the semantic strategy with a large proportion is used as the preset semantic strategy in a targeted manner, so that repeated probing of the preset semantic strategy is avoided.
In a practical implementation manner of the foregoing embodiment, the step of determining the word frequency of each word in the text to be repeated in S320 specifically includes: and determining the total number of words of the text to be checked and the repetition number of each word. And determining the word frequency of each word in the text to be checked according to the total number of the words and the repetition times of each word.
After the text to be checked is divided into words and phrases to form a data file consisting of a plurality of words and phrases, the total number of words and phrases contained in the text to be checked can be counted, and a large number of repeated words and phrases exist in the words and phrases correspondingly, and therefore each word and phrase has a repetition number. In order to more objectively count the occurrence frequency of words in the text and compare the texts, a relative word frequency can be obtained by comparing the repetition times of the words with the total number of the words, and the problem of large difference of repetition results caused by different space of the label book can be avoided by the method.
In a feasible implementation manner of the foregoing embodiment of the present application, S330: determining word vectors and vector weights of the word vectors in each text to be checked according to the words and the word frequencies corresponding to the words, as shown in fig. 5, the method specifically comprises the following steps:
s331: and constructing a text vector space according to all words of the text to be checked.
S332: and determining a word vector of each word according to each word of the text to be checked and the corresponding word frequency of each word.
S333: determining the weight of the word vector in the text vector space according to the text vector space and the word vector of each word.
In order to obtain the repetition rate of different texts through statistics, the words and the word frequencies corresponding to the words need to be processed, and specifically, an SVM (Support Vector Machine) algorithm can be adopted to carry out vectorization on all words of the divided texts to be checked. And converting the text to be checked into a specific vector space through an SVM algorithm, wherein each vector in the vector space consists of a word and a word frequency corresponding to the word. In the foregoing steps, S331 and S332 do not need to be in order. Then, the weight of each vector in the vector space is calculated by a weight calculation method, and the commonly used weight calculation method can be carried out by a TF-IDF (Term Frequency-Inverse Document Frequency) function.
In a feasible implementation manner of the foregoing embodiment of the present application, S340: determining the repetition rate of any two texts to be checked according to the vector weight, as shown in fig. 6, the step specifically includes:
s341: and determining all the same words in the two texts to be checked.
S342: and determining the coincidence rate of the same word in the two documents to be checked according to the vector weight of the same word in the two documents to be checked.
S343: and determining the repetition rate of any two texts to be checked according to the coincidence rate of all the same words.
Through the method steps, after the system determines the weight of each word vector in each text vector space, the system can specifically calculate the similarity between a certain text and each text except the text according to a cosine algorithm. The detailed calculation process may be:
firstly, the divided text is characterized into a feature vector V (d) ═ t by an SVM model11(d);...;tns(d) In which t) isi(i-1, 2, …, s) is a list of mutually different terms, s is a positive integer, ω isi(d) Is tiThe weight in the text vector space d, generally defined as tiFrequency of occurrence tf in text vector space di(d) Is of the formula: omegai(d)=ψ(tfi(d) ). Specifically, ωi(d) Calculated by formula (1) and formula (2). And then, the similarity of the two texts to be checked is obtained through a cosine similarity formula (3), namely the repetition degree, namely the repetition rate, of the two texts to be checked is obtained.
Figure BDA0002452608640000101
Figure BDA0002452608640000102
N in formula (1) and formula (2) is the number of all texts, NiTo contain an entry tiThe number of texts.
Figure BDA0002452608640000103
Cosine similarity equation (3), diRepresenting the text vector space corresponding to the first text, djAnd m is the number of entries in the text vector space.
Based on the same inventive concept, another embodiment of the present application provides a text duplication checking apparatus 10, as shown in fig. 7, the text duplication checking apparatus 10 specifically includes an obtaining module 11, a word segmentation module 12, a vector module 13, and a duplication checking module 14.
The obtaining module 11 is configured to obtain a plurality of duplicate texts to be checked.
The word segmentation module 12 is configured to perform word segmentation on each duplicate text to be searched according to the word database, and determine a word frequency of each word in the duplicate text to be searched.
The vector module 13 is configured to determine the word vector and the vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word.
The duplication checking module 14 is configured to determine a repetition rate of any two texts to be duplicated according to the vector weights.
The text duplication checking device provided by the application divides text words through the word database, determines each word appearing in the word database and the word frequency of the word in the text, calculates the vector weight of the word according to the word and the word vector obtained according to the word frequency, and then determines the duplication rate between two texts through comparison of the vector weights, so that the duplication rate comparison of any two texts can be performed efficiently and more accurately.
Possibly, the word segmentation module 12 performs word segmentation on each duplicate text to be searched according to the word database, and determines the word frequency of each word in the duplicate text to be searched, including: and dividing the text to be checked into a plurality of words according to a preset semantic strategy. And if the words exist in the word database, determining the words as the first class words. And if the word does not exist in the word database, determining the word as a second class word. And determining the word frequency of each first-class word in the text to be checked.
Possibly, the word segmentation module 12 determines the word frequency of each word in the repeated text to be searched, including: and determining the total number of words of the text to be checked and the repetition number of each word. And determining the word frequency of each word in the text to be checked according to the total number of the words and the repetition times of each word.
The feasible vector module 13 determines the word vector and the vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word, and includes: and constructing a text vector space according to all words of the text to be checked. And determining a word vector of each word according to each word of the text to be checked and the corresponding word frequency of each word. Determining the weight of the word vector in the text vector space according to the text vector space and the word vector of each word.
The duplication checking module 14 may determine the repetition rate of any two texts to be duplicated according to the vector weight, including: and determining all the same words in the two texts to be checked. And determining the coincidence rate of the same word in the two documents to be checked according to the vector weight of the same word in the two documents to be checked. And determining the repetition rate of the two texts to be repeated according to the coincidence rate of all the same words.
Based on the same inventive concept, an embodiment of the present application provides an electronic device, including: a memory and a processor.
The memory is electrically connected with the processor.
At least one computer program stored in the memory, configured to, when executed by the processor, implement any one of the methods for text duplication checking provided by the embodiments of the present application/implement various alternative implementations of the methods for text duplication checking provided by the embodiments of the present application.
Those skilled in the art will appreciate that the electronic devices provided by the embodiments of the present application may be specially designed and manufactured for the required purposes, or may comprise known devices in general-purpose computers. These devices have stored therein computer programs that are selectively activated or reconfigured. Such a computer program may be stored in a device (e.g., computer) readable medium or in any type of medium suitable for storing electronic instructions and respectively coupled to a bus.
Compared with the prior art, the electronic equipment provided by the application can efficiently and accurately compare the repetition rates of any two articles.
In an alternative embodiment, the present application provides an electronic device, as shown in fig. 8, the electronic device 1000 shown in fig. 8 comprising: a processor 1001 and a memory 1003. The processor 1001 and the memory 1003 are electrically coupled, such as by a bus 1002.
The Processor 1001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application specific integrated Circuit), an FPGA (Field-Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 1001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.
Bus 1002 may include a path that transfers information between the above components. The bus 1002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (extended industry Standard Architecture) bus, or the like. The bus 1002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
The Memory 1003 may be a ROM (Read-Only Memory) or other type of static storage device that can store static information and instructions, a RAM (random access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically erasable programmable Read-Only Memory), a CD-ROM (Compact disk-Only Memory) or other optical disk storage, optical disk storage (including Compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
Optionally, the electronic device 1000 may also include a transceiver 1004. The transceiver 1004 may be used for reception and transmission of signals. The transceiver 1004 may allow the electronic device 1000 to communicate wirelessly or wiredly with other devices to exchange data. It should be noted that the transceiver 1004 is not limited to one in practical application.
Optionally, the electronic device 1000 may further include an input unit 1005. The input unit 1005 may be used to receive input numeric, character, image, and/or sound information, or to generate key signal inputs related to user settings and function control of the electronic apparatus 1000. The input unit 1005 may include, but is not limited to, one or more of a touch screen, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, a camera, a microphone, and the like.
Optionally, the electronic device 1000 may further include an output unit 1006. Output unit 1006 may be used to output or show information processed by processor 1001. The output unit 1006 may include, but is not limited to, one or more of a display device, a speaker, a vibration device, and the like.
While fig. 8 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
Optionally, the memory 1003 is used for storing application program codes for executing the scheme of the present application, and the processor 1001 controls the execution. The processor 1001 is configured to execute the application program codes stored in the memory 1003 to implement any one of the methods for text duplication checking provided by the embodiments of the present application.
Based on the same inventive concept, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any one of the text duplicate checking methods shown in the present application/various alternative implementations of the text duplicate checking method provided by the present application.
Embodiments of the present application provide a computer-readable storage medium that enables efficient and more accurate comparison of repetition rates of any two articles by a computer program stored therein, as compared to the prior art.
Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.
The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.
In the description of the present application, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.
In the description herein, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A method for text duplicate checking, comprising:
acquiring a plurality of parts of duplicate texts to be checked;
performing word division on each duplicate text to be checked according to a word database, and determining the word frequency of each word in the duplicate text to be checked;
determining word vectors and vector weights of the word vectors in each text to be checked according to the words and the word frequencies corresponding to the words;
and determining the repetition rate between any two texts to be checked according to the vector weight.
2. The method for searching for duplicate texts according to claim 1, wherein the word division is performed on each duplicate text to be searched according to a word database, and the word frequency of each word in the duplicate text to be searched is determined, including:
dividing the text to be checked into a plurality of words according to a preset semantic strategy;
if the word exists in the word database, determining that the word is a first-class word;
if the word does not exist in the word database, determining that the word is a second class word;
and determining the word frequency of each word in the first class in the repeated text to be checked.
3. The method for text repetition check according to claim 1, wherein the determining the word frequency of each word in the text to be repeated comprises:
determining the total number of words of the text to be checked and the repetition times of each word;
and determining the word frequency of each word in the text to be checked according to the total number of the words and the repetition times of each word.
4. The method for text duplicate checking according to claim 1, wherein the determining a word vector and a vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word comprises:
constructing a text vector space according to all words of the text to be checked;
determining a word vector of each word according to each word of the text to be checked and the word frequency corresponding to each word;
determining a weight of the word vector in the text vector space based on the text vector space and the word vector for each of the words.
5. The method for text repetition check according to claim 1, wherein the determining the repetition rate of any two texts to be repeated according to the vector weight comprises:
determining all the same words in the two texts to be checked;
determining the coincidence rate of the same words in the two documents to be checked according to the vector weights of the same words in the two documents to be checked;
and determining the repetition rate of any two texts to be checked according to the coincidence rate of all the same words.
6. A text duplication checking apparatus, comprising:
the acquisition module is used for acquiring a plurality of parts of duplicate texts to be checked;
the word segmentation module is used for carrying out word segmentation on each repeated text to be searched according to a word database and determining the word frequency of each word in the repeated text to be searched;
the vector module is used for determining word vectors and vector weights of the word vectors in each text to be checked according to the words and the word frequencies corresponding to the words;
and the duplication checking module is used for determining the repetition rate of any two texts to be duplicated according to the vector weight.
7. A text management method, comprising:
acquiring identity authentication information and determining an identity authentication result;
processing a text corresponding to the identity authentication information according to the identity authentication result;
the method for text duplicate checking according to any one of claims 1-5, wherein the text duplicate checking is performed.
8. A text management system, comprising:
the account management device is used for acquiring the authentication information and determining the authentication result;
the file transmission device is used for processing the text corresponding to the identity authentication information according to the identity authentication result;
the text duplicate checking device is used for checking the duplicate of the text according to the text duplicate checking method of any one of claims 1-5.
9. An electronic device, comprising:
a processor;
a memory electrically connected with the processor;
at least one program stored in the memory and configured to be executed by the processor, the at least one program configured to: a method of implementing text repetition checking as claimed in any one of claims 1 to 5.
10. A computer-readable storage medium storing computer instructions for implementing a method for text repetition checking according to any one of claims 1 to 5 when the computer instructions are run on a computer.
CN202010297125.0A 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment Pending CN111539196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010297125.0A CN111539196A (en) 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010297125.0A CN111539196A (en) 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment

Publications (1)

Publication Number Publication Date
CN111539196A true CN111539196A (en) 2020-08-14

Family

ID=71978613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010297125.0A Pending CN111539196A (en) 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment

Country Status (1)

Country Link
CN (1) CN111539196A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN110390084A (en) * 2019-06-19 2019-10-29 平安国际智慧城市科技股份有限公司 Text duplicate checking method, apparatus, equipment and storage medium
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN110390084A (en) * 2019-06-19 2019-10-29 平安国际智慧城市科技股份有限公司 Text duplicate checking method, apparatus, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王冲 等: "现代信息检索技术基本原理教程", vol. 2013, 西安电子科技大学出版社, pages: 103 - 104 *

Similar Documents

Publication Publication Date Title
Johann et al. Safe: A simple approach for feature extraction from app descriptions and app reviews
CN104699730B (en) For identifying the method and system of the relation between candidate answers
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN111753048B (en) Document retrieval method, device, equipment and storage medium
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN110795628A (en) Search term processing method and device based on correlation and computing equipment
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN111651552B (en) Structured information determining method and device and electronic equipment
CN112214515A (en) Data automatic matching method and device, electronic equipment and storage medium
CN116775879A (en) Fine tuning training method of large language model, contract risk review method and system
US20210192125A1 (en) Methods and systems for facilitating summarization of a document
CN113011156A (en) Quality inspection method, device and medium for audit text and electronic equipment
TWM423854U (en) Document analyzing apparatus
Varol et al. Detecting near-duplicate text documents with a hybrid approach
CN116402166A (en) Training method and device of prediction model, electronic equipment and storage medium
Rahimi et al. Contextualized topic coherence metrics
CN111539196A (en) Text duplicate checking method and device, text management system and electronic equipment
Hermansson et al. Tracking amendments to legislation and other political texts with a novel minimum-edit-distance algorithm: DocuToads
Huetle-Figueroa et al. Measuring semantic similarity of documents with weighted cosine and fuzzy logic
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
CN112561714A (en) NLP technology-based underwriting risk prediction method and device and related equipment
JP4592556B2 (en) Document search apparatus, document search method, and document search program
CN116150456B (en) Intelligent archive management method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination