CN113312319B - Mobile internet shared document duplicate checking early warning system and method - Google Patents

Mobile internet shared document duplicate checking early warning system and method Download PDF

Info

Publication number
CN113312319B
CN113312319B CN202110720405.2A CN202110720405A CN113312319B CN 113312319 B CN113312319 B CN 113312319B CN 202110720405 A CN202110720405 A CN 202110720405A CN 113312319 B CN113312319 B CN 113312319B
Authority
CN
China
Prior art keywords
document
real
early warning
shared
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110720405.2A
Other languages
Chinese (zh)
Other versions
CN113312319A (en
Inventor
何成良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhiku Information Technology Co ltd
Original Assignee
Shenzhen Zhiku Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhiku Information Technology Co ltd filed Critical Shenzhen Zhiku Information Technology Co ltd
Priority to CN202110720405.2A priority Critical patent/CN113312319B/en
Publication of CN113312319A publication Critical patent/CN113312319A/en
Application granted granted Critical
Publication of CN113312319B publication Critical patent/CN113312319B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a mobile internet shared document duplicate checking early warning system and a mobile internet shared document duplicate checking early warning method, which solve the technical problem that the efficiency of document duplicate checking is low because a standard text in a database cannot be constructed in the prior art, wherein a standard preset unit is used for extracting keywords of the shared document in the shared database, constructing a duplicate checking standard text, obtaining an importance coefficient Xi of the shared document through a formula, then obtaining a preset duplicate checking standard text, carrying out semantic analysis on the text, carrying out subset sorting on a standard text set after the semantic analysis, then sending the subset sorting to a duplicate checking management platform, and filling a blank subset sequence corresponding to a corresponding removed subset if the subset removed set exists; the standard text of the shared document is constructed, the accuracy of document duplicate checking is improved, and the influence of the tone words on the standard text is eliminated by performing semantic analysis on the text, so that the efficiency of text duplicate checking is enhanced.

Description

Mobile internet shared document duplicate checking early warning system and method
Technical Field
The invention relates to the technical field of shared document duplicate checking and early warning, in particular to a mobile internet shared document duplicate checking and early warning system and method.
Background
The human society enters the knowledge economy era from the industrial economy era, knowledge becomes an important resource of enterprises, the effect is continuously increased along with the propagation and the sharing, the marginal cost is unchanged, the marginal income is increased progressively, the effect of reducing the operation cost of the enterprises can be achieved, the enterprises promote the internal knowledge sharing, the investment is less, and a lot of harvests are obtained;
the information security of the document is also vital, the modes for searching the duplicate of the document are various, and in the prior art, the document cannot construct the standard text in the database in the duplicate searching process, so that the duplicate searching efficiency of the document is low;
in view of the above technical drawbacks, a solution is proposed.
Disclosure of Invention
The invention aims to provide a mobile internet shared document duplicate checking early warning system and a mobile internet shared document duplicate checking early warning method, wherein a standard preset unit is used for extracting keywords of a shared document in a shared database, a duplicate checking standard text is constructed, an importance coefficient Xi of the shared document is obtained through a formula, then the preset duplicate checking standard text is obtained, semantic analysis is carried out on the text, subset sorting is carried out on a standard text set after the semantic analysis, then the subset sorting is sent to a duplicate checking management platform, and the blank subset sequence corresponding to a removed subset is filled if the subset removed set exists; the standard text of the shared document is constructed, the accuracy of document duplicate checking is improved, and the influence of the tone words on the standard text is eliminated by performing semantic analysis on the text, so that the efficiency of text duplicate checking is enhanced.
The purpose of the invention can be realized by the following technical scheme:
a mobile internet shared document duplicate checking and early warning system comprises a registration login unit, a database, an efficiency detection unit, a document identification unit, a standard preset unit, a duplicate checking management platform, a complaint unit and an early warning unit;
the standard presetting unit is used for extracting keywords of shared documents in the shared database and constructing a duplication checking standard text, and the specific extraction and construction process is as follows:
step S1: acquiring a shared document in a shared database, removing paragraphs of the shared document, summarizing the whole paragraphs of the shared document into one paragraph, dividing the shared document into single Chinese characters or words according to grammar, marking the single Chinese characters or words as the shared document, and setting a mark number i, wherein i =1, 2, … …, n and n are positive integers;
step S2: acquiring the occurrence frequency and frequency of a single Chinese character or word, marking the occurrence frequency and frequency of the single Chinese character or word as CSi and PLi, and obtaining the frequency of the single Chinese character or word by a formula
Figure 848989DEST_PATH_IMAGE001
Obtaining the importance coefficient Xi of the shared document, wherein v1 andv2 are all preset weight coefficients, v1 is more than v2 is more than 0, beta 1 is an error correction factor, the value is 2.03, and e is a natural constant; the importance coefficient is a numerical value used for analyzing the word selection probability obtained by carrying out normalization processing on the word corresponding parameters in the shared document, and the larger the occurrence frequency and the times of the words obtained by a formula is, the larger the importance coefficient is, the higher the probability that the corresponding words are marked as keywords is represented;
step S3: comparing the importance coefficient Xi of the shared document with an importance coefficient threshold: if the importance coefficient Xi of the shared document is larger than or equal to the importance coefficient threshold, setting a corresponding single Chinese character or word in the shared document as a preset duplication checking standard text; if the importance coefficient Xi of the shared document is less than the importance coefficient threshold value, marking the corresponding single Chinese character or word in the shared document as useless text;
step S4: acquiring preset duplication checking standard texts corresponding to the top ten importance coefficient ranks, constructing a standard text set (X1, X2, … and X10), performing semantic analysis on single Chinese characters or words corresponding to subsets in the standard text set, removing the corresponding single Chinese characters or words from the standard text set if the single Chinese characters or words are semanteme words, and not removing the corresponding single Chinese characters or words if the single Chinese characters or words are not semanteme words, wherein the semantic analysis indicates that word meaning analysis is required to be performed through searching of an Internet electronic dictionary;
step S5: performing subset sorting on the standard text set after semantic analysis, then sorting the subsets and sending the subsets to a duplicate checking management platform, wherein the subset sorting means that if a subset removal set exists, a blank subset sequence corresponding to the removed subset is filled; by performing semantic analysis on the text, the influence of the mood words on the standard text is eliminated, so that the efficiency of text duplicate checking is enhanced.
Further, the document identification unit is configured to identify and check a duplicate of the real-time document, where a specific identification and check process is as follows:
step SS 1: marking a single Chinese character or word in the real-time document as o, o =1, 2, … …, m and m are positive integers, and acquiring the single Chinese character or word in the real-time documentThe number of occurrences and frequency of words, and the number of occurrences and frequency of individual Chinese characters or words in the real-time document are labeled CSo and PLo, by formula
Figure 584864DEST_PATH_IMAGE002
Acquiring an importance coefficient Xo of the real-time document, wherein s1 and s2 are preset weight coefficients, s1 is greater than s2 is greater than 0, and beta 2 is an error correction factor and takes the value of 1.65;
step SS 2: comparing the importance coefficient Xo of the real-time document with an importance coefficient threshold: if the importance coefficient Xo of the real-time document is larger than or equal to the importance coefficient threshold, judging that a single Chinese character or a word corresponding to the real-time document is a key Chinese character or a word of the real-time document; if the importance coefficient Xo of the real-time document is less than the importance coefficient threshold value, judging that the single Chinese character or the word corresponding to the real-time document is a useless Chinese character or word of the real-time document;
step SS 3: comparing the key Chinese characters or words of the real-time document with the subset in the standard text set, if the key Chinese characters or words of the real-time document are the same as the subset in the standard text set, comparing the average interval character number of the corresponding key Chinese characters or words in the real-time document with the average interval character number of the corresponding subset in the standard text set, if the average interval character numbers are the same, judging that the real-time document has repeated manuscripts, marking the real-time document as an overlapped document, and if the average interval character numbers are different, judging that the real-time document does not have repeated manuscripts; and if the key Chinese characters or words of the real-time document are different from the subset in the standard text set, judging that no repeated manuscript exists in the real-time document.
Further, the early warning unit is configured to analyze the overlapped documents and perform early warning on the overlapped documents, where a specific analysis and early warning process is as follows:
step T1: acquiring repeated key single Chinese characters or words in the overlapped document, acquiring the average interval character number of the repeated key single Chinese characters or words, then acquiring the average interval character number of the repeated key single Chinese characters or words in a standard text set corresponding to the overlapped document, then calculating and acquiring the difference interval character number of the overlapped document according to the average interval character number difference value of the overlapped document and the standard text set, and marking the difference interval character number as CZ;
step T2: marking the sentences of the repeated key single Chinese characters or words in the repeated documents as repeated sentences, acquiring the maximum number of characters of the key single Chinese characters or words in the repeated sentences in the overlapped documents, and marking the maximum number of characters of the key single Chinese characters or words in the repeated sentences in the overlapped documents as CD;
step T3: by the formula
Figure 518185DEST_PATH_IMAGE003
Acquiring an early warning coefficient YJ in an overlapped document, wherein b1 and b2 are both preset weight coefficients, b1 is greater than b2 is greater than 0, and the values are 1.3 and 1.2 respectively; the early warning coefficient is a numerical value used for analyzing the early warning probability of the document obtained by carrying out normalization processing on the corresponding parameters of the words in the overlapped document, and the larger the difference interval character number and the repeated character number of the document obtained by a formula are, the larger the early warning coefficient is, the higher the probability of early warning of the corresponding document is represented;
step T4: and comparing the early warning coefficient YJ in the overlapped document with an early warning coefficient threshold value.
Further, the efficiency detection unit is configured to analyze the duplicate checking efficiency information, so as to detect the duplicate checking efficiency, where the duplicate checking efficiency information includes speed data and accurate data, the speed data is a speed of the duplicate checking of the real-time shared document, and the accurate data is an accuracy of the duplicate checking of the real-time shared document, and a specific analysis and detection process is as follows:
step TT 1: acquiring the duplicate checking speed of the real-time shared document through a timer, and marking the speed of the duplicate checking of the real-time shared document as CSD;
step TT 2: acquiring the accuracy rate of the duplicate checking of the real-time shared document through sampling analysis, and marking the accuracy rate of the duplicate checking of the real-time shared document as ZQL;
step TT 3: by the formula
Figure 921485DEST_PATH_IMAGE004
Acquiring a check weight efficiency detection coefficient JC of a shared document, wherein both k1 and k2 are preset weight coefficients, k1 is greater than k2 is greater than 0, and alpha is an error correction factor and has a value of 2.065; the duplication checking efficiency detection coefficient is a numerical value used for reflecting the duplication checking efficiency obtained by carrying out normalization processing on the duplication checking corresponding parameters of the shared document, and the greater the duplication checking speed and accuracy of the shared document can be obtained through a formula, the greater the duplication checking efficiency detection coefficient is, the greater the numerical value representing the duplication checking efficiency of the corresponding shared document is;
step TT 4: and comparing the duplicate checking efficiency detection coefficient JC of the shared document with the duplicate checking efficiency detection coefficient threshold, improving the work efficiency of duplicate checking of the document and reducing the error rate of duplicate checking.
Further, the complaint unit is configured to analyze the rewritten document received by the administrator, so as to determine whether the rewritten document can be complaint, where a specific analysis and determination process is as follows:
step P1: marking a document repetition determination proportion as g, acquiring the total word number of the acquired rewritten document, marking the total word number of the rewritten document as ZS, dividing the rewritten document into a head part, a body part and a tail part, and respectively marking the word numbers of the head part, the body part and the tail part as ZS1, ZS2 and ZS 3;
step P2: acquiring early warning coefficients corresponding to the beginning part, the text part and the end part, and if the early warning coefficient of any one of the beginning part or the end part is larger than or equal to an early warning coefficient threshold value, judging that the corresponding rewritten document cannot be complained; if the early warning coefficients of the beginning part or the ending part are all smaller than the early warning coefficient threshold value, judging that the text part of the corresponding rewritten document is marked as an abnormal part, simultaneously generating a complaint signal and rechecking the corresponding rewritten document;
step P3: segmenting the text of the corresponding rewritten document, wherein the difference value of the number of the segmented text is not more than 100, refining the number of the text words, so as to prevent the situation that the base number corresponding to the text early warning coefficient is large, which causes inaccurate early warning judgment of the document, acquiring the early warning coefficient corresponding to each text part segment, judging that the rewritten document is unqualified for rechecking if any segment of the early warning coefficient corresponding to each text part segment is not less than the early warning coefficient threshold value, and judging that the rewritten document is unqualified for rechecking if any segment of the early warning coefficient corresponding to each text part segment is not less than the early warning coefficient threshold value;
step P4: and judging that the corresponding rewritten document cannot complain.
Further, a mobile internet shared document duplicate checking and early warning method specifically comprises the following steps:
step one, registering and logging in, wherein a user and a manager register through a registering and logging unit;
step two, text presetting, namely extracting keywords from shared documents in a shared database through a standard presetting unit and constructing a duplication checking standard text;
identifying the document, namely identifying and checking the duplicate of the real-time document by a document identification unit;
fourthly, document early warning, namely analyzing the overlapped document through an early warning unit and early warning the overlapped document;
and fifthly, detecting efficiency, namely analyzing the duplicate checking efficiency information through an efficiency detection unit so as to detect the duplicate checking efficiency.
Compared with the prior art, the invention has the beneficial effects that:
1. in the invention, a standard preset unit is used for extracting keywords of a shared document in a shared database, a duplication checking standard text is constructed, an importance coefficient Xi of the shared document is obtained through a formula, then the preset duplication checking standard text is obtained, the text is subjected to semantic analysis, a standard text set after the semantic analysis is subjected to subset sorting, the subset sorting is sent to a duplication checking management platform, and the subset sorting is expressed that if a subset removal set exists, a blank subset sequence corresponding to the removed subset is filled; the standard text of the shared document is constructed, the accuracy of document duplicate checking is improved, and the influence of the tone words on the standard text is eliminated by performing semantic analysis on the text, so that the efficiency of text duplicate checking is enhanced;
2. in the invention, a document identification unit is used for identifying and checking the real-time document, an importance coefficient Xo of the real-time document is obtained through a formula, and if the importance coefficient Xo of the real-time document is more than or equal to an importance coefficient threshold, a single Chinese character or a word corresponding to the real-time document is judged to be a key Chinese character or a word of the real-time document; then, average interval character numbers are compared, if the average interval character numbers are the same, the real-time document is judged to have repeated manuscripts, the real-time document is marked as an overlapped document, and if the average interval character numbers are different, the real-time document is judged not to have repeated manuscripts; the shared document is identified, and then character comparison is carried out, so that the working efficiency of document duplicate checking is improved, and the error rate of duplicate checking is reduced.
Drawings
In order to facilitate understanding for those skilled in the art, the present invention will be further described with reference to the accompanying drawings;
fig. 1 is a schematic block diagram of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a mobile internet shared document duplicate checking and early warning system includes a registration login unit, a database, an efficiency detection unit, a document identification unit, a standard preset unit, a duplicate checking management platform, a complaint unit and an early warning unit, wherein the duplicate checking management platform, the registration login unit, the database, the efficiency detection unit, the document identification unit, the standard preset unit, the complaint unit and the early warning unit are all in bidirectional communication connection, and the registration login unit and the shared database are in bidirectional communication connection;
the registration login unit is used for the manager and the user to submit manager information and user information through the mobile phone terminal for registration, and sending the manager information and the user information which are successfully registered to the database for storage, wherein the manager information comprises the name, the age, the time of entry and the mobile phone number of real name authentication of the user, and the user information comprises the name, the age, the occupation and the mobile phone number of real name authentication of the user; the information of the management personnel and the information of the user are real-name information, so that the data leakage of personnel is prevented, the safety and reliability of the duplicate checking early warning system are reduced, and the accuracy of early warning is reduced;
the standard presetting unit is used for extracting keywords of shared documents in the shared database, establishing a duplication checking standard text, presetting the standard text, enhancing the checking efficiency of the real-time documents, avoiding reduction of duplication checking accuracy caused by temporary establishment of the standard text when duplication checking is carried out on the real-time documents, and reducing risks of manual operation, wherein the specific extraction and establishment process comprises the following steps:
step S1: acquiring a shared document in a shared database, removing paragraphs of the shared document, summarizing the whole paragraphs of the shared document into one paragraph, dividing the shared document into single Chinese characters or words according to grammar, marking the single Chinese characters or words as the shared document, and setting a mark number i, wherein i =1, 2, … …, n, n is a positive integer, and the shared document is acquired by network equipment by taking the grammar as a dividing condition;
step S2: acquiring the occurrence frequency and frequency of a single Chinese character or word, marking the occurrence frequency and frequency of the single Chinese character or word as CSi and PLi, and obtaining the frequency of the single Chinese character or word by a formula
Figure 649269DEST_PATH_IMAGE001
Obtaining an importance coefficient Xi of a shared document, wherein v1 and v2 are both preset weight coefficients, v1 is greater than v2 is greater than 0, values are 0.7 and 0.65 respectively, beta 1 is an error correction factor, a value is 2.03, and e is a natural constant; the importance coefficient is a numerical value used for analyzing the word selection probability obtained by carrying out normalization processing on the word corresponding parameters in the shared document, and the larger the occurrence frequency and the times of the words obtained by a formula is, the larger the importance coefficient is, the higher the probability that the corresponding words are marked as keywords is represented;
step S3: comparing the importance coefficient Xi of the shared document with an importance coefficient threshold: if the importance coefficient Xi of the shared document is larger than or equal to the importance coefficient threshold, setting a corresponding single Chinese character or word in the shared document as a preset duplication checking standard text; if the importance coefficient Xi of the shared document is less than the importance coefficient threshold value, marking the corresponding single Chinese character or word in the shared document as useless text; the method comprises the steps of dividing texts in a shared document, marking useless texts, and preventing the real-time texts from being compared with the useless texts in a standard text, so that the duplication checking work intensity is increased, and the duplication checking efficiency is reduced;
step S4: acquiring preset duplication checking standard texts corresponding to the top ten importance coefficient ranks, constructing a standard text set (X1, X2, … and X10), performing semantic analysis on single Chinese characters or words corresponding to subsets in the standard text set, removing the corresponding single Chinese characters or words from the standard text set if the single Chinese characters or words are semanteme words, and not removing the corresponding single Chinese characters or words if the single Chinese characters or words are not semanteme words, wherein the semantic analysis indicates that word meaning analysis is required to be performed through searching of an Internet electronic dictionary;
step S5: performing subset sorting on the standard text set after semantic analysis, then sorting the subsets and sending the subsets to a duplicate checking management platform, wherein the subset sorting means that if a subset removal set exists, a blank subset sequence corresponding to the removed subset is filled;
the document identification unit is used for identifying and checking the real-time document, and the specific identification and checking process is as follows:
step SS 1: marking a single Chinese character or word in the real-time document as o, marking o =1, 2, … …, and m, wherein m is a positive integer, acquiring the occurrence frequency and the frequency of the single Chinese character or word in the real-time document, marking the occurrence frequency and the frequency of the single Chinese character or word in the real-time document as CSo and PLo, and calculating the probability of the single Chinese character or word in the real-time document according to the formula
Figure 138019DEST_PATH_IMAGE002
Obtaining an importance coefficient Xo of the real-time document, wherein s1 and s2 are all preset weight coefficients, s1 is more than s2 is more than 0, the values are respectively 1.1 and 1.03, beta 2 is an error correction factor, and the value is 1.65;
step SS 2: comparing the importance coefficient Xo of the real-time document with an importance coefficient threshold: if the importance coefficient Xo of the real-time document is larger than or equal to the importance coefficient threshold, judging that a single Chinese character or a word corresponding to the real-time document is a key Chinese character or a word of the real-time document; if the importance coefficient Xo of the real-time document is less than the importance coefficient threshold value, judging that the single Chinese character or the word corresponding to the real-time document is a useless Chinese character or word of the real-time document; the real-time text is detected and compared, useless Chinese characters or useless words in the real-time text are marked, so that duplication of the useless words or the useless Chinese characters is reduced, and the working intensity is reduced;
step SS 3: comparing the key Chinese characters or words of the real-time document with the subset in the standard text set, if the key Chinese characters or words of the real-time document are the same as the subset in the standard text set, comparing the average interval character number of the corresponding key Chinese characters or words in the real-time document with the average interval character number of the corresponding subset in the standard text set, if the average interval character numbers are the same, judging that the real-time document has repeated manuscripts, marking the real-time document as an overlapped document, and if the average interval character numbers are different, judging that the real-time document does not have repeated manuscripts; if the key Chinese characters or words of the real-time document are different from the subset in the standard text set, judging that no repeated manuscript exists in the real-time document;
the early warning unit is used for analyzing the overlapped documents, giving early warning to the overlapped documents, analyzing repeated keywords in the documents and judging whether to give early warning to the documents, and the specific analysis and early warning process is as follows:
step T1: acquiring repeated key single Chinese characters or words in the overlapped document, acquiring the average interval character number of the repeated key single Chinese characters or words, then acquiring the average interval character number of the repeated key single Chinese characters or words in a standard text set corresponding to the overlapped document, then calculating and acquiring the difference interval character number of the overlapped document according to the average interval character number difference value of the overlapped document and the standard text set, and marking the difference interval character number as CZ; the average interval character number is the average value of the total average interval character number between all repeated key single Chinese characters or words in the document, and if the average interval character number of the overlapped document is equal to that of the standard document, the difference value of the average interval character number is 1;
step T2: marking the sentences of the repeated key single Chinese characters or words in the repeated documents as repeated sentences, acquiring the maximum number of characters of the key single Chinese characters or words in the repeated sentences in the overlapped documents, and marking the maximum number of characters of the key single Chinese characters or words in the repeated sentences in the overlapped documents as CD;
step T3: by the formula
Figure 858720DEST_PATH_IMAGE003
Acquiring an early warning coefficient YJ in an overlapped document, wherein b1 and b2 are both preset weight coefficients, b1 is greater than b2 is greater than 0, and the values are 1.3 and 1.2 respectively; the early warning coefficient is a numerical value used for analyzing the early warning probability of the document obtained by carrying out normalization processing on the corresponding parameters of the words in the overlapped document, and the larger the difference interval character number and the repeated character number of the document obtained by a formula are, the larger the early warning coefficient is, the higher the probability of early warning of the corresponding document is represented;
step T4: comparing the early warning coefficient YJ in the overlapped document with an early warning coefficient threshold value:
if the early warning coefficient YJ in the overlapped document is larger than or equal to the early warning coefficient threshold value, marking the corresponding overlapped document as a rewritten document and sending the rewritten document to a mobile phone terminal of a manager;
if the early warning coefficient YJ in the overlapped document is smaller than the early warning coefficient threshold value, marking the corresponding overlapped document as a modified document and sending the modified document to a mobile phone terminal of a manager;
the efficiency detection unit is used for analyzing the duplicate checking efficiency information, so that the duplicate checking efficiency is detected, the duplicate checking efficiency information comprises speed data and accurate data, the speed data is the duplicate checking speed of the real-time shared document, the accurate data is the duplicate checking accuracy of the real-time shared document, the duplicate checking efficiency is analyzed, the duplicate checking accuracy is improved, meanwhile, the duplicate checking system can be improved, the duplicate checking speed is improved, and the specific analysis and detection process is as follows:
step TT 1: acquiring the duplicate checking speed of the real-time shared document through a timer, and marking the speed of the duplicate checking of the real-time shared document as CSD;
step TT 2: acquiring the accuracy rate of the duplicate checking of the real-time shared document through sampling analysis, and marking the accuracy rate of the duplicate checking of the real-time shared document as ZQL;
step TT 3: by the formula
Figure 850946DEST_PATH_IMAGE004
Acquiring a check weight efficiency detection coefficient JC of a shared document, wherein both k1 and k2 are preset weight coefficients, k1 is greater than k2 is greater than 0, and alpha is an error correction factor and has a value of 2.065; the duplication checking efficiency detection coefficient is a numerical value used for reflecting the duplication checking efficiency obtained by carrying out normalization processing on the duplication checking corresponding parameters of the shared document, and the greater the duplication checking speed and accuracy of the shared document can be obtained through a formula, the greater the duplication checking efficiency detection coefficient is, the greater the numerical value representing the duplication checking efficiency of the corresponding shared document is;
step TT 4: comparing a duplicate checking efficiency detection coefficient JC of the shared document with a duplicate checking efficiency detection coefficient threshold value:
if the duplicate checking efficiency detection coefficient JC of the shared document is larger than or equal to the duplicate checking efficiency detection coefficient threshold, judging that the duplicate checking efficiency of the shared document is qualified, generating a duplicate checking efficiency qualified signal and sending the duplicate checking efficiency qualified signal to a mobile phone terminal of a manager;
if the duplicate checking efficiency detection coefficient JC of the shared document is less than the duplicate checking efficiency detection coefficient threshold value, judging that the duplicate checking efficiency of the shared document is unqualified, generating a signal with unqualified duplicate checking efficiency and sending the signal with unqualified duplicate checking efficiency to a mobile phone terminal of a manager;
the complaint unit is used for analyzing the rewritten documents received by the manager, so as to judge whether the rewritten documents can be complaint or not, and complaint judgment is carried out on the rewritten documents, so that document early warning caused by proportion problems of the rewritten documents is prevented, the rewriting accuracy is reduced, content resources of the corresponding documents are wasted, unnecessary troubles are brought to writers, and the specific analysis and judgment process is as follows:
step P1: marking a document repetition determination proportion as g, acquiring the total word number of the acquired rewritten document, marking the total word number of the rewritten document as ZS, dividing the rewritten document into a head part, a body part and a tail part, and respectively marking the word numbers of the head part, the body part and the tail part as ZS1, ZS2 and ZS 3;
step P2: acquiring early warning coefficients corresponding to the beginning part, the text part and the end part, and if the early warning coefficient of any one of the beginning part or the end part is larger than or equal to an early warning coefficient threshold value, judging that the corresponding rewritten document cannot be complained; if the early warning coefficients of the beginning part or the ending part are all smaller than the early warning coefficient threshold value, judging that the text part of the corresponding rewritten document is marked as an abnormal part, simultaneously generating a complaint signal and rechecking the corresponding rewritten document;
step P3: segmenting the text of the corresponding rewritten document, wherein the difference value of the number of the segmented text is not more than 100, refining the number of the text words, so as to prevent the situation that the base number corresponding to the text early warning coefficient is large, which causes inaccurate early warning judgment of the document, acquiring the early warning coefficient corresponding to each text part segment, judging that the rewritten document is unqualified for rechecking if any segment of the early warning coefficient corresponding to each text part segment is not less than the early warning coefficient threshold value, and judging that the rewritten document is unqualified for rechecking if any segment of the early warning coefficient corresponding to each text part segment is not less than the early warning coefficient threshold value;
step P4: judging that the corresponding rewritten document cannot complain;
a mobile internet shared document duplicate checking early warning method specifically comprises the following steps:
step one, registering and logging in, wherein a user and a manager register through a registering and logging unit;
step two, text presetting, namely extracting keywords from shared documents in a shared database through a standard presetting unit and constructing a duplication checking standard text;
identifying the document, namely identifying and checking the duplicate of the real-time document by a document identification unit;
fourthly, document early warning, namely analyzing the overlapped document through an early warning unit and early warning the overlapped document;
fifthly, efficiency detection, wherein the efficiency detection unit is used for analyzing the duplicate checking efficiency information so as to detect the duplicate checking efficiency;
the formulas are obtained by acquiring a large amount of data and performing software simulation, and the coefficients in the formulas are set by the technicians in the field according to actual conditions; such as: formula (II)
Figure 15211DEST_PATH_IMAGE001
Collecting multiple groups of sample data and setting corresponding importance coefficient for each group of words by technicians in the field; substituting the set importance coefficient and the collected sample data into formulas, forming a five-element linear equation set by any five formulas, and calculating the coefficient corresponding to the five-element linear equation set through software simulation; performing analog calculation on a plurality of quinary linear equations, screening the calculated coefficients and taking the average value to obtain values of v1 and v2 which are 0.7 and 0.65 respectively; the value of beta 1 is 2.03; the coefficients are all obtained by the method;
the size of the coefficient is to quantize each parameter to obtain a specific numerical value, which is convenient for subsequent comparison, and regarding the size of the coefficient, the corresponding importance coefficient is preliminarily set for each group of sample data by the technicians in the field depending on the number of the sample data; it is only necessary to not influence the proportional relationship between the parameters and the quantized numerical values, for example, the importance coefficient is proportional to the occurrence frequency of the words.
When the system works, registration and login are carried out, and a user and a manager register through a registration and login unit; text presetting, namely extracting keywords from shared documents in a shared database through a standard presetting unit, constructing a duplication checking standard text, improving the accuracy of duplication checking of the documents, performing semantic analysis on the text, and eliminating the influence of tone words on the standard text, so that the efficiency of text duplication checking is improved, and the document identification unit is used for identifying and checking the duplication of the documents in real time; document early warning, analyzing the overlapped document through an early warning unit, and early warning the overlapped document; and (3) efficiency detection, wherein the duplicate checking efficiency information is analyzed through an efficiency detection unit, so that the duplicate checking efficiency is detected, the work efficiency of document duplicate checking is improved, and the error rate of duplicate checking is reduced.
The above formulas are all calculated by taking the numerical value of the dimension, the formula is a formula which obtains the latest real situation by acquiring a large amount of data and performing software simulation, and the preset parameters in the formula are set by the technical personnel in the field according to the actual situation.
The foregoing is merely exemplary and illustrative of the present invention and various modifications, additions and substitutions may be made by those skilled in the art to the specific embodiments described without departing from the scope of the invention as defined in the following claims.

Claims (5)

1. A mobile Internet shared document duplication checking early warning system is characterized by comprising an efficiency detection unit, a document identification unit, a standard preset unit, an duplication checking management platform, a complaint unit and an early warning unit;
the standard presetting unit is used for extracting keywords of shared documents in the shared database and constructing a duplication checking standard text, and the specific extraction and construction process is as follows:
step S1: acquiring a shared document in a shared database, removing paragraphs of the shared document, summarizing the whole paragraphs of the shared document into one paragraph, dividing the shared document into single Chinese characters or words according to grammar, marking the single Chinese characters or words as the shared document, and setting a mark number i, wherein i =1, 2, … …, n and n are positive integers;
step S2: acquiring the occurrence frequency and frequency of a single Chinese character or word, marking the occurrence frequency and frequency of the single Chinese character or word as CSi and PLi, and obtaining the frequency of the single Chinese character or word by a formula
Figure 676490DEST_PATH_IMAGE001
Obtaining an importance coefficient Xi of a shared document, wherein v1 and v2 are both preset weight coefficients, v1 is greater than v2 is greater than 0, values are 0.7 and 0.65 respectively, beta 1 is an error correction factor, a value is 2.03, and e is a natural constant;
step S3: comparing the importance coefficient Xi of the shared document with an importance coefficient threshold: if the importance coefficient Xi of the shared document is larger than or equal to the importance coefficient threshold, setting a corresponding single Chinese character or word in the shared document as a preset duplication checking standard text; if the importance coefficient Xi of the shared document is less than the importance coefficient threshold value, marking the corresponding single Chinese character or word in the shared document as useless text;
step S4: acquiring preset duplication checking standard texts corresponding to the top ten importance coefficient ranks, constructing a standard text set (X1, X2, … and X10), performing semantic analysis on single Chinese characters or words corresponding to the subsets in the standard text set, removing the corresponding single Chinese characters or words from the standard text set if the single Chinese characters or words are Chinese characters or words, and not removing the corresponding single Chinese characters or words if the single Chinese characters or words are not Chinese characters or words;
step S5: performing subset sorting on the standard text set after semantic analysis, and then sending the subset sorting to a duplication checking management platform;
the document identification unit is used for identifying and checking the real-time document, and the specific identification and checking process is as follows:
step SS 1: marking a single Chinese character or word in the real-time document as o, marking o =1, 2, … …, and m, wherein m is a positive integer, acquiring the occurrence frequency and the frequency of the single Chinese character or word in the real-time document, marking the occurrence frequency and the frequency of the single Chinese character or word in the real-time document as CSo and PLo, and calculating the probability of the single Chinese character or word in the real-time document according to the formula
Figure 721807DEST_PATH_IMAGE002
Obtaining an importance coefficient Xo of the real-time document, wherein s1 and s2 are preset weight coefficients, s1 is greater than s2 is greater than 0, and the value is divided1.1 and 1.03 respectively, wherein beta 2 is an error correction factor and takes a value of 1.65;
step SS 2: comparing the importance coefficient Xo of the real-time document with an importance coefficient threshold: if the importance coefficient Xo of the real-time document is larger than or equal to the importance coefficient threshold, judging that a single Chinese character or a word corresponding to the real-time document is a key Chinese character or a word of the real-time document; if the importance coefficient Xo of the real-time document is less than the importance coefficient threshold value, judging that the single Chinese character or the word corresponding to the real-time document is a useless Chinese character or word of the real-time document;
step SS 3: comparing the key Chinese characters or words of the real-time document with the subset in the standard text set, if the key Chinese characters or words of the real-time document are the same as the subset in the standard text set, comparing the average interval character number of the corresponding key Chinese characters or words in the real-time document with the average interval character number of the corresponding subset in the standard text set, if the average interval character numbers are the same, judging that the real-time document has repeated manuscripts, marking the real-time document as an overlapped document, and if the average interval character numbers are different, judging that the real-time document does not have repeated manuscripts; and if the key Chinese characters or words of the real-time document are different from the subset in the standard text set, judging that no repeated manuscript exists in the real-time document.
2. The system for pre-warning the duplicate checking and the early warning of the shared document in the mobile internet according to claim 1, wherein the early warning unit is configured to analyze the overlapping document and pre-warn the overlapping document, and the analysis and early warning process specifically comprises:
step T1: acquiring repeated key single Chinese characters or words in the overlapped document, acquiring the average interval character number of the repeated key single Chinese characters or words, then acquiring the average interval character number of the repeated key single Chinese characters or words in a standard text set corresponding to the overlapped document, then calculating and acquiring the difference interval character number of the overlapped document according to the average interval character number difference value of the overlapped document and the standard text set, and marking the difference interval character number as CZ;
step T2: marking the sentences of the repeated key single Chinese characters or words in the repeated documents as repeated sentences, acquiring the maximum number of characters of the key single Chinese characters or words in the repeated sentences in the overlapped documents, and marking the maximum number of characters of the key single Chinese characters or words in the repeated sentences in the overlapped documents as CD;
step T3: and acquiring an early warning coefficient YJ in the overlapped document through a formula, and comparing the early warning coefficient YJ in the overlapped document with an early warning coefficient threshold value.
3. The system of claim 1, wherein the efficiency detection unit is configured to analyze duplicate checking efficiency information so as to detect duplicate checking efficiency, the duplicate checking efficiency information includes speed data and accurate data, the speed data is a speed of the real-time shared document duplicate checking, the accurate data is an accuracy of the real-time shared document duplicate checking, and the specific analysis and detection process includes:
step TT 1: acquiring the duplicate checking speed of the real-time shared document through a timer, and marking the speed of the duplicate checking of the real-time shared document as CSD;
step TT 2: acquiring the accuracy rate of the duplicate checking of the real-time shared document through sampling analysis, and marking the accuracy rate of the duplicate checking of the real-time shared document as ZQL;
step TT 3: by the formula
Figure 884060DEST_PATH_IMAGE003
Acquiring a check weight efficiency detection coefficient JC of a shared document, wherein both k1 and k2 are preset weight coefficients, k1 is greater than k2 is greater than 0, and alpha is an error correction factor and has a value of 2.065;
step TT 4: and comparing the duplicate checking efficiency detection coefficient JC of the shared document with the duplicate checking efficiency detection coefficient threshold.
4. The system for prewarning the review of the shared document in the mobile internet as claimed in claim 1, wherein the complaint unit is configured to analyze the rewritten document received by the administrator, so as to determine whether the rewritten document can be complaint, and the specific analysis and determination process is as follows:
step P1: marking a document repetition determination proportion as g, acquiring the total word number of the acquired rewritten document, marking the total word number of the rewritten document as ZS, dividing the rewritten document into a head part, a body part and a tail part, and respectively marking the word numbers of the head part, the body part and the tail part as ZS1, ZS2 and ZS 3;
step P2: acquiring early warning coefficients corresponding to the beginning part, the text part and the end part, and if the early warning coefficient of any one of the beginning part or the end part is larger than or equal to an early warning coefficient threshold value, judging that the corresponding rewritten document cannot be complained; if the early warning coefficients of the beginning part or the ending part are all smaller than the early warning coefficient threshold value, judging that the text part of the corresponding rewritten document is marked as an abnormal part, simultaneously generating a complaint signal and rechecking the corresponding rewritten document;
step P3: segmenting the text of the corresponding rewritten document, wherein the difference value of the word numbers of the segmented text is not more than 100, refining the word number of the text to obtain the early warning coefficient corresponding to each text part segment, and if any one segment of the early warning coefficient corresponding to each text part segment is not less than the early warning coefficient threshold value, judging that the rewritten document is unqualified for re-checking;
step P4: and judging that the corresponding rewritten document cannot complain.
5. A mobile internet shared document duplicate checking and early warning method, which is characterized by comprising the mobile internet shared document duplicate checking and early warning method of any claim from 1 to 4.
CN202110720405.2A 2021-06-28 2021-06-28 Mobile internet shared document duplicate checking early warning system and method Active CN113312319B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110720405.2A CN113312319B (en) 2021-06-28 2021-06-28 Mobile internet shared document duplicate checking early warning system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110720405.2A CN113312319B (en) 2021-06-28 2021-06-28 Mobile internet shared document duplicate checking early warning system and method

Publications (2)

Publication Number Publication Date
CN113312319A CN113312319A (en) 2021-08-27
CN113312319B true CN113312319B (en) 2021-11-26

Family

ID=77380637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110720405.2A Active CN113312319B (en) 2021-06-28 2021-06-28 Mobile internet shared document duplicate checking early warning system and method

Country Status (1)

Country Link
CN (1) CN113312319B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7904450B2 (en) * 2008-04-25 2011-03-08 Wilson Kelce S Public electronic document dating list
CN108073604A (en) * 2016-11-10 2018-05-25 北京国双科技有限公司 Text handling method and device
CN106649222B (en) * 2016-12-13 2019-07-16 浙江网新恒天软件有限公司 Based on semantic analysis repetition detection method approximate with the text of multiple Simhash

Also Published As

Publication number Publication date
CN113312319A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN110826320A (en) Sensitive data discovery method and system based on text recognition
CN108509561B (en) Post recruitment data screening method and system based on machine learning and storage medium
CN112435651A (en) Quality evaluation method for automatic voice data annotation
CN112651296A (en) Method and system for automatically detecting data quality problem without prior knowledge
CN111898385A (en) Earthquake disaster assessment method and system
CN107562720B (en) Alarm data matching method for electric power information network security linkage defense
CN113312319B (en) Mobile internet shared document duplicate checking early warning system and method
CN113010695A (en) Professional dictionary construction method suitable for relay protection device defect analysis
CN112529629A (en) Malicious user comment brushing behavior identification method and system
CN112668284B (en) Legal document segmentation method and system
CN112541075B (en) Standard case sending time extraction method and system for alert text
CN111291376B (en) Web vulnerability verification method based on crowdsourcing and machine learning
CN114706886A (en) Evaluation method and device, computer equipment and storage medium
CN112988972A (en) Administrative penalty file evaluation and checking method and system based on data model
CN114065934A (en) Method and system for constructing semantic knowledge base in environmental impact evaluation field
CN113569005A (en) Large-scale data feature intelligent extraction method based on data content
CN110928985A (en) Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm
CN114548825B (en) Complaint work order distortion detection method, device, equipment and storage medium
CN116204844B (en) Electrical equipment abnormal data cleaning method based on uncertainty
CN113283760B (en) Case flow analysis report generation method and system
CN116301646B (en) Personal computer storage management system based on machine learning
CN115687334B (en) Data quality inspection method, device, equipment and storage medium
CN115828166A (en) Secret-related information detection method and system based on public information
CN117827991A (en) Method and system for identifying personal identification information in semi-structured data
CN117726300A (en) Automatic intelligent processing system for verifying bidding agency business data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant