CN113792119A - Article originality evaluation system, method, device and medium - Google Patents
Article originality evaluation system, method, device and medium Download PDFInfo
- Publication number
- CN113792119A CN113792119A CN202111091198.5A CN202111091198A CN113792119A CN 113792119 A CN113792119 A CN 113792119A CN 202111091198 A CN202111091198 A CN 202111091198A CN 113792119 A CN113792119 A CN 113792119A
- Authority
- CN
- China
- Prior art keywords
- document
- originality
- similar
- candidate
- article
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000011156 evaluation Methods 0.000 title claims description 18
- 238000003860 storage Methods 0.000 claims abstract description 44
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims abstract description 17
- 241000157593 Milvus Species 0.000 claims abstract description 10
- 241001178520 Stomatepia mongo Species 0.000 claims abstract description 9
- 230000015654 memory Effects 0.000 claims description 38
- 238000012545 processing Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000004140 cleaning Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 5
- 239000004065 semiconductor Substances 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 239000002904 solvent Substances 0.000 description 3
- 238000005406 washing Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000004886 process control Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure relates to a system, method, device and medium for evaluating originality of an article, wherein the system includes: the system comprises a text data preprocessing module and an inventory document management subsystem, wherein the inventory document management subsystem comprises a document warehousing module, a semantic similar document candidate submodule ES, a semantic similar document candidate submodule Milvus, a feature storage submodule Mongo and an article originality degree operator system, wherein the article originality degree operator system is specifically composed of a candidate similar document retrieval module and an originality degree calculation module; the text data preprocessing module is arranged between the article originality calculating subsystem and the stock document management subsystem; the word meaning similar document candidate submodule ES and the semantic similar document candidate submodule Milvus are arranged between the article originality measuring subsystem and the stock document management subsystem and are respectively connected with the document storage module and the candidate similar document retrieval module.
Description
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a system, a method, a device, and a medium for evaluating originality of an article.
Background
The dramatic expansion of the scale of data mastered by humans is accompanied by the existence of large amounts of similar data. In some scenarios, we need to measure the originality of a document to determine the processing method of the document. For example, after the originality of a manuscript is preliminarily confirmed by 'duplicate checking', the academic journal considers whether to accept the manuscript or not; the phenomena of plagiarism, reprinting and the like exist in the internet in large quantity, and the phenomena can be found efficiently only by a calculation tool based on the original degree.
The current originality calculation tool measures the similarity degree between two documents mainly based on a character string matching mode, and has low processing capacity on 'manuscript washing'.
Disclosure of Invention
The method aims to solve the technical problem that the evaluation method of the originality of the article in the prior art cannot meet the requirements of users.
In order to achieve the technical purpose, the present disclosure provides an article originality evaluation method, which includes:
preprocessing the document to be put in storage and storing the preprocessed document in storage;
performing word meaning similar document candidate processing, word meaning similar document candidate processing and/or feature extraction and storage on the newly-put documents;
recalling similar documents possibly existing in the stock documents and the documents to be evaluated in the document library;
and calculating the originality of the document to be evaluated based on the similarity of the document to be evaluated and the stock document, wherein the stock document is a document which has higher originality in the service scene and is determined to need intellectual property protection, and the stock document is stored in a document library.
Further, the method is characterized in that the preprocessing of the document to be put in storage specifically comprises:
carrying out document cleaning and feature extraction on the document;
calculating to obtain word characteristics and distributed representation of the document to be evaluated, namely segmenting the document to be evaluated into N paragraphs to obtain a paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isn,n=1,2…… N denotes a document paragraph after segmentation, and N is an integer of 2 or more.
Further, the recalling that similar documents may exist in the inventory document and the document to be evaluated in the document library specifically includes:
the word sense candidate similar paragraph set of the paragraph pn obtained by the mark retrieval is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
paragraph p obtained by label searchnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecAnd (4) whether to recall, recording all the recalled candidate similar section sets:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, and K is an integer of 2 or more.
Further, the calculating the originality of the document to be evaluated specifically includes:
calculating the originality of the article by using the following formula;
degree of originality of articleIn the formula scorenThe originality of the nth text of the document to be evaluated;
wherein,
scoren=min(score_wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
Further, the original creativity score under the bag-of-words model is calculated by the following formula:
wherein,
in the formula,is a collection of words within a paragraph pn, represents paragraph pnAnd skThe number of the same words contained; in the denominatorRepresenting the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
Further, the originality score under the distributed representation is specifically calculated by the following formula:
wherein,
In order to achieve the above technical object, the present disclosure can also provide an article originality evaluation method, including:
performing document cleaning and feature extraction on the document to be evaluated by using the text data preprocessing module, and calculating to obtain word features and distributed representation of the document to be evaluated;
recalling the stock documents in the document library and the documents to be evaluated which may have similarity by utilizing the candidate similar document retrieval module;
and calculating the originality of the document to be evaluated by using the originality calculation module based on the similarity between the document to be evaluated and the document stored in the document library.
Further, still include:
carrying out document cleaning and feature extraction on the document to be put in storage by utilizing the text data preprocessing module;
the document storage module is used for respectively storing the preprocessed document data to be stored into a corresponding database: a word meaning similar document candidate submodule ES, a semantic meaning similar document candidate submodule Milvus and a characteristic storage submodule Mongo.
To achieve the above technical objects, the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the above-described article originality assessment method when executed by a processor.
In order to achieve the above technical object, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the article originality assessment method when executing the computer program.
The beneficial effect of this disclosure does:
the article originality degree evaluation system disclosed by the invention has the advantages that the main calculation is completed off-line, the reasoning speed is high, and meanwhile, the article originality degree evaluation system disclosed by the invention can support real-time calculation.
The article originality evaluation system disclosed by the invention has two dimensions of literal and semantic, and can effectively process the situation of 'manuscript washing'.
Drawings
Fig. 1 shows a schematic structural diagram of a system of embodiment 1 of the present disclosure;
FIG. 2 shows a schematic structural diagram of a text data preprocessing module of the system of embodiment 1 of the present disclosure;
FIG. 3 shows a flow diagram of a method of embodiment 2 of the present disclosure;
fig. 4 shows a schematic structural diagram of embodiment 4 of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
Various structural schematics according to embodiments of the present disclosure are shown in the figures. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers, and relative sizes and positional relationships therebetween shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, as actually required.
The first embodiment is as follows:
as shown in fig. 1:
the present disclosure provides an article originality evaluation system, including:
the text data preprocessing module is used for preprocessing the document;
the system comprises an inventory document management subsystem and a document management module, wherein the inventory document management subsystem comprises a document storage module used for maintaining a document library, and inventory documents are stored in the document library, wherein the inventory documents are documents which have higher originality in service scenes and are identified to need intellectual property protection;
a word sense similar document candidate submodule ES for providing candidate similar documents which are similar in word;
the semantically similar document candidate submodule Milvus is used for providing semantically similar candidate similar documents;
the characteristic storage submodule Mongo is used for storing all characteristic data of the document;
the document storage module is used for storing document data into the semantic similar document candidate submodule ES, the semantic similar document candidate submodule Milvus and the feature storage submodule Mongo;
the article originality degree calculation operator system is used for calculating the originality degree of the evaluation article;
the article originality degree operator system is composed of a candidate similar document retrieval module and an originality degree calculation module;
the candidate similar document retrieval module is used for recalling the stock documents in the document library and the documents to be evaluated which may have similarity;
the originality degree calculation module is used for calculating the originality degree of the document to be evaluated based on the similarity degree of the document to be evaluated and the document stored in the document library;
the text data preprocessing module is arranged between the article originality calculating subsystem and the stock document management subsystem;
the word meaning similar document candidate submodule ES and the semantic similar document candidate submodule Milvus are arranged between the article originality measuring subsystem and the stock document management subsystem and are respectively connected with the document storage module and the candidate similar document retrieval module;
the characteristic storage submodule Mongo is arranged between the article originality degree computing subsystem and the stock document management subsystem and is respectively connected with the document storage module and the originality degree computing module.
As shown in figure 2 of the drawings, in which,
further, the text data preprocessing module is specifically configured to:
carrying out document cleaning and feature extraction on the document;
namely, the document to be evaluated is segmented into N paragraphs, the obtained paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isnN is 1, 2 … … N represents a document paragraph after segmentation, and N is an integer of 2 or more.
And calculating to obtain word characteristics and distributed representation of the document to be evaluated:
performing word segmentation processing and stop word processing on the segmented document to obtain word characteristics, namely a word bag model;
and performing sentence vector calculation on the segmented document to obtain a distributed representation.
Further, the candidate similar document retrieval module is specifically configured to:
marking the paragraph p retrieved from the word sense similar document candidate submodule ESnThe word sense candidate similar segment set is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
marking the paragraph p retrieved from the semantic similar document candidate submodule MilvusnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecRecall or not, recallSet of all candidate similar paragraphs:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, and K is an integer of 2 or more.
Further, the originality degree calculation module is specifically configured to:
calculating the originality of the article by using the following formula;
degree of originality of articleIn the formula scorenThe originality of the nth text of the document to be evaluated;
wherein,
scoren=min(score_wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
Further, the original creativity score under the bag-of-words model is calculated by the following formula:
wherein,
in the formula,is paragraph pnA set of inner words is selected from the group, represents paragraph pnAnd skThe number of the same words contained; in the denominatorRepresenting the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
Further, the originality score under the distributed representation is specifically calculated by the following formula:
wherein,
The article originality degree evaluation system disclosed by the invention has the advantages that the main calculation is completed off-line, the reasoning speed is high, and meanwhile, the article originality degree evaluation system disclosed by the invention can support real-time calculation.
The article originality evaluation system disclosed by the invention has two dimensions of literal and semantic, and can effectively process the situation of 'manuscript washing'.
Example two:
as shown in figure 3 of the drawings,
the present disclosure can also provide an article originality evaluation method, including:
s201: performing document cleaning and feature extraction on the document to be evaluated by using the text data preprocessing module, and calculating to obtain word features and distributed representation of the document to be evaluated;
in particular, the amount of the solvent to be used,
marking the sense of the word fromParagraph p retrieved from similar document candidate submodule ESnThe word sense candidate similar segment set is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
tagging paragraphs p retrieved from the semantically similar document candidate submodule MilVusnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
s202: recalling the stock documents in the document library and the documents to be evaluated which may have similarity by utilizing the candidate similar document retrieval module;
in particular, the amount of the solvent to be used,
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecAnd (4) whether to recall, recording all the recalled candidate similar section sets:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, K represents an integer of 2 or more;
s203: and calculating the originality of the document to be evaluated by using the originality calculation module based on the similarity between the document to be evaluated and the document stored in the document library.
In particular, the amount of the solvent to be used,
calculating the originality of the article by using the following formula;
degree of originality of articleIn the formula scorenFor the nth text of the document to be evaluatedDegree of originality of;
wherein,
scoren=min(score_wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
Further, the original creativity score under the bag-of-words model is calculated by the following formula:
wherein,
in the formula,is paragraph pnA set of inner words is selected from the group, represents paragraph pnAnd skThe number of the same words contained; in the denominatorRepresenting the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
Further, the originality score under the distributed representation is specifically calculated by the following formula:
wherein,
Further, still include:
carrying out document cleaning and feature extraction on the document to be put in storage by utilizing the text data preprocessing module;
the document storage module is used for respectively storing the preprocessed document data to be stored into a corresponding database: a word meaning similar document candidate submodule ES, a semantic meaning similar document candidate submodule Milvus and a characteristic storage submodule Mongo.
Example three:
the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the above-described article originality assessment system when executed by a processor.
The computer storage medium of the present disclosure may be implemented with a semiconductor memory, a magnetic core memory, a magnetic drum memory, or a magnetic disk memory.
Semiconductor memories are mainly used as semiconductor memory elements of computers, and there are two types, Mos and bipolar memory elements. Mos devices have high integration, simple process, but slow speed. The bipolar element has the advantages of complex process, high power consumption, low integration level and high speed. NMos and CMos were introduced to make Mos memory dominate in semiconductor memory. NMos is fast, e.g. 45ns for 1K bit sram from intel. The CMos power consumption is low, and the access time of the 4K-bit CMos static memory is 300 ns. The semiconductor memories described above are all Random Access Memories (RAMs), i.e. read and write new contents randomly during operation. And a semiconductor Read Only Memory (ROM), which can be read out randomly but cannot be written in during operation, is used to store solidified programs and data. The ROM is classified into a non-rewritable fuse type ROM, PROM, and a rewritable EPROM.
The magnetic core memory has the characteristics of low cost and high reliability, and has more than 20 years of practical use experience. Magnetic core memories were widely used as main memories before the mid 70's. The storage capacity can reach more than 10 bits, and the access time is 300ns at the fastest speed. The typical international magnetic core memory has a capacity of 4 MS-8 MB and an access cycle of 1.0-1.5 mus. After semiconductor memory is rapidly developed to replace magnetic core memory as a main memory location, magnetic core memory can still be applied as a large-capacity expansion memory.
Drum memory, an external memory for magnetic recording. Because of its fast information access speed and stable and reliable operation, it is being replaced by disk memory, but it is still used as external memory for real-time process control computers and medium and large computers. In order to meet the needs of small and micro computers, subminiature magnetic drums have emerged, which are small, lightweight, highly reliable, and convenient to use.
Magnetic disk memory, an external memory for magnetic recording. It combines the advantages of drum and tape storage, i.e. its storage capacity is larger than that of drum, its access speed is faster than that of tape storage, and it can be stored off-line, so that the magnetic disk is widely used as large-capacity external storage in various computer systems. Magnetic disks are generally classified into two main categories, hard disks and floppy disk memories.
Hard disk memories are of a wide variety. The structure is divided into a replaceable type and a fixed type. The replaceable disk is replaceable and the fixed disk is fixed. The replaceable and fixed magnetic disks have both multi-disk combinations and single-chip structures, and are divided into fixed head types and movable head types. The fixed head type magnetic disk has a small capacity, a low recording density, a high access speed, and a high cost. The movable head type magnetic disk has a high recording density (up to 1000 to 6250 bits/inch) and thus a large capacity, but has a low access speed compared with a fixed head magnetic disk. The storage capacity of a magnetic disk product can reach several hundred megabytes with a bit density of 6250 bits per inch and a track density of 475 tracks per inch. The disk set of the multiple replaceable disk memory can be replaced, so that the disk set has large off-body capacity, large capacity and high speed, can store large-capacity information data, and is widely applied to an online information retrieval system and a database management system.
Example four:
the present disclosure also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the above-mentioned article originality assessment system are implemented.
Fig. 4 is a schematic diagram of an internal structure of the electronic device in one embodiment. As shown in fig. 4, the electronic device includes a processor, a storage medium, a memory, and a network interface connected through a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize an article originality evaluation system when being executed by the processor. The processor of the electrical device is used to provide computing and control capabilities to support the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to execute a system for assessing originality of an article. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The electronic device includes, but is not limited to, a smart phone, a computer, a tablet, a wearable smart device, an artificial smart device, a mobile power source, and the like.
The processor may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor is a Control Unit of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (for example, executing remote data reading and writing programs, etc.) stored in the memory and calling data stored in the memory.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connected communication between the memory and at least one processor or the like.
Fig. 4 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 4 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.
Claims (10)
1. An article originality evaluation method is characterized by comprising the following steps:
preprocessing the document to be put in storage and storing the preprocessed document in storage;
performing word meaning similar document candidate processing, word meaning similar document candidate processing and/or feature extraction and storage on the newly-put documents;
recalling similar documents possibly existing in the stock documents and the documents to be evaluated in the document library;
and calculating the originality of the document to be evaluated based on the similarity of the document to be evaluated and the stock document, wherein the stock document is a document which has higher originality in the service scene and is determined to need intellectual property protection, and the stock document is stored in a document library.
2. The method of claim 1, wherein the text data preprocessing module is specifically configured to:
carrying out document cleaning and feature extraction on the document;
calculating to obtain word characteristics and distributed representation of the document to be evaluated, namely segmenting the document to be evaluated into N paragraphs to obtain a paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isnN is 1, 2 … … N represents a document paragraph after segmentation, and N is an integer of 2 or more.
3. The method according to claim 2, wherein the recalling that similar documents may exist in the inventory document and the document to be evaluated in the document library specifically comprises:
paragraph p obtained by label searchnThe word sense candidate similar segment set is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
paragraph p obtained by label searchnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecAnd (4) whether to recall, recording all the recalled candidate similar section sets:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, and K is an integer of 2 or more.
4. The method of claim 3, wherein the calculating the originality of the document to be evaluated comprises:
calculating the originality of the article by using the following formula;
degree of originality of articleIn the formula scorenThe originality of the nth text of the document to be evaluated;
wherein,
scoren=min(score-wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
5. The method of claim 4, wherein the originality score under the bag of words model is calculated by the following formula:
wherein,
in the formula,is a collection of words within a paragraph pn, represents paragraph pnAnd skThe number of the same words contained; in the denominatorRepresenting the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
7. An article originality evaluation system is characterized by comprising:
the text data preprocessing module is used for preprocessing the document;
the system comprises an inventory document management subsystem and a document management module, wherein the inventory document management subsystem comprises a document storage module used for maintaining a document library, and inventory documents are stored in the document library, wherein the inventory documents are documents which have higher originality in service scenes and are identified to need intellectual property protection;
a word sense similar document candidate submodule ES for providing candidate similar documents which are similar in word;
the semantically similar document candidate submodule Milvus is used for providing semantically similar candidate similar documents;
the characteristic storage submodule Mongo is used for storing all characteristic data of the document;
the document storage module is used for storing document data into the semantic similar document candidate submodule ES, the semantic similar document candidate submodule Milvus and the feature storage submodule Mongo;
the article originality degree calculation operator system is used for calculating the originality degree of the evaluation article;
the article originality degree operator system is composed of a candidate similar document retrieval module and an originality degree calculation module;
the candidate similar document retrieval module is used for recalling the stock documents in the document library and the documents to be evaluated which may have similarity;
the originality degree calculation module is used for calculating the originality degree of the document to be evaluated based on the similarity degree of the document to be evaluated and the document stored in the document library;
the text data preprocessing module is arranged between the article originality calculating subsystem and the stock document management subsystem;
the word meaning similar document candidate submodule ES and the semantic similar document candidate submodule Milvus are arranged between the article originality measuring subsystem and the stock document management subsystem and are respectively connected with the document storage module and the candidate similar document retrieval module;
the characteristic storage submodule Mongo is arranged between the article originality degree computing subsystem and the stock document management subsystem and is respectively connected with the document storage module and the originality degree computing module.
8. The system of claim 7, wherein the text data preprocessing module is specifically configured to:
carrying out document cleaning and feature extraction on the document;
calculating to obtain word characteristics and distributed representation of the document to be evaluated, namely segmenting the document to be evaluated into N paragraphs to obtain a paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isnN is 1, 2 … … N represents a document paragraph after segmentation, and N is an integer of 2 or more.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps corresponding to the method for evaluating the originality of an article as claimed in any one of claims 1 to 6 when executing the computer program.
10. A computer storage medium having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, are for implementing the steps corresponding to the article originality assessment method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111091198.5A CN113792119A (en) | 2021-09-17 | 2021-09-17 | Article originality evaluation system, method, device and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111091198.5A CN113792119A (en) | 2021-09-17 | 2021-09-17 | Article originality evaluation system, method, device and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113792119A true CN113792119A (en) | 2021-12-14 |
Family
ID=79183893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111091198.5A Pending CN113792119A (en) | 2021-09-17 | 2021-09-17 | Article originality evaluation system, method, device and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113792119A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347782A (en) * | 2019-07-18 | 2019-10-18 | 知者信息技术服务成都有限公司 | Article duplicate checking method, apparatus and electronic equipment |
US20200394186A1 (en) * | 2019-06-11 | 2020-12-17 | International Business Machines Corporation | Nlp-based context-aware log mining for troubleshooting |
CN113011194A (en) * | 2021-04-15 | 2021-06-22 | 电子科技大学 | Text similarity calculation method fusing keyword features and multi-granularity semantic features |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
-
2021
- 2021-09-17 CN CN202111091198.5A patent/CN113792119A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200394186A1 (en) * | 2019-06-11 | 2020-12-17 | International Business Machines Corporation | Nlp-based context-aware log mining for troubleshooting |
CN110347782A (en) * | 2019-07-18 | 2019-10-18 | 知者信息技术服务成都有限公司 | Article duplicate checking method, apparatus and electronic equipment |
CN113011194A (en) * | 2021-04-15 | 2021-06-22 | 电子科技大学 | Text similarity calculation method fusing keyword features and multi-granularity semantic features |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102541968B (en) | Indexing method | |
CN110162522B (en) | Distributed data search system and method | |
CN103678277A (en) | Theme-vocabulary distribution establishing method and system based on document segmenting | |
CN113449187A (en) | Product recommendation method, device and equipment based on double portraits and storage medium | |
CN111666415A (en) | Topic clustering method and device, electronic equipment and storage medium | |
CN105550354A (en) | Configuration file management method and system | |
CN102959548B (en) | Date storage method, lookup method and device | |
CN115018588A (en) | Product recommendation method and device, electronic equipment and readable storage medium | |
CN112766512A (en) | Deep learning framework diagnosis system, method, device, equipment and medium based on meta-operator | |
CN113971225A (en) | Image retrieval system, method and device | |
CN113360803A (en) | Data caching method, device and equipment based on user behavior and storage medium | |
CN114138784A (en) | Information tracing method and device based on storage library, electronic equipment and medium | |
CN113255682B (en) | Target detection system, method, device, equipment and medium | |
CN112308313A (en) | Method, device, medium and computer equipment for continuous point addressing of school | |
CN114754786A (en) | Truck navigation way finding method, device, equipment and medium | |
CN115878824A (en) | Image retrieval system, method and device | |
CN113792119A (en) | Article originality evaluation system, method, device and medium | |
US20210089539A1 (en) | Associating user-provided content items to interest nodes | |
CN113806539A (en) | Text data enhancement system, method, device and medium | |
US20160357822A1 (en) | Using locations to define moments | |
CN114692573A (en) | Text structuring method, apparatus, computer device, medium, and product | |
CN113537286B (en) | Image classification method, device, equipment and medium | |
CN107908724A (en) | A kind of data model matching process, device, equipment and storage medium | |
CN112989938A (en) | Real-time tracking and identifying method, device, medium and equipment for pedestrians | |
CN113866638A (en) | Battery parameter inference method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |