CN113792119A - Article originality evaluation system, method, device and medium - Google Patents

Article originality evaluation system, method, device and medium Download PDF

Info

Publication number
CN113792119A
CN113792119A CN202111091198.5A CN202111091198A CN113792119A CN 113792119 A CN113792119 A CN 113792119A CN 202111091198 A CN202111091198 A CN 202111091198A CN 113792119 A CN113792119 A CN 113792119A
Authority
CN
China
Prior art keywords
document
originality
similar
candidate
article
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111091198.5A
Other languages
Chinese (zh)
Inventor
李鹏宇
李剑锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202111091198.5A priority Critical patent/CN113792119A/en
Publication of CN113792119A publication Critical patent/CN113792119A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a system, method, device and medium for evaluating originality of an article, wherein the system includes: the system comprises a text data preprocessing module and an inventory document management subsystem, wherein the inventory document management subsystem comprises a document warehousing module, a semantic similar document candidate submodule ES, a semantic similar document candidate submodule Milvus, a feature storage submodule Mongo and an article originality degree operator system, wherein the article originality degree operator system is specifically composed of a candidate similar document retrieval module and an originality degree calculation module; the text data preprocessing module is arranged between the article originality calculating subsystem and the stock document management subsystem; the word meaning similar document candidate submodule ES and the semantic similar document candidate submodule Milvus are arranged between the article originality measuring subsystem and the stock document management subsystem and are respectively connected with the document storage module and the candidate similar document retrieval module.

Description

Article originality evaluation system, method, device and medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a system, a method, a device, and a medium for evaluating originality of an article.
Background
The dramatic expansion of the scale of data mastered by humans is accompanied by the existence of large amounts of similar data. In some scenarios, we need to measure the originality of a document to determine the processing method of the document. For example, after the originality of a manuscript is preliminarily confirmed by 'duplicate checking', the academic journal considers whether to accept the manuscript or not; the phenomena of plagiarism, reprinting and the like exist in the internet in large quantity, and the phenomena can be found efficiently only by a calculation tool based on the original degree.
The current originality calculation tool measures the similarity degree between two documents mainly based on a character string matching mode, and has low processing capacity on 'manuscript washing'.
Disclosure of Invention
The method aims to solve the technical problem that the evaluation method of the originality of the article in the prior art cannot meet the requirements of users.
In order to achieve the technical purpose, the present disclosure provides an article originality evaluation method, which includes:
preprocessing the document to be put in storage and storing the preprocessed document in storage;
performing word meaning similar document candidate processing, word meaning similar document candidate processing and/or feature extraction and storage on the newly-put documents;
recalling similar documents possibly existing in the stock documents and the documents to be evaluated in the document library;
and calculating the originality of the document to be evaluated based on the similarity of the document to be evaluated and the stock document, wherein the stock document is a document which has higher originality in the service scene and is determined to need intellectual property protection, and the stock document is stored in a document library.
Further, the method is characterized in that the preprocessing of the document to be put in storage specifically comprises:
carrying out document cleaning and feature extraction on the document;
calculating to obtain word characteristics and distributed representation of the document to be evaluated, namely segmenting the document to be evaluated into N paragraphs to obtain a paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isn,n=1,2…… N denotes a document paragraph after segmentation, and N is an integer of 2 or more.
Further, the recalling that similar documents may exist in the inventory document and the document to be evaluated in the document library specifically includes:
the word sense candidate similar paragraph set of the paragraph pn obtained by the mark retrieval is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
paragraph p obtained by label searchnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecAnd (4) whether to recall, recording all the recalled candidate similar section sets:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, and K is an integer of 2 or more.
Further, the calculating the originality of the document to be evaluated specifically includes:
calculating the originality of the article by using the following formula;
degree of originality of article
Figure BDA0003266979380000031
In the formula scorenThe originality of the nth text of the document to be evaluated;
wherein,
scoren=min(score_wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
Further, the original creativity score under the bag-of-words model is calculated by the following formula:
Figure BDA0003266979380000032
wherein,
Figure BDA0003266979380000033
in the formula,
Figure BDA0003266979380000034
is a collection of words within a paragraph pn,
Figure BDA0003266979380000035
Figure BDA0003266979380000036
represents paragraph pnAnd skThe number of the same words contained; in the denominator
Figure BDA0003266979380000037
Representing the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
Further, the originality score under the distributed representation is specifically calculated by the following formula:
Figure BDA0003266979380000038
wherein,
cosine distance
Figure BDA0003266979380000039
In the formula
Figure BDA00032669793800000310
Is a distributed representation of the nth text segment.
In order to achieve the above technical object, the present disclosure can also provide an article originality evaluation method, including:
performing document cleaning and feature extraction on the document to be evaluated by using the text data preprocessing module, and calculating to obtain word features and distributed representation of the document to be evaluated;
recalling the stock documents in the document library and the documents to be evaluated which may have similarity by utilizing the candidate similar document retrieval module;
and calculating the originality of the document to be evaluated by using the originality calculation module based on the similarity between the document to be evaluated and the document stored in the document library.
Further, still include:
carrying out document cleaning and feature extraction on the document to be put in storage by utilizing the text data preprocessing module;
the document storage module is used for respectively storing the preprocessed document data to be stored into a corresponding database: a word meaning similar document candidate submodule ES, a semantic meaning similar document candidate submodule Milvus and a characteristic storage submodule Mongo.
To achieve the above technical objects, the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the above-described article originality assessment method when executed by a processor.
In order to achieve the above technical object, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the article originality assessment method when executing the computer program.
The beneficial effect of this disclosure does:
the article originality degree evaluation system disclosed by the invention has the advantages that the main calculation is completed off-line, the reasoning speed is high, and meanwhile, the article originality degree evaluation system disclosed by the invention can support real-time calculation.
The article originality evaluation system disclosed by the invention has two dimensions of literal and semantic, and can effectively process the situation of 'manuscript washing'.
Drawings
Fig. 1 shows a schematic structural diagram of a system of embodiment 1 of the present disclosure;
FIG. 2 shows a schematic structural diagram of a text data preprocessing module of the system of embodiment 1 of the present disclosure;
FIG. 3 shows a flow diagram of a method of embodiment 2 of the present disclosure;
fig. 4 shows a schematic structural diagram of embodiment 4 of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
Various structural schematics according to embodiments of the present disclosure are shown in the figures. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers, and relative sizes and positional relationships therebetween shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, as actually required.
The first embodiment is as follows:
as shown in fig. 1:
the present disclosure provides an article originality evaluation system, including:
the text data preprocessing module is used for preprocessing the document;
the system comprises an inventory document management subsystem and a document management module, wherein the inventory document management subsystem comprises a document storage module used for maintaining a document library, and inventory documents are stored in the document library, wherein the inventory documents are documents which have higher originality in service scenes and are identified to need intellectual property protection;
a word sense similar document candidate submodule ES for providing candidate similar documents which are similar in word;
the semantically similar document candidate submodule Milvus is used for providing semantically similar candidate similar documents;
the characteristic storage submodule Mongo is used for storing all characteristic data of the document;
the document storage module is used for storing document data into the semantic similar document candidate submodule ES, the semantic similar document candidate submodule Milvus and the feature storage submodule Mongo;
the article originality degree calculation operator system is used for calculating the originality degree of the evaluation article;
the article originality degree operator system is composed of a candidate similar document retrieval module and an originality degree calculation module;
the candidate similar document retrieval module is used for recalling the stock documents in the document library and the documents to be evaluated which may have similarity;
the originality degree calculation module is used for calculating the originality degree of the document to be evaluated based on the similarity degree of the document to be evaluated and the document stored in the document library;
the text data preprocessing module is arranged between the article originality calculating subsystem and the stock document management subsystem;
the word meaning similar document candidate submodule ES and the semantic similar document candidate submodule Milvus are arranged between the article originality measuring subsystem and the stock document management subsystem and are respectively connected with the document storage module and the candidate similar document retrieval module;
the characteristic storage submodule Mongo is arranged between the article originality degree computing subsystem and the stock document management subsystem and is respectively connected with the document storage module and the originality degree computing module.
As shown in figure 2 of the drawings, in which,
further, the text data preprocessing module is specifically configured to:
carrying out document cleaning and feature extraction on the document;
namely, the document to be evaluated is segmented into N paragraphs, the obtained paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isnN is 1, 2 … … N represents a document paragraph after segmentation, and N is an integer of 2 or more.
And calculating to obtain word characteristics and distributed representation of the document to be evaluated:
performing word segmentation processing and stop word processing on the segmented document to obtain word characteristics, namely a word bag model;
and performing sentence vector calculation on the segmented document to obtain a distributed representation.
Further, the candidate similar document retrieval module is specifically configured to:
marking the paragraph p retrieved from the word sense similar document candidate submodule ESnThe word sense candidate similar segment set is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
marking the paragraph p retrieved from the semantic similar document candidate submodule MilvusnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecRecall or not, recallSet of all candidate similar paragraphs:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, and K is an integer of 2 or more.
Further, the originality degree calculation module is specifically configured to:
calculating the originality of the article by using the following formula;
degree of originality of article
Figure BDA0003266979380000081
In the formula scorenThe originality of the nth text of the document to be evaluated;
wherein,
scoren=min(score_wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
Further, the original creativity score under the bag-of-words model is calculated by the following formula:
Figure BDA0003266979380000082
wherein,
Figure BDA0003266979380000083
in the formula,
Figure BDA0003266979380000091
is paragraph pnA set of inner words is selected from the group,
Figure BDA0003266979380000092
Figure BDA0003266979380000093
represents paragraph pnAnd skThe number of the same words contained; in the denominator
Figure BDA0003266979380000094
Representing the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
Further, the originality score under the distributed representation is specifically calculated by the following formula:
Figure BDA0003266979380000095
wherein,
cosine distance
Figure BDA0003266979380000096
In the formula
Figure BDA0003266979380000097
Is a distributed representation of the nth text segment.
The article originality degree evaluation system disclosed by the invention has the advantages that the main calculation is completed off-line, the reasoning speed is high, and meanwhile, the article originality degree evaluation system disclosed by the invention can support real-time calculation.
The article originality evaluation system disclosed by the invention has two dimensions of literal and semantic, and can effectively process the situation of 'manuscript washing'.
Example two:
as shown in figure 3 of the drawings,
the present disclosure can also provide an article originality evaluation method, including:
s201: performing document cleaning and feature extraction on the document to be evaluated by using the text data preprocessing module, and calculating to obtain word features and distributed representation of the document to be evaluated;
in particular, the amount of the solvent to be used,
marking the sense of the word fromParagraph p retrieved from similar document candidate submodule ESnThe word sense candidate similar segment set is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
tagging paragraphs p retrieved from the semantically similar document candidate submodule MilVusnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
s202: recalling the stock documents in the document library and the documents to be evaluated which may have similarity by utilizing the candidate similar document retrieval module;
in particular, the amount of the solvent to be used,
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecAnd (4) whether to recall, recording all the recalled candidate similar section sets:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, K represents an integer of 2 or more;
s203: and calculating the originality of the document to be evaluated by using the originality calculation module based on the similarity between the document to be evaluated and the document stored in the document library.
In particular, the amount of the solvent to be used,
calculating the originality of the article by using the following formula;
degree of originality of article
Figure BDA0003266979380000101
In the formula scorenFor the nth text of the document to be evaluatedDegree of originality of;
wherein,
scoren=min(score_wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
Further, the original creativity score under the bag-of-words model is calculated by the following formula:
Figure BDA0003266979380000111
wherein,
Figure BDA0003266979380000112
in the formula,
Figure BDA0003266979380000113
is paragraph pnA set of inner words is selected from the group,
Figure BDA0003266979380000114
Figure BDA0003266979380000115
represents paragraph pnAnd skThe number of the same words contained; in the denominator
Figure BDA0003266979380000116
Representing the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
Further, the originality score under the distributed representation is specifically calculated by the following formula:
Figure BDA0003266979380000117
wherein,
cosine distance
Figure BDA0003266979380000118
In the formula
Figure BDA0003266979380000119
Is a distributed representation of the nth text segment.
Further, still include:
carrying out document cleaning and feature extraction on the document to be put in storage by utilizing the text data preprocessing module;
the document storage module is used for respectively storing the preprocessed document data to be stored into a corresponding database: a word meaning similar document candidate submodule ES, a semantic meaning similar document candidate submodule Milvus and a characteristic storage submodule Mongo.
Example three:
the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the above-described article originality assessment system when executed by a processor.
The computer storage medium of the present disclosure may be implemented with a semiconductor memory, a magnetic core memory, a magnetic drum memory, or a magnetic disk memory.
Semiconductor memories are mainly used as semiconductor memory elements of computers, and there are two types, Mos and bipolar memory elements. Mos devices have high integration, simple process, but slow speed. The bipolar element has the advantages of complex process, high power consumption, low integration level and high speed. NMos and CMos were introduced to make Mos memory dominate in semiconductor memory. NMos is fast, e.g. 45ns for 1K bit sram from intel. The CMos power consumption is low, and the access time of the 4K-bit CMos static memory is 300 ns. The semiconductor memories described above are all Random Access Memories (RAMs), i.e. read and write new contents randomly during operation. And a semiconductor Read Only Memory (ROM), which can be read out randomly but cannot be written in during operation, is used to store solidified programs and data. The ROM is classified into a non-rewritable fuse type ROM, PROM, and a rewritable EPROM.
The magnetic core memory has the characteristics of low cost and high reliability, and has more than 20 years of practical use experience. Magnetic core memories were widely used as main memories before the mid 70's. The storage capacity can reach more than 10 bits, and the access time is 300ns at the fastest speed. The typical international magnetic core memory has a capacity of 4 MS-8 MB and an access cycle of 1.0-1.5 mus. After semiconductor memory is rapidly developed to replace magnetic core memory as a main memory location, magnetic core memory can still be applied as a large-capacity expansion memory.
Drum memory, an external memory for magnetic recording. Because of its fast information access speed and stable and reliable operation, it is being replaced by disk memory, but it is still used as external memory for real-time process control computers and medium and large computers. In order to meet the needs of small and micro computers, subminiature magnetic drums have emerged, which are small, lightweight, highly reliable, and convenient to use.
Magnetic disk memory, an external memory for magnetic recording. It combines the advantages of drum and tape storage, i.e. its storage capacity is larger than that of drum, its access speed is faster than that of tape storage, and it can be stored off-line, so that the magnetic disk is widely used as large-capacity external storage in various computer systems. Magnetic disks are generally classified into two main categories, hard disks and floppy disk memories.
Hard disk memories are of a wide variety. The structure is divided into a replaceable type and a fixed type. The replaceable disk is replaceable and the fixed disk is fixed. The replaceable and fixed magnetic disks have both multi-disk combinations and single-chip structures, and are divided into fixed head types and movable head types. The fixed head type magnetic disk has a small capacity, a low recording density, a high access speed, and a high cost. The movable head type magnetic disk has a high recording density (up to 1000 to 6250 bits/inch) and thus a large capacity, but has a low access speed compared with a fixed head magnetic disk. The storage capacity of a magnetic disk product can reach several hundred megabytes with a bit density of 6250 bits per inch and a track density of 475 tracks per inch. The disk set of the multiple replaceable disk memory can be replaced, so that the disk set has large off-body capacity, large capacity and high speed, can store large-capacity information data, and is widely applied to an online information retrieval system and a database management system.
Example four:
the present disclosure also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the above-mentioned article originality assessment system are implemented.
Fig. 4 is a schematic diagram of an internal structure of the electronic device in one embodiment. As shown in fig. 4, the electronic device includes a processor, a storage medium, a memory, and a network interface connected through a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize an article originality evaluation system when being executed by the processor. The processor of the electrical device is used to provide computing and control capabilities to support the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to execute a system for assessing originality of an article. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The electronic device includes, but is not limited to, a smart phone, a computer, a tablet, a wearable smart device, an artificial smart device, a mobile power source, and the like.
The processor may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor is a Control Unit of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (for example, executing remote data reading and writing programs, etc.) stored in the memory and calling data stored in the memory.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connected communication between the memory and at least one processor or the like.
Fig. 4 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 4 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (10)

1. An article originality evaluation method is characterized by comprising the following steps:
preprocessing the document to be put in storage and storing the preprocessed document in storage;
performing word meaning similar document candidate processing, word meaning similar document candidate processing and/or feature extraction and storage on the newly-put documents;
recalling similar documents possibly existing in the stock documents and the documents to be evaluated in the document library;
and calculating the originality of the document to be evaluated based on the similarity of the document to be evaluated and the stock document, wherein the stock document is a document which has higher originality in the service scene and is determined to need intellectual property protection, and the stock document is stored in a document library.
2. The method of claim 1, wherein the text data preprocessing module is specifically configured to:
carrying out document cleaning and feature extraction on the document;
calculating to obtain word characteristics and distributed representation of the document to be evaluated, namely segmenting the document to be evaluated into N paragraphs to obtain a paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isnN is 1, 2 … … N represents a document paragraph after segmentation, and N is an integer of 2 or more.
3. The method according to claim 2, wherein the recalling that similar documents may exist in the inventory document and the document to be evaluated in the document library specifically comprises:
paragraph p obtained by label searchnThe word sense candidate similar segment set is cand _ listwordbag=(c1,c2,...ci,...cI) (ii) a Wherein, ciRepresents paragraph pnThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;
paragraph p obtained by label searchnThe semantic candidate similar segment set is cand _ listdistvec=(d1,d2,...dj,...dJ) (ii) a Wherein d isjRepresents paragraph pnThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;
a threshold is used to determine the word sense candidate similar segment set cand _ listwordbagSimilar to semantic candidate paragraph set cand _ listdistvecAnd (4) whether to recall, recording all the recalled candidate similar section sets:
candlist=cand_listwordbag∪cand_listdistvec=(s1,s2,...,sk,...,sK);
wherein K is 1, 2, … …, and K is an integer of 2 or more.
4. The method of claim 3, wherein the calculating the originality of the document to be evaluated comprises:
calculating the originality of the article by using the following formula;
degree of originality of article
Figure FDA0003266979370000021
In the formula scorenThe originality of the nth text of the document to be evaluated;
wherein,
scoren=min(score-wordbagn,score_distvecn),
wherein, score _ WordbagnIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvecnIs the originality score of the nth article paragraph under the distributed representation.
5. The method of claim 4, wherein the originality score under the bag of words model is calculated by the following formula:
Figure FDA0003266979370000022
wherein,
Figure FDA0003266979370000023
in the formula,
Figure FDA0003266979370000024
is a collection of words within a paragraph pn,
Figure FDA0003266979370000025
Figure FDA0003266979370000026
represents paragraph pnAnd skThe number of the same words contained; in the denominator
Figure FDA0003266979370000027
Representing the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.
6. The method of claim 4, wherein the originality score under the distributed representation is calculated by:
Figure FDA0003266979370000031
wherein,
cosine distance
Figure FDA0003266979370000032
In the formula
Figure FDA0003266979370000033
Is a distributed representation of the nth text segment.
7. An article originality evaluation system is characterized by comprising:
the text data preprocessing module is used for preprocessing the document;
the system comprises an inventory document management subsystem and a document management module, wherein the inventory document management subsystem comprises a document storage module used for maintaining a document library, and inventory documents are stored in the document library, wherein the inventory documents are documents which have higher originality in service scenes and are identified to need intellectual property protection;
a word sense similar document candidate submodule ES for providing candidate similar documents which are similar in word;
the semantically similar document candidate submodule Milvus is used for providing semantically similar candidate similar documents;
the characteristic storage submodule Mongo is used for storing all characteristic data of the document;
the document storage module is used for storing document data into the semantic similar document candidate submodule ES, the semantic similar document candidate submodule Milvus and the feature storage submodule Mongo;
the article originality degree calculation operator system is used for calculating the originality degree of the evaluation article;
the article originality degree operator system is composed of a candidate similar document retrieval module and an originality degree calculation module;
the candidate similar document retrieval module is used for recalling the stock documents in the document library and the documents to be evaluated which may have similarity;
the originality degree calculation module is used for calculating the originality degree of the document to be evaluated based on the similarity degree of the document to be evaluated and the document stored in the document library;
the text data preprocessing module is arranged between the article originality calculating subsystem and the stock document management subsystem;
the word meaning similar document candidate submodule ES and the semantic similar document candidate submodule Milvus are arranged between the article originality measuring subsystem and the stock document management subsystem and are respectively connected with the document storage module and the candidate similar document retrieval module;
the characteristic storage submodule Mongo is arranged between the article originality degree computing subsystem and the stock document management subsystem and is respectively connected with the document storage module and the originality degree computing module.
8. The system of claim 7, wherein the text data preprocessing module is specifically configured to:
carrying out document cleaning and feature extraction on the document;
calculating to obtain word characteristics and distributed representation of the document to be evaluated, namely segmenting the document to be evaluated into N paragraphs to obtain a paragraph set paras,
paras=(p1,p2,...,pn,...,pN) Wherein p isnN is 1, 2 … … N represents a document paragraph after segmentation, and N is an integer of 2 or more.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps corresponding to the method for evaluating the originality of an article as claimed in any one of claims 1 to 6 when executing the computer program.
10. A computer storage medium having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, are for implementing the steps corresponding to the article originality assessment method of any one of claims 1 to 6.
CN202111091198.5A 2021-09-17 2021-09-17 Article originality evaluation system, method, device and medium Pending CN113792119A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111091198.5A CN113792119A (en) 2021-09-17 2021-09-17 Article originality evaluation system, method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111091198.5A CN113792119A (en) 2021-09-17 2021-09-17 Article originality evaluation system, method, device and medium

Publications (1)

Publication Number Publication Date
CN113792119A true CN113792119A (en) 2021-12-14

Family

ID=79183893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111091198.5A Pending CN113792119A (en) 2021-09-17 2021-09-17 Article originality evaluation system, method, device and medium

Country Status (1)

Country Link
CN (1) CN113792119A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347782A (en) * 2019-07-18 2019-10-18 知者信息技术服务成都有限公司 Article duplicate checking method, apparatus and electronic equipment
US20200394186A1 (en) * 2019-06-11 2020-12-17 International Business Machines Corporation Nlp-based context-aware log mining for troubleshooting
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200394186A1 (en) * 2019-06-11 2020-12-17 International Business Machines Corporation Nlp-based context-aware log mining for troubleshooting
CN110347782A (en) * 2019-07-18 2019-10-18 知者信息技术服务成都有限公司 Article duplicate checking method, apparatus and electronic equipment
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102541968B (en) Indexing method
CN110162522B (en) Distributed data search system and method
CN103678277A (en) Theme-vocabulary distribution establishing method and system based on document segmenting
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN111666415A (en) Topic clustering method and device, electronic equipment and storage medium
CN105550354A (en) Configuration file management method and system
CN102959548B (en) Date storage method, lookup method and device
CN115018588A (en) Product recommendation method and device, electronic equipment and readable storage medium
CN112766512A (en) Deep learning framework diagnosis system, method, device, equipment and medium based on meta-operator
CN113971225A (en) Image retrieval system, method and device
CN113360803A (en) Data caching method, device and equipment based on user behavior and storage medium
CN114138784A (en) Information tracing method and device based on storage library, electronic equipment and medium
CN113255682B (en) Target detection system, method, device, equipment and medium
CN112308313A (en) Method, device, medium and computer equipment for continuous point addressing of school
CN114754786A (en) Truck navigation way finding method, device, equipment and medium
CN115878824A (en) Image retrieval system, method and device
CN113792119A (en) Article originality evaluation system, method, device and medium
US20210089539A1 (en) Associating user-provided content items to interest nodes
CN113806539A (en) Text data enhancement system, method, device and medium
US20160357822A1 (en) Using locations to define moments
CN114692573A (en) Text structuring method, apparatus, computer device, medium, and product
CN113537286B (en) Image classification method, device, equipment and medium
CN107908724A (en) A kind of data model matching process, device, equipment and storage medium
CN112989938A (en) Real-time tracking and identifying method, device, medium and equipment for pedestrians
CN113866638A (en) Battery parameter inference method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination