CN113792119A

CN113792119A - Article originality evaluation system, method, device and medium

Info

Publication number: CN113792119A
Application number: CN202111091198.5A
Authority: CN
Inventors: 李鹏宇; 李剑锋
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-14

Abstract

The present disclosure relates to an article originality evaluation system, method, device and medium, wherein the system includes: a text data preprocessing module, an inventory document management subsystem, including a document storage module, and a similar document candidate submodule ES, The semantically similar document candidate submodule Milvus, the feature storage submodule Mongo, and the article originality calculation subsystem, wherein the article originality calculation subsystem is specifically composed of a candidate similar document retrieval module and an originality calculation module; the text data pre- The processing module is between the article originality calculation subsystem and the inventory document management subsystem; the semantically similar document candidate submodule ES and the semantically similar document candidate submodule Milvus are between the article originality calculation The subsystem and the inventory document management subsystem are respectively connected with the document storage module and the candidate similar document retrieval module.

Description

An article originality evaluation system, method, equipment and medium

技术领域technical field

本公开涉及数据处理技术领域，更为具体来说，本公开涉及一种文章原创度评价系统、方法、设备及介质。The present disclosure relates to the technical field of data processing, and more particularly, to an article originality evaluation system, method, device and medium.

背景技术Background technique

人类所掌握数据规模的急剧扩大，伴随着大量相似数据的存在。在一些场景下，我们需要度量一篇文档的原创度，来决定对该文档的处理方式。比如说，学术期刊在经过“查重”初步确确认一篇来稿的原创性后，才会考虑是否接受投稿；在互联网中存在大量抄袭、转载等现象，需要基于原创度计算工具才能高效地发现。The rapid expansion of the scale of data held by humans is accompanied by the existence of a large number of similar data. In some scenarios, we need to measure the originality of a document to decide how to process the document. For example, academic journals will only consider whether to accept submissions after initially confirming the originality of a manuscript through "duplication checking"; there are a large number of plagiarism, reprinting and other phenomena on the Internet, which need to be efficiently discovered based on originality calculation tools .

目前的原创度计算工具，主要基于字符串匹配的方式来度量两篇文档之间的相似程度，对“洗稿”的处理能力较低。The current originality calculation tools are mainly based on string matching to measure the similarity between two documents, and the processing ability of "manufacturing" is low.

发明内容SUMMARY OF THE INVENTION

为解决现有技术的文章原创度评价方法无法满足用户需求的技术问题。In order to solve the technical problem that the article originality evaluation method in the prior art cannot meet the needs of users.

为实现上述技术目的，本公开提供了一种文章原创度评价方法，其特征在于，包括：In order to achieve the above-mentioned technical purpose, the present disclosure provides a method for evaluating the originality of an article, which is characterized in that it includes:

对待入库文档进行预处理并入库存储；Preprocess the documents to be stored and store them in the database;

对新入库的文档进行字义相似文档候选处理、字义相似文档候选处理和/或特征提取并存储；Perform candidate processing of similar documents in literal meaning, candidate processing of documents in similar literal meaning, and/or feature extraction and storage on the newly entered documents;

召回所述文档库中的库存文档和待评估文档可能存在相似的文档；There may be similar documents in the inventory document and the document to be evaluated in the recalled document library;

基于待评估文档与库存文档的相似程度，计算待评估文档的原创度，其中，所述库存文档为内容为业务场景中具有较高原创度的，被认定需要进行知识产权保护的文档，所述库存文档存储于文档库中。Based on the degree of similarity between the document to be assessed and the inventory document, the originality of the document to be assessed is calculated, wherein the inventory document is a document whose content is highly original in a business scenario and is determined to require intellectual property protection. Inventory documents are stored in document repositories.

进一步，其特征在于，所述对待入库文档进行预处理具体包括：Further, it is characterized in that the preprocessing of the documents to be stored specifically includes:

对文档进行文档清洗以及特征提取；Perform document cleaning and feature extraction on documents;

计算得到待评估文档的词语特征和分布式表示，即待评估文档被切分为N个段落，得到的段落集合paras，Calculate the word features and distributed representation of the document to be evaluated, that is, the document to be evaluated is divided into N paragraphs, and the obtained paragraph set paras,

paras＝(p₁，p₂，...，p_n，...，p_N)，其中，p_n，n＝1，2……N表示切分后的文档段落，N为大于等于2的整数。paras=(p ₁ , p ₂ ,..., p _n ,..., p _N ), where p _n , n=1, 2... N represents the segmented document paragraph, and N is greater than or equal to 2 the integer.

进一步，其特征在于，所述召回所述文档库中的库存文档和待评估文档可能存在相似的文档具体包括：Further, it is characterized in that the recalling of the inventory documents in the document library and the documents to be evaluated may have similar documents specifically includes:

标记检索得到的段落pn的字义候选相似段落集合为cand_list_wordbag＝(c₁，c₂，...c_i，...c_I)；其中，c_i表示段落p_n检索得到的字义相似段落，i＝1，2，……I，I为大于等于2的整数； _{Cand_list} _wordbag = ( _c ₁ , c ₂ , . . . c _i , _. , i=1, 2,...I, I is an integer greater than or equal to 2;

标记检索得到的段落p_n的语义候选相似段落集合为cand_list_distvec＝(d₁，d₂，...d_j，...d_J)；其中，d_j表示段落p_n检索得到的语义相似段落，j＝1，2，……J，J为大于等于2的整数；The set of semantic candidate similar paragraphs of paragraph _pn obtained by tag retrieval is cand_list _distvec =(d ₁ , d ₂ , ... d _j , ... d _J ); wherein, d _j represents the semantic similarity retrieved by paragraph _pn Paragraph, j=1, 2,...J, J is an integer greater than or equal to 2;

利用一个阈值，来决定字义候选相似段落集合cand_list_wordbag和语义候选相似段落集合cand_list_distvec是否召回，记召回的所有候选相似段落集合：Use a threshold to determine whether the semantic candidate similar paragraph set cand_list _wordbag and the semantic candidate similar paragraph set cand_list _distvec are recalled, and record all the recalled candidate similar paragraph sets:

cand_list＝cand_list_wordbag∪cand_list_distvec＝(s₁，s₂，...，s_k，...，s_K)；cand _list =cand_list _wordbag ∪cand_list _distvec =(s ₁ , s ₂ ,...,s _k ,...,s _K );

其中，k＝1，2，……，K，K表示大于等于2的整数。Wherein, k=1, 2, ..., K, K represents an integer greater than or equal to 2.

进一步，所述计算待评估文档的原创度具体包括：Further, the calculation of the originality of the document to be evaluated specifically includes:

利用如下公式计算文章原创度；Use the following formula to calculate the originality of the article;

文章原创度

式中score_n为待评估文档的第n段文本的原创度；Article originality

In the formula, score _n is the originality of the n-th text of the document to be evaluated;

其中，in,

score_n＝min(score_wordbagn，score_distvec_n)，score _n = min(score_wordbagn, score_distvec _n ),

其中，score_wordbag_n是第n段文章段落在词袋模型下的原创度得分，score_distvec_n是第n段文章段落在分布式表示下的原创度得分。Among them, score_wordbag _n is the originality score of the nth article paragraph under the word bag model, and score_distvec _n is the originality score of the nth article paragraph under the distributed representation.

进一步，所述词袋模型下的原创度得分具体通过下式计算得到：Further, the originality score under the bag-of-words model is calculated by the following formula:

其中，in,

式中，

是段落pn内词语的集合，

表示段落p_n和s_k包含的相同词语的个数；分母中

表示两个段落长度的差距绝对值；系数β表示在计算文本相似度时，文档长度差异因素的权重，默认为0.5。In the formula,

is the set of words in paragraph pn,

Indicates the number of identical words contained in paragraphs p _n and _sk ; in the denominator

Represents the absolute value of the difference between the lengths of two paragraphs; the coefficient β represents the weight of the document length difference factor when calculating the text similarity, and the default is 0.5.

进一步，所述分布式表示下的原创度得分具体通过下式计算得到：Further, the originality score under the distributed representation is calculated by the following formula:

其中，in,

余弦距离

式中

为第n段文本的分布式表示。cosine distance

in the formula

is the distributed representation of the nth paragraph of text.

为实现上述技术目的，本公开还能够提供一种文章原创度评价方法，包括：To achieve the above technical purpose, the present disclosure can also provide a method for evaluating the originality of an article, including:

利用所述文本数据预处理模块对待评估文档进行文档清洗以及特征提取，并计算得到待评估文档的词语特征和分布式表示；Utilize the text data preprocessing module to perform document cleaning and feature extraction on the document to be evaluated, and calculate word features and distributed representations of the document to be evaluated;

利用所述候选相似文档检索模块召回所述文档库中的库存文档和待评估文档可能存在相似的文档；Using the candidate similar document retrieval module to recall the stock documents in the document library and the documents to be evaluated may have similar documents;

利用所述原创度计算模块基于待评估文档与所述文档库中库存文档的相似程度，计算待评估文档的原创度。Using the originality calculation module to calculate the originality of the document to be assessed based on the similarity between the document to be assessed and the inventory documents in the document library.

进一步，还包括：Further, it also includes:

利用所述文本数据预处理模块对待入库文档进行文档清洗以及特征提取；Utilize the text data preprocessing module to perform document cleaning and feature extraction on the documents to be stored;

利用所述文档入库模块将预处理后的待入库文档数据分别存储进对应的数据库：字义相似文档候选子模块ES、语义相似文档候选子模块Milvus以及特征存储子模块Mongo中。The document storage module is used to store the preprocessed document data to be stored in the corresponding databases: the semantically similar document candidate submodule ES, the semantically similar document candidate submodule Milvus, and the feature storage submodule Mongo.

为实现上述技术目的，本公开还能够提供一种计算机存储介质，其上存储有计算机程序，计算机程序被处理器执行时用于实现上述的文章原创度评价方法的步骤。In order to achieve the above technical purpose, the present disclosure can also provide a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the steps of the above-mentioned article originality evaluation method.

为实现上述技术目的，本公开还提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述的文章原创度评价方法的步骤。In order to achieve the above technical purpose, the present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the above-mentioned article originality evaluation is realized when the processor executes the computer program. steps of the method.

本公开的有益效果为：The beneficial effects of the present disclosure are:

本公开的文章原创度评价系统主要的计算均为离线完成，推理速度很快，同时本公开的文章原创度评价系统可以支持实时计算。The main calculation of the article originality evaluation system of the present disclosure is completed offline, and the reasoning speed is very fast, and the article originality evaluation system of the present disclosure can support real-time calculation.

本公开的文章原创度评价系统兼顾字面和语义两个维度，可以有效处理“洗稿”情形。The article originality evaluation system of the present disclosure takes into account both literal and semantic dimensions, and can effectively deal with the situation of "washing manuscripts".

附图说明Description of drawings

图1示出了本公开的实施例1的系统的结构示意图；FIG. 1 shows a schematic structural diagram of the system of Embodiment 1 of the present disclosure;

图2示出了本公开的实施例1的系统的文本数据预处理模块的结构示意图；2 shows a schematic structural diagram of a text data preprocessing module of the system according to Embodiment 1 of the present disclosure;

图3示出了本公开的实施例2的方法的流程示意图；3 shows a schematic flowchart of the method of Embodiment 2 of the present disclosure;

图4示出了本公开的实施例4的结构示意图。FIG. 4 shows a schematic structural diagram of Embodiment 4 of the present disclosure.

具体实施方式Detailed ways

以下，将参照附图来描述本公开的实施例。但是应该理解，这些描述只是示例性的，而并非要限制本公开的范围。此外，在以下说明中，省略了对公知结构和技术的描述，以避免不必要地混淆本公开的概念。Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood, however, that these descriptions are exemplary only, and are not intended to limit the scope of the present disclosure. Also, in the following description, descriptions of well-known structures and techniques are omitted to avoid unnecessarily obscuring the concepts of the present disclosure.

在附图中示出了根据本公开实施例的各种结构示意图。这些图并非是按比例绘制的，其中为了清楚表达的目的，放大了某些细节，并且可能省略了某些细节。图中所示的各种区域、层的形状以及它们之间的相对大小、位置关系仅是示例性的，实际中可能由于制造公差或技术限制而有所偏差，并且本领域技术人员根据实际所需可以另外设计具有不同形状、大小、相对位置的区域/层。Various structural schematic diagrams according to embodiments of the present disclosure are shown in the accompanying drawings. The figures are not to scale, some details have been exaggerated for clarity, and some details may have been omitted. The shapes of the various regions and layers shown in the figures, as well as the relative sizes and positional relationships between them are only exemplary, and may vary in practice due to manufacturing tolerances or technical limitations, and those skilled in the art will It needs to be possible to additionally design regions/layers with different shapes, sizes, relative positions.

实施例一：Example 1:

如图1所示：As shown in Figure 1:

本公开提供了一种文章原创度评价系统，包括：The present disclosure provides an article originality evaluation system, including:

文本数据预处理模块，用于对文档进行预处理；Text data preprocessing module for preprocessing documents;

库存文档管理子系统，包括文档入库模块，用于维护文档库，所述文档库中存储有库存文档，其中，所述库存文档为内容为业务场景中具有较高原创度的，被认定需要进行知识产权保护的文档；An inventory document management subsystem, including a document storage module, is used to maintain a document library, where inventory documents are stored in the document library, wherein the inventory documents are those whose content is highly original in business scenarios and are deemed to be required Documentation for intellectual property protection;

字义相似文档候选子模块ES，用于提供字面上相似的候选相似文档；Literally similar document candidate sub-module ES, which is used to provide literally similar candidate similar documents;

语义相似文档候选子模块Milvus，用于提供语义上相似的候选相似文档；The semantically similar document candidate submodule Milvus is used to provide semantically similar candidate similar documents;

特征存储子模块Mongo，用于存储文档的全部特征数据；The feature storage sub-module Mongo is used to store all feature data of the document;

所述文档入库模块用于将文档数据存储入所述字义相似文档候选子模块ES、所述语义相似文档候选子模块Milvus以及所述特征存储子模块Mongo中；The document warehousing module is used to store document data in the semantically similar document candidate submodule ES, the semantically similar document candidate submodule Milvus, and the feature storage submodule Mongo;

文章原创度计算子系统，用于计算评价文章的原创度；The article originality calculation subsystem is used to calculate the originality of the evaluation article;

其中，所述文章原创度计算子系统具体由候选相似文档检索模块以及原创度计算模块组成；Wherein, the article originality calculation subsystem is specifically composed of a candidate similar document retrieval module and an originality calculation module;

所述候选相似文档检索模块，用于召回所述文档库中的库存文档和待评估文档可能存在相似的文档；The candidate similar document retrieval module is used to recall the stock documents in the document library and the documents to be evaluated that may have similar documents;

所述原创度计算模块，用于基于待评估文档与所述文档库中库存文档的相似程度，计算待评估文档的原创度；The originality calculation module is used to calculate the originality of the document to be assessed based on the similarity between the document to be assessed and the inventory documents in the document library;

所述文本数据预处理模块介于所述文章原创度计算子系统和所述库存文档管理子系统之间；The text data preprocessing module is located between the article originality calculation subsystem and the inventory document management subsystem;

所述字义相似文档候选子模块ES和所述语义相似文档候选子模块Milvus介于所述文章原创度计算子系统和所述库存文档管理子系统之间，分别与所述文档入库模块以及所述候选相似文档检索模块相连接；The semantically similar document candidate submodule ES and the semantically similar document candidate submodule Milvus are located between the article originality calculation subsystem and the inventory document management subsystem, and are respectively connected with the document storage module and the document storage module. The candidate similar document retrieval module is connected;

所述特征存储子模块Mongo介于所述文章原创度计算子系统和所述库存文档管理子系统之间，分别于所述文档入库模块以及所述原创度计算模块相连接。The feature storage sub-module Mongo is located between the article originality calculation subsystem and the inventory document management subsystem, and is respectively connected with the document storage module and the originality calculation module.

如图2所示，as shown in picture 2,

进一步，所述文本数据预处理模块具体用于：Further, the text data preprocessing module is specifically used for:

即待评估文档被切分为N个段落，得到的段落集合paras，That is, the document to be evaluated is divided into N paragraphs, and the obtained paragraph set paras,

计算得到待评估文档的词语特征和分布式表示：Calculate the word features and distributed representation of the document to be evaluated:

对分段后的文档进行分词处理和去停用词处理得到词语特征即词袋模型；Perform word segmentation and stop word removal processing on the segmented documents to obtain word features, that is, word bag model;

对分段后的文档进行句向量计算得到分布式表示。Sentence vector calculation is performed on the segmented document to obtain a distributed representation.

进一步，所述候选相似文档检索模块具体用于：Further, the candidate similar document retrieval module is specifically used for:

标记从所述字义相似文档候选子模块ES中检索得到的段落p_n的字义候选相似段落集合为cand_list_wordbag＝(c₁，c₂，...c_i，...c_I)；其中，c_i表示段落p_n检索得到的字义相似段落，i＝1，2，……I，I为大于等于2的整数；Mark the word-sense candidate similar paragraph set of paragraph _pn retrieved from the word-sense similar document candidate sub-module ES as _{cand_list} _wordbag =(c ₁ , c ₂ ,...ci ,...c _I ); wherein, c _i represents a paragraph with similar meanings retrieved by paragraph p _n , i=1, 2, ...... I, I is an integer greater than or equal to 2;

标记从所述语义相似文档候选子模块Milvus中检索得到的段落p_n的语义候选相似段落集合为cand_list_distvec＝(d₁，d₂，...d_j，...d_J)；其中，d_j表示段落p_n检索得到的语义相似段落，j＝1，2，……J，J为大于等于2的整数；Mark the semantic candidate similar paragraph set of paragraph _pn retrieved from the semantically similar document candidate sub-module Milvus as cand_list _distvec =(d ₁ , d ₂ ,...d _j ,...d _J ); wherein, d _j represents the semantically similar paragraph retrieved by paragraph _pn , j=1, 2,...J, J is an integer greater than or equal to 2;

进一步，所述原创度计算模块具体用于：Further, the originality calculation module is specifically used for:

文章原创度

其中，in,

其中，in,

式中，

是段落p_n内词语的集合，

表示段落p_n和s_k包含的相同词语的个数；分母中

is the set of words in paragraph p _n ,

其中，in,

余弦距离

式中

为第n段文本的分布式表示。cosine distance

in the formula

is the distributed representation of the nth paragraph of text.

实施例二：Embodiment 2:

如图3所示，As shown in Figure 3,

本公开还能够提供一种文章原创度评价方法，包括：The present disclosure can also provide an article originality evaluation method, including:

S201：利用所述文本数据预处理模块对待评估文档进行文档清洗以及特征提取，并计算得到待评估文档的词语特征和分布式表示；S201: Use the text data preprocessing module to perform document cleaning and feature extraction on the document to be evaluated, and calculate word features and distributed representations of the document to be evaluated;

具体地，specifically,

S202：利用所述候选相似文档检索模块召回所述文档库中的库存文档和待评估文档可能存在相似的文档；S202: Use the candidate similar document retrieval module to recall that the inventory document in the document library and the document to be evaluated may have similar documents;

具体地，specifically,

其中，k＝1，2，……，K，K表示大于等于2的整数；Among them, k=1, 2,..., K, K represents an integer greater than or equal to 2;

S203：利用所述原创度计算模块基于待评估文档与所述文档库中库存文档的相似程度，计算待评估文档的原创度。S203: Calculate the originality of the document to be assessed based on the similarity between the document to be assessed and the inventory document in the document library by using the originality calculation module.

具体地，specifically,

文章原创度

其中，in,

score_n＝min(score_wordbag_n，score_distvec_n)，score _n = min(score_wordbag _n , score_distvec _n ),

其中，in,

式中，

是段落p_n内词语的集合，

表示段落p_n和s_k包含的相同词语的个数；分母中

is the set of words in paragraph p _n ,

其中，in,

余弦距离

式中

为第n段文本的分布式表示。cosine distance

in the formula

is the distributed representation of the nth paragraph of text.

进一步，还包括：Further, it also includes:

实施例三：Embodiment three:

本公开还能够提供一种计算机存储介质，其上存储有计算机程序，计算机程序被处理器执行时用于实现上述的文章原创度评价系统的步骤。The present disclosure can also provide a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, is used to implement the steps of the above-mentioned article originality evaluation system.

本公开的计算机存储介质可以采用半导体存储器、磁芯存储器、磁鼓存储器或磁盘存储器实现。The computer storage medium of the present disclosure may be implemented using semiconductor memory, magnetic core memory, magnetic drum memory, or magnetic disk memory.

半导体存储器，主要用于计算机的半导体存储元件主要有Mos和双极型两种。Mos元件集成度高、工艺简单但速度较慢。双极型元件工艺复杂、功耗大、集成度低但速度快。NMos和CMos问世后，使Mos存储器在半导体存储器中开始占主要地位。NMos速度快，如英特尔公司的1K位静态随机存储器的存取时间为45ns。而CMos耗电省，4K位的CMos静态存储器存取时间为300ns。上述半导体存储器都是随机存取存储器(RAM),即在工作过程中可随机进行读出和写入新内容。而半导体只读存储器(ROM)在工作过程中可随机读出但不能写入，它用来存放已固化好的程序和数据。ROM又分为不可改写的熔断丝式只读存储器──PROM和可改写的只读存储器EPROM两种。Semiconductor memory, the semiconductor memory elements mainly used in computers are mainly Mos and bipolar. Mos components have high integration, simple process but slow speed. Bipolar components have complex process, high power consumption, low integration but fast speed. After the advent of NMos and CMos, Mos memory began to occupy a dominant position in semiconductor memory. NMos is fast, such as Intel's 1K-bit SRAM access time of 45ns. The CMos consumes less power, and the 4K-bit CMos static memory access time is 300ns. The above semiconductor memories are all random access memories (RAM), that is, new contents can be read and written randomly during the working process. The semiconductor read-only memory (ROM) can be randomly read but not written in the working process, it is used to store the solidified program and data. ROM is divided into two types of non-rewritable fuse-type read-only memory ─ ─ PROM and rewritable read-only memory EPROM.

磁芯存储器，具有成本低，可靠性高的特点，且有20多年的实际使用经验。70年代中期以前广泛使用磁芯存储器作为主存储器。其存储容量可达10位以上，存取时间最快为300ns。国际上典型的磁芯存储器容量为4MS～8MB，存取周期为1.0～1.5μs。在半导体存储快速发展取代磁芯存储器作为主存储器的位置之后，磁芯存储器仍然可以作为大容量扩充存储器而得到应用。Magnetic core memory has the characteristics of low cost and high reliability, and has more than 20 years of practical experience. Before the mid-1970s, magnetic core memory was widely used as main memory. Its storage capacity can reach more than 10 bits, and the fastest access time is 300ns. The international typical magnetic core memory capacity is 4MS ~ 8MB, and the access cycle is 1.0 ~ 1.5μs. After the rapid development of semiconductor storage has replaced the magnetic core memory as the main memory, the magnetic core memory can still be applied as a large-capacity expansion memory.

磁鼓存储器，一种磁记录的外存储器。由于其信息存取速度快，工作稳定可靠，虽然其容量较小，正逐渐被磁盘存储器所取代，但仍被用作实时过程控制计算机和中、大型计算机的外存储器。为了适应小型和微型计算机的需要，出现了超小型磁鼓，其体积小、重量轻、可靠性高、使用方便。Drum memory, an external memory for magnetic recording. Because of its fast information access speed and stable and reliable work, although its capacity is small, it is gradually being replaced by disk memory, but it is still used as external memory for real-time process control computers and medium and large computers. In order to meet the needs of small and microcomputers, ultra-small magnetic drums have appeared, which are small in size, light in weight, high in reliability and easy to use.

磁盘存储器，一种磁记录的外存储器。它兼有磁鼓和磁带存储器的优点，即其存储容量较磁鼓容量大，而存取速度则较磁带存储器快，又可脱机贮存，因此在各种计算机系统中磁盘被广泛用作大容量的外存储器。磁盘一般分为硬磁盘和软磁盘存储器两大类。Disk storage, a type of external storage for magnetic recording. It has the advantages of both the magnetic drum and the magnetic tape memory, that is, its storage capacity is larger than that of the magnetic drum, and the access speed is faster than that of the magnetic tape memory, and it can be stored offline, so the magnetic disk is widely used in various computer systems. capacity of external memory. Disks are generally divided into two categories: hard disks and floppy disks.

硬磁盘存储器的品种很多。从结构上，分可换式和固定式两种。可换式磁盘盘片可调换，固定式磁盘盘片是固定的。可换式和固定式磁盘都有多片组合和单片结构两种，又都可分为固定磁头型和活动磁头型。固定磁头型磁盘的容量较小，记录密度低存取速度高，但造价高。活动磁头型磁盘记录密度高(可达1000～6250位/英寸)，因而容量大,但存取速度相对固定磁头磁盘低。磁盘产品的存储容量可达几百兆字节，位密度为每英寸6 250位,道密度为每英寸475道。其中多片可换磁盘存储器由于盘组可以更换,具有很大的脱体容量,而且容量大,速度高,可存储大容量情报资料，在联机情报检索系统、数据库管理系统中得到广泛应用。There are many varieties of hard disk storage. Structurally, there are two types: replaceable and fixed. Replaceable disk platters are replaceable, while fixed disk platters are fixed. Both replaceable and fixed disks have multi-chip and single-chip structures, and can be divided into fixed-head type and movable-head type. The fixed-head type magnetic disk has a small capacity, low recording density and high access speed, but high cost. The movable head type disk has a high recording density (up to 1000-6250 bits/inch), so the capacity is large, but the access speed is lower than that of the fixed head disk. The storage capacity of magnetic disk products can reach hundreds of megabytes, the bit density is 6250 bits per inch, and the track density is 475 tracks per inch. Among them, the multi-chip replaceable disk storage has a large off-body capacity because the disk group can be replaced, and has a large capacity and high speed, and can store large-capacity information materials. It is widely used in online information retrieval systems and database management systems.

实施例四：Embodiment 4:

本公开还提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述的文章原创度评价系统的步骤。The present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the above steps of the article originality evaluation system are implemented.

图4为一个实施例中电子设备的内部结构示意图。如图4所示，该电子设备包括通过系统总线连接的处理器、存储介质、存储器和网络接口。其中，该计算机设备的存储介质存储有操作系统、数据库和计算机可读指令，数据库中可存储有控件信息序列，该计算机可读指令被处理器执行时，可使得处理器实现一种文章原创度评价系统。该电设备的处理器用于提供计算和控制能力，支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令，该计算机可读指令被处理器执行时，可使得处理器执行一种文章原创度评价系统。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解，图4中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。FIG. 4 is a schematic diagram of the internal structure of an electronic device in one embodiment. As shown in FIG. 4 , the electronic device includes a processor, a storage medium, a memory, and a network interface connected through a system bus. Wherein, the storage medium of the computer device stores an operating system, a database and computer-readable instructions, and the database can store a sequence of control information. When the computer-readable instructions are executed by the processor, the processor can achieve a degree of originality of the article. evaluation system. The processor of the electrical equipment is used to provide computing and control capabilities and support the operation of the entire computer equipment. Computer-readable instructions may be stored in the memory of the computer device, and when executed by the processor, the computer-readable instructions may cause the processor to execute an article originality evaluation system. The network interface of the computer equipment is used for communication with the terminal connection. Those skilled in the art can understand that the structure shown in FIG. 4 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

该电子设备包括但不限于智能电话、计算机、平板电脑、可穿戴智能设备、人工智能设备、移动电源等。The electronic devices include, but are not limited to, smart phones, computers, tablet computers, wearable smart devices, artificial intelligence devices, power banks, and the like.

所述处理器在一些实施例中可以由集成电路组成，例如可以由单个封装的集成电路所组成，也可以是由多个相同功能或不同功能封装的集成电路所组成，包括一个或者多个中央处理器(Central Processing unit，CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器是所述电子设备的控制核心(Control Unit)，利用各种接口和线路连接整个电子设备的各个部件，通过运行或执行存储在所述存储器内的程序或者模块(例如执行远端数据读写程序等)，以及调用存储在所述存储器内的数据，以执行电子设备的各种功能和处理数据。In some embodiments, the processor may be composed of integrated circuits, such as a single packaged integrated circuit, or a plurality of integrated circuits packaged with the same function or different functions, including one or more central A processor (Central Processing Unit, CPU), a microprocessor, a digital processing chip, a graphics processor and a combination of various control chips, etc. The processor is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module stored in the memory (for example, executing a remote control unit). data reading and writing programs, etc.), and call data stored in the memory to perform various functions of the electronic device and process data.

所述总线可以是外设部件互连标准(peripheral component interconnect，简称PCI)总线或扩展工业标准结构(extended industry standard architecture，简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器以及至少一个处理器等之间的连接通信。The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (extended industry standard architecture, EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to enable connection communication between the memory and at least one processor or the like.

图4仅示出了具有部件的电子设备，本领域技术人员可以理解的是，图4示出的结构并不构成对所述电子设备的限定，可以包括比图示更少或者更多的部件，或者组合某些部件，或者不同的部件布置。FIG. 4 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the electronic device, and may include fewer or more components than those shown in the drawings. , or a combination of certain components, or a different arrangement of components.

例如，尽管未示出，所述电子设备还可以包括给各个部件供电的电源(比如电池)，优选地，电源可以通过电源管理装置与所述至少一个处理器逻辑相连，从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备还可以包括多种传感器、蓝牙模块、Wi-Fi模块等，在此不再赘述。For example, although not shown, the electronic device may also include a power source (such as a battery) for powering the various components, preferably, the power source may be logically connected to the at least one processor through a power management device, so as to be implemented by the power management device Charge management, discharge management, and power management functions. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

进一步地，所述电子设备还可以包括网络接口，可选地，所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等)，通常用于在该电子设备与其他电子设备之间建立通信连接。Further, the electronic device may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.) Establish a communication connection between other electronic devices.

可选地，该电子设备还可以包括用户接口，用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard))，可选地，用户接口还可以是标准的有线接口、无线接口。可选地，在一些实施例中，显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode，有机发光二极管)触摸器等。其中，显示器也可以适当的称为显示屏或显示单元，用于显示在电子设备中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device may further include a user interface, and the user interface may be a display (Display), an input unit (such as a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device and for displaying a visual user interface.

进一步地，所述计算机可用存储介质可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序等；存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.

在本发明所提供的几个实施例中，应该理解到，所揭露的设备，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

所述作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能模块可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

以上对本公开的实施例进行了描述。但是，这些实施例仅仅是为了说明的目的，而并非为了限制本公开的范围。本公开的范围由所附权利要求及其等价物限定。不脱离本公开的范围，本领域技术人员可以做出多种替代和修改，这些替代和修改都应落在本公开的范围之内。Embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only, and are not intended to limit the scope of the present disclosure. The scope of the present disclosure is defined by the appended claims and their equivalents. Without departing from the scope of the present disclosure, those skilled in the art can make various substitutions and modifications, and these substitutions and modifications should all fall within the scope of the present disclosure.

Claims

1. An article originality evaluation method is characterized by comprising the following steps:

preprocessing the document to be put in storage and storing the preprocessed document in storage;

performing word meaning similar document candidate processing, word meaning similar document candidate processing and/or feature extraction and storage on the newly-put documents;

recalling similar documents possibly existing in the stock documents and the documents to be evaluated in the document library;

and calculating the originality of the document to be evaluated based on the similarity of the document to be evaluated and the stock document, wherein the stock document is a document which has higher originality in the service scene and is determined to need intellectual property protection, and the stock document is stored in a document library.

2. The method of claim 1, wherein the text data preprocessing module is specifically configured to:

carrying out document cleaning and feature extraction on the document;

calculating to obtain word characteristics and distributed representation of the document to be evaluated, namely segmenting the document to be evaluated into N paragraphs to obtain a paragraph set paras,

paras＝(p₁，p₂，...，p_n，...，p_N) Wherein p is_nN is 1, 2 … … N represents a document paragraph after segmentation, and N is an integer of 2 or more.

3. The method according to claim 2, wherein the recalling that similar documents may exist in the inventory document and the document to be evaluated in the document library specifically comprises:

paragraph p obtained by label search_nThe word sense candidate similar segment set is cand _ list_wordbag＝(c₁，c₂，...c_i，...c_I) (ii) a Wherein, c_iRepresents paragraph p_nThe searched word sense similar paragraph, I is 1, 2, … … I, I is an integer greater than or equal to 2;

paragraph p obtained by label search_nThe semantic candidate similar segment set is cand _ list_distvec＝(d₁，d₂，...d_j，...d_J) (ii) a Wherein d is_jRepresents paragraph p_nThe searched semantically similar paragraphs, J is 1, 2, … … J, and J is an integer greater than or equal to 2;

a threshold is used to determine the word sense candidate similar segment set cand _ list_wordbagSimilar to semantic candidate paragraph set cand _ list_distvecAnd (4) whether to recall, recording all the recalled candidate similar section sets:

cand_list＝cand_list_wordbag∪cand_list_distvec＝(s₁，s₂，...，s_k，...，s_K)；

wherein K is 1, 2, … …, and K is an integer of 2 or more.

4. The method of claim 3, wherein the calculating the originality of the document to be evaluated comprises:

calculating the originality of the article by using the following formula;

degree of originality of article

In the formula score_nThe originality of the nth text of the document to be evaluated;

wherein,

score_n＝min(score-wordbag_n，score_distvec_n)，

wherein, score _ Wordbag_nIs the original creativity score of the nth article segment falling under the bag-of-words model, score _ distvec_nIs the originality score of the nth article paragraph under the distributed representation.

5. The method of claim 4, wherein the originality score under the bag of words model is calculated by the following formula:

wherein,

in the formula,

is a collection of words within a paragraph pn,

represents paragraph p_nAnd s_kThe number of the same words contained; in the denominator

Representing the absolute difference of the lengths of the two paragraphs; the coefficient β represents the weight of the document length difference factor in calculating the text similarity, and is 0.5 by default.

6. The method of claim 4, wherein the originality score under the distributed representation is calculated by:

wherein,

cosine distance

In the formula

Is a distributed representation of the nth text segment.

7. An article originality evaluation system is characterized by comprising:

the text data preprocessing module is used for preprocessing the document;

the system comprises an inventory document management subsystem and a document management module, wherein the inventory document management subsystem comprises a document storage module used for maintaining a document library, and inventory documents are stored in the document library, wherein the inventory documents are documents which have higher originality in service scenes and are identified to need intellectual property protection;

a word sense similar document candidate submodule ES for providing candidate similar documents which are similar in word;

the semantically similar document candidate submodule Milvus is used for providing semantically similar candidate similar documents;

the characteristic storage submodule Mongo is used for storing all characteristic data of the document;

the document storage module is used for storing document data into the semantic similar document candidate submodule ES, the semantic similar document candidate submodule Milvus and the feature storage submodule Mongo;

the article originality degree calculation operator system is used for calculating the originality degree of the evaluation article;

the article originality degree operator system is composed of a candidate similar document retrieval module and an originality degree calculation module;

the candidate similar document retrieval module is used for recalling the stock documents in the document library and the documents to be evaluated which may have similarity;

the originality degree calculation module is used for calculating the originality degree of the document to be evaluated based on the similarity degree of the document to be evaluated and the document stored in the document library;

the text data preprocessing module is arranged between the article originality calculating subsystem and the stock document management subsystem;

the word meaning similar document candidate submodule ES and the semantic similar document candidate submodule Milvus are arranged between the article originality measuring subsystem and the stock document management subsystem and are respectively connected with the document storage module and the candidate similar document retrieval module;

the characteristic storage submodule Mongo is arranged between the article originality degree computing subsystem and the stock document management subsystem and is respectively connected with the document storage module and the originality degree computing module.

8. The system of claim 7, wherein the text data preprocessing module is specifically configured to:

carrying out document cleaning and feature extraction on the document;

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps corresponding to the method for evaluating the originality of an article as claimed in any one of claims 1 to 6 when executing the computer program.

10. A computer storage medium having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, are for implementing the steps corresponding to the article originality assessment method of any one of claims 1 to 6.