CN108846031A - Project similarity comparison method for power industry - Google Patents

Project similarity comparison method for power industry Download PDF

Info

Publication number
CN108846031A
CN108846031A CN201810521004.2A CN201810521004A CN108846031A CN 108846031 A CN108846031 A CN 108846031A CN 201810521004 A CN201810521004 A CN 201810521004A CN 108846031 A CN108846031 A CN 108846031A
Authority
CN
China
Prior art keywords
sentence
similar
comparison
text
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810521004.2A
Other languages
Chinese (zh)
Other versions
CN108846031B (en
Inventor
段飞虎
吕强
冯自强
张宏伟
邓春宇
季知祥
史梦洁
陈立斌
王冠群
徐翀
梁芙翠
王頔
魏冠元
付蓉
马铁群
朱承志
孙黎滢
谷记亭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongfang Knowledge Network Digital Publishing Technology Co ltd
State Grid Zhejiang Electric Power Co Ltd
China Electric Power Research Institute Co Ltd CEPRI
State Grid Energy Research Institute Co Ltd
Original Assignee
Tongfang Knowledge Network Digital Publishing Technology Co ltd
State Grid Zhejiang Electric Power Co Ltd
China Electric Power Research Institute Co Ltd CEPRI
State Grid Energy Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongfang Knowledge Network Digital Publishing Technology Co ltd, State Grid Zhejiang Electric Power Co Ltd, China Electric Power Research Institute Co Ltd CEPRI, State Grid Energy Research Institute Co Ltd filed Critical Tongfang Knowledge Network Digital Publishing Technology Co ltd
Priority to CN201810521004.2A priority Critical patent/CN108846031B/en
Publication of CN108846031A publication Critical patent/CN108846031A/en
Application granted granted Critical
Publication of CN108846031B publication Critical patent/CN108846031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a project similarity comparison method for the power industry, which comprises the following steps: fragmenting the text, unifying the format and storing the text in a database; retrieving several texts most similar to the comparison items through a KBase database; respectively comparing the similar texts with the comparison texts; analyzing comparison results of all similar texts, and forming result output according to a comparison sequence; and optimizing the similarity of the comparison statements, wherein the optimization adopts parallel computing and uses a plurality of threads to compute simultaneously. The method comprises the steps of splitting a text according to sentences, segmenting words to achieve the minimum granularity of text representation, then carrying out semantic analysis according to electric power subject words, searching similar text labels in all items of a database and outputting the similar text labels; the efficiency of the duplicate checking comparison of the declaration project is improved, and the waste of resources such as manpower and material resources is reduced.

Description

A kind of item similarity control methods towards power industry
Technical field
The present invention relates to text mining field and technical field of computer information processing, more particularly to one kind is towards electric power row The item similarity control methods of industry.
Background technique
In power industry, annual each department's electric power mechanism all can declare project to power grid, a large amount of within the set time The demand that project is audited, power grid uses manual examination and verification at present.Not only need declaring project and be discharged according to forgetting Similar terms will also be found to have frontier nature according to current social development new situations, and innovative declares project.Therefore, right For auditor, the project for not only needing to remember former years application excludes similar terms, will also be according to the need of current social It asks to find valuable project.Such case increases the workload of auditor, consumes great effort.
In face of above situation, there is an urgent need to a kind of item similarity comparison technologies for power industry to solve current difficulty. Although similitude comparison technology present in current techniques field is able to solve the problem of power industry, but for level of hardware Excessively high with software technology requirement, non-technical industry is difficult to provide the support of this hardware and software.Secondly because power industry Shen The confidentiality of report project, it is difficult to open to use current universal similitude comparison technology.Therefore, there is an urgent need to research and develop for power industry A kind of item similarity comparison technology of lightweight solves the problems of current industry.
Summary of the invention
In order to solve the above technical problems, the object of the present invention is to provide a kind of, the item similarity towards power industry is compared Method.
The purpose of the present invention is realized by technical solution below:
A kind of item similarity control methods towards power industry, including:
Text is carried out fragmentation processing by step 10, and unified format simultaneously saves in the database;
Step 20 goes out several texts most like with the project that compares by KBase database retrieval;
Step 30 by Similar Text respectively with compare text and be compared;
Step 40 analyzes the comparison result of all Similar Texts, and forms result output according to comparison sequence;
The similarity of step 50 pair comparison sentence optimizes, which is counted using parallel computation using multiple threads simultaneously It calculates.
Compared with prior art, one or more embodiments of the invention can have following advantage:
This method carries out participle and reaches text representation minimum particle size by splitting text according to sentence, later Semantic analysis is carried out according to electric power descriptor, and searches Similar Text label output in database all items;Improve Shen The efficiency that report project duplicate checking compares, reduces the waste of the resources such as manpower and material resources.
Detailed description of the invention
Fig. 1 is the item similarity correlation technique flow chart towards power industry;
Fig. 2 is the datagram saved after fragmentation in the database;
Fig. 3 is one-to-many duplicate checking result display diagram;
Fig. 4 is one-to-one accurate duplicate checking result display diagram.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with examples and drawings to this hair It is bright to be described in further detail.
As shown in Figure 1, being the item similarity correlation technique process towards power industry, include the following steps:
Text is carried out fragmentation processing by step 10, and unified format simultaneously saves in the database;
Step 20 goes out several texts most like with the project that compares by KBase database retrieval;
KBase data use high-order fingerprint technique, can quickly find out with the biggish text of the urtext degree of correlation, and Its similarity is roughly calculated.
Step 30 by Similar Text respectively with compare text and be compared;It specifically includes:
Two texts are split into sentence by punctuation mark by step 301, if Similar Text and comparison text are:
D={ d1,d2,d3,...,dn, M={ m1,m2,m3,...,mk}
Wherein D and M is the sentence set of urtext, and d and m are the split sentence separated, and n and k indicate the quantity of sentence;
Step 302 segments the sentence in D and M, and carries out similarity comparison;
Two String searching Similar contents will be by comparing sentence by sentence, i.e., n*k times comparison.Firstly, dividing all sentences Word, the calculating formula of similarity between sentence are:
Each word of two words is matched, semantic analysis is carried out according to electric power theme dictionary and near synonym library, such as There are identical and semantic similarity words then to record number of words for fruit, and all same or similar words are added up.Wherein, LCS(dn, mk) it is sentence dnWith mkIn identical and close word number of words, Num has recorded sentence dnWith mkTotal number of word, calculate phase Same or close number of words ratio shared in each sentence simultaneously takes smaller value therein as the similarity for working as the first two sentence;
Step 303 given threshold similar, and by similarity required in the threshold value similar of setting and step 302 into Row compares;
Sentence of the similarity greater than threshold value similar is exactly the similar sentence to be searched, and threshold value similar can be according to reality The adjustment of border situation.
The result of comparison and the context in original text are saved and mark output by step 304.
By the position mark at similar sentence place in two texts, and original text is traced back to according to position and takes former sentence institute Context, similar sentence and context are taken out and are saved.
Step 40 analyzes the comparison result of all Similar Texts, and according to original text sequence after duplicate contents are marked Output;
The comparison result for taking out all similar articles merges if there is repeat statement is identical, by the repetition in result set Sentence carries out marking red and forms final result.
The similarity of step 50 pair comparison sentence optimizes, which is counted using parallel computation using multiple threads simultaneously It calculates.
Due to needing to calculate the similarity of all sentences in comparison two-by-two when building alignment matrix, that is, need by n*k times It calculates, and as the increase for comparing length can increase operation time, therefore same using multiple threads using parallel calculating method When calculate, by limitation single compare sentence number come achieve the effect that limit Thread Count, if single compare sentence quantity be t, then Open Thread Count v calculation formula be:
Wherein, Ceiling function is that bracket function then adds 1 simply by the presence of decimal simultaneously.
Fig. 2 is the datagram saved after fragmentation in the database;Fig. 3 is one-to-many duplicate checking result display diagram;Fig. 4 is one To an accurate duplicate checking result display diagram.
Although disclosed herein embodiment it is as above, the content is only to facilitate understanding the present invention and adopting Embodiment is not intended to limit the invention.Any those skilled in the art to which this invention pertains are not departing from this Under the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But scope of patent protection of the invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims (6)

1. a kind of item similarity control methods towards power industry, which is characterized in that the method includes:
Text is carried out fragmentation processing by step 10, and unified format simultaneously saves in the database;
Step 20 goes out several texts most like with the project that compares by KBase database retrieval;
Step 30 by Similar Text respectively with compare text and be compared;
Step 40 analyzes the comparison result of all Similar Texts, and defeated according to original text sequence after duplicate contents are marked Out;
The similarity of step 50 pair comparison sentence optimizes, which is calculated using parallel computation using multiple threads simultaneously.
2. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 30 It specifically includes:
Two texts are split into sentence by punctuation mark by step 301, if Similar Text and comparison text are:
D={ d1,d2,d3,...,dn, M={ m1,m2,m3,...,mk}
Wherein D and M is the sentence set of urtext, and d and m are the split sentence separated, and n and k indicate the quantity of sentence;
Step 302 segments the sentence in D and M, and carries out similarity comparison;Calculating formula of similarity between sentence is:
Wherein, LCS (dn, mk) it is sentence dnWith mkIn identical and close word number of words, Num has recorded sentence dnWith mkTotal word Number calculates same or similar number of words ratio shared in each sentence and takes smaller value therein as the phase for working as the first two sentence Like degree;
Step 303 given threshold similar, and the threshold value similar of setting and similarity required in step 302 are compared It is right;
The result of comparison and the context in original text are saved and mark output by step 304.
3. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 40 It specifically includes:The comparing result for taking out all similar articles merges if there is repeat statement is identical, and by the weight in result set Multiple sentence carries out marking red formation final result.
4. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 50 It specifically includes:Due to needing to calculate the similarity of all sentences in comparison two-by-two when building alignment matrix, that is, need by n*k times It calculates, and as the increase for comparing length can increase operation time, therefore same using multiple threads using parallel calculating method When calculate, by limitation single compare sentence number come achieve the effect that limit Thread Count, if single compare sentence quantity be t, then Open Thread Count v calculation formula be:
Wherein, Ceiling function is that bracket function then adds 1 simply by the presence of decimal simultaneously.
5. the item similarity control methods towards power industry as claimed in claim 2, which is characterized in that the step In 302:Each word of two words is matched, semantic analysis is carried out according to electric power theme dictionary and near synonym library, if There are identical and semantic similarity words then to record number of words, and all same or similar words are added up.
6. the item similarity control methods towards power industry as claimed in claim 2, which is characterized in that the step In 303:Sentence of the similarity greater than threshold value similar is exactly the similar sentence to be searched, and wherein threshold value similar can basis Actual conditions adjustment.
CN201810521004.2A 2018-05-28 2018-05-28 Project similarity comparison method for power industry Active CN108846031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810521004.2A CN108846031B (en) 2018-05-28 2018-05-28 Project similarity comparison method for power industry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810521004.2A CN108846031B (en) 2018-05-28 2018-05-28 Project similarity comparison method for power industry

Publications (2)

Publication Number Publication Date
CN108846031A true CN108846031A (en) 2018-11-20
CN108846031B CN108846031B (en) 2022-05-13

Family

ID=64213600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810521004.2A Active CN108846031B (en) 2018-05-28 2018-05-28 Project similarity comparison method for power industry

Country Status (1)

Country Link
CN (1) CN108846031B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109636352A (en) * 2018-12-20 2019-04-16 湖南晖龙集团股份有限公司 A kind of distributed content duplicate checking early warning system based on financial big data
CN110888920A (en) * 2019-12-06 2020-03-17 北京中电普华信息技术有限公司 Method and device for determining similarity of project functions
CN114741474A (en) * 2022-04-20 2022-07-12 山东科迅信息技术有限公司 Data processing method applied to project declaration system

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
CN102708090A (en) * 2012-05-16 2012-10-03 中国人民解放军国防科学技术大学 Verification method for shared storage multicore multithreading processor hardware lock
CN103455535A (en) * 2013-05-08 2013-12-18 深圳市明唐通信有限公司 Method for establishing knowledge base based on historical consultation data
CN103729422A (en) * 2013-12-23 2014-04-16 武汉传神信息技术有限公司 Information fragment associative output method and system
CN104699849A (en) * 2015-04-07 2015-06-10 同方知网数字出版技术股份有限公司 Digital library resource unified search system
CN105630822A (en) * 2014-11-04 2016-06-01 上海兵飞软件有限公司 Method for marking similar contents in patent retrieval in red color
CN105718506A (en) * 2016-01-04 2016-06-29 胡新伟 Duplicate-checking comparison method for science and technology projects
CN106802884A (en) * 2017-02-17 2017-06-06 同方知网(北京)技术有限公司 A kind of method of format document text fragmentation
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN107015961A (en) * 2016-01-27 2017-08-04 中文在线数字出版集团股份有限公司 A kind of text similarity comparison method
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN107273350A (en) * 2017-05-16 2017-10-20 广东电网有限责任公司江门供电局 A kind of information processing method and its device for realizing intelligent answer
CN107357765A (en) * 2017-07-14 2017-11-17 北京神州泰岳软件股份有限公司 Word document flaking method and device
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system
CN107908796A (en) * 2017-12-15 2018-04-13 广州市齐明软件科技有限公司 E-Government duplicate checking method, apparatus and computer-readable recording medium
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
CN102708090A (en) * 2012-05-16 2012-10-03 中国人民解放军国防科学技术大学 Verification method for shared storage multicore multithreading processor hardware lock
CN103455535A (en) * 2013-05-08 2013-12-18 深圳市明唐通信有限公司 Method for establishing knowledge base based on historical consultation data
CN103729422A (en) * 2013-12-23 2014-04-16 武汉传神信息技术有限公司 Information fragment associative output method and system
CN105630822A (en) * 2014-11-04 2016-06-01 上海兵飞软件有限公司 Method for marking similar contents in patent retrieval in red color
CN104699849A (en) * 2015-04-07 2015-06-10 同方知网数字出版技术股份有限公司 Digital library resource unified search system
CN105718506A (en) * 2016-01-04 2016-06-29 胡新伟 Duplicate-checking comparison method for science and technology projects
CN107015961A (en) * 2016-01-27 2017-08-04 中文在线数字出版集团股份有限公司 A kind of text similarity comparison method
CN106802884A (en) * 2017-02-17 2017-06-06 同方知网(北京)技术有限公司 A kind of method of format document text fragmentation
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN107273350A (en) * 2017-05-16 2017-10-20 广东电网有限责任公司江门供电局 A kind of information processing method and its device for realizing intelligent answer
CN107357765A (en) * 2017-07-14 2017-11-17 北京神州泰岳软件股份有限公司 Word document flaking method and device
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN107908796A (en) * 2017-12-15 2018-04-13 广州市齐明软件科技有限公司 E-Government duplicate checking method, apparatus and computer-readable recording medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
QIANG LV 等: "Similarity Retrieval Algorithm based on Multilevel Fingerprint Comparison Matrix", 《PROCEEDINGS OF THE 2018 INTERNATIONAL SYMPOSIUM ON COMMUNICATION ENGINEERING & COMPUTER SCIENCE》 *
RAMAN A 等: "Speculative parallelization using software multi-threaded transactions", 《PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS》 *
吕强 等: "句子语义相似度计算", 《计算机工程与应用》 *
王小芳 等: "于最优化控制模型的文本主题域划分", 《吉林大学学报(理学版)》 *
肖鹏元: "基于GPU并行计算的重复文本检测系统", 《浙江大学》 *
陈佐 等: "一种多线程负载均衡分析方法研究", 《计算机应用研究》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109636352A (en) * 2018-12-20 2019-04-16 湖南晖龙集团股份有限公司 A kind of distributed content duplicate checking early warning system based on financial big data
CN110888920A (en) * 2019-12-06 2020-03-17 北京中电普华信息技术有限公司 Method and device for determining similarity of project functions
CN110888920B (en) * 2019-12-06 2022-10-11 北京中电普华信息技术有限公司 Method and device for determining similarity of project functions
CN114741474A (en) * 2022-04-20 2022-07-12 山东科迅信息技术有限公司 Data processing method applied to project declaration system

Also Published As

Publication number Publication date
CN108846031B (en) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
WO2019227584A1 (en) Method for parsing and processing resume data information, device, apparatus, and storage medium
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
CN111897970A (en) Text comparison method, device and equipment based on knowledge graph and storage medium
CN108573045A (en) A kind of alignment matrix similarity retrieval method based on multistage fingerprint
Heu et al. FoDoSu: multi-document summarization exploiting semantic analysis based on social Folksonomy
Alami et al. Cybercrime profiling: Text mining techniques to detect and predict criminal activities in microblog posts
CN108846031A (en) Project similarity comparison method for power industry
JP2012108570A (en) Device and method for extraction of word semantic relation
CN110795932B (en) Geological report text information extraction method based on geological ontology
WO2022227535A1 (en) Method and system for recognizing mining malicious software, and storage medium
CN114861677B (en) Information extraction method and device, electronic equipment and storage medium
WO2019214142A1 (en) Electronic device, research report data-based prediction method, program, and computer storage medium
US10706030B2 (en) Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure
CN111553556A (en) Business data analysis method and device, computer equipment and storage medium
Hu et al. A user profile modeling method based on word2vec
Singh et al. Sentiment analysis using lexicon based approach
Wang et al. Cyber threat intelligence entity extraction based on deep learning and field knowledge engineering
US20150039290A1 (en) Knowledge-rich automatic term disambiguation
CN111522950A (en) Rapid identification system for unstructured massive text sensitive data
CN114840685A (en) Emergency plan knowledge graph construction method
Panenghat et al. Towards the necessity for debiasing natural language inference datasets
Mkrtchyan et al. Deep parsing at the CLEF2014 IE task (DFKI-Medical)
Yang et al. A system fault diagnosis method with a reclustering algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant