CN108846031A - Project similarity comparison method for power industry - Google Patents
Project similarity comparison method for power industry Download PDFInfo
- Publication number
- CN108846031A CN108846031A CN201810521004.2A CN201810521004A CN108846031A CN 108846031 A CN108846031 A CN 108846031A CN 201810521004 A CN201810521004 A CN 201810521004A CN 108846031 A CN108846031 A CN 108846031A
- Authority
- CN
- China
- Prior art keywords
- sentence
- similar
- comparison
- text
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention discloses a project similarity comparison method for the power industry, which comprises the following steps: fragmenting the text, unifying the format and storing the text in a database; retrieving several texts most similar to the comparison items through a KBase database; respectively comparing the similar texts with the comparison texts; analyzing comparison results of all similar texts, and forming result output according to a comparison sequence; and optimizing the similarity of the comparison statements, wherein the optimization adopts parallel computing and uses a plurality of threads to compute simultaneously. The method comprises the steps of splitting a text according to sentences, segmenting words to achieve the minimum granularity of text representation, then carrying out semantic analysis according to electric power subject words, searching similar text labels in all items of a database and outputting the similar text labels; the efficiency of the duplicate checking comparison of the declaration project is improved, and the waste of resources such as manpower and material resources is reduced.
Description
Technical field
The present invention relates to text mining field and technical field of computer information processing, more particularly to one kind is towards electric power row
The item similarity control methods of industry.
Background technique
In power industry, annual each department's electric power mechanism all can declare project to power grid, a large amount of within the set time
The demand that project is audited, power grid uses manual examination and verification at present.Not only need declaring project and be discharged according to forgetting
Similar terms will also be found to have frontier nature according to current social development new situations, and innovative declares project.Therefore, right
For auditor, the project for not only needing to remember former years application excludes similar terms, will also be according to the need of current social
It asks to find valuable project.Such case increases the workload of auditor, consumes great effort.
In face of above situation, there is an urgent need to a kind of item similarity comparison technologies for power industry to solve current difficulty.
Although similitude comparison technology present in current techniques field is able to solve the problem of power industry, but for level of hardware
Excessively high with software technology requirement, non-technical industry is difficult to provide the support of this hardware and software.Secondly because power industry Shen
The confidentiality of report project, it is difficult to open to use current universal similitude comparison technology.Therefore, there is an urgent need to research and develop for power industry
A kind of item similarity comparison technology of lightweight solves the problems of current industry.
Summary of the invention
In order to solve the above technical problems, the object of the present invention is to provide a kind of, the item similarity towards power industry is compared
Method.
The purpose of the present invention is realized by technical solution below:
A kind of item similarity control methods towards power industry, including:
Text is carried out fragmentation processing by step 10, and unified format simultaneously saves in the database;
Step 20 goes out several texts most like with the project that compares by KBase database retrieval;
Step 30 by Similar Text respectively with compare text and be compared;
Step 40 analyzes the comparison result of all Similar Texts, and forms result output according to comparison sequence;
The similarity of step 50 pair comparison sentence optimizes, which is counted using parallel computation using multiple threads simultaneously
It calculates.
Compared with prior art, one or more embodiments of the invention can have following advantage:
This method carries out participle and reaches text representation minimum particle size by splitting text according to sentence, later
Semantic analysis is carried out according to electric power descriptor, and searches Similar Text label output in database all items;Improve Shen
The efficiency that report project duplicate checking compares, reduces the waste of the resources such as manpower and material resources.
Detailed description of the invention
Fig. 1 is the item similarity correlation technique flow chart towards power industry;
Fig. 2 is the datagram saved after fragmentation in the database;
Fig. 3 is one-to-many duplicate checking result display diagram;
Fig. 4 is one-to-one accurate duplicate checking result display diagram.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with examples and drawings to this hair
It is bright to be described in further detail.
As shown in Figure 1, being the item similarity correlation technique process towards power industry, include the following steps:
Text is carried out fragmentation processing by step 10, and unified format simultaneously saves in the database;
Step 20 goes out several texts most like with the project that compares by KBase database retrieval;
KBase data use high-order fingerprint technique, can quickly find out with the biggish text of the urtext degree of correlation, and
Its similarity is roughly calculated.
Step 30 by Similar Text respectively with compare text and be compared;It specifically includes:
Two texts are split into sentence by punctuation mark by step 301, if Similar Text and comparison text are:
D={ d1,d2,d3,...,dn, M={ m1,m2,m3,...,mk}
Wherein D and M is the sentence set of urtext, and d and m are the split sentence separated, and n and k indicate the quantity of sentence;
Step 302 segments the sentence in D and M, and carries out similarity comparison;
Two String searching Similar contents will be by comparing sentence by sentence, i.e., n*k times comparison.Firstly, dividing all sentences
Word, the calculating formula of similarity between sentence are:
Each word of two words is matched, semantic analysis is carried out according to electric power theme dictionary and near synonym library, such as
There are identical and semantic similarity words then to record number of words for fruit, and all same or similar words are added up.Wherein,
LCS(dn, mk) it is sentence dnWith mkIn identical and close word number of words, Num has recorded sentence dnWith mkTotal number of word, calculate phase
Same or close number of words ratio shared in each sentence simultaneously takes smaller value therein as the similarity for working as the first two sentence;
Step 303 given threshold similar, and by similarity required in the threshold value similar of setting and step 302 into
Row compares;
Sentence of the similarity greater than threshold value similar is exactly the similar sentence to be searched, and threshold value similar can be according to reality
The adjustment of border situation.
The result of comparison and the context in original text are saved and mark output by step 304.
By the position mark at similar sentence place in two texts, and original text is traced back to according to position and takes former sentence institute
Context, similar sentence and context are taken out and are saved.
Step 40 analyzes the comparison result of all Similar Texts, and according to original text sequence after duplicate contents are marked
Output;
The comparison result for taking out all similar articles merges if there is repeat statement is identical, by the repetition in result set
Sentence carries out marking red and forms final result.
The similarity of step 50 pair comparison sentence optimizes, which is counted using parallel computation using multiple threads simultaneously
It calculates.
Due to needing to calculate the similarity of all sentences in comparison two-by-two when building alignment matrix, that is, need by n*k times
It calculates, and as the increase for comparing length can increase operation time, therefore same using multiple threads using parallel calculating method
When calculate, by limitation single compare sentence number come achieve the effect that limit Thread Count, if single compare sentence quantity be t, then
Open Thread Count v calculation formula be:
Wherein, Ceiling function is that bracket function then adds 1 simply by the presence of decimal simultaneously.
Fig. 2 is the datagram saved after fragmentation in the database;Fig. 3 is one-to-many duplicate checking result display diagram;Fig. 4 is one
To an accurate duplicate checking result display diagram.
Although disclosed herein embodiment it is as above, the content is only to facilitate understanding the present invention and adopting
Embodiment is not intended to limit the invention.Any those skilled in the art to which this invention pertains are not departing from this
Under the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details,
But scope of patent protection of the invention, still should be subject to the scope of the claims as defined in the appended claims.
Claims (6)
1. a kind of item similarity control methods towards power industry, which is characterized in that the method includes:
Text is carried out fragmentation processing by step 10, and unified format simultaneously saves in the database;
Step 20 goes out several texts most like with the project that compares by KBase database retrieval;
Step 30 by Similar Text respectively with compare text and be compared;
Step 40 analyzes the comparison result of all Similar Texts, and defeated according to original text sequence after duplicate contents are marked
Out;
The similarity of step 50 pair comparison sentence optimizes, which is calculated using parallel computation using multiple threads simultaneously.
2. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 30
It specifically includes:
Two texts are split into sentence by punctuation mark by step 301, if Similar Text and comparison text are:
D={ d1,d2,d3,...,dn, M={ m1,m2,m3,...,mk}
Wherein D and M is the sentence set of urtext, and d and m are the split sentence separated, and n and k indicate the quantity of sentence;
Step 302 segments the sentence in D and M, and carries out similarity comparison;Calculating formula of similarity between sentence is:
Wherein, LCS (dn, mk) it is sentence dnWith mkIn identical and close word number of words, Num has recorded sentence dnWith mkTotal word
Number calculates same or similar number of words ratio shared in each sentence and takes smaller value therein as the phase for working as the first two sentence
Like degree;
Step 303 given threshold similar, and the threshold value similar of setting and similarity required in step 302 are compared
It is right;
The result of comparison and the context in original text are saved and mark output by step 304.
3. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 40
It specifically includes:The comparing result for taking out all similar articles merges if there is repeat statement is identical, and by the weight in result set
Multiple sentence carries out marking red formation final result.
4. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 50
It specifically includes:Due to needing to calculate the similarity of all sentences in comparison two-by-two when building alignment matrix, that is, need by n*k times
It calculates, and as the increase for comparing length can increase operation time, therefore same using multiple threads using parallel calculating method
When calculate, by limitation single compare sentence number come achieve the effect that limit Thread Count, if single compare sentence quantity be t, then
Open Thread Count v calculation formula be:
Wherein, Ceiling function is that bracket function then adds 1 simply by the presence of decimal simultaneously.
5. the item similarity control methods towards power industry as claimed in claim 2, which is characterized in that the step
In 302:Each word of two words is matched, semantic analysis is carried out according to electric power theme dictionary and near synonym library, if
There are identical and semantic similarity words then to record number of words, and all same or similar words are added up.
6. the item similarity control methods towards power industry as claimed in claim 2, which is characterized in that the step
In 303:Sentence of the similarity greater than threshold value similar is exactly the similar sentence to be searched, and wherein threshold value similar can basis
Actual conditions adjustment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810521004.2A CN108846031B (en) | 2018-05-28 | 2018-05-28 | Project similarity comparison method for power industry |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810521004.2A CN108846031B (en) | 2018-05-28 | 2018-05-28 | Project similarity comparison method for power industry |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108846031A true CN108846031A (en) | 2018-11-20 |
CN108846031B CN108846031B (en) | 2022-05-13 |
Family
ID=64213600
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810521004.2A Active CN108846031B (en) | 2018-05-28 | 2018-05-28 | Project similarity comparison method for power industry |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108846031B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109636352A (en) * | 2018-12-20 | 2019-04-16 | 湖南晖龙集团股份有限公司 | A kind of distributed content duplicate checking early warning system based on financial big data |
CN110888920A (en) * | 2019-12-06 | 2020-03-17 | 北京中电普华信息技术有限公司 | Method and device for determining similarity of project functions |
CN114741474A (en) * | 2022-04-20 | 2022-07-12 | 山东科迅信息技术有限公司 | Data processing method applied to project declaration system |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080077570A1 (en) * | 2004-10-25 | 2008-03-27 | Infovell, Inc. | Full Text Query and Search Systems and Method of Use |
CN102708090A (en) * | 2012-05-16 | 2012-10-03 | 中国人民解放军国防科学技术大学 | Verification method for shared storage multicore multithreading processor hardware lock |
CN103455535A (en) * | 2013-05-08 | 2013-12-18 | 深圳市明唐通信有限公司 | Method for establishing knowledge base based on historical consultation data |
CN103729422A (en) * | 2013-12-23 | 2014-04-16 | 武汉传神信息技术有限公司 | Information fragment associative output method and system |
CN104699849A (en) * | 2015-04-07 | 2015-06-10 | 同方知网数字出版技术股份有限公司 | Digital library resource unified search system |
CN105630822A (en) * | 2014-11-04 | 2016-06-01 | 上海兵飞软件有限公司 | Method for marking similar contents in patent retrieval in red color |
CN105718506A (en) * | 2016-01-04 | 2016-06-29 | 胡新伟 | Duplicate-checking comparison method for science and technology projects |
CN106802884A (en) * | 2017-02-17 | 2017-06-06 | 同方知网(北京)技术有限公司 | A kind of method of format document text fragmentation |
CN106844314A (en) * | 2017-02-21 | 2017-06-13 | 北京焦点新干线信息技术有限公司 | A kind of duplicate checking method and device of article |
CN107015961A (en) * | 2016-01-27 | 2017-08-04 | 中文在线数字出版集团股份有限公司 | A kind of text similarity comparison method |
CN107122340A (en) * | 2017-03-30 | 2017-09-01 | 浙江省科技信息研究院 | A kind of similarity detection method for the science and technology item return analyzed based on synonym |
CN107273350A (en) * | 2017-05-16 | 2017-10-20 | 广东电网有限责任公司江门供电局 | A kind of information processing method and its device for realizing intelligent answer |
CN107357765A (en) * | 2017-07-14 | 2017-11-17 | 北京神州泰岳软件股份有限公司 | Word document flaking method and device |
CN107391671A (en) * | 2017-07-21 | 2017-11-24 | 华中科技大学 | A kind of document leakage detection method and system |
CN107908796A (en) * | 2017-12-15 | 2018-04-13 | 广州市齐明软件科技有限公司 | E-Government duplicate checking method, apparatus and computer-readable recording medium |
CN107992470A (en) * | 2017-11-08 | 2018-05-04 | 中国科学院计算机网络信息中心 | A kind of text duplicate checking method and system based on similarity |
-
2018
- 2018-05-28 CN CN201810521004.2A patent/CN108846031B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080077570A1 (en) * | 2004-10-25 | 2008-03-27 | Infovell, Inc. | Full Text Query and Search Systems and Method of Use |
CN102708090A (en) * | 2012-05-16 | 2012-10-03 | 中国人民解放军国防科学技术大学 | Verification method for shared storage multicore multithreading processor hardware lock |
CN103455535A (en) * | 2013-05-08 | 2013-12-18 | 深圳市明唐通信有限公司 | Method for establishing knowledge base based on historical consultation data |
CN103729422A (en) * | 2013-12-23 | 2014-04-16 | 武汉传神信息技术有限公司 | Information fragment associative output method and system |
CN105630822A (en) * | 2014-11-04 | 2016-06-01 | 上海兵飞软件有限公司 | Method for marking similar contents in patent retrieval in red color |
CN104699849A (en) * | 2015-04-07 | 2015-06-10 | 同方知网数字出版技术股份有限公司 | Digital library resource unified search system |
CN105718506A (en) * | 2016-01-04 | 2016-06-29 | 胡新伟 | Duplicate-checking comparison method for science and technology projects |
CN107015961A (en) * | 2016-01-27 | 2017-08-04 | 中文在线数字出版集团股份有限公司 | A kind of text similarity comparison method |
CN106802884A (en) * | 2017-02-17 | 2017-06-06 | 同方知网(北京)技术有限公司 | A kind of method of format document text fragmentation |
CN106844314A (en) * | 2017-02-21 | 2017-06-13 | 北京焦点新干线信息技术有限公司 | A kind of duplicate checking method and device of article |
CN107122340A (en) * | 2017-03-30 | 2017-09-01 | 浙江省科技信息研究院 | A kind of similarity detection method for the science and technology item return analyzed based on synonym |
CN107273350A (en) * | 2017-05-16 | 2017-10-20 | 广东电网有限责任公司江门供电局 | A kind of information processing method and its device for realizing intelligent answer |
CN107357765A (en) * | 2017-07-14 | 2017-11-17 | 北京神州泰岳软件股份有限公司 | Word document flaking method and device |
CN107391671A (en) * | 2017-07-21 | 2017-11-24 | 华中科技大学 | A kind of document leakage detection method and system |
CN107992470A (en) * | 2017-11-08 | 2018-05-04 | 中国科学院计算机网络信息中心 | A kind of text duplicate checking method and system based on similarity |
CN107908796A (en) * | 2017-12-15 | 2018-04-13 | 广州市齐明软件科技有限公司 | E-Government duplicate checking method, apparatus and computer-readable recording medium |
Non-Patent Citations (6)
Title |
---|
QIANG LV 等: "Similarity Retrieval Algorithm based on Multilevel Fingerprint Comparison Matrix", 《PROCEEDINGS OF THE 2018 INTERNATIONAL SYMPOSIUM ON COMMUNICATION ENGINEERING & COMPUTER SCIENCE》 * |
RAMAN A 等: "Speculative parallelization using software multi-threaded transactions", 《PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS》 * |
吕强 等: "句子语义相似度计算", 《计算机工程与应用》 * |
王小芳 等: "于最优化控制模型的文本主题域划分", 《吉林大学学报(理学版)》 * |
肖鹏元: "基于GPU并行计算的重复文本检测系统", 《浙江大学》 * |
陈佐 等: "一种多线程负载均衡分析方法研究", 《计算机应用研究》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109636352A (en) * | 2018-12-20 | 2019-04-16 | 湖南晖龙集团股份有限公司 | A kind of distributed content duplicate checking early warning system based on financial big data |
CN110888920A (en) * | 2019-12-06 | 2020-03-17 | 北京中电普华信息技术有限公司 | Method and device for determining similarity of project functions |
CN110888920B (en) * | 2019-12-06 | 2022-10-11 | 北京中电普华信息技术有限公司 | Method and device for determining similarity of project functions |
CN114741474A (en) * | 2022-04-20 | 2022-07-12 | 山东科迅信息技术有限公司 | Data processing method applied to project declaration system |
Also Published As
Publication number | Publication date |
---|---|
CN108846031B (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019091026A1 (en) | Knowledge base document rapid search method, application server, and computer readable storage medium | |
CN108932294B (en) | Resume data processing method, device, equipment and storage medium based on index | |
WO2019227584A1 (en) | Method for parsing and processing resume data information, device, apparatus, and storage medium | |
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
CN111897970A (en) | Text comparison method, device and equipment based on knowledge graph and storage medium | |
CN108573045A (en) | A kind of alignment matrix similarity retrieval method based on multistage fingerprint | |
Heu et al. | FoDoSu: multi-document summarization exploiting semantic analysis based on social Folksonomy | |
Alami et al. | Cybercrime profiling: Text mining techniques to detect and predict criminal activities in microblog posts | |
CN108846031A (en) | Project similarity comparison method for power industry | |
JP2012108570A (en) | Device and method for extraction of word semantic relation | |
CN110795932B (en) | Geological report text information extraction method based on geological ontology | |
WO2022227535A1 (en) | Method and system for recognizing mining malicious software, and storage medium | |
CN114861677B (en) | Information extraction method and device, electronic equipment and storage medium | |
WO2019214142A1 (en) | Electronic device, research report data-based prediction method, program, and computer storage medium | |
US10706030B2 (en) | Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure | |
CN111553556A (en) | Business data analysis method and device, computer equipment and storage medium | |
Hu et al. | A user profile modeling method based on word2vec | |
Singh et al. | Sentiment analysis using lexicon based approach | |
Wang et al. | Cyber threat intelligence entity extraction based on deep learning and field knowledge engineering | |
US20150039290A1 (en) | Knowledge-rich automatic term disambiguation | |
CN111522950A (en) | Rapid identification system for unstructured massive text sensitive data | |
CN114840685A (en) | Emergency plan knowledge graph construction method | |
Panenghat et al. | Towards the necessity for debiasing natural language inference datasets | |
Mkrtchyan et al. | Deep parsing at the CLEF2014 IE task (DFKI-Medical) | |
Yang et al. | A system fault diagnosis method with a reclustering algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |