CN108846031A

CN108846031A - Project similarity comparison method for power industry

Info

Publication number: CN108846031A
Application number: CN201810521004.2A
Authority: CN
Inventors: 段飞虎; 吕强; 冯自强; 张宏伟; 邓春宇; 季知祥; 史梦洁; 陈立斌; 王冠群; 徐翀; 梁芙翠; 王頔; 魏冠元; 付蓉; 马铁群; 朱承志; 孙黎滢; 谷记亭
Original assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd; State Grid Zhejiang Electric Power Co Ltd; China Electric Power Research Institute Co Ltd CEPRI; State Grid Energy Research Institute Co Ltd
Current assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd; State Grid Zhejiang Electric Power Co Ltd; China Electric Power Research Institute Co Ltd CEPRI; State Grid Energy Research Institute Co Ltd
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-20
Anticipated expiration: 2038-05-28
Also published as: CN108846031B

Abstract

The invention discloses a project similarity comparison method for the power industry, which comprises the following steps: fragmenting the text, unifying the format and storing the text in a database; retrieving several texts most similar to the comparison items through a KBase database; respectively comparing the similar texts with the comparison texts; analyzing comparison results of all similar texts, and forming result output according to a comparison sequence; and optimizing the similarity of the comparison statements, wherein the optimization adopts parallel computing and uses a plurality of threads to compute simultaneously. The method comprises the steps of splitting a text according to sentences, segmenting words to achieve the minimum granularity of text representation, then carrying out semantic analysis according to electric power subject words, searching similar text labels in all items of a database and outputting the similar text labels; the efficiency of the duplicate checking comparison of the declaration project is improved, and the waste of resources such as manpower and material resources is reduced.

Description

A kind of item similarity control methods towards power industry

Technical field

The present invention relates to text mining field and technical field of computer information processing, more particularly to one kind is towards electric power row The item similarity control methods of industry.

Background technique

In power industry, annual each department's electric power mechanism all can declare project to power grid, a large amount of within the set time The demand that project is audited, power grid uses manual examination and verification at present.Not only need declaring project and be discharged according to forgetting Similar terms will also be found to have frontier nature according to current social development new situations, and innovative declares project.Therefore, right For auditor, the project for not only needing to remember former years application excludes similar terms, will also be according to the need of current social It asks to find valuable project.Such case increases the workload of auditor, consumes great effort.

In face of above situation, there is an urgent need to a kind of item similarity comparison technologies for power industry to solve current difficulty. Although similitude comparison technology present in current techniques field is able to solve the problem of power industry, but for level of hardware Excessively high with software technology requirement, non-technical industry is difficult to provide the support of this hardware and software.Secondly because power industry Shen The confidentiality of report project, it is difficult to open to use current universal similitude comparison technology.Therefore, there is an urgent need to research and develop for power industry A kind of item similarity comparison technology of lightweight solves the problems of current industry.

Summary of the invention

In order to solve the above technical problems, the object of the present invention is to provide a kind of, the item similarity towards power industry is compared Method.

The purpose of the present invention is realized by technical solution below：

A kind of item similarity control methods towards power industry, including：

Text is carried out fragmentation processing by step 10, and unified format simultaneously saves in the database；

Step 20 goes out several texts most like with the project that compares by KBase database retrieval；

Step 30 by Similar Text respectively with compare text and be compared；

Step 40 analyzes the comparison result of all Similar Texts, and forms result output according to comparison sequence；

The similarity of step 50 pair comparison sentence optimizes, which is counted using parallel computation using multiple threads simultaneously It calculates.

Compared with prior art, one or more embodiments of the invention can have following advantage：

This method carries out participle and reaches text representation minimum particle size by splitting text according to sentence, later Semantic analysis is carried out according to electric power descriptor, and searches Similar Text label output in database all items；Improve Shen The efficiency that report project duplicate checking compares, reduces the waste of the resources such as manpower and material resources.

Detailed description of the invention

Fig. 1 is the item similarity correlation technique flow chart towards power industry；

Fig. 2 is the datagram saved after fragmentation in the database；

Fig. 3 is one-to-many duplicate checking result display diagram；

Fig. 4 is one-to-one accurate duplicate checking result display diagram.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with examples and drawings to this hair It is bright to be described in further detail.

As shown in Figure 1, being the item similarity correlation technique process towards power industry, include the following steps：

KBase data use high-order fingerprint technique, can quickly find out with the biggish text of the urtext degree of correlation, and Its similarity is roughly calculated.

Step 30 by Similar Text respectively with compare text and be compared；It specifically includes：

Two texts are split into sentence by punctuation mark by step 301, if Similar Text and comparison text are：

D={ d₁,d₂,d₃,...,d_n, M={ m₁,m₂,m₃,...,m_k}

Wherein D and M is the sentence set of urtext, and d and m are the split sentence separated, and n and k indicate the quantity of sentence；

Step 302 segments the sentence in D and M, and carries out similarity comparison；

Two String searching Similar contents will be by comparing sentence by sentence, i.e., n*k times comparison.Firstly, dividing all sentences Word, the calculating formula of similarity between sentence are：

Each word of two words is matched, semantic analysis is carried out according to electric power theme dictionary and near synonym library, such as There are identical and semantic similarity words then to record number of words for fruit, and all same or similar words are added up.Wherein, LCS(d_n, m_k) it is sentence d_nWith m_kIn identical and close word number of words, Num has recorded sentence d_nWith m_kTotal number of word, calculate phase Same or close number of words ratio shared in each sentence simultaneously takes smaller value therein as the similarity for working as the first two sentence；

Step 303 given threshold similar, and by similarity required in the threshold value similar of setting and step 302 into Row compares；

Sentence of the similarity greater than threshold value similar is exactly the similar sentence to be searched, and threshold value similar can be according to reality The adjustment of border situation.

The result of comparison and the context in original text are saved and mark output by step 304.

By the position mark at similar sentence place in two texts, and original text is traced back to according to position and takes former sentence institute Context, similar sentence and context are taken out and are saved.

Step 40 analyzes the comparison result of all Similar Texts, and according to original text sequence after duplicate contents are marked Output；

The comparison result for taking out all similar articles merges if there is repeat statement is identical, by the repetition in result set Sentence carries out marking red and forms final result.

Due to needing to calculate the similarity of all sentences in comparison two-by-two when building alignment matrix, that is, need by n*k times It calculates, and as the increase for comparing length can increase operation time, therefore same using multiple threads using parallel calculating method When calculate, by limitation single compare sentence number come achieve the effect that limit Thread Count, if single compare sentence quantity be t, then Open Thread Count v calculation formula be：

Wherein, Ceiling function is that bracket function then adds 1 simply by the presence of decimal simultaneously.

Fig. 2 is the datagram saved after fragmentation in the database；Fig. 3 is one-to-many duplicate checking result display diagram；Fig. 4 is one To an accurate duplicate checking result display diagram.

Although disclosed herein embodiment it is as above, the content is only to facilitate understanding the present invention and adopting Embodiment is not intended to limit the invention.Any those skilled in the art to which this invention pertains are not departing from this Under the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But scope of patent protection of the invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of item similarity control methods towards power industry, which is characterized in that the method includes：

Step 30 by Similar Text respectively with compare text and be compared；

Step 40 analyzes the comparison result of all Similar Texts, and defeated according to original text sequence after duplicate contents are marked Out；

The similarity of step 50 pair comparison sentence optimizes, which is calculated using parallel computation using multiple threads simultaneously.

2. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 30 It specifically includes：

D={ d₁,d₂,d₃,...,d_n, M={ m₁,m₂,m₃,...,m_k}

Step 302 segments the sentence in D and M, and carries out similarity comparison；Calculating formula of similarity between sentence is：

Wherein, LCS (d_n, m_k) it is sentence d_nWith m_kIn identical and close word number of words, Num has recorded sentence d_nWith m_kTotal word Number calculates same or similar number of words ratio shared in each sentence and takes smaller value therein as the phase for working as the first two sentence Like degree；

Step 303 given threshold similar, and the threshold value similar of setting and similarity required in step 302 are compared It is right；

3. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 40 It specifically includes：The comparing result for taking out all similar articles merges if there is repeat statement is identical, and by the weight in result set Multiple sentence carries out marking red formation final result.

4. the item similarity control methods towards power industry as described in claim 1, which is characterized in that the step 50 It specifically includes：Due to needing to calculate the similarity of all sentences in comparison two-by-two when building alignment matrix, that is, need by n*k times It calculates, and as the increase for comparing length can increase operation time, therefore same using multiple threads using parallel calculating method When calculate, by limitation single compare sentence number come achieve the effect that limit Thread Count, if single compare sentence quantity be t, then Open Thread Count v calculation formula be：

5. the item similarity control methods towards power industry as claimed in claim 2, which is characterized in that the step In 302：Each word of two words is matched, semantic analysis is carried out according to electric power theme dictionary and near synonym library, if There are identical and semantic similarity words then to record number of words, and all same or similar words are added up.

6. the item similarity control methods towards power industry as claimed in claim 2, which is characterized in that the step In 303：Sentence of the similarity greater than threshold value similar is exactly the similar sentence to be searched, and wherein threshold value similar can basis Actual conditions adjustment.