CN106354708A

CN106354708A - Client interaction information search engine system based on electricity information collection system

Info

Publication number: CN106354708A
Application number: CN201510409475.0A
Authority: CN
Inventors: 刘宣; 徐英辉; 祝恩国; 李造利; 窦健; 阿辽沙．叶; 章宏伟
Original assignee: Tianjin University; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI
Current assignee: Tianjin University; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI
Priority date: 2015-07-13
Filing date: 2015-07-13
Publication date: 2017-01-25

Abstract

The invention provides a client interaction information search engine system based on an electricity information collection system. The search engine system is constructed based on an open-source search engine solr and comprises an electric power word bank module, a document analysis module, a Chinese word segmentation module, an index library module and a retrieval interface module. The Chinese word segmentation technology and an electric power word bank are combined to establish a standard index; during search, the standard word bank is also used, so that search is accurate, comprehensive and fast.

Description

A kind of client connection information search engine system based on power information acquisition system

Technical field

The present invention relates to a kind of method of field of power, in particular to a kind of client based on power information acquisition system is mutual Dynamic information search engine system.

Background technology

According to Chinese intelligent grid development plan, 2,011 2015 years, the intelligent grid of China entered the all-round construction stage.2015 Year, 41 intelligent grid innovative demonstration engineering construction tasks, power information acquisition system user interaction in intelligent grid will be completed Function is realized substantially, including information interaction, electric energy and business interaction.Pass through " tou power price ", " rank on the basis of interactive information The means such as terraced electricity price " and " bi-directional scheduling " " peak load shifting " effect substantially, and achieves negative to user under not power-off condition Lotus is controlled and then advances ordered electric.Structurized client connection information Store adopts server database memory module, The development trend in the future of destructuring interactive information is the electric power data center based on hadoop framework.The connecing of general big and medium-sized cities Enter electric number of users and reach million orders of magnitude, the thing followed is magnanimity interactive information, efficient, accurate and comprehensive lookup information becomes Improve operating efficiency, make full use of the bottleneck of data resource.

Content of the invention

For overcoming above-mentioned the deficiencies in the prior art, present invention offer is a kind of to be searched based on the client connection information of power information acquisition system Rope automotive engine system, can achieve the search for being stored in unit or intranet shared file content at this stage, thus realize right The accurate lookup of customer information.

Realizing the solution that above-mentioned purpose adopted is:

A kind of client connection information search engine system based on power information acquisition system, wherein, described search engine system base In increasing income, search engine solr builds, including electric power dictionary module, document parsing module, Chinese word segmentation module, index library module With Retrieval Interface module.

Preferably, two aspect sources are had in described electric power dictionary module, one is with reference to country and power industry standard and state The standard of the border electric power network technique committee, specialized vocabulary conventional in user interaction information is included into dictionary；It two will be crucial for core publication Relational language in the summary such as word and field term extraction algorithm extraction " Proceedings of the CSEE " using regular Distribution Entropy It is included into dictionary.

Preferably, described document parsing module is responsible for resolution file, from unstructured data such as pdf, word, excel and In the document of the forms such as powerpoint extract description document word, these descriptive information include Document Title, author, Main contents etc., are carrying out syntactic analysis and Language Processing further i.e. using tf idf (term frequency-inverse Document frequency) weighting algorithm is estimated to the word in text, and choose weights and be more than the word of threshold value and extract as document Core vocabulary go forward side by side a step application message gain method (information gain) preferably core vocabulary and then formed comprise content and The text of core vocabulary.

Complete the document process of different-format using multiple resources in storehouse of increasing income.For example, apache poi program can complete The function that microsoft office format file is read and write.Its structure includes: hssf provides read-write microsoft excel xls The function of form archives；Xssf provides the function of read-write microsoft excel ooxml xlsx form archives；Hwpf provides The function of read-write microsoft word doc form archives；Hslf provides read-write microsoft powerpoint form archives Function；Hdgf provides function of read-write microsoft visio form archives etc..The establishment of pdfbox offer pdf document, Process and document content abstraction function.

Preferably, described Chinese word segmentation module is responsible for using Chinese Word Automatic Segmentation, and text content is carried out full text participle, will Word segmentation result compares with electric power dictionary Plays term one by one, deletes the participle not having in dictionary, using the standard of electric power dictionary Word, forms index file, and Chinese word segmentation adopts " ik analyzer " kit, participle when setting up index data base and search When, it is required for comparing with electric power dictionary java standard library, the index data base being built such that is easily by using identical standard dictionary Search engine searches.

Preferably, described index library module disappears again by interactive information data prediction using Digital Signature Algorithm, empty using phasor Between model (vsm:vector space model) represent text characteristic information, set up index data base, for user search for provide Retrieval source；

The index file of described index library module comprises index terms and index list.

It is different from other topic-specified search engines, the feature when forming index database for the system is that index terms is to build based on electric power dictionary Vertical, thus forming standardized index database.

Preferably, described Retrieval Interface module is the interface that user uses, and accepts inputting and exporting Query Result of user.Retrieval When will input term participle after formed keyword, segmenter analysis of key word, parsed and with electric power dictionary comparison, shape Become multiple search words, then index file is scanned for, and result is ranked up with output to user.

It is different from other search engines, the feature in retrieval for the system is that the keyword after participle is compared with electric power dictionary, from And form standardized term.

Preferably, described retrieval includes: 1) sets up characteristic item: set up characteristic item, document=d (t to the word of document, word, sentence₁, t₂,…t_k,…t_n), it is expressed as a dimension, wherein t_kRepresent k-th characteristic item；

2) calculate the weight of characteristic item: in an object to be retrieved, each characteristic item is endowed weight c_j, to represent Significance level in the text for the characteristic item；

3) set up vector space model: after having given up the order information between each characteristic item, text mean that into Amount, i.e. feature space a point；Text d₁Expression: v (d₁)=(w_i1, w_i2... ... w_ik... w_im), wherein, w_ik=f (t_k, c_j) it is weighting function, reflection weight is c_jFeature phase t_kDetermine document d_iBelong to the degree of feature set；

4) Similarity Measure: all documents are mapped as the vector space of this document by vector space model, thus by document information Matching problem is converted into the vector matching problem in vector space, the distance at n-dimensional space midpoint with the cosine angle between vector Lai Tolerance, that is, illustrate similarity degree between document it is assumed that destination document is u, during lookup and destination document u compare certain not Know that document is v_i, the similarity of angle less expository writing shelves is higher, similar computing formula (1):

Wherein, w_ikIt is unknown document v_iIn k-th characteristic item weighting function, w_kIt is k-th characteristic item in destination document u Weighting function, characteristic item has the m i.e. value of k from 1 to m；Calculating weighting function using word frequency is w_ik=tf_k(d_i)^1/2, It is normalized:tf_k(d_i) represent k-th characteristic item in unknown document v_iThe frequency of middle appearance, The value of j travels through all characteristic items from 1 to m；d_jRepresent jth item document；The computational methods of wk are identical with wik, in mesh Calculating weighting function using word frequency in mark document u is w_k=tf_k(d)^1/2, and be normalized: tf_kD frequency that k-th characteristic item of () expression occurs in destination document u, the value of j travels through all characteristic items from 1 to m.

When returning user search information, similarity is ranked up from high to low, provides retrieval entry.

Compared with prior art, the method have the advantages that

Chinese words segmentation of the present invention and electric power dictionary combines it is established that the index of standard, the keyword after participle during search with Electric power dictionary compares, thus forming standardized term, makes search accurately, comprehensively and quickly.

Brief description

Fig. 1 is the search engine framework figure of the present invention；

Fig. 2 is the index frame diagram of the present invention；

Fig. 3 is the retrieval frame diagram of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.

The present invention is based on search engine solr kit of increasing income, and builds a kind of gopher of search fixed disk file content, including electricity Power dictionary module, document parsing module, Chinese word segmentation module, index library module and Retrieval Interface module.Electric power dictionary module is built Specialized dictionary conventional in user interaction information is particularly sorted out by vertical industry standard term.Document parsing module is responsible for resolution file； Chinese word segmentation module is responsible for using Chinese Word Automatic Segmentation, and file content is carried out full text participle, in conjunction with electric power dictionary, sets up in full Index.Index database data storage；Retrieval Interface module is the interface that user uses, and accepts inputting and exporting Query Result of user. System framework such as Fig. 1.

In document analysis module, from the document of the unstructured data such as form such as pdf, word, excel and powerpoint Extract the word of description document, these descriptive information include Document Title, author, main contents etc., carrying out language further Method analysis and Language Processing and then formation index.Complete the document process of different-format using multiple resources in storehouse of increasing income.Example As apache poi program can complete the function that microsoft office format file is read and write.Its structure includes: hssf carries For reading and writing the function of microsoft excel xls form archives；Xssf provides read-write microsoft excel ooxml xlsx The function of form archives；Hwpf provides the function of read-write microsoft word doc form archives；Hslf provides read-write The function of microsoft powerpoint form archives；Hdgf provides function of read-write microsoft visio form archives etc.. Pdfbox provides establishment, process and the document content abstraction function of pdf document.

The content Primary Reference country of electric power dictionary and power industry standard, and the standard of international grid technical committee.Due to Intelligent grid be new things in constantly improve, some of which common words need collection to be individually added into.

Chinese word segmentation adopts " ik analyzer " kit, when setting up index data base and search during participle, is required for and electric power Dictionary java standard library compares, and the index data base being built such that easily is searched using the search engine of identical standard dictionary.

Index framework such as Fig. 2.In the hard disks such as word, excel, txt, pdf, the different types of file application of storage is corresponding Kit extraction document content from file forms text and gives segmenter, and segmenter combines power specialty dictionary and sets up index File, comprise in index file is the keyword that the key message extracting in text simultaneously compares with electric power dictionary and sets up.

Retrieval framework such as Fig. 3.After user input keyword, segmenter analysis of key word, parsed and with the comparison of electric power dictionary, Form multiple search words, then index file is scanned for, and result is ranked up with output to user.

It is implemented as follows:

1) set up characteristic item: the word of document, word, sentence etc. are set up with characteristic item, document=d (t₁,t₂,…t_k,…t_n), Wherein t_kRepresent k-th characteristic item, be expressed as a dimension.Specifically, can be by the payment of certain customer electricity payment information The words such as unit, Payment Amount, Subscriber Number, customer address, project name, electricity charge month, this paid, total RMB Respectively as d (t₁,t₂,…t_k,…t_n) one of characteristic item.

2) calculate the weight of characteristic item: in an object to be retrieved (such as text), each characteristic item is endowed a power Weight c_j, to represent significance level in the text for the characteristic item.Specifically, characteristic item user being concerned about: electricity charge month, Project name, fee to be charged, account balance etc. give heavier weight, and for other more sparse with this retrieval relation Characteristic item: customer address, serial number, client etc. give less weight.

3) set up vector space model: after having given up the order information between each characteristic item, text mean that into Amount, i.e. feature space a point.As text d₁Expression: v (d₁)=(w_i1, w_i2... ... w_ik... w_im).Wherein, w_ik=f (t_k, c_j) it is weighting function, reflect feature t_kDetermine document d_iWhether belong to c_jImportance.

4) Similarity Measure: all documents are mapped as the vector space of this document by vector space model, thus by document information Matching problem is converted into the vector matching problem in vector space.The distance at n-dimensional space midpoint with vector between cosine angle Lai Tolerance, namely illustrate the similarity degree between document.Assume that destination document vector is u, unknown document is v_i, angle gets over novel The similarity of plaintext shelves is higher, similar computing formula (1):

Weight w therein_ikIt is the function of characteristic item institute's frequency of occurrences in a document, use tf_k(d_i) represent t_kIn document d_iMiddle appearance Frequency, using word frequency w_ik=tf_k(d_i)^1/2Calculate weighting function, and be normalized post processing:

When returning user search information, it is ranked up with similarity, provides retrieval entry.

Finally it should be noted that: above example is merely to illustrate technical scheme rather than the restriction to its protection domain of the application, Although being described in detail to the application with reference to above-described embodiment, those of ordinary skill in the art are it is understood that this area Technical staff still can carry out a variety of changes, modification or equivalent to the specific embodiment of application after reading the application, but These changes, modification or equivalent, are all applying within pending claims.

Claims

1. a kind of client connection information search engine system based on power information acquisition system is it is characterised in that described search is drawn System of holding up is based on the search engine solr that increases income and builds, including electric power dictionary module, document parsing module, Chinese word segmentation module, rope Draw library module and Retrieval Interface module.

2. search engine system as claimed in claim 1 comes it is characterised in that having two aspects in described electric power dictionary module Source, one, will be conventional in user interaction information with reference to the standard of country and power industry standard and international grid technical committee Specialized vocabulary is included into dictionary；It two is extracted by core publication keyword and using the field term extraction algorithm of regular Distribution Entropy Relational language in summary is included into dictionary.

3. search engine system as claimed in claim 1 is it is characterised in that described document parsing module is responsible for resolution file, from The word extracting description document in unstructured document carries out syntactic analysis and Language Processing further i.e. using tf idf weighting calculation Method is estimated to the word in text, chooses weights and is more than the core vocabulary that extracts as document of word of threshold value and goes forward side by side a step application message Gain method preferred core vocabulary so formed comprise content and the text of core vocabulary.

4. search engine system as claimed in claim 1 is it is characterised in that described Chinese word segmentation module is responsible for using Chinese word segmentation Algorithm, text content is carried out full text participle, word segmentation result is compared with electric power dictionary Plays term one by one, deletes The participle not having in dictionary, using the standard word of electric power dictionary, forms index file.

5. search engine system as claimed in claim 1 is it is characterised in that described index library module is pre- by interactive information data Process and disappeared again using Digital Signature Algorithm, represent the characteristic information of text using vector space model, set up index data base, be User's search provides retrieval source；

6. search engine system as claimed in claim 1 it is characterised in that described Retrieval Interface module be user use interface, Accept inputting and exporting Query Result of user；

Forming keyword by after the term participle of input during retrieval, comparing with electric power dictionary, thus forming standardized term.

7. search engine system as claimed in claim 6 is it is characterised in that described retrieval includes:

1) set up characteristic item: characteristic item, document=d (t are set up to the word of document, word, sentence₁,t₂,…t_k,…t_n), It is expressed as a dimension, wherein t_kRepresent k-th characteristic item；

Wherein, w_ikIt is unknown document v_iIn k-th characteristic item weighting function, w_kIt is k-th characteristic item in destination document u Weighting function, characteristic item has the m i.e. value of k from 1 to m；Calculating weighting function using word frequency is w_ik=tf_k(d_i)^1/2, It is normalized:tf_k(d_i) represent k-th characteristic item in unknown document v_iThe frequency of middle appearance, The value of j travels through all characteristic items from 1 to m；d_jRepresent jth item document；

W can be obtained in the same manner_k:

w_{k} = \frac{w_{k}}{\sqrt{σ_{j &element; d_{j}} w_{j}^{2}}};