CN106354708A - Client interaction information search engine system based on electricity information collection system - Google Patents

Client interaction information search engine system based on electricity information collection system Download PDF

Info

Publication number
CN106354708A
CN106354708A CN201510409475.0A CN201510409475A CN106354708A CN 106354708 A CN106354708 A CN 106354708A CN 201510409475 A CN201510409475 A CN 201510409475A CN 106354708 A CN106354708 A CN 106354708A
Authority
CN
China
Prior art keywords
document
search engine
module
characteristic item
word
Prior art date
Application number
CN201510409475.0A
Other languages
Chinese (zh)
Inventor
刘宣
徐英辉
祝恩国
李造利
窦健
阿辽沙.叶
章宏伟
Original Assignee
中国电力科学研究院
国家电网公司
天津大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国电力科学研究院, 国家电网公司, 天津大学 filed Critical 中国电力科学研究院
Priority to CN201510409475.0A priority Critical patent/CN106354708A/en
Publication of CN106354708A publication Critical patent/CN106354708A/en

Links

Abstract

The invention provides a client interaction information search engine system based on an electricity information collection system. The search engine system is constructed based on an open-source search engine solr and comprises an electric power word bank module, a document analysis module, a Chinese word segmentation module, an index library module and a retrieval interface module. The Chinese word segmentation technology and an electric power word bank are combined to establish a standard index; during search, the standard word bank is also used, so that search is accurate, comprehensive and fast.

Description

A kind of client connection information search engine system based on power information acquisition system

Technical field

The present invention relates to a kind of method of field of power, in particular to a kind of client based on power information acquisition system is mutual Dynamic information search engine system.

Background technology

According to Chinese intelligent grid development plan, 2,011 2015 years, the intelligent grid of China entered the all-round construction stage.2015 Year, 41 intelligent grid innovative demonstration engineering construction tasks, power information acquisition system user interaction in intelligent grid will be completed Function is realized substantially, including information interaction, electric energy and business interaction.Pass through " tou power price ", " rank on the basis of interactive information The means such as terraced electricity price " and " bi-directional scheduling " " peak load shifting " effect substantially, and achieves negative to user under not power-off condition Lotus is controlled and then advances ordered electric.Structurized client connection information Store adopts server database memory module, The development trend in the future of destructuring interactive information is the electric power data center based on hadoop framework.The connecing of general big and medium-sized cities Enter electric number of users and reach million orders of magnitude, the thing followed is magnanimity interactive information, efficient, accurate and comprehensive lookup information becomes Improve operating efficiency, make full use of the bottleneck of data resource.

Content of the invention

For overcoming above-mentioned the deficiencies in the prior art, present invention offer is a kind of to be searched based on the client connection information of power information acquisition system Rope automotive engine system, can achieve the search for being stored in unit or intranet shared file content at this stage, thus realize right The accurate lookup of customer information.

Realizing the solution that above-mentioned purpose adopted is:

A kind of client connection information search engine system based on power information acquisition system, wherein, described search engine system base In increasing income, search engine solr builds, including electric power dictionary module, document parsing module, Chinese word segmentation module, index library module With Retrieval Interface module.

Preferably, two aspect sources are had in described electric power dictionary module, one is with reference to country and power industry standard and state The standard of the border electric power network technique committee, specialized vocabulary conventional in user interaction information is included into dictionary;It two will be crucial for core publication Relational language in the summary such as word and field term extraction algorithm extraction " Proceedings of the CSEE " using regular Distribution Entropy It is included into dictionary.

Preferably, described document parsing module is responsible for resolution file, from unstructured data such as pdf, word, excel and In the document of the forms such as powerpoint extract description document word, these descriptive information include Document Title, author, Main contents etc., are carrying out syntactic analysis and Language Processing further i.e. using tf idf (term frequency-inverse Document frequency) weighting algorithm is estimated to the word in text, and choose weights and be more than the word of threshold value and extract as document Core vocabulary go forward side by side a step application message gain method (information gain) preferably core vocabulary and then formed comprise content and The text of core vocabulary.

Complete the document process of different-format using multiple resources in storehouse of increasing income.For example, apache poi program can complete The function that microsoft office format file is read and write.Its structure includes: hssf provides read-write microsoft excel xls The function of form archives;Xssf provides the function of read-write microsoft excel ooxml xlsx form archives;Hwpf provides The function of read-write microsoft word doc form archives;Hslf provides read-write microsoft powerpoint form archives Function;Hdgf provides function of read-write microsoft visio form archives etc..The establishment of pdfbox offer pdf document, Process and document content abstraction function.

Preferably, described Chinese word segmentation module is responsible for using Chinese Word Automatic Segmentation, and text content is carried out full text participle, will Word segmentation result compares with electric power dictionary Plays term one by one, deletes the participle not having in dictionary, using the standard of electric power dictionary Word, forms index file, and Chinese word segmentation adopts " ik analyzer " kit, participle when setting up index data base and search When, it is required for comparing with electric power dictionary java standard library, the index data base being built such that is easily by using identical standard dictionary Search engine searches.

Preferably, described index library module disappears again by interactive information data prediction using Digital Signature Algorithm, empty using phasor Between model (vsm:vector space model) represent text characteristic information, set up index data base, for user search for provide Retrieval source;

The index file of described index library module comprises index terms and index list.

It is different from other topic-specified search engines, the feature when forming index database for the system is that index terms is to build based on electric power dictionary Vertical, thus forming standardized index database.

Preferably, described Retrieval Interface module is the interface that user uses, and accepts inputting and exporting Query Result of user.Retrieval When will input term participle after formed keyword, segmenter analysis of key word, parsed and with electric power dictionary comparison, shape Become multiple search words, then index file is scanned for, and result is ranked up with output to user.

It is different from other search engines, the feature in retrieval for the system is that the keyword after participle is compared with electric power dictionary, from And form standardized term.

Preferably, described retrieval includes: 1) sets up characteristic item: set up characteristic item, document=d (t to the word of document, word, sentence1, t2,…tk,…tn), it is expressed as a dimension, wherein tkRepresent k-th characteristic item;

2) calculate the weight of characteristic item: in an object to be retrieved, each characteristic item is endowed weight cj, to represent Significance level in the text for the characteristic item;

3) set up vector space model: after having given up the order information between each characteristic item, text mean that into Amount, i.e. feature space a point;Text d1Expression: v (d1)=(wi1, wi2... ... wik... wim), wherein, wik=f (tk, cj) it is weighting function, reflection weight is cjFeature phase tkDetermine document diBelong to the degree of feature set;

4) Similarity Measure: all documents are mapped as the vector space of this document by vector space model, thus by document information Matching problem is converted into the vector matching problem in vector space, the distance at n-dimensional space midpoint with the cosine angle between vector Lai Tolerance, that is, illustrate similarity degree between document it is assumed that destination document is u, during lookup and destination document u compare certain not Know that document is vi, the similarity of angle less expository writing shelves is higher, similar computing formula (1):

Wherein, wikIt is unknown document viIn k-th characteristic item weighting function, wkIt is k-th characteristic item in destination document u Weighting function, characteristic item has the m i.e. value of k from 1 to m;Calculating weighting function using word frequency is wik=tfk(di)1/2, It is normalized:tfk(di) represent k-th characteristic item in unknown document viThe frequency of middle appearance, The value of j travels through all characteristic items from 1 to m;djRepresent jth item document;The computational methods of wk are identical with wik, in mesh Calculating weighting function using word frequency in mark document u is wk=tfk(d)1/2, and be normalized: tfkD frequency that k-th characteristic item of () expression occurs in destination document u, the value of j travels through all characteristic items from 1 to m.

When returning user search information, similarity is ranked up from high to low, provides retrieval entry.

Compared with prior art, the method have the advantages that

Chinese words segmentation of the present invention and electric power dictionary combines it is established that the index of standard, the keyword after participle during search with Electric power dictionary compares, thus forming standardized term, makes search accurately, comprehensively and quickly.

Brief description

Fig. 1 is the search engine framework figure of the present invention;

Fig. 2 is the index frame diagram of the present invention;

Fig. 3 is the retrieval frame diagram of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.

The present invention is based on search engine solr kit of increasing income, and builds a kind of gopher of search fixed disk file content, including electricity Power dictionary module, document parsing module, Chinese word segmentation module, index library module and Retrieval Interface module.Electric power dictionary module is built Specialized dictionary conventional in user interaction information is particularly sorted out by vertical industry standard term.Document parsing module is responsible for resolution file; Chinese word segmentation module is responsible for using Chinese Word Automatic Segmentation, and file content is carried out full text participle, in conjunction with electric power dictionary, sets up in full Index.Index database data storage;Retrieval Interface module is the interface that user uses, and accepts inputting and exporting Query Result of user. System framework such as Fig. 1.

In document analysis module, from the document of the unstructured data such as form such as pdf, word, excel and powerpoint Extract the word of description document, these descriptive information include Document Title, author, main contents etc., carrying out language further Method analysis and Language Processing and then formation index.Complete the document process of different-format using multiple resources in storehouse of increasing income.Example As apache poi program can complete the function that microsoft office format file is read and write.Its structure includes: hssf carries For reading and writing the function of microsoft excel xls form archives;Xssf provides read-write microsoft excel ooxml xlsx The function of form archives;Hwpf provides the function of read-write microsoft word doc form archives;Hslf provides read-write The function of microsoft powerpoint form archives;Hdgf provides function of read-write microsoft visio form archives etc.. Pdfbox provides establishment, process and the document content abstraction function of pdf document.

The content Primary Reference country of electric power dictionary and power industry standard, and the standard of international grid technical committee.Due to Intelligent grid be new things in constantly improve, some of which common words need collection to be individually added into.

Chinese word segmentation adopts " ik analyzer " kit, when setting up index data base and search during participle, is required for and electric power Dictionary java standard library compares, and the index data base being built such that easily is searched using the search engine of identical standard dictionary.

Index framework such as Fig. 2.In the hard disks such as word, excel, txt, pdf, the different types of file application of storage is corresponding Kit extraction document content from file forms text and gives segmenter, and segmenter combines power specialty dictionary and sets up index File, comprise in index file is the keyword that the key message extracting in text simultaneously compares with electric power dictionary and sets up.

It is different from other topic-specified search engines, the feature when forming index database for the system is that index terms is to build based on electric power dictionary Vertical, thus forming standardized index database.

Retrieval framework such as Fig. 3.After user input keyword, segmenter analysis of key word, parsed and with the comparison of electric power dictionary, Form multiple search words, then index file is scanned for, and result is ranked up with output to user.

It is different from other search engines, the feature in retrieval for the system is that the keyword after participle is compared with electric power dictionary, from And form standardized term.

It is implemented as follows:

1) set up characteristic item: the word of document, word, sentence etc. are set up with characteristic item, document=d (t1,t2,…tk,…tn), Wherein tkRepresent k-th characteristic item, be expressed as a dimension.Specifically, can be by the payment of certain customer electricity payment information The words such as unit, Payment Amount, Subscriber Number, customer address, project name, electricity charge month, this paid, total RMB Respectively as d (t1,t2,…tk,…tn) one of characteristic item.

2) calculate the weight of characteristic item: in an object to be retrieved (such as text), each characteristic item is endowed a power Weight cj, to represent significance level in the text for the characteristic item.Specifically, characteristic item user being concerned about: electricity charge month, Project name, fee to be charged, account balance etc. give heavier weight, and for other more sparse with this retrieval relation Characteristic item: customer address, serial number, client etc. give less weight.

3) set up vector space model: after having given up the order information between each characteristic item, text mean that into Amount, i.e. feature space a point.As text d1Expression: v (d1)=(wi1, wi2... ... wik... wim).Wherein, wik=f (tk, cj) it is weighting function, reflect feature tkDetermine document diWhether belong to cjImportance.

4) Similarity Measure: all documents are mapped as the vector space of this document by vector space model, thus by document information Matching problem is converted into the vector matching problem in vector space.The distance at n-dimensional space midpoint with vector between cosine angle Lai Tolerance, namely illustrate the similarity degree between document.Assume that destination document vector is u, unknown document is vi, angle gets over novel The similarity of plaintext shelves is higher, similar computing formula (1):

Weight w thereinikIt is the function of characteristic item institute's frequency of occurrences in a document, use tfk(di) represent tkIn document diMiddle appearance Frequency, using word frequency wik=tfk(di)1/2Calculate weighting function, and be normalized post processing:

When returning user search information, it is ranked up with similarity, provides retrieval entry.

Finally it should be noted that: above example is merely to illustrate technical scheme rather than the restriction to its protection domain of the application, Although being described in detail to the application with reference to above-described embodiment, those of ordinary skill in the art are it is understood that this area Technical staff still can carry out a variety of changes, modification or equivalent to the specific embodiment of application after reading the application, but These changes, modification or equivalent, are all applying within pending claims.

Claims (7)

1. a kind of client connection information search engine system based on power information acquisition system is it is characterised in that described search is drawn System of holding up is based on the search engine solr that increases income and builds, including electric power dictionary module, document parsing module, Chinese word segmentation module, rope Draw library module and Retrieval Interface module.
2. search engine system as claimed in claim 1 comes it is characterised in that having two aspects in described electric power dictionary module Source, one, will be conventional in user interaction information with reference to the standard of country and power industry standard and international grid technical committee Specialized vocabulary is included into dictionary;It two is extracted by core publication keyword and using the field term extraction algorithm of regular Distribution Entropy Relational language in summary is included into dictionary.
3. search engine system as claimed in claim 1 is it is characterised in that described document parsing module is responsible for resolution file, from The word extracting description document in unstructured document carries out syntactic analysis and Language Processing further i.e. using tf idf weighting calculation Method is estimated to the word in text, chooses weights and is more than the core vocabulary that extracts as document of word of threshold value and goes forward side by side a step application message Gain method preferred core vocabulary so formed comprise content and the text of core vocabulary.
4. search engine system as claimed in claim 1 is it is characterised in that described Chinese word segmentation module is responsible for using Chinese word segmentation Algorithm, text content is carried out full text participle, word segmentation result is compared with electric power dictionary Plays term one by one, deletes The participle not having in dictionary, using the standard word of electric power dictionary, forms index file.
5. search engine system as claimed in claim 1 is it is characterised in that described index library module is pre- by interactive information data Process and disappeared again using Digital Signature Algorithm, represent the characteristic information of text using vector space model, set up index data base, be User's search provides retrieval source;
The index file of described index library module comprises index terms and index list.
6. search engine system as claimed in claim 1 it is characterised in that described Retrieval Interface module be user use interface, Accept inputting and exporting Query Result of user;
Forming keyword by after the term participle of input during retrieval, comparing with electric power dictionary, thus forming standardized term.
7. search engine system as claimed in claim 6 is it is characterised in that described retrieval includes:
1) set up characteristic item: characteristic item, document=d (t are set up to the word of document, word, sentence1,t2,…tk,…tn), It is expressed as a dimension, wherein tkRepresent k-th characteristic item;
2) calculate the weight of characteristic item: in an object to be retrieved, each characteristic item is endowed weight cj, to represent Significance level in the text for the characteristic item;
3) set up vector space model: after having given up the order information between each characteristic item, text mean that into Amount, i.e. feature space a point;Text d1Expression: v (d1)=(wi1, wi2... ... wik... wim), wherein, wik=f (tk, cj) it is weighting function, reflection weight is cjFeature phase tkDetermine document diBelong to the degree of feature set;
4) Similarity Measure: all documents are mapped as the vector space of this document by vector space model, thus by document information Matching problem is converted into the vector matching problem in vector space, the distance at n-dimensional space midpoint with the cosine angle between vector Lai Tolerance, that is, illustrate similarity degree between document it is assumed that destination document is u, during lookup and destination document u compare certain not Know that document is vi, the similarity of angle less expository writing shelves is higher, similar computing formula (1):
Wherein, wikIt is unknown document viIn k-th characteristic item weighting function, wkIt is k-th characteristic item in destination document u Weighting function, characteristic item has the m i.e. value of k from 1 to m;Calculating weighting function using word frequency is wik=tfk(di)1/2, It is normalized:tfk(di) represent k-th characteristic item in unknown document viThe frequency of middle appearance, The value of j travels through all characteristic items from 1 to m;djRepresent jth item document;
W can be obtained in the same mannerk: w k = w k σ j &element; d j w j 2 ;
When returning user search information, similarity is ranked up from high to low, provides retrieval entry.
CN201510409475.0A 2015-07-13 2015-07-13 Client interaction information search engine system based on electricity information collection system CN106354708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510409475.0A CN106354708A (en) 2015-07-13 2015-07-13 Client interaction information search engine system based on electricity information collection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510409475.0A CN106354708A (en) 2015-07-13 2015-07-13 Client interaction information search engine system based on electricity information collection system

Publications (1)

Publication Number Publication Date
CN106354708A true CN106354708A (en) 2017-01-25

Family

ID=57842063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510409475.0A CN106354708A (en) 2015-07-13 2015-07-13 Client interaction information search engine system based on electricity information collection system

Country Status (1)

Country Link
CN (1) CN106354708A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777405A (en) * 2017-04-05 2017-05-31 安徽机器猫电子商务股份有限公司 Promote the e-business method of low frequency class transaction based on SaaS services
CN106844700A (en) * 2017-02-03 2017-06-13 山东浪潮商用系统有限公司 It is a kind of to ask tax system based on Sorl
CN106997390A (en) * 2017-04-05 2017-08-01 安徽机器猫电子商务股份有限公司 A kind of equipment part or parts commodity transaction information search method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116588A (en) * 2011-11-17 2013-05-22 腾讯科技(深圳)有限公司 Method and system for personalized recommendation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116588A (en) * 2011-11-17 2013-05-22 腾讯科技(深圳)有限公司 Method and system for personalized recommendation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐树振: "聚类反馈式电网资源分布搜索引擎研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
薛为民 等: "文本挖掘技术研究", 《北京联合大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844700A (en) * 2017-02-03 2017-06-13 山东浪潮商用系统有限公司 It is a kind of to ask tax system based on Sorl
CN106777405A (en) * 2017-04-05 2017-05-31 安徽机器猫电子商务股份有限公司 Promote the e-business method of low frequency class transaction based on SaaS services
CN106997390A (en) * 2017-04-05 2017-08-01 安徽机器猫电子商务股份有限公司 A kind of equipment part or parts commodity transaction information search method

Similar Documents

Publication Publication Date Title
Zhang et al. Chinese comments sentiment classification based on word2vec and SVMperf
Deveaud et al. Accurate and effective latent concept modeling for ad hoc information retrieval
CN103440329B (en) Authority author and high-quality paper commending system and recommend method
Chuang et al. Termite: Visualization techniques for assessing textual topic models
Gupta et al. Query expansion for mixed-script information retrieval
Meij et al. Adding semantics to microblog posts
Zhang et al. “Term clumping” for technical intelligence: A case study on dye-sensitized solar cells
Sayyadi et al. A graph analytical approach for topic detection
CN103593792B (en) A kind of personalized recommendation method based on Chinese knowledge mapping and system
Leydesdorff et al. Co‐occurrence matrices and their applications in information science: Extending ACA to the Web environment
US20180082183A1 (en) Machine learning-based relationship association and related discovery and search engines
CN102890713B (en) A kind of music recommend method based on user's current geographic position and physical environment
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
Vercoustre et al. Entity ranking in Wikipedia
Gao et al. Ontology similarity measure by optimizing NDCG measure and application in physics education
Zhang et al. Do users rate or review? Boost phrase-level sentiment labeling with review-level sentiment classification
Liu et al. An improvement of TFIDF weighting in text categorization
Shi et al. A sentiment analysis model for hotel reviews based on supervised learning
Ding et al. Learning topical translation model for microblog hashtag suggestion
CN104239501B (en) Mass video semantic annotation method based on Spark
Al-Moslmi et al. Approaches to cross-domain sentiment analysis: A systematic literature review
Liu et al. Full‐text citation analysis: A new method to enhance scholarly networks
CN105468605B (en) A kind of entity information map generation method and device
CN102982153B (en) A kind of information retrieval method and device thereof
CN102597991A (en) Document analysis and association system and method

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination