CN108427775A - A kind of project cost inventory sorting technique based on multinomial Bayes - Google Patents

A kind of project cost inventory sorting technique based on multinomial Bayes Download PDF

Info

Publication number
CN108427775A
CN108427775A CN201810564742.5A CN201810564742A CN108427775A CN 108427775 A CN108427775 A CN 108427775A CN 201810564742 A CN201810564742 A CN 201810564742A CN 108427775 A CN108427775 A CN 108427775A
Authority
CN
China
Prior art keywords
inventory
text
sorted
training
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810564742.5A
Other languages
Chinese (zh)
Inventor
屈鸿
汤明松
廖兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Dazi Tong Technology Co Ltd
Original Assignee
Chengdu Dazi Tong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Dazi Tong Technology Co Ltd filed Critical Chengdu Dazi Tong Technology Co Ltd
Priority to CN201810564742.5A priority Critical patent/CN108427775A/en
Publication of CN108427775A publication Critical patent/CN108427775A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The project cost inventory sorting technique based on multinomial Bayes that the invention discloses a kind of, is related to project cost inventory intelligent classification field, includes the following steps:S1:Training project cost listings data and key message in project cost listings data to be sorted are extracted respectively and is integrated into trained inventory text and inventory text to be sorted, and training inventory text and inventory text to be sorted are pre-processed;S2:Pretreated trained inventory text and inventory text to be sorted are subjected to text representation respectively;S3:To carrying out classification based training by the training inventory text of text representation, inventory text classifier is constructed;S4:Inventory text classifier is acted on into pretreated inventory text to be sorted, obtains classification results.It solves and is affected by human factors greatly existing for existing inventory classification summary method, regular universality is poor, and the manpower and time cost of cost are larger, and is difficult to find the problem of hiding rule.

Description

A kind of project cost inventory sorting technique based on multinomial Bayes
Technical field
The present invention relates to project cost inventory intelligent classification fields more particularly to a kind of engineering based on multinomial Bayes to make Valence inventory sorting technique.
Background technology
Project Cost Field research relates generally to the project cost file worked out by relevant party in each architectural engineering, work Journey, which is made, has contained a large amount of valuable information in value document, for project cost file big data data mining and study for China's building trade has the meaning of directiveness, wherein it is important that the classification to project cost inventory, it is therefore an objective to make engineering Valence inventory is a kind of according to certain that inventory is referred under an inventory taxonomic hierarchies by the information such as the description of inventory and material therefor, with The process of the convenient specific composition for understanding on the whole and analyzing an engineering, however, China's Construction Cost Industry is still located at present In information-based initial stage, the Comparisions such as project cost inventory big data analysis and excavation are lagged, project cost inventory Classification work still relies on the rule-based matched sorting technique of tradition, and cost inventory is considered as the total of several attributes by this method With, by project cost expert manually by existing for similar inventory items attribute in history engineering cost inventory general character summarize established practice Then, classified to new listings data by the rule in matching rule base in following project cost listings data.
But the professional degree by cost expert that this method is difficult to avoid that, the artificial subjective factors such as experience It influences, dominant knowledge can only be summed up, can not find the association hidden between data, there may be one-sided for the rule summed up The problem of, the case where there is frequent change rule in practical operation and increase new constraints to original rule, and it is practical Classifying quality is also barely satisfactory, and it is larger to may be only available for some sample sizes for rule base while spending a large amount of manpowers and time cost Classification item, there is the case where can not sorting out, there are larger limitations.
It is affected by human factors greatly in conclusion traditional rule-based matched method exists, the rule summed up is general Adaptive is poor, and the manpower and time cost of cost are larger, it is difficult to find hiding rule.
Invention content
It is an object of the invention to:A kind of project cost inventory sorting technique based on multinomial Bayes is provided, is solved existing Have and be affected by human factors greatly existing for inventory classification summary method, regular universality is poor, the manpower and time cost of cost It is larger, and be difficult to find the problem of hiding rule.
The technical solution adopted by the present invention is as follows:
A kind of project cost inventory sorting technique based on multinomial Bayes, includes the following steps:
S1:Extraction trains the key message in project cost listings data and project cost listings data to be sorted simultaneously respectively It is integrated into trained inventory text and inventory text to be sorted, and training inventory text and inventory text to be sorted are located in advance Reason;
S2:Pretreated trained inventory text and inventory text to be sorted are subjected to text representation respectively;
S3:To carrying out classification based training by the training inventory text of text representation, inventory text classifier is constructed;
S4:Inventory text classifier is acted on into pretreated inventory text to be sorted, obtains classification results.
Further, the step S3 carries out classification based training using multinomial Bayesian Classification Arithmetic.
Further, the step S1 is as follows:
S101:The key message in training project cost listings data and project cost listings data to be sorted is extracted respectively And it is integrated into trained inventory text and inventory text to be sorted;
S102:Training inventory text and inventory text to be sorted are segmented, and establish proper nouns dictionary;
S103:Stop words is carried out to thesaurus to handle, and counts the frequency that each vocabulary of thesaurus occurs;
S104:Low frequency words in thesaurus are removed, and using remaining vocabulary as training inventory text and inventory to be sorted The Feature Words of text classification simultaneously carry out text representation.
Further, the step S2 is as follows:
S201:The power of each Feature Words in training inventory text and inventory text to be sorted is calculated using TF-IDF algorithms Weight;
S202:Term weight function in training inventory text and inventory documents to be sorted is indicated with vector respectively.
Further, the step S3 is as follows:
S301:The probability that training inventory text belongs to each classification is calculated using multinomial Bayesian Classification Arithmetic;
S302:The probability that trained inventory text belongs to all categories is acquired, maximum probability classification is the training inventory text Classification belonging to this.
In conclusion by adopting the above-described technical solution, the beneficial effects of the invention are as follows:
1, in the present invention, inventory text data to be sorted is carried out at classification using the sorting technique based on multinomial Bayes Reason solves the problems, such as to be affected by human factors existing for the rule-based matched sorting technique of tradition big.
2, in the present invention, by a large amount of acquisition process to training project cost listings data, it can be found that being deposited between data Hiding association, realize intelligence learning classification, keep inventory text data to be sorted classification processing more flexible, can adapt to each The classification item of kind of sample size, and convenient for comprehensively sum up there are the problem of, prevent in practical operation frequently change rule New constraint then and to original rule is added, it is time saving and energy saving.
3, in the present invention, during classifying to project cost inventory, the inventory classification time is (every hundred 0.18 shorter Second), and by inventory text classification accuracy from it is original 80% improve till now nearly 90%.
Description of the drawings
Fig. 1 is that the present invention is based on the overall flow figures of the project cost inventory sorting technique of multinomial Bayes;
Fig. 2 is the pretreated particular flow sheet of listings data of the present invention;
Fig. 3 is the particular flow sheet of multinomial Bayes's inventory classifier training of the present invention;
Fig. 4 is the particular flow sheet of multinomial Bayes's inventory grader test and practical application of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.
As shown in Figure 1, a kind of project cost inventory sorting technique based on multinomial Bayes, includes the following steps:
S1:Extraction trains the key message in project cost listings data and project cost listings data to be sorted simultaneously respectively It is integrated into trained inventory text and inventory text to be sorted, and training inventory text and inventory text to be sorted are located in advance Reason, as shown in Fig. 2, the pretreatment is as follows:
S101:The key message in training project cost listings data and project cost listings data to be sorted is extracted respectively (such as description of inventory title, inventory, inventory material) is simultaneously integrated into trained inventory text data and inventory text data to be sorted;
S102:Training inventory text and inventory text to be sorted are segmented, and proprietary for Project Cost Field The more situation of noun establishes proper nouns dictionary, and proper noun in field is avoided to be split as multiple words;
S103:Stop words is carried out to thesaurus to handle, and counts the frequency that each vocabulary of thesaurus occurs;
S104:Remove low frequency words in thesaurus, and using remaining vocabulary as trained inventory text data and to be sorted The Feature Words of inventory text data classification simultaneously carry out text representation.
S2:Pretreated trained inventory text and inventory text to be sorted are subjected to text representation, specific steps respectively It is as follows:
S201:The power of each Feature Words in training inventory text and inventory text to be sorted is calculated using TF-IDF algorithms Weight, calculation formula are:
Wherein, tfijWhat is indicated is ith feature word in inventory text (training inventory text or inventory text to be sorted) dj The frequency number of middle appearance, N are the total number of inventory text (training inventory text or inventory text to be sorted), NiFor inventory text Occurs the text number of ith feature word in this (training inventory text or inventory text to be sorted) set, n is training inventory text Feature Words number in this, k are the value of sum formula origin-to-destination, are calculated to n, tf since 1kjWhat is indicated is k-th Feature Words are in training inventory text djThe frequency number of middle appearance, αLIt is Laplce's smoothing parameter, α is obtained in experimentLIt takes 0.0001 classifying quality is preferable;
S202:Term weight function in training inventory text and inventory text to be sorted is indicated with vector respectively, specific table It is shown as:
v(di)=(t1(di),t2(di),...,tn(di))
Wherein, n indicates all Feature Words numbers of inventory text (training inventory text or inventory text to be sorted), wj (di) indicate j-th of Feature Words in inventory text (training inventory text or inventory text to be sorted) djIn weight, j be 1 arrive n Arbitrary value;
S3:To carrying out classification based training by the training inventory text of text representation, inventory text classifier, such as Fig. 3 are constructed Shown, classifier training is as follows:
S301:The probability that training inventory text belongs to each classification is calculated using multinomial Bayesian Classification Arithmetic, for text The Bayesian formula of this classification is:
During probability calculation, by the set that inventory text representation is all Feature Words, the probability of inventory text is The probability of all Feature Words, i.e. P (d)=P (w1,w2,...wn), wherein P (d) is the probability of inventory text, wiFor ith feature Word, while according to conditional independence assumption, it is assumed that between each Feature Words of composition inventory text independently of each other, to by above-mentioned Formula deduces:
Wherein CiFor i-th of classification, P (Ci| d) indicate that inventory text d belongs to CiThe probability of classification, P (Ci) it is training inventory C in text dataiThe probability that classification occurs, P (wj|Ci) it is CiFeature Words w in classificationjFrequency.
Calculating P (wj|Ci) when, using the term weight function vector v for each inventory text being calculated in step S202 (di) calculated, specific formula is:
Wherein m is to belong to classification CiAll inventory texts quantity, tJ, k (ifC=Ci)It is to belong to classification CiK-th of inventory The TF-IDF values of j-th of Feature Words of text, n are the Feature Words sum of inventory text, and training obtains the P (w of each Feature Wordsj| Ci) and the probability of all Feature Words is stored as model.
S302:The probability that trained inventory text belongs to all categories is acquired, maximum probability classification is the training inventory text Classification belonging to this;
S4:Text classifier is acted on into pretreated inventory text to be sorted, obtains classification results.As shown in figure 4, Inventory text to be sorted is carried out to obtain the Feature Words of the inventory text after the work such as pre-processing, according to trained in step S301 P (the w of each Feature Words arrivedj|Ci) probability that the inventory text belongs to each classification is calculated, it is to wait for choose highest classification The final class label of classification inventory text.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims (5)

1. a kind of project cost inventory sorting technique based on multinomial Bayes, which is characterized in that include the following steps:
S1:Training project cost listings data and the key message in project cost listings data to be sorted and integration are extracted respectively To train inventory text and inventory text to be sorted, and training inventory text and inventory text to be sorted are pre-processed;
S2:Pretreated trained inventory text and inventory text to be sorted are subjected to text representation respectively;
S3:To carrying out classification based training by the training inventory text of text representation, inventory text classifier is constructed;
S4:Inventory text classifier is acted on into pretreated inventory text to be sorted, obtains classification results.
2. a kind of project cost inventory sorting technique based on multinomial Bayes according to claim 1, it is characterised in that:Institute It states step S3 and classification based training is carried out using multinomial Bayesian Classification Arithmetic.
3. a kind of project cost inventory sorting technique based on multinomial Bayes according to claim 1, which is characterized in that institute Step S1 is stated to be as follows:
S101:Training project cost listings data and key message in project cost listings data to be sorted and whole are extracted respectively It is combined into trained inventory text and inventory text to be sorted;
S102:Training inventory text and inventory text to be sorted are segmented, and establish proper nouns dictionary;
S103:Stop words is carried out to thesaurus to handle, and counts the frequency that each vocabulary of thesaurus occurs;
S104:Low frequency words in thesaurus are removed, and using remaining vocabulary as training inventory text and inventory text to be sorted The Feature Words of classification simultaneously carry out text representation.
4. a kind of project cost inventory sorting technique based on multinomial Bayes according to claim 1, which is characterized in that institute Step S2 is stated to be as follows:
S201:The weight of each Feature Words in training inventory text and inventory text to be sorted is calculated using TF-IDF algorithms;
S202:Term weight function in training inventory text and inventory documents to be sorted is indicated with vector respectively.
5. a kind of project cost inventory sorting technique based on multinomial Bayes according to claim 1, which is characterized in that institute Step S3 is stated to be as follows:
S301:The probability that training inventory text belongs to each classification is calculated using multinomial Bayesian Classification Arithmetic;
S302:The probability that trained inventory text belongs to all categories is acquired, maximum probability classification is the training inventory text institute The classification of category.
CN201810564742.5A 2018-06-04 2018-06-04 A kind of project cost inventory sorting technique based on multinomial Bayes Pending CN108427775A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810564742.5A CN108427775A (en) 2018-06-04 2018-06-04 A kind of project cost inventory sorting technique based on multinomial Bayes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810564742.5A CN108427775A (en) 2018-06-04 2018-06-04 A kind of project cost inventory sorting technique based on multinomial Bayes

Publications (1)

Publication Number Publication Date
CN108427775A true CN108427775A (en) 2018-08-21

Family

ID=63164287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810564742.5A Pending CN108427775A (en) 2018-06-04 2018-06-04 A kind of project cost inventory sorting technique based on multinomial Bayes

Country Status (1)

Country Link
CN (1) CN108427775A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447522A (en) * 2018-12-03 2019-03-08 今天誉讯(北京)科技有限公司 A method of it is applied based on project cost internet big data
CN109523224A (en) * 2018-10-08 2019-03-26 重庆大学城市科技学院 A kind of analyzer and control method of construction engineering cost
CN112270615A (en) * 2020-10-26 2021-01-26 西安邮电大学 Intelligent decomposition method for manufacturing BOM (Bill of Material) by complex equipment based on semantic calculation
CN114119110A (en) * 2022-01-26 2022-03-01 四川野马科技有限公司 Project cost list collection system and method thereof
CN117454225A (en) * 2023-11-13 2024-01-26 承德市工程建设造价管理站 Engineering cost data management system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN107086952A (en) * 2017-04-19 2017-08-22 中国石油大学(华东) A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations
CN107391772A (en) * 2017-09-15 2017-11-24 国网四川省电力公司眉山供电公司 A kind of file classification method based on naive Bayesian

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN105740424A (en) * 2016-01-29 2016-07-06 湖南大学 Spark platform based high efficiency text classification method
CN107086952A (en) * 2017-04-19 2017-08-22 中国石油大学(华东) A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations
CN107391772A (en) * 2017-09-15 2017-11-24 国网四川省电力公司眉山供电公司 A kind of file classification method based on naive Bayesian

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李丹: "基于朴素贝叶斯方法的中文文本分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523224A (en) * 2018-10-08 2019-03-26 重庆大学城市科技学院 A kind of analyzer and control method of construction engineering cost
CN109447522A (en) * 2018-12-03 2019-03-08 今天誉讯(北京)科技有限公司 A method of it is applied based on project cost internet big data
CN112270615A (en) * 2020-10-26 2021-01-26 西安邮电大学 Intelligent decomposition method for manufacturing BOM (Bill of Material) by complex equipment based on semantic calculation
CN114119110A (en) * 2022-01-26 2022-03-01 四川野马科技有限公司 Project cost list collection system and method thereof
CN117454225A (en) * 2023-11-13 2024-01-26 承德市工程建设造价管理站 Engineering cost data management system
CN117454225B (en) * 2023-11-13 2024-05-14 承德市工程建设造价管理站 Engineering cost data management system

Similar Documents

Publication Publication Date Title
CN108427775A (en) A kind of project cost inventory sorting technique based on multinomial Bayes
CN109635291B (en) Recommendation method for fusing scoring information and article content based on collaborative training
CN105868184B (en) A kind of Chinese personal name recognition method based on Recognition with Recurrent Neural Network
CN111160037B (en) Fine-grained emotion analysis method supporting cross-language migration
CN105389379B (en) A kind of rubbish contribution classification method indicated based on text distributed nature
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
CN105279495B (en) A kind of video presentation method summarized based on deep learning and text
CN110569508A (en) Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism
CN110807320B (en) Short text emotion analysis method based on CNN bidirectional GRU attention mechanism
CN107391772B (en) Text classification method based on naive Bayes
CN110188192B (en) Multi-task network construction and multi-scale criminal name law enforcement combined prediction method
CN111506732B (en) Text multi-level label classification method
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
DE112013004082T5 (en) Search system of the emotion entity for the microblog
CN110427458B (en) Social network bilingual five-classification emotion analysis method based on double-gate LSTM
CN104731772B (en) Improved feature evaluation function based Bayesian spam filtering method
CN111046171B (en) Emotion discrimination method based on fine-grained labeled data
CN110874411A (en) Cross-domain emotion classification system based on attention mechanism fusion
CN110162631A (en) Chinese patent classification method, system and storage medium towards TRIZ inventive principle
CN103678318B (en) Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN112580332B (en) Enterprise portrait method based on label layering and deepening modeling
CN106446147A (en) Emotion analysis method based on structuring features
CN107169061A (en) A kind of text multi-tag sorting technique for merging double information sources
CN110110087A (en) A kind of Feature Engineering method for Law Text classification based on two classifiers
CN113051932A (en) Method for detecting category of network media event of semantic and knowledge extension topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180821

RJ01 Rejection of invention patent application after publication