CN112749278B - Classification method for building engineering change instructions - Google Patents

Classification method for building engineering change instructions Download PDF

Info

Publication number
CN112749278B
CN112749278B CN202011629638.3A CN202011629638A CN112749278B CN 112749278 B CN112749278 B CN 112749278B CN 202011629638 A CN202011629638 A CN 202011629638A CN 112749278 B CN112749278 B CN 112749278B
Authority
CN
China
Prior art keywords
word
text
words
building
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011629638.3A
Other languages
Chinese (zh)
Other versions
CN112749278A (en
Inventor
刘发贵
吴怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Zhuwuzhilian Technology Co ltd
South China University of Technology SCUT
Original Assignee
Guangdong Zhuwuzhilian Technology Co ltd
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Zhuwuzhilian Technology Co ltd, South China University of Technology SCUT filed Critical Guangdong Zhuwuzhilian Technology Co ltd
Priority to CN202011629638.3A priority Critical patent/CN112749278B/en
Publication of CN112749278A publication Critical patent/CN112749278A/en
Application granted granted Critical
Publication of CN112749278B publication Critical patent/CN112749278B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for classifying a construction engineering change instruction. The method comprises the following steps: carrying out document conversion processing aiming at the characteristics of the construction engineering documents, extracting the text content of each document, marking the category of each document, and establishing a construction engineering corpus; constructing a building field dictionary, and performing word segmentation pretreatment on texts in the building engineering corpus by combining a word segmentation tool with the building field dictionary to obtain a text word sequence; establishing a word vector training corpus, and training a distributed word vector by adopting a word vector model; the method comprises the steps of representing global word features and local syntactic features of a text based on a distributed method, and constructing text fusion semantic representation; based on text fusion semantic representation, a classification model is constructed by adopting a supervised machine learning algorithm; and predicting the document to be classified, and quickly acquiring an instruction related to engineering change. The invention solves the problem of low efficiency of manually classifying engineering documents in the building field by utilizing a natural language processing technology, and is beneficial to construction management of building engineering.

Description

Method for classifying building engineering change instructions
Technical Field
The invention belongs to the field of natural language processing and constructional engineering, and particularly relates to a method for classifying constructional engineering change instructions.
Background
The urbanization process of China is already in a high-speed stage, the quantity of newly increased projects of municipal facilities, houses and traffic roads is obviously increased, the scale of monomers of newly built projects is continuously increased, and the building structure forms tend to be diversified. The construction engineering is complex, and relates to a plurality of factors such as management, technology, finance and the like, and multiple parties are needed to cooperate. The work contact list is used as a communication file and a construction basis among all units and all departments, and information such as scheme confirmation, progress coordination, construction feedback, government documents and the like in the building construction process is recorded. For a type of notification contact list related to engineering change, due to the fact that the content of the notification contact list is related to construction object change, construction materials and personnel input increase and decrease, a constructor needs to respond to a change instruction in time, and waste is avoided. Therefore, the engineering change instruction can be quickly and accurately screened out from a large number of work contact lists, and the method has great significance for successfully completing the construction engineering. However, at present, the construction industry still adopts a manual review mode to obtain the engineering change instruction in the engineering work contact list. Not only consumes time and labor, but also is easy to generate careless omission and even mistakes, thereby causing a great deal of manpower and material resource waste.
Natural language processing is a combination of computer science and language science used to study the interaction between computer languages and human natural language. The identification of the construction project engineering change instruction relates to a text classification technology in natural language processing, automatic classification marking can be carried out on a work contact list in a construction project through a classification process, engineering change information can be quickly and accurately acquired, and the construction project information management level can be improved.
Currently, the application of natural language processing technology in the building field is still in the beginning. Bell billows et al (2018) invented a method and system for classifying building quality complaint texts (application publication No. CN 1085631A), and a classification model is established based on a convolutional neural network. But considering the number of documents of the actual project, this method is not suitable for the problem solved by the present invention. At present, shallow machine learning is mainly adopted in text classification research in the building field. The text representation is the key in machine learning, the semantic and syntactic relevance of words are not considered in a classical word bag model, and the semantic representation of the text has a high-dimensional sparse problem.
Disclosure of Invention
The invention aims to solve the problem of low efficiency of manually classifying construction engineering documents by using a natural language processing technology. The invention provides a classification method of a construction engineering change instruction, which uses a distributed word vector to represent global word characteristics of a text to obtain global semantic representation of the text; text syntactic characteristics are extracted based on dependency syntactic analysis, characteristics with high degree of association with the change trigger words are enhanced through an attention mechanism, and text local semantic representation is obtained; and splicing the two parts to obtain text fusion semantics, and meanwhile, constructing a classification model by adopting a supervised machine learning algorithm according to the number level of actual engineering documents to realize the classification of the construction engineering change instructions.
The purpose of the invention is realized by at least one of the following technical solutions.
A classification method of a construction engineering change instruction comprises the following steps:
s1, carrying out document conversion processing aiming at the characteristics of the construction engineering documents, extracting the text content of each document, marking the category of each document, and establishing a construction engineering corpus;
s2, building a building field dictionary, and performing word segmentation pretreatment on the text in the building engineering corpus by combining a word segmentation tool with the building field dictionary to obtain a text word sequence;
s3, establishing a word vector training corpus, and training a distributed word vector by adopting a word vector model;
s4, representing global word features and local syntactic features of the text based on a distributed method, and constructing text fusion semantic representation;
s5, based on the text fusion semantic representation obtained in the step S4, a classification model is constructed by adopting a supervised machine learning algorithm; and predicting the document to be classified, and quickly acquiring an instruction related to engineering change.
Further, step S1 comprises the steps of:
s1.1, identifying characters in the construction engineering document by using an Optical Character Recognition (OCR) technology and storing the characters in a text format (.txt) through an image PDF generated by a scanning mode;
s1.2, sorting and perfecting the recognized text of each document, removing fixed contents of a header and a footer of a table, and correcting text contents of a table body;
s1.3, manually marking the category of each text according to whether the text content relates to an engineering change instruction;
s1.4, building a construction engineering corpus in a sample form by taking the 'category text content'.
Further, step S2 includes the steps of:
s2.1, segmenting the texts in the constructional engineering corpus based on the general dictionary of the segmentation tool to obtain a word sequence after each text is segmented;
s2.2, calculating the word sequence obtained in the step S2.1 by utilizing the forward maximum matching algorithm idea, setting the maximum length to be 4, and adding the combined word into a field candidate word set if the combined word appears in a plurality of texts in a construction engineering corpus;
s2.3, manually checking the field candidate word set, eliminating words which do not accord with language logic and forming words which refer to specific people and objects, and generating a building field dictionary;
s2.4, searching the network building word bank, and adding the words in the network building word bank into the building field dictionary generated based on the corpus in the step S2.3;
and S2.5, combining the general dictionary and the building field dictionary constructed in the step S2.4, and performing word segmentation on the texts in the building engineering corpus again by using a word segmentation tool to obtain a final text word sequence of each text in the preprocessing stage.
Further, step S3 includes the steps of:
s3.1, establishing word vector training corpora, wherein the word vector training corpora comprises corpora in the building engineering corpus established in the step S1, wiki encyclopedia Chinese corpora and related current national standards of the building industry, and performing word segmentation pretreatment based on the building field dictionary established in the step S2;
and S3.2, training the corpus by adopting a word vector model to obtain a distributed word vector.
Further, step S4 includes the steps of:
s4.1, using the distributed word vector to represent the global word features of the text to obtain a global semantic representation C of the text g
S4.2, using distributed word vectors to represent local syntactic characteristics of texts to obtain local semantic representation C of the texts l
S4.3, combining the text global semantic representation in the step S4.1 and the text local semantic representation in the step S4.2 in a splicing mode to obtain text fusion semantic representation:
Figure BDA0002875923910000031
further, the specific steps of step S4.1 are as follows:
s4.1.1, representing the text word sequence by using the distributed word vectors in the step S3.2 to obtain a text word vector matrix, namely X = { X = 1 ,x 2 ,…,x m Where m is the number of words in the text word sequence, x i A word vector representing the ith word, the range of i being 1-m;
s4.1.2, calculating a word sequence dimension average value to obtain a text global semantic representation:
Figure BDA0002875923910000032
further, the specific steps of step S4.2 are as follows:
s4.2.1, identifying the dependency relationship among the words in the text word sequence of each text by using a dependency analysis tool to form a dependency relationship structure which takes a predicate verb as a root node and takes other words as child nodes to directly or indirectly depend on the root node;
s4.2.2, by utilizing a dependency relationship structure of sentences, extracting a root node and a supported matched word on a dependency arc of a main predicate relationship (SBV) and a moving object relationship (VOB) of the root node, then respectively taking a verb with a parallel relationship (COO) with the root node as a father node, extracting the father node and the supported matched word on the dependency arc of the main predicate relationship and the moving object relationship of the father node, and finally arranging all the extracted words in the original sequence as a text local syntactic characteristic;
s4.2.3, expressing the syntactic characteristics extracted in the step S4.3.2 by using the distributed word vectors in the step S3.2 to obtain a word vector matrix of the syntactic characteristics of each text, namely
Figure BDA0002875923910000033
Where n is the number of extracted syntactic characteristic words,
Figure BDA0002875923910000034
a word vector representing the jth syntactic characteristic word, wherein the range of j is 1-n;
s4.2.4, enabling verbs to be in a dominant position in sentences, establishing connection relations among other words, and from the semantic point of view, describing actions for changing the states of the participles on the moving object relation dependency arcs, namely relating to a construction engineering change instruction and called as a change trigger word; constructing attention based on a change trigger word dictionary, and calculating the association degree alpha of the text syntactic characteristic words and the change trigger words j And j ranges from 1 to n, and then the text local semantic representation is obtained:
Figure BDA0002875923910000041
further, in step s4.2.4, the specific steps of calculating the association degree weight of the syntactic characteristic word and the alteration trigger word are as follows:
s4.2.4.1, collecting verbs describing engineering change in the construction engineering language database text, and establishing a change trigger dictionary;
s4.2.4.2, based on the Chinese synonym library, searching for 5 words with the maximum similarity with each verb in the changed trigger word dictionary in the step S4.2.4, and expanding the changed trigger word dictionary;
s4.2.4.3, using the distributed word vectors in the step S3.2, representing each word in the change trigger word dictionary as a word vector and respectively serving as a key vector, using each word vector in a text syntactic characteristic word vector matrix as a query vector, calculating the association degree between the syntactic characteristic word and the trigger word, and enhancing the semantic closer to the weight of the word representing state change; the attention score is calculated by a dot product method, namely:
Figure BDA0002875923910000042
wherein the content of the first and second substances,
Figure BDA0002875923910000043
triggering a word vector of a k word in a word dictionary for change;
taking the maximum value of the attention scores of the syntactic characteristic words and all words in the change trigger word dictionary as the attention based on the change trigger word dictionary, namely:
Figure BDA0002875923910000044
the attention weight of the corresponding syntactic characteristic word based on the change trigger word dictionary is as follows:
Figure BDA0002875923910000045
further, step S5 includes the steps of:
s5.1, dividing the text fusion semantic vector obtained in the step S4.3 into a training set and a test set according to a certain proportion;
s5.2, inputting the fusion semantic vector and the class label of each text in the training set into different machine learning classifiers for model training based on a supervised machine learning algorithm; random search and cross validation are adopted in training to obtain the optimal hyper-parameters of each classification model;
s5.3, testing the text fusion semantic vectors in the test set by the classification models obtained in the step S5.2, evaluating the classification models by three indexes of accuracy, recall rate and F1 measurement, and selecting the classification model with the best effect as a final classification model of the construction engineering change instruction;
and S5.4, preprocessing the document to be classified according to the steps S1 to S4, inputting the classification model of the construction engineering change instruction obtained in the step S5.3 for prediction, and quickly obtaining the construction engineering change instruction.
Further, in step S5.2, the supervised machine learning algorithm includes support vector machine, naive bayes and K nearest neighbor.
Compared with the prior art, the invention has the following advantages and effects:
1. the invention generates words with a special expression mode of the construction engineering based on the construction corpus, and simultaneously constructs a construction field dictionary containing special words in the construction engineering corpus and construction industry terms by combining the network construction lexicon, thereby solving the timeliness problem of dictionary construction. The method can effectively improve the accuracy of word segmentation results by segmenting words of texts based on the building field dictionary;
2. according to the invention, through training word vectors, word features of a text are embedded into a low-dimensional vector space by using a distributed method, so that the dimension of the feature space is reduced, and the semantic effect of words is exerted. The context semantic relevance of a word is of concern compared to the classical bag of words model.
3. The dependency relationship structure is obtained by analyzing the dependency syntax of the text, so that words on the dependent arcs are extracted as local syntax characteristics and are fused with the global characteristics of the text, and the semantic representation performance of the text is improved.
Drawings
FIG. 1 is a flow chart of the present invention for building engineering change order classification;
FIG. 2 is a diagram of a dependency syntax structure for text in an embodiment of the method of the present invention;
fig. 3 is a schematic diagram of obtaining a text fusion semantic vector in the embodiment of the method of the present invention.
Detailed Description
In order to make the technical solutions and advantages of the present invention more apparent, the following detailed description is made with reference to the accompanying drawings, but the present invention is not limited thereto.
The embodiment is as follows:
a method for classifying a building engineering change instruction, as shown in fig. 1, includes the following steps:
s1, aiming at the characteristics of the construction engineering documents, carrying out document conversion processing, extracting the text content of each document, marking the category of each document, and establishing a construction engineering corpus, comprising the following steps:
s1.1, identifying characters in the construction engineering document by using an Optical Character Recognition (OCR) technology and storing the characters in a text format (.txt), wherein the construction engineering document is an image PDF generated in a scanning mode generally;
s1.2, the texts recognized by each document are sorted and perfected, fixed contents of a header and a footer of a table are removed, and the text content of a table body is corrected;
s1.3, manually marking the category of each text according to whether the text content relates to an engineering change instruction;
s1.4, building engineering corpus is established in a sample form of 'category text content'.
In the embodiment, a text content finally obtained after processing a building document is' ECN, and an ECN teacher determines to modify five second-floor fire ladders in a coordinated conference in 2018, 5 and 8, specifically, see attachments in detail, and then follows the file construction. "where" ECN "is a category label indicating that the text belongs to an engineering change instruction.
S2, building a building field dictionary, and performing word segmentation pretreatment on the text in the building engineering corpus by combining a word segmentation tool with the building field dictionary to obtain a text word sequence, wherein the method comprises the following steps:
s2.1, segmenting the texts in the constructional engineering corpus based on the general dictionary of the segmentation tool to obtain a word sequence after each text is segmented;
s2.2, calculating the word sequence obtained in the step S2.1 by utilizing the forward maximum matching algorithm idea, setting the maximum length to be 4, and adding the combined word into a field candidate word set if the combined word appears in a plurality of texts in a construction engineering corpus;
s2.3, manually checking the field candidate word set, eliminating words which do not accord with language logic and forming words which refer to specific persons and objects, such as combined words of 'architect + specific name', and generating a dictionary of the building field;
s2.4, searching a network building word bank, in the embodiment, adopting a building dictionary in the search dog cell word bank, and adding words in the building dictionary generated based on the corpus in the step S2.3;
in this embodiment, the text is segmented before the building domain dictionary is not used, and the obtained result is:
"then | teacher | in |2018 |5 month |8 day | in | harmonization | meeting | determine | modify | five | seat | two | building | walk | fire ladder |, | concrete | see | attachment |, | i take after | this | file | construction |. "
The 'former' refers to architects in the Hongkong and Piao region, and the 'fire ladder' refers to a lifesaving channel in the case of fire, and the two field words are not correctly divided; therefore, the forward maximum matching algorithm idea is used for carrying out length splicing on the preliminarily divided word sequences, namely splicing the 'then' and the 'teacher' into the 'then teacher', and adding the 'then teacher' into the candidate words of the building field dictionary if the 'then teacher' also appears in other texts.
And S2.5, combining the general dictionary and the building field dictionary constructed in the step S2.4, and performing word segmentation on the texts in the building engineering corpus again by using a word segmentation tool to obtain a final text word sequence of each text in the preprocessing stage.
In this embodiment, the text is segmented again in combination with the building domain dictionary, and the obtained result is:
"then teacher | in |2018 |5 month |8 day | in | harmonization | meeting | determine | modify | five | seat | two | building | fire ladder |, | concrete | see | attachment |, | i am out | follow this | file | construction |. "
S3, establishing a word vector training corpus, and performing word vector pre-training by adopting a word vector training model, wherein the method comprises the following steps:
s3.1, establishing word vector training corpora, wherein the word vector training corpora comprises corpora in the building engineering corpus established in the step S1, wiki encyclopedia Chinese corpora and related current national standards of the building industry, and performing word segmentation pretreatment based on the building field dictionary established in the step S2;
and S3.2, adopting a word vector model, in the embodiment, training the corpus by using a word2vec model to obtain a distributed word vector.
S4, representing the global word characteristics and the local syntactic characteristics of the text based on a distributed method, and constructing text fusion semantic representation, wherein the method comprises the following steps;
s4.1, representing the context word sequence by using the distributed word vectors in the step S3.2 to obtain a text word vector matrix, namely X = { X = 1 ,x 2 ,…,x m H, where m is the number of words in the sequence of text words, x i A word vector representing the ith word, the range of i being 1-m;
s4.2, calculating a word sequence dimension average value to obtain a text global semantic representation:
Figure BDA0002875923910000071
s4.3, carrying out syntactic analysis on the text word sequence by using a dependency analysis tool, extracting words on related dependency relations as text syntactic characteristics, and finally representing the text local syntactic characteristics by using distributed word vectors to obtain text local semantic representation; as shown in fig. 2, which is a dependency structure diagram in the present embodiment, it can be seen from the diagram that a predicate verb is the center of a natural language sentence and dominates other components of the sentence, and is not itself dominated by any component. In the construction project text, the project change instruction is mainly expressed by a subject-predicate structure, wherein a verb represents an implementation action, a subject represents an implementation subject, and an object represents a subject of implementation, namely, a subject-predicate relationship (SBV) and a subject-predicate relationship (VOB) in a dependency relationship. Therefore, syntactic feature extraction of text includes the following steps:
s4.3.1, identifying the dependency relationship among the words in the text word sequence of each text by using a dependency analysis tool to form a dependency relationship structure which takes a predicate verb as a root node and takes other words as child nodes to directly or indirectly depend on the root node;
s4.3.2, by utilizing a dependency relationship structure of sentences, extracting a root node and a supported matched word on a dependency arc of a main predicate relationship (SBV) and a moving object relationship (VOB) of the root node, then respectively taking a verb with a parallel relationship (COO) with the root node as a father node, extracting the father node and the supported matched word on the dependency arc of the main predicate relationship and the moving object relationship of the father node, and finally arranging all the extracted words in the original sequence as the characteristic of a text to obtain the syntactic characteristic of the text;
in this embodiment, fig. 2 is a diagram of a dependency syntax structure, and the final text syntax features are:
{ the teacher, determine, modify, walk away, see, attach, me, follow-up, file, construct };
s4.3.3, expressing the syntactic characteristics extracted in the step S4.3.2 by using the distributed word vectors in the step S3.2 to obtain a word vector matrix of the syntactic characteristics of each text, namely
Figure BDA0002875923910000072
Where n is the number of extracted syntactic characteristic words,
Figure BDA0002875923910000073
a word vector representing the jth syntactic characteristic word, wherein j ranges from 1 to n;
s4.4, collecting verbs describing engineering changes in the construction engineering corpus text, and establishing a change trigger word dictionary;
s4.5, in the embodiment, based on a Chinese synonym toolkit Synonyms, searching for 5 words with the maximum similarity with each verb in the change trigger dictionary in the step S4.4, and expanding the change trigger dictionary;
s4.6, using the distributed word vectors in the step S3.2, representing each word in the change trigger word dictionary as a word vector and respectively serving as a key vector, using each word vector in a text syntactic characteristic word vector matrix as a query vector, calculating the association degree between the syntactic characteristic word and the trigger word, and enhancing the weight of the word with the semantic closer to representing the state change; the attention score is calculated by a dot product method, namely:
Figure BDA0002875923910000081
wherein the content of the first and second substances,
Figure BDA0002875923910000082
triggering a word vector of a kth word in the word dictionary for the change;
taking the maximum value of the attention scores of all the words in the syntactic characteristic word and the change trigger word dictionary as the attention based on the change trigger word dictionary, namely:
Figure BDA0002875923910000083
the attention weight of the corresponding syntactic characteristic word based on the change triggering word dictionary is as follows:
Figure BDA0002875923910000084
and further obtaining text local semantic representation:
Figure BDA0002875923910000085
s4.7, combining the text global semantic representation in the step S4.2 and the text local semantic representation in the step S4.6 in a splicing mode to obtain text fusion semantic representation:
Figure BDA0002875923910000086
s5, building a classification model by adopting a supervised machine learning algorithm based on the text fusion semantic vector obtained in the step S4; the method for predicting the documents to be classified and quickly acquiring the instructions related to engineering change comprises the following steps:
s5.1, fusing the text obtained in the step S4.7 into a semantic vector, wherein in the embodiment, the semantic vector is obtained by performing the following steps according to the sequence of 8:2, dividing the training set into a training set and a testing set;
s5.2, in the embodiment, three algorithms of a support vector machine, naive Bayes and K nearest neighbor are adopted, and the fusion semantic vector and the class label of each text in the training set are input into different machine learning classifiers for model training; random search and cross validation are adopted in training to obtain the optimal hyper-parameters of each model;
s5.3, testing the text fusion semantic vectors in the test set by the classification models obtained in the step S5.2, evaluating the classification models by three indexes of accuracy, recall rate and F1 measurement, and selecting the classification model with the best effect as a final classification model of the construction engineering change instruction;
and S5.4, preprocessing the document to be classified according to the steps S1 to S4, inputting the classification model of the construction engineering change instruction obtained in the step 5.3 for prediction, and quickly obtaining the construction engineering change instruction.
The above-mentioned procedures are preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention shall be covered by the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. A classification method for a construction engineering change instruction is characterized by comprising the following steps:
s1, carrying out document conversion processing aiming at the characteristics of the construction engineering documents, extracting the text content of each document, marking the category of each document, and establishing a construction engineering corpus;
s2, building a building field dictionary, and performing word segmentation pretreatment on the text in the building engineering corpus by combining a word segmentation tool with the building field dictionary to obtain a text word sequence;
s3, establishing a word vector training corpus, and training a distributed word vector by adopting a word vector model; the method specifically comprises the following steps:
s3.1, establishing word vector training corpora, wherein the word vector training corpora comprises corpora in the building engineering corpus established in the step S1, wiki encyclopedia Chinese corpora and related current national standards of the building industry, and performing word segmentation pretreatment based on the building field dictionary established in the step S2;
s3.2, training the corpus by adopting a word vector model to obtain a distributed word vector;
s4, representing global word features and local syntactic features of the text based on a distributed method, and constructing text fusion semantic representation; the method comprises the following steps:
s4.1, using the distributed word vectors to represent the global word features of the text to obtain global semantic representation C of the text g (ii) a The specific steps of step S4.1 are as follows:
s4.1.1, representing the text word sequence by using the distributed word vectors in the step S3.2 to obtain a text word vector matrix, namely X = { X = 1 ,x 2 ,…,x m Where m is the number of words in the text word sequence, x i A word vector representing the ith word, the range of i being 1-m;
s4.1.2, calculating the dimension average of the word sequence to obtain the global semantic representation of the text:
Figure FDA0003819433810000011
s4.2, using distributed word vectors to represent local syntactic characteristics of texts to obtain local semantic representation C of the texts l (ii) a The specific steps of step S4.2 are as follows:
s4.2.1, identifying the dependency relationship among the words in the text word sequence of each text by using a dependency analysis tool to form a dependency relationship structure which takes a predicate verb as a root node and takes other words as child nodes to directly or indirectly depend on the root node;
s4.2.2, by utilizing the dependency relationship structure of sentences, extracting the root nodes and the major-predicate relationship thereof, and the supported collocations on the moving-guest relationship dependency arcs, then respectively taking verbs which are in parallel relationship with the root nodes as father nodes, extracting the father nodes and the major-predicate relationship thereof, and the supported collocations on the moving-guest relationship dependency arcs, and finally arranging all the extracted words in the original sequence as the syntactic characteristics of the text;
s4.2.3, expressing the syntactic characteristics extracted in the step S4.3.2 by using the distributed word vectors in the step S3.2 to obtain a word vector matrix of the syntactic characteristics of each text, namely
Figure FDA0003819433810000012
Where n is the number of extracted syntactic feature words,
Figure FDA0003819433810000013
a word vector representing the jth syntactic characteristic word, wherein the range of j is 1-n;
s4.2.4, enabling verbs to be in a dominant position in sentences, establishing connection relations among other words, and from the semantic point of view, describing actions for changing the states of the participles on the moving object relation dependency arcs, namely relating to a construction engineering change instruction and called as a change trigger word; constructing attention based on a change trigger word dictionary, and calculating the association degree alpha of the text syntactic characteristic words and the change trigger words j And j ranges from 1 to n, and then the text local semantic representation is obtained:
Figure FDA0003819433810000021
the method comprises the following specific steps:
s4.2.4.1, collecting verbs describing engineering change in the construction engineering corpus text, and establishing a change trigger word dictionary;
s4.2.4.2, based on the Chinese synonym library, searching for 5 words with the maximum similarity with each verb in the change trigger word dictionary in the step S4.2.4.1, and expanding the change trigger word dictionary;
s4.2.4.3, using the distributed word vectors in the step S3.2, representing each word in the change trigger word dictionary as a word vector and respectively serving as a key vector, using each word vector in a text syntax characteristic word vector matrix as a query vector, calculating the association degree between the syntax characteristic word and the trigger word, and enhancing the weight of the word with the semantic closer to representing the state change; the attention score is calculated by a dot product method, namely:
Figure FDA0003819433810000022
wherein the content of the first and second substances,
Figure FDA0003819433810000023
triggering word dictionaries for changesA word vector of the k-th word;
taking the maximum value of the attention scores of the syntactic characteristic words and all words in the change trigger word dictionary as the attention based on the change trigger word dictionary, namely:
Figure FDA0003819433810000024
the attention weight of the corresponding syntactic characteristic word based on the change trigger word dictionary is as follows:
Figure FDA0003819433810000025
s4.3, combining the text global semantic representation in the step S4.1 and the text local semantic representation in the step S4.2 in a splicing mode to obtain text fusion semantic representation:
Figure FDA0003819433810000026
s5, based on the text fusion semantic representation obtained in the step S4, a classification model is constructed by adopting a supervised machine learning algorithm; and predicting the document to be classified, and quickly acquiring an instruction related to engineering change.
2. The method for classifying construction engineering change instructions according to claim 1, wherein the step S1 comprises the following steps:
s1.1, recognizing characters in the construction engineering document by using an optical character recognition technology and storing the characters in a text format;
s1.2, the texts recognized by each document are sorted and perfected, fixed contents of a header and a footer of a table are removed, and the text content of a table body is corrected;
s1.3, manually marking the category of each text according to whether the text content relates to an engineering change instruction;
s1.4, building engineering corpus is established in a sample form of 'category text content'.
3. The method for classifying building engineering change instructions according to claim 2, wherein the step S2 comprises the steps of:
s2.1, segmenting the texts in the constructional engineering corpus based on the general dictionary of the segmentation tool to obtain a word sequence after each text is segmented;
s2.2, calculating the word sequence obtained in the step S2.1 by utilizing the positive direction maximum matching algorithm idea, setting the maximum length to be 4, and adding the combined words into a field candidate word set if the combined words appear in a plurality of texts in a building engineering corpus;
s2.3, manually checking the field candidate word set, eliminating words which do not accord with language logic and forming words which refer to specific people and objects, and generating a building field dictionary;
s2.4, searching the network building word bank, and adding the words in the network building word bank into the building field dictionary generated based on the corpus in the step S2.3;
and S2.5, combining the general dictionary and the building field dictionary constructed in the step S2.4, and performing word segmentation on the texts in the building engineering corpus again by using a word segmentation tool to obtain a final text word sequence of each text in the preprocessing stage.
4. The method for classifying construction engineering change instructions according to claim 1, wherein the step S5 comprises the steps of:
s5.1, dividing the text fusion semantic vector obtained in the step S4.3 into a training set and a test set according to a certain proportion;
s5.2, inputting the fusion semantic vector and the class label of each text in the training set into different machine learning classifiers for model training based on a supervised machine learning algorithm; obtaining the optimal hyper-parameter of each classification model by adopting random search and cross validation in training;
s5.3, testing the text fusion semantic vectors in the test set by the classification models obtained in the step S5.2, evaluating the classification models by three indexes of accuracy, recall rate and F1 measurement, and selecting the classification model with the best effect as a final classification model of the construction engineering change instruction;
and S5.4, preprocessing the document to be classified according to the steps S1 to S4, inputting the classification model of the construction engineering change instruction obtained in the step 5.3 for prediction, and quickly obtaining the construction engineering change instruction.
5. The method for classifying construction engineering change instructions according to any one of claims 1 to 4, wherein in step S5.2, the supervised machine learning algorithm comprises a support vector machine, naive Bayes and K nearest neighbors.
CN202011629638.3A 2020-12-30 2020-12-30 Classification method for building engineering change instructions Active CN112749278B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011629638.3A CN112749278B (en) 2020-12-30 2020-12-30 Classification method for building engineering change instructions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011629638.3A CN112749278B (en) 2020-12-30 2020-12-30 Classification method for building engineering change instructions

Publications (2)

Publication Number Publication Date
CN112749278A CN112749278A (en) 2021-05-04
CN112749278B true CN112749278B (en) 2022-10-18

Family

ID=75650745

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011629638.3A Active CN112749278B (en) 2020-12-30 2020-12-30 Classification method for building engineering change instructions

Country Status (1)

Country Link
CN (1) CN112749278B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762589A (en) * 2021-07-16 2021-12-07 国家电网有限公司 Power transmission and transformation project change prediction system and method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169086B (en) * 2017-05-12 2020-10-27 北京化工大学 Text classification method
CN108563791A (en) * 2018-04-29 2018-09-21 华中科技大学 A kind of construction quality complains the method and system of text classification
CN109902293B (en) * 2019-01-30 2020-11-24 华南理工大学 Text classification method based on local and global mutual attention mechanism
CN110135457B (en) * 2019-04-11 2021-04-06 中国科学院计算技术研究所 Event trigger word extraction method and system based on self-encoder fusion document information
CN110609897B (en) * 2019-08-12 2023-08-04 北京化工大学 Multi-category Chinese text classification method integrating global and local features
CN111414451A (en) * 2020-02-27 2020-07-14 中国平安财产保险股份有限公司 Information identification method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于远监督的语义知识资源扩展研究;卢达威等;《中文信息学报》;20161115(第06期);165-173 *

Also Published As

Publication number Publication date
CN112749278A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN112001185B (en) Emotion classification method combining Chinese syntax and graph convolution neural network
CN108287822B (en) Chinese similarity problem generation system and method
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN110287323B (en) Target-oriented emotion classification method
CN106919673A (en) Text mood analysis system based on deep learning
CN112001187A (en) Emotion classification system based on Chinese syntax and graph convolution neural network
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN113157859B (en) Event detection method based on upper concept information
CN111259153B (en) Attribute-level emotion analysis method of complete attention mechanism
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN116822625A (en) Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method
CN114139533A (en) Text content auditing method for Chinese novel field
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
CN114817454A (en) NLP knowledge graph construction method combining information content and BERT-BilSTM-CRF
CN112749278B (en) Classification method for building engineering change instructions
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
CN112862569B (en) Product appearance style evaluation method and system based on image and text multi-modal data
CN113361252A (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
Rahman et al. A dynamic strategy for classifying sentiment from Bengali text by utilizing Word2vector model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant