CN114004222A

CN114004222A - Chinese word segmentation boundary correction method based on frequent items

Info

Publication number: CN114004222A
Application number: CN202111297120.9A
Authority: CN
Inventors: 任晓春; 王玮; 谢斯; 张雨; 朱磊
Original assignee: China Railway First Survey and Design Institute Group Ltd
Current assignee: China Railway First Survey and Design Institute Group Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-02-01
Anticipated expiration: 2041-11-04
Also published as: CN114004222B

Abstract

The invention relates to a Chinese word segmentation boundary correction method based on frequent terms. The traditional Chinese word segmentation method based on the dictionary strictly depends on the quality of the dictionary and can not identify unknown words. The invention extracts terms from the standard as dictionary; taking a subway design specification text to be processed as an input text S1, performing reverse maximum matching processing to generate a first result set S2, and performing boundary correction processing by combining a front and rear affixed word dictionary to obtain a second result set S3; and extracting frequent item rules by using an FP-growth algorithm, performing result screening on the second result set S3, and deleting wrong participles. The method has the advantages that the flow is fully-automatic, the problem that Chinese word segmentation cannot identify unknown words in subway design specifications is solved, the boundary correction is carried out on Chinese word segmentation results by utilizing a prefix-suffix word rule, and frequent items are extracted as evaluation indexes based on an FP-growth algorithm, so that the word segmentation results are more accurate.

Description

Chinese word segmentation boundary correction method based on frequent items

Technical Field

The invention relates to the technical field of subway design data information processing, in particular to a Chinese word segmentation boundary correction method based on frequent terms.

Background

The Chinese word segmentation is used as an important subtask of natural language processing, and is widely applied to related fields of information extraction, question-answering systems, knowledge maps and the like in recent years, and the processing result directly influences the performance of Chinese information processing.

Design specifications serve as the benchmark of the building industry, and how to efficiently and intelligently process the building industry is the current hot problem of the industry. The Chinese word segmentation is used as the basis of informatization processing and can be applied to the information processing of design specifications. The traditional Chinese word segmentation method based on the dictionary strictly depends on the quality of the dictionary, and can not identify unknown words, and the traditional Chinese word segmentation method based on the rules can not carry out the positive screening and verification on the correction result.

Disclosure of Invention

The invention aims to provide a Chinese word segmentation boundary correction method based on frequent terms, which solves the problem that Chinese word segmentation based on a dictionary in the prior art cannot identify unknown words in subway design specifications.

The technical scheme adopted by the invention is as follows:

the Chinese word segmentation boundary correction method based on frequent items is characterized by comprising the following steps:

the method comprises the following steps:

step 1: extracting terms from the specification standard as a dictionary;

step 2, taking the subway design specification text to be processed as an input text S1;

step 3, performing reverse maximum matching processing on the input text S1 to generate a first result set S2;

step 4, performing boundary correction processing on the first result set S2 obtained in the step 3 in combination with the prefix and suffix dictionary to obtain a second result set S3;

step 5, extracting frequent item rules by using an FP-growth algorithm;

and 6, screening results of the second result set S3 obtained in the step 4 by using the frequent item rule extracted in the step 5, and deleting wrong participles.

In the step 1, the standard standards are appendix A building information model classification and coding in building information model classification and coding standard GB/T51269-2017 and appendix A Chinese term index in urban rail transit engineering basic term standard GB/T50833-2012.

In step 3, the specific process of the inverse maximum matching processing is as follows:

3.1) processing the input text S1 of the step 2 sentence by sentence according to the sequence from front to back;

3.2) if the length of a sentence obtained in the step 3.1) is less than the preset maximum word length n, taking the sentence as a matching field a, and executing the step 3.3);

if the word length is larger than or equal to the maximum word length n, starting from the rightmost side of the sentence, taking the character string with the maximum word length as a matching field a, and executing the step 3.3);

3.3) searching the dictionary file in the step 1, judging whether the matching field a is in the dictionary according to the matching field a obtained in the step 3.2), if the matching is successful, adding a separator "/", outputting the separator to the first result set S2, and removing the matching field a from the input text S1; repeating step 3.2) for the rest of the input text S1;

if no match exists, executing step 3.4);

3.4) eliminating the leftmost character of the matching field a, using the field formed by the rest n-1 characters as a new matching field b, and repeatedly executing the step 3.3); if the removal is successful until the single character is not matched, adding a separator character "/" and removing the character from the input text S1 until the sentence is empty;

3.5) after a sentence is processed, the sentence is removed from the input text S1, and a new sentence is obtained from the rest part of the input text S1 according to the sequence from front to back;

3.6) repeating steps 3.2) to 3.5) until the input text S1 is empty, and finally outputting the first result set S2.

In step 4, the specific process of performing the boundary correction processing on the first result set S2 obtained in step 3 in combination with the prefix and suffix dictionary is as follows:

4.1) processing the first result set S2 output in the step 3 sentence by sentence in the sequence from front to back;

4.2) acquiring a sentence, traversing the sentence, judging a separator according to the value of the ASCII code, and executing the step 4.3) if the separator is found, and executing the step 4.4) if the separator is not found;

4.3) judging whether a Chinese character before the current separator exists in the prefix dictionary, if so, removing the current separator and executing the step 4.2); if not, judging whether a Chinese character after the current separator exists in the suffix word dictionary, if so, removing the current separator and executing the step 4.2), and if not, executing the step 4.4);

4.4) removing the current sentence from the first result set S2, and obtaining a new sentence from the remaining first result set S2 in the order from front to back;

4.5) repeating steps 4.2) to 4.4) until the first result set S2 is empty, and finally outputting a second result set S3.

In the step 5, a specific process of extracting the frequent item rule by using the FP-growth algorithm is as follows:

5.1) randomly extracting 500 subway design specification texts processed in the step 4 as a training set, and calculating relevant parameters of the texts, wherein the method comprises the following specific steps:

each new word of the subway design standard text after the boundary correction in the step 4wCarrying out parameter values (#)w|, f(w), f(s), P(T _w), dl, qt) Is calculated as (a) whereinw|, dlAndqtthe calculation formula of (a) is as follows:

wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T _wis composed ofwThe standard of (2) to focus the word,P(T _w) Is thatwWord frequencies in the standard set;

after calculating the parameter values of all new words, recording the parameters which are more than or equal to 1 into the data row of the word, and taking the parameters as a training set to execute the step 5.2);

5.2) after the training set is obtained through the calculation in the step 5.1), extracting and training the frequent item rules by using an FP-growth algorithm, wherein the method specifically comprises the following steps:

firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T _w), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; completing the construction of the FP-Tree until all the nodes are inserted;

and then using FP-Tree to extract frequent item rules: firstly, according to the occurrence frequency of each parameter, taking the minimum frequency as the minimum support degree; then, starting from the leaf node at the lowest part of the FP-Tree, taking the leaf node as a leaf node to be excavated to obtain a corresponding sub-Tree; after a sub-tree is obtained, setting the count of each node in the sub-tree as a leaf node count, and deleting the node with the count lower than the support degree, wherein the current sub-tree is the frequent item of the leaf node; and recursing from bottom to top until the ancestor node is reached, taking the finally extracted frequent item as an evaluation index of the correction result, and taking the parameter of the frequent item as a frequent item rule.

In step 6, the specific steps of performing result screening on the second result set S3 obtained in step 4 by using the frequent item rule extracted in step 5 are as follows:

6.1) first calculate the tintof each new word in the second result set S3w|, f(w), f(s), P(T _w), dl, qtThen step 6.2) is performed;

6.2) screening the parameters obtained by calculation in the step 6.1) according to the frequent item rule obtained in the step 5; if the rule is met, no processing is carried out; and if the rule is not met, adding the separator removed in the step 4 to the original position.

The invention has the following advantages:

the method comprises the steps of firstly extracting terms from a standard as a dictionary, then carrying out reverse maximum matching algorithm processing on subway design specifications according to the dictionary to obtain a result set, carrying out boundary correction processing on the result set according to a front and back affixed word dictionary obtained by manual induction, then extracting frequent item rules by utilizing an FP-growth algorithm, and finally screening results after boundary correction by utilizing the frequent item rules, wherein the whole process is realized in a full-automatic manner.

The method constructs a dictionary according to terms in the specification, performs Chinese word segmentation processing on subway design specification based on the dictionary, performs preliminary correction on word segmentation results by utilizing pre-defined prefix and suffix word dictionaries, and performs screening processing on the preliminary correction results on the basis of extracting frequent item rules based on an FP-growth algorithm, so that the word segmentation results are more accurate, and the whole process is fully automatically realized.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific embodiments.

The invention relates to a Chinese word segmentation boundary correction method based on frequent terms, which can be used for data processing of subway design. The method specifically comprises the following steps:

step 1: extracting terms from the specification standard as a dictionary;

if no match exists, executing step 3.4);

Step 5, extracting frequent item rules by using an FP-growth algorithm;

wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T _wis composed ofwThe standard of (2) to focus the word,P(T _w) Is thatwThe frequency of the words in the standard set,dlandqtthe result parameters are calculated by the corresponding formulas, and have no practical significance;

taking the suffix "door" as an example, after boundary correction, the new word "department" is identified. The word is 2 in length, thuswL = 0; the "department" is not presented in the form of a separate word in the specification, and thusf(w) Is 0; government supervisorDepartment of department"and" government major editionPart (A) Door with a door panel"etc. nesting words occur 7 times in total, thereforef(s) Is 7, and because the nested words that appear are all standard hits,P(T _w) Also 7. Is calculated bydlIs a non-volatile organic compound (I) with a value of 0,qtis 1. The data behavior of the current new word "department" is:f(s), P(T _w), qt。

firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T _w), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; up to all sectionsAfter all the points are inserted, the construction of the FP-Tree is completed;

Step 6 is exemplified by analysis:

wherein, the part of frequent item rules extracted in the step 5 after the experiment aredl, qt; qt; dl, |wL (different rules are separated by ";" etc.).

The invention is not limited to the examples, and any equivalent changes to the technical solution of the invention by a person skilled in the art after reading the description of the invention are covered by the claims of the invention.

Claims

1. The Chinese word segmentation boundary correction method based on frequent items is characterized by comprising the following steps:

the method comprises the following steps:

step 1: extracting terms from the specification standard as a dictionary;

step 5, extracting frequent item rules by using an FP-growth algorithm;

2. The method for Chinese word segmentation boundary modification based on frequent items according to claim 1, wherein:

3. The method for Chinese word segmentation boundary modification based on frequent items according to claim 2, wherein:

if no match exists, executing step 3.4);

4. The method for Chinese word segmentation boundary modification based on frequent items according to claim 3, wherein:

5. The method for Chinese word segmentation boundary modification based on frequent items according to claim 4, wherein:

6. The method for Chinese word segmentation boundary modification based on frequent items according to claim 5, wherein: