CN114004222A - Chinese word segmentation boundary correction method based on frequent items - Google Patents
Chinese word segmentation boundary correction method based on frequent items Download PDFInfo
- Publication number
- CN114004222A CN114004222A CN202111297120.9A CN202111297120A CN114004222A CN 114004222 A CN114004222 A CN 114004222A CN 202111297120 A CN202111297120 A CN 202111297120A CN 114004222 A CN114004222 A CN 114004222A
- Authority
- CN
- China
- Prior art keywords
- sentence
- result set
- word
- dictionary
- separator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012937 correction Methods 0.000 title claims abstract description 28
- 230000011218 segmentation Effects 0.000 title claims abstract description 25
- 238000012545 processing Methods 0.000 claims abstract description 30
- 238000013461 design Methods 0.000 claims abstract description 18
- 238000012216 screening Methods 0.000 claims abstract description 13
- 238000011156 evaluation Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 6
- 238000003780 insertion Methods 0.000 claims description 3
- 230000037431 insertion Effects 0.000 claims description 3
- 238000012986 modification Methods 0.000 claims 5
- 230000004048 modification Effects 0.000 claims 5
- 239000000284 extract Substances 0.000 abstract 1
- 230000010365 information processing Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 239000012855 volatile organic compound Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a Chinese word segmentation boundary correction method based on frequent terms. The traditional Chinese word segmentation method based on the dictionary strictly depends on the quality of the dictionary and can not identify unknown words. The invention extracts terms from the standard as dictionary; taking a subway design specification text to be processed as an input text S1, performing reverse maximum matching processing to generate a first result set S2, and performing boundary correction processing by combining a front and rear affixed word dictionary to obtain a second result set S3; and extracting frequent item rules by using an FP-growth algorithm, performing result screening on the second result set S3, and deleting wrong participles. The method has the advantages that the flow is fully-automatic, the problem that Chinese word segmentation cannot identify unknown words in subway design specifications is solved, the boundary correction is carried out on Chinese word segmentation results by utilizing a prefix-suffix word rule, and frequent items are extracted as evaluation indexes based on an FP-growth algorithm, so that the word segmentation results are more accurate.
Description
Technical Field
The invention relates to the technical field of subway design data information processing, in particular to a Chinese word segmentation boundary correction method based on frequent terms.
Background
The Chinese word segmentation is used as an important subtask of natural language processing, and is widely applied to related fields of information extraction, question-answering systems, knowledge maps and the like in recent years, and the processing result directly influences the performance of Chinese information processing.
Design specifications serve as the benchmark of the building industry, and how to efficiently and intelligently process the building industry is the current hot problem of the industry. The Chinese word segmentation is used as the basis of informatization processing and can be applied to the information processing of design specifications. The traditional Chinese word segmentation method based on the dictionary strictly depends on the quality of the dictionary, and can not identify unknown words, and the traditional Chinese word segmentation method based on the rules can not carry out the positive screening and verification on the correction result.
Disclosure of Invention
The invention aims to provide a Chinese word segmentation boundary correction method based on frequent terms, which solves the problem that Chinese word segmentation based on a dictionary in the prior art cannot identify unknown words in subway design specifications.
The technical scheme adopted by the invention is as follows:
the Chinese word segmentation boundary correction method based on frequent items is characterized by comprising the following steps:
the method comprises the following steps:
step 1: extracting terms from the specification standard as a dictionary;
step 2, taking the subway design specification text to be processed as an input text S1;
step 3, performing reverse maximum matching processing on the input text S1 to generate a first result set S2;
step 4, performing boundary correction processing on the first result set S2 obtained in the step 3 in combination with the prefix and suffix dictionary to obtain a second result set S3;
step 5, extracting frequent item rules by using an FP-growth algorithm;
and 6, screening results of the second result set S3 obtained in the step 4 by using the frequent item rule extracted in the step 5, and deleting wrong participles.
In the step 1, the standard standards are appendix A building information model classification and coding in building information model classification and coding standard GB/T51269-2017 and appendix A Chinese term index in urban rail transit engineering basic term standard GB/T50833-2012.
In step 3, the specific process of the inverse maximum matching processing is as follows:
3.1) processing the input text S1 of the step 2 sentence by sentence according to the sequence from front to back;
3.2) if the length of a sentence obtained in the step 3.1) is less than the preset maximum word length n, taking the sentence as a matching field a, and executing the step 3.3);
if the word length is larger than or equal to the maximum word length n, starting from the rightmost side of the sentence, taking the character string with the maximum word length as a matching field a, and executing the step 3.3);
3.3) searching the dictionary file in the step 1, judging whether the matching field a is in the dictionary according to the matching field a obtained in the step 3.2), if the matching is successful, adding a separator "/", outputting the separator to the first result set S2, and removing the matching field a from the input text S1; repeating step 3.2) for the rest of the input text S1;
if no match exists, executing step 3.4);
3.4) eliminating the leftmost character of the matching field a, using the field formed by the rest n-1 characters as a new matching field b, and repeatedly executing the step 3.3); if the removal is successful until the single character is not matched, adding a separator character "/" and removing the character from the input text S1 until the sentence is empty;
3.5) after a sentence is processed, the sentence is removed from the input text S1, and a new sentence is obtained from the rest part of the input text S1 according to the sequence from front to back;
3.6) repeating steps 3.2) to 3.5) until the input text S1 is empty, and finally outputting the first result set S2.
In step 4, the specific process of performing the boundary correction processing on the first result set S2 obtained in step 3 in combination with the prefix and suffix dictionary is as follows:
4.1) processing the first result set S2 output in the step 3 sentence by sentence in the sequence from front to back;
4.2) acquiring a sentence, traversing the sentence, judging a separator according to the value of the ASCII code, and executing the step 4.3) if the separator is found, and executing the step 4.4) if the separator is not found;
4.3) judging whether a Chinese character before the current separator exists in the prefix dictionary, if so, removing the current separator and executing the step 4.2); if not, judging whether a Chinese character after the current separator exists in the suffix word dictionary, if so, removing the current separator and executing the step 4.2), and if not, executing the step 4.4);
4.4) removing the current sentence from the first result set S2, and obtaining a new sentence from the remaining first result set S2 in the order from front to back;
4.5) repeating steps 4.2) to 4.4) until the first result set S2 is empty, and finally outputting a second result set S3.
In the step 5, a specific process of extracting the frequent item rule by using the FP-growth algorithm is as follows:
5.1) randomly extracting 500 subway design specification texts processed in the step 4 as a training set, and calculating relevant parameters of the texts, wherein the method comprises the following specific steps:
each new word of the subway design standard text after the boundary correction in the step 4wCarrying out parameter values (#)w|, f(w), f(s), P(T w ), dl, qt) Is calculated as (a) whereinw|, dlAndqtthe calculation formula of (a) is as follows:
wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T w is composed ofwThe standard of (2) to focus the word,P(T w ) Is thatwWord frequencies in the standard set;
after calculating the parameter values of all new words, recording the parameters which are more than or equal to 1 into the data row of the word, and taking the parameters as a training set to execute the step 5.2);
5.2) after the training set is obtained through the calculation in the step 5.1), extracting and training the frequent item rules by using an FP-growth algorithm, wherein the method specifically comprises the following steps:
firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T w ), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; completing the construction of the FP-Tree until all the nodes are inserted;
and then using FP-Tree to extract frequent item rules: firstly, according to the occurrence frequency of each parameter, taking the minimum frequency as the minimum support degree; then, starting from the leaf node at the lowest part of the FP-Tree, taking the leaf node as a leaf node to be excavated to obtain a corresponding sub-Tree; after a sub-tree is obtained, setting the count of each node in the sub-tree as a leaf node count, and deleting the node with the count lower than the support degree, wherein the current sub-tree is the frequent item of the leaf node; and recursing from bottom to top until the ancestor node is reached, taking the finally extracted frequent item as an evaluation index of the correction result, and taking the parameter of the frequent item as a frequent item rule.
In step 6, the specific steps of performing result screening on the second result set S3 obtained in step 4 by using the frequent item rule extracted in step 5 are as follows:
6.1) first calculate the tintof each new word in the second result set S3w|, f(w), f(s), P(T w ), dl, qtThen step 6.2) is performed;
6.2) screening the parameters obtained by calculation in the step 6.1) according to the frequent item rule obtained in the step 5; if the rule is met, no processing is carried out; and if the rule is not met, adding the separator removed in the step 4 to the original position.
The invention has the following advantages:
the method comprises the steps of firstly extracting terms from a standard as a dictionary, then carrying out reverse maximum matching algorithm processing on subway design specifications according to the dictionary to obtain a result set, carrying out boundary correction processing on the result set according to a front and back affixed word dictionary obtained by manual induction, then extracting frequent item rules by utilizing an FP-growth algorithm, and finally screening results after boundary correction by utilizing the frequent item rules, wherein the whole process is realized in a full-automatic manner.
The method constructs a dictionary according to terms in the specification, performs Chinese word segmentation processing on subway design specification based on the dictionary, performs preliminary correction on word segmentation results by utilizing pre-defined prefix and suffix word dictionaries, and performs screening processing on the preliminary correction results on the basis of extracting frequent item rules based on an FP-growth algorithm, so that the word segmentation results are more accurate, and the whole process is fully automatically realized.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific embodiments.
The invention relates to a Chinese word segmentation boundary correction method based on frequent terms, which can be used for data processing of subway design. The method specifically comprises the following steps:
step 1: extracting terms from the specification standard as a dictionary;
in the step 1, the standard standards are appendix A building information model classification and coding in building information model classification and coding standard GB/T51269-2017 and appendix A Chinese term index in urban rail transit engineering basic term standard GB/T50833-2012.
Step 2, taking the subway design specification text to be processed as an input text S1;
step 3, performing reverse maximum matching processing on the input text S1 to generate a first result set S2;
in step 3, the specific process of the inverse maximum matching processing is as follows:
3.1) processing the input text S1 of the step 2 sentence by sentence according to the sequence from front to back;
3.2) if the length of a sentence obtained in the step 3.1) is less than the preset maximum word length n, taking the sentence as a matching field a, and executing the step 3.3);
if the word length is larger than or equal to the maximum word length n, starting from the rightmost side of the sentence, taking the character string with the maximum word length as a matching field a, and executing the step 3.3);
3.3) searching the dictionary file in the step 1, judging whether the matching field a is in the dictionary according to the matching field a obtained in the step 3.2), if the matching is successful, adding a separator "/", outputting the separator to the first result set S2, and removing the matching field a from the input text S1; repeating step 3.2) for the rest of the input text S1;
if no match exists, executing step 3.4);
3.4) eliminating the leftmost character of the matching field a, using the field formed by the rest n-1 characters as a new matching field b, and repeatedly executing the step 3.3); if the removal is successful until the single character is not matched, adding a separator character "/" and removing the character from the input text S1 until the sentence is empty;
3.5) after a sentence is processed, the sentence is removed from the input text S1, and a new sentence is obtained from the rest part of the input text S1 according to the sequence from front to back;
3.6) repeating steps 3.2) to 3.5) until the input text S1 is empty, and finally outputting the first result set S2.
Step 4, performing boundary correction processing on the first result set S2 obtained in the step 3 in combination with the prefix and suffix dictionary to obtain a second result set S3;
in step 4, the specific process of performing the boundary correction processing on the first result set S2 obtained in step 3 in combination with the prefix and suffix dictionary is as follows:
4.1) processing the first result set S2 output in the step 3 sentence by sentence in the sequence from front to back;
4.2) acquiring a sentence, traversing the sentence, judging a separator according to the value of the ASCII code, and executing the step 4.3) if the separator is found, and executing the step 4.4) if the separator is not found;
4.3) judging whether a Chinese character before the current separator exists in the prefix dictionary, if so, removing the current separator and executing the step 4.2); if not, judging whether a Chinese character after the current separator exists in the suffix word dictionary, if so, removing the current separator and executing the step 4.2), and if not, executing the step 4.4);
4.4) removing the current sentence from the first result set S2, and obtaining a new sentence from the remaining first result set S2 in the order from front to back;
4.5) repeating steps 4.2) to 4.4) until the first result set S2 is empty, and finally outputting a second result set S3.
Step 5, extracting frequent item rules by using an FP-growth algorithm;
in the step 5, a specific process of extracting the frequent item rule by using the FP-growth algorithm is as follows:
5.1) randomly extracting 500 subway design specification texts processed in the step 4 as a training set, and calculating relevant parameters of the texts, wherein the method comprises the following specific steps:
each new word of the subway design standard text after the boundary correction in the step 4wCarrying out parameter values (#)w|, f(w), f(s), P(T w ), dl, qt) Is calculated as (a) whereinw|, dlAndqtthe calculation formula of (a) is as follows:
wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T w is composed ofwThe standard of (2) to focus the word,P(T w ) Is thatwThe frequency of the words in the standard set,dlandqtthe result parameters are calculated by the corresponding formulas, and have no practical significance;
after calculating the parameter values of all new words, recording the parameters which are more than or equal to 1 into the data row of the word, and taking the parameters as a training set to execute the step 5.2);
taking the suffix "door" as an example, after boundary correction, the new word "department" is identified. The word is 2 in length, thuswL = 0; the "department" is not presented in the form of a separate word in the specification, and thusf(w) Is 0; government supervisorDepartment of department"and" government major editionPart (A) Door with a door panel"etc. nesting words occur 7 times in total, thereforef(s) Is 7, and because the nested words that appear are all standard hits,P(T w ) Also 7. Is calculated bydlIs a non-volatile organic compound (I) with a value of 0,qtis 1. The data behavior of the current new word "department" is:f(s), P(T w ), qt。
5.2) after the training set is obtained through the calculation in the step 5.1), extracting and training the frequent item rules by using an FP-growth algorithm, wherein the method specifically comprises the following steps:
firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T w ), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; up to all sectionsAfter all the points are inserted, the construction of the FP-Tree is completed;
and then using FP-Tree to extract frequent item rules: firstly, according to the occurrence frequency of each parameter, taking the minimum frequency as the minimum support degree; then, starting from the leaf node at the lowest part of the FP-Tree, taking the leaf node as a leaf node to be excavated to obtain a corresponding sub-Tree; after a sub-tree is obtained, setting the count of each node in the sub-tree as a leaf node count, and deleting the node with the count lower than the support degree, wherein the current sub-tree is the frequent item of the leaf node; and recursing from bottom to top until the ancestor node is reached, taking the finally extracted frequent item as an evaluation index of the correction result, and taking the parameter of the frequent item as a frequent item rule.
And 6, screening results of the second result set S3 obtained in the step 4 by using the frequent item rule extracted in the step 5, and deleting wrong participles.
In step 6, the specific steps of performing result screening on the second result set S3 obtained in step 4 by using the frequent item rule extracted in step 5 are as follows:
6.1) first calculate the tintof each new word in the second result set S3w|, f(w), f(s), P(T w ), dl, qtThen step 6.2) is performed;
6.2) screening the parameters obtained by calculation in the step 6.1) according to the frequent item rule obtained in the step 5; if the rule is met, no processing is carried out; and if the rule is not met, adding the separator removed in the step 4 to the original position.
Step 6 is exemplified by analysis:
wherein, the part of frequent item rules extracted in the step 5 after the experiment aredl, qt; qt; dl, |wL (different rules are separated by ";" etc.).
The invention is not limited to the examples, and any equivalent changes to the technical solution of the invention by a person skilled in the art after reading the description of the invention are covered by the claims of the invention.
Claims (6)
1. The Chinese word segmentation boundary correction method based on frequent items is characterized by comprising the following steps:
the method comprises the following steps:
step 1: extracting terms from the specification standard as a dictionary;
step 2, taking the subway design specification text to be processed as an input text S1;
step 3, performing reverse maximum matching processing on the input text S1 to generate a first result set S2;
step 4, performing boundary correction processing on the first result set S2 obtained in the step 3 in combination with the prefix and suffix dictionary to obtain a second result set S3;
step 5, extracting frequent item rules by using an FP-growth algorithm;
and 6, screening results of the second result set S3 obtained in the step 4 by using the frequent item rule extracted in the step 5, and deleting wrong participles.
2. The method for Chinese word segmentation boundary modification based on frequent items according to claim 1, wherein:
in the step 1, the standard standards are appendix A building information model classification and coding in building information model classification and coding standard GB/T51269-2017 and appendix A Chinese term index in urban rail transit engineering basic term standard GB/T50833-2012.
3. The method for Chinese word segmentation boundary modification based on frequent items according to claim 2, wherein:
in step 3, the specific process of the inverse maximum matching processing is as follows:
3.1) processing the input text S1 of the step 2 sentence by sentence according to the sequence from front to back;
3.2) if the length of a sentence obtained in the step 3.1) is less than the preset maximum word length n, taking the sentence as a matching field a, and executing the step 3.3);
if the word length is larger than or equal to the maximum word length n, starting from the rightmost side of the sentence, taking the character string with the maximum word length as a matching field a, and executing the step 3.3);
3.3) searching the dictionary file in the step 1, judging whether the matching field a is in the dictionary according to the matching field a obtained in the step 3.2), if the matching is successful, adding a separator "/", outputting the separator to the first result set S2, and removing the matching field a from the input text S1; repeating step 3.2) for the rest of the input text S1;
if no match exists, executing step 3.4);
3.4) eliminating the leftmost character of the matching field a, using the field formed by the rest n-1 characters as a new matching field b, and repeatedly executing the step 3.3); if the removal is successful until the single character is not matched, adding a separator character "/" and removing the character from the input text S1 until the sentence is empty;
3.5) after a sentence is processed, the sentence is removed from the input text S1, and a new sentence is obtained from the rest part of the input text S1 according to the sequence from front to back;
3.6) repeating steps 3.2) to 3.5) until the input text S1 is empty, and finally outputting the first result set S2.
4. The method for Chinese word segmentation boundary modification based on frequent items according to claim 3, wherein:
in step 4, the specific process of performing the boundary correction processing on the first result set S2 obtained in step 3 in combination with the prefix and suffix dictionary is as follows:
4.1) processing the first result set S2 output in the step 3 sentence by sentence in the sequence from front to back;
4.2) acquiring a sentence, traversing the sentence, judging a separator according to the value of the ASCII code, and executing the step 4.3) if the separator is found, and executing the step 4.4) if the separator is not found;
4.3) judging whether a Chinese character before the current separator exists in the prefix dictionary, if so, removing the current separator and executing the step 4.2); if not, judging whether a Chinese character after the current separator exists in the suffix word dictionary, if so, removing the current separator and executing the step 4.2), and if not, executing the step 4.4);
4.4) removing the current sentence from the first result set S2, and obtaining a new sentence from the remaining first result set S2 in the order from front to back;
4.5) repeating steps 4.2) to 4.4) until the first result set S2 is empty, and finally outputting a second result set S3.
5. The method for Chinese word segmentation boundary modification based on frequent items according to claim 4, wherein:
in the step 5, a specific process of extracting the frequent item rule by using the FP-growth algorithm is as follows:
5.1) randomly extracting 500 subway design specification texts processed in the step 4 as a training set, and calculating relevant parameters of the texts, wherein the method comprises the following specific steps:
each new word of the subway design standard text after the boundary correction in the step 4wCarrying out parameter values (#)w|, f(w), f(s), P(T w ), dl, qt) Is calculated as (a) whereinw|, dlAndqtthe calculation formula of (a) is as follows:
wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T w is composed ofwThe standard of (2) to focus the word,P(T w ) Is thatwWord frequencies in the standard set;
after calculating the parameter values of all new words, recording the parameters which are more than or equal to 1 into the data row of the word, and taking the parameters as a training set to execute the step 5.2);
5.2) after the training set is obtained through the calculation in the step 5.1), extracting and training the frequent item rules by using an FP-growth algorithm, wherein the method specifically comprises the following steps:
firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T w ), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; completing the construction of the FP-Tree until all the nodes are inserted;
and then using FP-Tree to extract frequent item rules: firstly, according to the occurrence frequency of each parameter, taking the minimum frequency as the minimum support degree; then, starting from the leaf node at the lowest part of the FP-Tree, taking the leaf node as a leaf node to be excavated to obtain a corresponding sub-Tree; after a sub-tree is obtained, setting the count of each node in the sub-tree as a leaf node count, and deleting the node with the count lower than the support degree, wherein the current sub-tree is the frequent item of the leaf node; and recursing from bottom to top until the ancestor node is reached, taking the finally extracted frequent item as an evaluation index of the correction result, and taking the parameter of the frequent item as a frequent item rule.
6. The method for Chinese word segmentation boundary modification based on frequent items according to claim 5, wherein:
in step 6, the specific steps of performing result screening on the second result set S3 obtained in step 4 by using the frequent item rule extracted in step 5 are as follows:
6.1) first calculate the tintof each new word in the second result set S3w|, f(w), f(s), P(T w ), dl, qtThen step 6.2) is performed;
6.2) screening the parameters obtained by calculation in the step 6.1) according to the frequent item rule obtained in the step 5; if the rule is met, no processing is carried out; and if the rule is not met, adding the separator removed in the step 4 to the original position.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111297120.9A CN114004222B (en) | 2021-11-04 | 2021-11-04 | Chinese word segmentation boundary correction method based on frequent items |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111297120.9A CN114004222B (en) | 2021-11-04 | 2021-11-04 | Chinese word segmentation boundary correction method based on frequent items |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114004222A true CN114004222A (en) | 2022-02-01 |
CN114004222B CN114004222B (en) | 2024-04-30 |
Family
ID=79927025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111297120.9A Active CN114004222B (en) | 2021-11-04 | 2021-11-04 | Chinese word segmentation boundary correction method based on frequent items |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114004222B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077275A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Method and device for performing word segmentation based on context |
CN108536724A (en) * | 2018-02-13 | 2018-09-14 | 西安理工大学 | Main body recognition methods in a kind of metro design code based on the double-deck hash index |
CN108829696A (en) * | 2018-04-18 | 2018-11-16 | 西安理工大学 | Towards knowledge mapping node method for auto constructing in metro design code |
CN110046348A (en) * | 2019-03-19 | 2019-07-23 | 西安理工大学 | Main body recognition methods in a kind of rule-based and dictionary metro design code |
-
2021
- 2021-11-04 CN CN202111297120.9A patent/CN114004222B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077275A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Method and device for performing word segmentation based on context |
CN108536724A (en) * | 2018-02-13 | 2018-09-14 | 西安理工大学 | Main body recognition methods in a kind of metro design code based on the double-deck hash index |
CN108829696A (en) * | 2018-04-18 | 2018-11-16 | 西安理工大学 | Towards knowledge mapping node method for auto constructing in metro design code |
CN110046348A (en) * | 2019-03-19 | 2019-07-23 | 西安理工大学 | Main body recognition methods in a kind of rule-based and dictionary metro design code |
Also Published As
Publication number | Publication date |
---|---|
CN114004222B (en) | 2024-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111241294B (en) | Relationship extraction method of graph convolution network based on dependency analysis and keywords | |
CN103049501B (en) | Based on mutual information and the Chinese domain term recognition method of conditional random field models | |
CN109960804B (en) | Method and device for generating topic text sentence vector | |
CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
CN109960724A (en) | A kind of text snippet method based on TF-IDF | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN111159990B (en) | Method and system for identifying general special words based on pattern expansion | |
CN108415953A (en) | A kind of non-performing asset based on natural language processing technique manages knowledge management method | |
CN109783809B (en) | Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus | |
CN108614897B (en) | Content diversification searching method for natural language | |
CN113705237B (en) | Relationship extraction method and device integrating relationship phrase knowledge and electronic equipment | |
CN113609857B (en) | Legal named entity recognition method and system based on cascade model and data enhancement | |
CN108153851B (en) | General forum subject post page information extraction method based on rules and semantics | |
CN112287240A (en) | Case microblog evaluation object extraction method and device based on double-embedded multilayer convolutional neural network | |
CN105956158A (en) | Automatic extraction method of network neologism on the basis of mass microblog texts and use information | |
CN114298048A (en) | Named entity identification method and device | |
JP7487532B2 (en) | Method and device for correcting image block recognition results, and storage medium | |
CN116127079B (en) | Text classification method | |
CN102637202B (en) | Method for automatically acquiring iterative conception attribute name and system | |
CN112417296A (en) | Internet key data information acquisition and extraction method | |
CN114004222A (en) | Chinese word segmentation boundary correction method based on frequent items | |
CN115358227A (en) | Open domain relation joint extraction method and system based on phrase enhancement | |
CN110909546B (en) | Text data processing method, device, equipment and medium | |
CN112632985A (en) | Corpus processing method and device, storage medium and processor | |
Kasthuri et al. | An improved rule based iterative affix stripping stemmer for Tamil language using K-mean clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |