CN114004222A - Chinese word segmentation boundary correction method based on frequent items - Google Patents

Chinese word segmentation boundary correction method based on frequent items Download PDF

Info

Publication number
CN114004222A
CN114004222A CN202111297120.9A CN202111297120A CN114004222A CN 114004222 A CN114004222 A CN 114004222A CN 202111297120 A CN202111297120 A CN 202111297120A CN 114004222 A CN114004222 A CN 114004222A
Authority
CN
China
Prior art keywords
sentence
result set
word
dictionary
separator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111297120.9A
Other languages
Chinese (zh)
Other versions
CN114004222B (en
Inventor
任晓春
王玮
谢斯
张雨
朱磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Railway First Survey and Design Institute Group Ltd
Original Assignee
China Railway First Survey and Design Institute Group Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Railway First Survey and Design Institute Group Ltd filed Critical China Railway First Survey and Design Institute Group Ltd
Priority to CN202111297120.9A priority Critical patent/CN114004222B/en
Publication of CN114004222A publication Critical patent/CN114004222A/en
Application granted granted Critical
Publication of CN114004222B publication Critical patent/CN114004222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese word segmentation boundary correction method based on frequent terms. The traditional Chinese word segmentation method based on the dictionary strictly depends on the quality of the dictionary and can not identify unknown words. The invention extracts terms from the standard as dictionary; taking a subway design specification text to be processed as an input text S1, performing reverse maximum matching processing to generate a first result set S2, and performing boundary correction processing by combining a front and rear affixed word dictionary to obtain a second result set S3; and extracting frequent item rules by using an FP-growth algorithm, performing result screening on the second result set S3, and deleting wrong participles. The method has the advantages that the flow is fully-automatic, the problem that Chinese word segmentation cannot identify unknown words in subway design specifications is solved, the boundary correction is carried out on Chinese word segmentation results by utilizing a prefix-suffix word rule, and frequent items are extracted as evaluation indexes based on an FP-growth algorithm, so that the word segmentation results are more accurate.

Description

Chinese word segmentation boundary correction method based on frequent items
Technical Field
The invention relates to the technical field of subway design data information processing, in particular to a Chinese word segmentation boundary correction method based on frequent terms.
Background
The Chinese word segmentation is used as an important subtask of natural language processing, and is widely applied to related fields of information extraction, question-answering systems, knowledge maps and the like in recent years, and the processing result directly influences the performance of Chinese information processing.
Design specifications serve as the benchmark of the building industry, and how to efficiently and intelligently process the building industry is the current hot problem of the industry. The Chinese word segmentation is used as the basis of informatization processing and can be applied to the information processing of design specifications. The traditional Chinese word segmentation method based on the dictionary strictly depends on the quality of the dictionary, and can not identify unknown words, and the traditional Chinese word segmentation method based on the rules can not carry out the positive screening and verification on the correction result.
Disclosure of Invention
The invention aims to provide a Chinese word segmentation boundary correction method based on frequent terms, which solves the problem that Chinese word segmentation based on a dictionary in the prior art cannot identify unknown words in subway design specifications.
The technical scheme adopted by the invention is as follows:
the Chinese word segmentation boundary correction method based on frequent items is characterized by comprising the following steps:
the method comprises the following steps:
step 1: extracting terms from the specification standard as a dictionary;
step 2, taking the subway design specification text to be processed as an input text S1;
step 3, performing reverse maximum matching processing on the input text S1 to generate a first result set S2;
step 4, performing boundary correction processing on the first result set S2 obtained in the step 3 in combination with the prefix and suffix dictionary to obtain a second result set S3;
step 5, extracting frequent item rules by using an FP-growth algorithm;
and 6, screening results of the second result set S3 obtained in the step 4 by using the frequent item rule extracted in the step 5, and deleting wrong participles.
In the step 1, the standard standards are appendix A building information model classification and coding in building information model classification and coding standard GB/T51269-2017 and appendix A Chinese term index in urban rail transit engineering basic term standard GB/T50833-2012.
In step 3, the specific process of the inverse maximum matching processing is as follows:
3.1) processing the input text S1 of the step 2 sentence by sentence according to the sequence from front to back;
3.2) if the length of a sentence obtained in the step 3.1) is less than the preset maximum word length n, taking the sentence as a matching field a, and executing the step 3.3);
if the word length is larger than or equal to the maximum word length n, starting from the rightmost side of the sentence, taking the character string with the maximum word length as a matching field a, and executing the step 3.3);
3.3) searching the dictionary file in the step 1, judging whether the matching field a is in the dictionary according to the matching field a obtained in the step 3.2), if the matching is successful, adding a separator "/", outputting the separator to the first result set S2, and removing the matching field a from the input text S1; repeating step 3.2) for the rest of the input text S1;
if no match exists, executing step 3.4);
3.4) eliminating the leftmost character of the matching field a, using the field formed by the rest n-1 characters as a new matching field b, and repeatedly executing the step 3.3); if the removal is successful until the single character is not matched, adding a separator character "/" and removing the character from the input text S1 until the sentence is empty;
3.5) after a sentence is processed, the sentence is removed from the input text S1, and a new sentence is obtained from the rest part of the input text S1 according to the sequence from front to back;
3.6) repeating steps 3.2) to 3.5) until the input text S1 is empty, and finally outputting the first result set S2.
In step 4, the specific process of performing the boundary correction processing on the first result set S2 obtained in step 3 in combination with the prefix and suffix dictionary is as follows:
4.1) processing the first result set S2 output in the step 3 sentence by sentence in the sequence from front to back;
4.2) acquiring a sentence, traversing the sentence, judging a separator according to the value of the ASCII code, and executing the step 4.3) if the separator is found, and executing the step 4.4) if the separator is not found;
4.3) judging whether a Chinese character before the current separator exists in the prefix dictionary, if so, removing the current separator and executing the step 4.2); if not, judging whether a Chinese character after the current separator exists in the suffix word dictionary, if so, removing the current separator and executing the step 4.2), and if not, executing the step 4.4);
4.4) removing the current sentence from the first result set S2, and obtaining a new sentence from the remaining first result set S2 in the order from front to back;
4.5) repeating steps 4.2) to 4.4) until the first result set S2 is empty, and finally outputting a second result set S3.
In the step 5, a specific process of extracting the frequent item rule by using the FP-growth algorithm is as follows:
5.1) randomly extracting 500 subway design specification texts processed in the step 4 as a training set, and calculating relevant parameters of the texts, wherein the method comprises the following specific steps:
each new word of the subway design standard text after the boundary correction in the step 4wCarrying out parameter values (#)w|, f(w), f(s), P(T w ), dl, qt) Is calculated as (a) whereinw|, dlAndqtthe calculation formula of (a) is as follows:
Figure DEST_PATH_IMAGE001
wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T w is composed ofwThe standard of (2) to focus the word,P(T w ) Is thatwWord frequencies in the standard set;
after calculating the parameter values of all new words, recording the parameters which are more than or equal to 1 into the data row of the word, and taking the parameters as a training set to execute the step 5.2);
5.2) after the training set is obtained through the calculation in the step 5.1), extracting and training the frequent item rules by using an FP-growth algorithm, wherein the method specifically comprises the following steps:
firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T w ), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; completing the construction of the FP-Tree until all the nodes are inserted;
and then using FP-Tree to extract frequent item rules: firstly, according to the occurrence frequency of each parameter, taking the minimum frequency as the minimum support degree; then, starting from the leaf node at the lowest part of the FP-Tree, taking the leaf node as a leaf node to be excavated to obtain a corresponding sub-Tree; after a sub-tree is obtained, setting the count of each node in the sub-tree as a leaf node count, and deleting the node with the count lower than the support degree, wherein the current sub-tree is the frequent item of the leaf node; and recursing from bottom to top until the ancestor node is reached, taking the finally extracted frequent item as an evaluation index of the correction result, and taking the parameter of the frequent item as a frequent item rule.
In step 6, the specific steps of performing result screening on the second result set S3 obtained in step 4 by using the frequent item rule extracted in step 5 are as follows:
6.1) first calculate the tintof each new word in the second result set S3w|, f(w), f(s), P(T w ), dl, qtThen step 6.2) is performed;
6.2) screening the parameters obtained by calculation in the step 6.1) according to the frequent item rule obtained in the step 5; if the rule is met, no processing is carried out; and if the rule is not met, adding the separator removed in the step 4 to the original position.
The invention has the following advantages:
the method comprises the steps of firstly extracting terms from a standard as a dictionary, then carrying out reverse maximum matching algorithm processing on subway design specifications according to the dictionary to obtain a result set, carrying out boundary correction processing on the result set according to a front and back affixed word dictionary obtained by manual induction, then extracting frequent item rules by utilizing an FP-growth algorithm, and finally screening results after boundary correction by utilizing the frequent item rules, wherein the whole process is realized in a full-automatic manner.
The method constructs a dictionary according to terms in the specification, performs Chinese word segmentation processing on subway design specification based on the dictionary, performs preliminary correction on word segmentation results by utilizing pre-defined prefix and suffix word dictionaries, and performs screening processing on the preliminary correction results on the basis of extracting frequent item rules based on an FP-growth algorithm, so that the word segmentation results are more accurate, and the whole process is fully automatically realized.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific embodiments.
The invention relates to a Chinese word segmentation boundary correction method based on frequent terms, which can be used for data processing of subway design. The method specifically comprises the following steps:
step 1: extracting terms from the specification standard as a dictionary;
in the step 1, the standard standards are appendix A building information model classification and coding in building information model classification and coding standard GB/T51269-2017 and appendix A Chinese term index in urban rail transit engineering basic term standard GB/T50833-2012.
Step 2, taking the subway design specification text to be processed as an input text S1;
step 3, performing reverse maximum matching processing on the input text S1 to generate a first result set S2;
in step 3, the specific process of the inverse maximum matching processing is as follows:
3.1) processing the input text S1 of the step 2 sentence by sentence according to the sequence from front to back;
3.2) if the length of a sentence obtained in the step 3.1) is less than the preset maximum word length n, taking the sentence as a matching field a, and executing the step 3.3);
if the word length is larger than or equal to the maximum word length n, starting from the rightmost side of the sentence, taking the character string with the maximum word length as a matching field a, and executing the step 3.3);
3.3) searching the dictionary file in the step 1, judging whether the matching field a is in the dictionary according to the matching field a obtained in the step 3.2), if the matching is successful, adding a separator "/", outputting the separator to the first result set S2, and removing the matching field a from the input text S1; repeating step 3.2) for the rest of the input text S1;
if no match exists, executing step 3.4);
3.4) eliminating the leftmost character of the matching field a, using the field formed by the rest n-1 characters as a new matching field b, and repeatedly executing the step 3.3); if the removal is successful until the single character is not matched, adding a separator character "/" and removing the character from the input text S1 until the sentence is empty;
3.5) after a sentence is processed, the sentence is removed from the input text S1, and a new sentence is obtained from the rest part of the input text S1 according to the sequence from front to back;
3.6) repeating steps 3.2) to 3.5) until the input text S1 is empty, and finally outputting the first result set S2.
Step 4, performing boundary correction processing on the first result set S2 obtained in the step 3 in combination with the prefix and suffix dictionary to obtain a second result set S3;
in step 4, the specific process of performing the boundary correction processing on the first result set S2 obtained in step 3 in combination with the prefix and suffix dictionary is as follows:
4.1) processing the first result set S2 output in the step 3 sentence by sentence in the sequence from front to back;
4.2) acquiring a sentence, traversing the sentence, judging a separator according to the value of the ASCII code, and executing the step 4.3) if the separator is found, and executing the step 4.4) if the separator is not found;
4.3) judging whether a Chinese character before the current separator exists in the prefix dictionary, if so, removing the current separator and executing the step 4.2); if not, judging whether a Chinese character after the current separator exists in the suffix word dictionary, if so, removing the current separator and executing the step 4.2), and if not, executing the step 4.4);
4.4) removing the current sentence from the first result set S2, and obtaining a new sentence from the remaining first result set S2 in the order from front to back;
4.5) repeating steps 4.2) to 4.4) until the first result set S2 is empty, and finally outputting a second result set S3.
Step 5, extracting frequent item rules by using an FP-growth algorithm;
in the step 5, a specific process of extracting the frequent item rule by using the FP-growth algorithm is as follows:
5.1) randomly extracting 500 subway design specification texts processed in the step 4 as a training set, and calculating relevant parameters of the texts, wherein the method comprises the following specific steps:
each new word of the subway design standard text after the boundary correction in the step 4wCarrying out parameter values (#)w|, f(w), f(s), P(T w ), dl, qt) Is calculated as (a) whereinw|, dlAndqtthe calculation formula of (a) is as follows:
Figure 374746DEST_PATH_IMAGE001
wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T w is composed ofwThe standard of (2) to focus the word,P(T w ) Is thatwThe frequency of the words in the standard set,dlandqtthe result parameters are calculated by the corresponding formulas, and have no practical significance;
after calculating the parameter values of all new words, recording the parameters which are more than or equal to 1 into the data row of the word, and taking the parameters as a training set to execute the step 5.2);
taking the suffix "door" as an example, after boundary correction, the new word "department" is identified. The word is 2 in length, thuswL = 0; the "department" is not presented in the form of a separate word in the specification, and thusf(w) Is 0; government supervisorDepartment of department"and" government major editionPart (A) Door with a door panel"etc. nesting words occur 7 times in total, thereforef(s) Is 7, and because the nested words that appear are all standard hits,P(T w ) Also 7. Is calculated bydlIs a non-volatile organic compound (I) with a value of 0,qtis 1. The data behavior of the current new word "department" is:f(s), P(T w ), qt
5.2) after the training set is obtained through the calculation in the step 5.1), extracting and training the frequent item rules by using an FP-growth algorithm, wherein the method specifically comprises the following steps:
firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T w ), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; up to all sectionsAfter all the points are inserted, the construction of the FP-Tree is completed;
and then using FP-Tree to extract frequent item rules: firstly, according to the occurrence frequency of each parameter, taking the minimum frequency as the minimum support degree; then, starting from the leaf node at the lowest part of the FP-Tree, taking the leaf node as a leaf node to be excavated to obtain a corresponding sub-Tree; after a sub-tree is obtained, setting the count of each node in the sub-tree as a leaf node count, and deleting the node with the count lower than the support degree, wherein the current sub-tree is the frequent item of the leaf node; and recursing from bottom to top until the ancestor node is reached, taking the finally extracted frequent item as an evaluation index of the correction result, and taking the parameter of the frequent item as a frequent item rule.
And 6, screening results of the second result set S3 obtained in the step 4 by using the frequent item rule extracted in the step 5, and deleting wrong participles.
In step 6, the specific steps of performing result screening on the second result set S3 obtained in step 4 by using the frequent item rule extracted in step 5 are as follows:
6.1) first calculate the tintof each new word in the second result set S3w|, f(w), f(s), P(T w ), dl, qtThen step 6.2) is performed;
6.2) screening the parameters obtained by calculation in the step 6.1) according to the frequent item rule obtained in the step 5; if the rule is met, no processing is carried out; and if the rule is not met, adding the separator removed in the step 4 to the original position.
Step 6 is exemplified by analysis:
Figure 832272DEST_PATH_IMAGE002
wherein, the part of frequent item rules extracted in the step 5 after the experiment aredl, qt; qt; dl, |wL (different rules are separated by ";" etc.).
The invention is not limited to the examples, and any equivalent changes to the technical solution of the invention by a person skilled in the art after reading the description of the invention are covered by the claims of the invention.

Claims (6)

1. The Chinese word segmentation boundary correction method based on frequent items is characterized by comprising the following steps:
the method comprises the following steps:
step 1: extracting terms from the specification standard as a dictionary;
step 2, taking the subway design specification text to be processed as an input text S1;
step 3, performing reverse maximum matching processing on the input text S1 to generate a first result set S2;
step 4, performing boundary correction processing on the first result set S2 obtained in the step 3 in combination with the prefix and suffix dictionary to obtain a second result set S3;
step 5, extracting frequent item rules by using an FP-growth algorithm;
and 6, screening results of the second result set S3 obtained in the step 4 by using the frequent item rule extracted in the step 5, and deleting wrong participles.
2. The method for Chinese word segmentation boundary modification based on frequent items according to claim 1, wherein:
in the step 1, the standard standards are appendix A building information model classification and coding in building information model classification and coding standard GB/T51269-2017 and appendix A Chinese term index in urban rail transit engineering basic term standard GB/T50833-2012.
3. The method for Chinese word segmentation boundary modification based on frequent items according to claim 2, wherein:
in step 3, the specific process of the inverse maximum matching processing is as follows:
3.1) processing the input text S1 of the step 2 sentence by sentence according to the sequence from front to back;
3.2) if the length of a sentence obtained in the step 3.1) is less than the preset maximum word length n, taking the sentence as a matching field a, and executing the step 3.3);
if the word length is larger than or equal to the maximum word length n, starting from the rightmost side of the sentence, taking the character string with the maximum word length as a matching field a, and executing the step 3.3);
3.3) searching the dictionary file in the step 1, judging whether the matching field a is in the dictionary according to the matching field a obtained in the step 3.2), if the matching is successful, adding a separator "/", outputting the separator to the first result set S2, and removing the matching field a from the input text S1; repeating step 3.2) for the rest of the input text S1;
if no match exists, executing step 3.4);
3.4) eliminating the leftmost character of the matching field a, using the field formed by the rest n-1 characters as a new matching field b, and repeatedly executing the step 3.3); if the removal is successful until the single character is not matched, adding a separator character "/" and removing the character from the input text S1 until the sentence is empty;
3.5) after a sentence is processed, the sentence is removed from the input text S1, and a new sentence is obtained from the rest part of the input text S1 according to the sequence from front to back;
3.6) repeating steps 3.2) to 3.5) until the input text S1 is empty, and finally outputting the first result set S2.
4. The method for Chinese word segmentation boundary modification based on frequent items according to claim 3, wherein:
in step 4, the specific process of performing the boundary correction processing on the first result set S2 obtained in step 3 in combination with the prefix and suffix dictionary is as follows:
4.1) processing the first result set S2 output in the step 3 sentence by sentence in the sequence from front to back;
4.2) acquiring a sentence, traversing the sentence, judging a separator according to the value of the ASCII code, and executing the step 4.3) if the separator is found, and executing the step 4.4) if the separator is not found;
4.3) judging whether a Chinese character before the current separator exists in the prefix dictionary, if so, removing the current separator and executing the step 4.2); if not, judging whether a Chinese character after the current separator exists in the suffix word dictionary, if so, removing the current separator and executing the step 4.2), and if not, executing the step 4.4);
4.4) removing the current sentence from the first result set S2, and obtaining a new sentence from the remaining first result set S2 in the order from front to back;
4.5) repeating steps 4.2) to 4.4) until the first result set S2 is empty, and finally outputting a second result set S3.
5. The method for Chinese word segmentation boundary modification based on frequent items according to claim 4, wherein:
in the step 5, a specific process of extracting the frequent item rule by using the FP-growth algorithm is as follows:
5.1) randomly extracting 500 subway design specification texts processed in the step 4 as a training set, and calculating relevant parameters of the texts, wherein the method comprises the following specific steps:
each new word of the subway design standard text after the boundary correction in the step 4wCarrying out parameter values (#)w|, f(w), f(s), P(T w ), dl, qt) Is calculated as (a) whereinw|, dlAndqtthe calculation formula of (a) is as follows:
Figure 33398DEST_PATH_IMAGE002
wherein,wto correct new words, countwI iswThe length of (a) of (b),f(w) Is thatwThe frequency of the words of (a) is,sis composed ofwThe nested word of (a) is set,f(s) Is composed ofwThe word frequency of the nested word of (2),T w is composed ofwThe standard of (2) to focus the word,P(T w ) Is thatwWord frequencies in the standard set;
after calculating the parameter values of all new words, recording the parameters which are more than or equal to 1 into the data row of the word, and taking the parameters as a training set to execute the step 5.2);
5.2) after the training set is obtained through the calculation in the step 5.1), extracting and training the frequent item rules by using an FP-growth algorithm, wherein the method specifically comprises the following steps:
firstly, constructing an FP-Tree according to a training set: the construction process is expressed as followsw|, f(w), f(s), P(T w ), dl, qtReading the data of each new word once in the sequence; when the FP tree is inserted, the ancestor nodes are arranged in the front of the sequence, and the descendant nodes are arranged in the back of the sequence; if the common node exists in the insertion process, adding 1 to the node count; when inserting, if a new node appears, linking the new node from the ancestor node; completing the construction of the FP-Tree until all the nodes are inserted;
and then using FP-Tree to extract frequent item rules: firstly, according to the occurrence frequency of each parameter, taking the minimum frequency as the minimum support degree; then, starting from the leaf node at the lowest part of the FP-Tree, taking the leaf node as a leaf node to be excavated to obtain a corresponding sub-Tree; after a sub-tree is obtained, setting the count of each node in the sub-tree as a leaf node count, and deleting the node with the count lower than the support degree, wherein the current sub-tree is the frequent item of the leaf node; and recursing from bottom to top until the ancestor node is reached, taking the finally extracted frequent item as an evaluation index of the correction result, and taking the parameter of the frequent item as a frequent item rule.
6. The method for Chinese word segmentation boundary modification based on frequent items according to claim 5, wherein:
in step 6, the specific steps of performing result screening on the second result set S3 obtained in step 4 by using the frequent item rule extracted in step 5 are as follows:
6.1) first calculate the tintof each new word in the second result set S3w|, f(w), f(s), P(T w ), dl, qtThen step 6.2) is performed;
6.2) screening the parameters obtained by calculation in the step 6.1) according to the frequent item rule obtained in the step 5; if the rule is met, no processing is carried out; and if the rule is not met, adding the separator removed in the step 4 to the original position.
CN202111297120.9A 2021-11-04 2021-11-04 Chinese word segmentation boundary correction method based on frequent items Active CN114004222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111297120.9A CN114004222B (en) 2021-11-04 2021-11-04 Chinese word segmentation boundary correction method based on frequent items

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111297120.9A CN114004222B (en) 2021-11-04 2021-11-04 Chinese word segmentation boundary correction method based on frequent items

Publications (2)

Publication Number Publication Date
CN114004222A true CN114004222A (en) 2022-02-01
CN114004222B CN114004222B (en) 2024-04-30

Family

ID=79927025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111297120.9A Active CN114004222B (en) 2021-11-04 2021-11-04 Chinese word segmentation boundary correction method based on frequent items

Country Status (1)

Country Link
CN (1) CN114004222B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context
CN108536724A (en) * 2018-02-13 2018-09-14 西安理工大学 Main body recognition methods in a kind of metro design code based on the double-deck hash index
CN108829696A (en) * 2018-04-18 2018-11-16 西安理工大学 Towards knowledge mapping node method for auto constructing in metro design code
CN110046348A (en) * 2019-03-19 2019-07-23 西安理工大学 Main body recognition methods in a kind of rule-based and dictionary metro design code

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context
CN108536724A (en) * 2018-02-13 2018-09-14 西安理工大学 Main body recognition methods in a kind of metro design code based on the double-deck hash index
CN108829696A (en) * 2018-04-18 2018-11-16 西安理工大学 Towards knowledge mapping node method for auto constructing in metro design code
CN110046348A (en) * 2019-03-19 2019-07-23 西安理工大学 Main body recognition methods in a kind of rule-based and dictionary metro design code

Also Published As

Publication number Publication date
CN114004222B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN103049501B (en) Based on mutual information and the Chinese domain term recognition method of conditional random field models
CN109960804B (en) Method and device for generating topic text sentence vector
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN109960724A (en) A kind of text snippet method based on TF-IDF
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN111159990B (en) Method and system for identifying general special words based on pattern expansion
CN108415953A (en) A kind of non-performing asset based on natural language processing technique manages knowledge management method
CN109783809B (en) Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus
CN108614897B (en) Content diversification searching method for natural language
CN113705237B (en) Relationship extraction method and device integrating relationship phrase knowledge and electronic equipment
CN113609857B (en) Legal named entity recognition method and system based on cascade model and data enhancement
CN108153851B (en) General forum subject post page information extraction method based on rules and semantics
CN112287240A (en) Case microblog evaluation object extraction method and device based on double-embedded multilayer convolutional neural network
CN105956158A (en) Automatic extraction method of network neologism on the basis of mass microblog texts and use information
CN114298048A (en) Named entity identification method and device
JP7487532B2 (en) Method and device for correcting image block recognition results, and storage medium
CN116127079B (en) Text classification method
CN102637202B (en) Method for automatically acquiring iterative conception attribute name and system
CN112417296A (en) Internet key data information acquisition and extraction method
CN114004222A (en) Chinese word segmentation boundary correction method based on frequent items
CN115358227A (en) Open domain relation joint extraction method and system based on phrase enhancement
CN110909546B (en) Text data processing method, device, equipment and medium
CN112632985A (en) Corpus processing method and device, storage medium and processor
Kasthuri et al. An improved rule based iterative affix stripping stemmer for Tamil language using K-mean clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant