CN114330307A

CN114330307A - Word segmentation method and system based on limited field

Info

Publication number: CN114330307A
Application number: CN202110483554.1A
Authority: CN
Inventors: 胡燕林; 闵宗茹; 李致; 纪天啸; 李佳; 张良; 黄亮; 党向磊; 井雅琪; 段运强; 熊颖; 杨云龙; 戴光耀
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2022-04-12

Abstract

The invention discloses a word segmentation method and a word segmentation system based on the limited field, wherein the method comprises the following steps: preprocessing data of a limited field, identifying a named entity in the preprocessed data, and extracting an entity vocabulary in the data to obtain a named entity identification result; calculating characteristic information in the corpus of the limited field based on the preprocessed data, constructing a new word discovery model according to the obtained characteristic information, and identifying new words in the corpus by using the new word discovery model to obtain a new word data group; filtering the acquired new word data set by using the named entity recognition result and the common word dictionary, removing common words and entity words to acquire and confirm sensitive words, and establishing a sensitive word bank based on the confirmed sensitive words; and performing word segmentation by combining the sensitive word information acquired from the sensitive word bank and the named entity recognition result. The technical problem of poor word segmentation accuracy in the limited field in the prior art is solved.

Description

Word segmentation method and system based on limited field

Technical Field

The invention relates to the technical field of word segmentation methods, in particular to a word segmentation method and a word segmentation system based on the field of restriction.

Background

In natural language processing, chinese processing is different from other languages, for example, chinese cannot be directly used as a processing method for many western languages, because chinese must have a word segmentation process. The Chinese word segmentation is the basis of other Chinese information processing, and the word segmentation accuracy is very important for the development of later text work. Existing word segmentation algorithms can be divided into three major categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. When any method is applied to a new limited domain vocabulary, the word segmentation effect is very poor because the method does not have domain knowledge.

Meanwhile, the domain vocabulary is partially limited, and a large number of abnormal words, abbreviations, vocabularies added with redundant information and domain proper nouns are generated. And the content is usually accompanied by the characteristics of reverse order, incomplete syntactic structure and the like. The traditional word segmentation model can not achieve good effect on short texts in the limited fields.

Disclosure of Invention

Therefore, the embodiment of the invention provides a word segmentation method and system based on a limited field, so as to at least partially solve the technical problem of poor word segmentation accuracy in the limited field in the prior art.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

a method of word segmentation based on a defined field, the method comprising:

preprocessing data of a limited field;

identifying the named entities in the preprocessed data, and extracting entity vocabularies in the data to obtain a named entity identification result;

calculating characteristic information in the corpus of the limited field based on the preprocessed data, constructing a new word discovery model according to the obtained characteristic information, and identifying new words in the corpus by using the new word discovery model to obtain a new word data group;

filtering the acquired new word data set by using the named entity recognition result and the common word dictionary, removing common words and entity words to acquire and confirm sensitive words, and establishing a sensitive word bank based on the confirmed sensitive words;

and performing word segmentation by combining the sensitive word information acquired from the sensitive word bank and the named entity recognition result.

Further, the preprocessing of the data of the limited field specifically includes at least one of:

short sentences shorter than preset characters in the initial data are filtered;

filtering out meaningless phrases or phrases with the repeated times of the phrases exceeding the preset times in the initial data;

and constructing a special symbol word list without special meaning, and filtering if the initial data contains the symbols in the special symbol word list.

Further, the identifying the named entity in the preprocessed data, and extracting the entity vocabulary in the data to obtain the named entity identification result specifically includes:

and at least identifying a name entity, a number entity and a mailbox entity in the data obtained after preprocessing.

Further, the calculating the feature information in the corpus of the limited field based on the preprocessed data specifically includes:

and calculating the N-element grammatical feature, the degree-of-freedom feature and the degree-of-consolidation feature in the limited domain corpus.

Further, the calculating N-gram features in the limited domain corpus specifically includes:

dividing the text string into text segments with the length of N, wherein N is a positive integer;

and calculating the frequency of the text strings under different sliding windows by using the structure of the Trie tree, and using the frequency to represent the language model characteristics of the text in the field.

Further, the feature of the degree of freedom in the corpus of the defined domain is calculated according to the following formula:

wherein the content of the first and second substances,

further, the freezing degree characteristic in the language material of the limited field is calculated according to the following formula:

wherein the content of the first and second substances,

the invention also provides a word segmentation system based on the limited field, which comprises:

the preprocessing module is used for preprocessing the data in the limited field;

the named entity recognition module is used for recognizing the named entity in the data obtained after preprocessing and extracting entity words in the data to obtain a named entity recognition result;

the new word discovery module is used for calculating characteristic information in the corpus of the limited field based on the preprocessed data, constructing a new word discovery model according to the obtained characteristic information, and identifying new words in the corpus by using the new word discovery model to obtain a new word data group;

the sensitive word screening module is used for filtering the acquired new word data set by utilizing the named entity recognition result and the common word dictionary, eliminating common words and entity words to acquire and confirm sensitive words and establishing a sensitive word bank based on the confirmed sensitive words;

and the word segmentation module is used for performing word segmentation by combining the sensitive word information acquired from the sensitive word bank and the named entity recognition result.

The invention provides a word segmentation method and system based on a limited field, which are characterized in that data of the limited field is preprocessed, named entities in the preprocessed data are identified, and entity words in the data are extracted to obtain a named entity identification result; calculating characteristic information in the corpus of the limited field based on the preprocessed data, constructing a new word discovery model according to the obtained characteristic information, and identifying new words in the corpus by using the new word discovery model to obtain a new word data group; filtering the acquired new word data set by using the named entity recognition result and the common word dictionary, removing common words and entity words to acquire and confirm sensitive words, and establishing a sensitive word bank based on the confirmed sensitive words; and performing word segmentation by combining the sensitive word information acquired from the sensitive word bank and the named entity recognition result.

In order to solve the problem of strong limitation of the vocabulary in the limited field, the method optimizes the method for segmenting the vocabulary in the limited field by stripping out the named entities and new words with strong limitation of the field and constructing a sensitive word bank; on the basis of the traditional word segmentation method, the domain characteristics are fully considered, optimization is performed from two aspects of new word discovery and sensitive word bank expansion, a word segmentation system in a specific domain is better constructed, and the technical problem that in the prior art, the word segmentation accuracy in the limited domain is poor is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other implementation drawings can be derived from the drawings provided by those of ordinary skill in the art without any creative effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions under which the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the scope of the present invention.

FIG. 1 is a flow chart of one embodiment of a domain-based word segmentation method provided by the present invention;

fig. 2 is a block diagram of an embodiment of a word segmentation system based on a limited domain according to the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Since limited domain vocabularies are often very limited in the field, named entities and new words with strong field limitations need to be separated by using a technical method. Firstly, preprocessing data of a limited field; then based on the preprocessed data, identifying named entities in the data, extracting entity vocabularies in the data, simultaneously calculating information such as N-element grammatical features, freedom degree features, solidification degree features and the like in the linguistic data of the limited field, and constructing a new word discovery model by taking the three features as the basis to identify new words in the corpus; then, filtering the found new words by using the named entity recognition result and the common word dictionary, removing the common words and the entity words, confirming the sensitive words based on manual study and judgment, and bringing the confirmed sensitive words into a sensitive word bank; and finally, sensitive word stock information and named entity information are brought into the existing word segmentation system, the word segmentation accuracy of the limited field is improved, and a word segmentation model is optimized.

In a specific embodiment, the word segmentation method based on the limited field provided by the invention comprises the following steps:

s1: preprocessing data in a limited field, primarily filtering text information by utilizing the preprocessing, and removing interference text strings in a keyword matching and regular matching mode.

In order to ensure the filtering effect and eliminate the interfering text strings as much as possible, the preprocessing process specifically comprises the steps of filtering short sentences shorter than preset characters in the initial data, filtering meaningless phrases or phrases with the repeated times of vocabularies exceeding the preset times in the initial data, constructing a special symbol vocabulary without special meanings, and filtering the phrases or phrases without special meanings if the initial data contains the symbols in the special symbol vocabulary. According to actual use requirements, for example, requirements for word precision in different fields are different, the preprocessing process can only filter out sentence breaks or only provide unanimous phrases, and can also only filter out special symbols without special meanings and the like.

In one usage scenario, pre-processing the domain-defined data includes:

filtering mass data, and filtering short sentences which are too short, such as short sentences which are shorter than 4 words and have no practical significance;

selecting meaningless phrases or phrases with the repeated times of the vocabularies exceeding four times in the massive texts, and filtering the meaningless phrases or phrases; specifically, a repeated vocabulary per sentence may be recorded, and if the repeated vocabulary is more than four times and the repeated vocabulary accounts for more than 70% of the word count of the whole sentence, the repeated vocabulary is filtered.

And constructing a special symbol word list without special meaning, and if the limited domain data contains the symbols in the special symbol word list, removing the special symbols.

S2: and identifying the named entity in the preprocessed data, and extracting the entity vocabulary in the data to obtain the named entity identification result. In step S2, at least the name entity, the number entity and the mailbox entity in the preprocessed data are identified. That is, a plurality of entities such as names, numbers, mailboxes and the like in the text are identified, the names adopt a Chinese name automatic identification method based on role labeling, and the numbers, the dates, the mailboxes and the like are identified by adopting a rule-based method.

In an actual use scenario, identifying and extracting the named entity may specifically be to construct a regular matching rule for number identification, and identify number information in a text, including house number and QQ number, according to different attributes of the number, such as digit, writing rule (whether or not to add), and keyword information matched with number context. And constructing a regular matching rule for mailbox identification, and identifying according to mailbox naming characteristics. And constructing a regular matching rule for date and time identification, constructing the matching rule according to information such as language characteristics (whether keywords such as year, month, day and hour exist) and writing characteristics (writing rules such as 2019-01-01 and 01/01/19) described by time, and identifying the time-class entities in the text. And identifying named entities such as names in the text by using a Chinese name automatic identification method based on role labeling. The method comprises the following steps of performing word segmentation by using a conventional corpus, performing semantic role labeling work on a word segmentation result by using a jieba word segmentation tool on the basis of word segmentation, and manually participating in labeling part name data on the basis of the labeling work, namely labeling: the family name, double first name, double last name, single name, name prefix, name suffix, etc. And training the data by adopting a hidden Markov model, producing a name recognition model, processing the processed data by using the model, and finding the name named entity in the data.

S3: and calculating characteristic information in the linguistic data of the limited field based on the preprocessed data, constructing a new word discovery model according to the obtained characteristic information, and identifying new words in the linguistic database by using the new word discovery model to obtain a new word data group. The step S3 is used to identify new words in the text, and classify the text strings by calculating the degree of freedom and the degree of solidity of the text segment, i.e. words and non-words, so as to identify new words that can be combined.

Wherein, the feature information in the corpus of the limited field is calculated based on the data obtained after the preprocessing, and the specific steps include:

The calculating of the N-element grammatical features in the limited field corpus specifically comprises:

Calculating the feature of the degree of freedom in the corpus of the defined field according to the following formula:

wherein the content of the first and second substances,

calculating the solidification degree characteristic in the language material of the limited field according to the following formula:

wherein the content of the first and second substances,

specifically, step S3 specifically includes:

s31: and extracting N-element grammatical features of the domain data. Dividing the text string into text segments with N lengths by using the N-gram characteristics (the topic N is 4), and calculating the frequency of the text string under different sliding windows by using the structure of a Trie tree for representing the language model characteristics of the text in the field;

s32: and extracting the freedom degree characteristics of the domain data. The degree of freedom feature represents the independence possibility of the text segment, the left entropy and the right entropy of a certain text segment are constructed by using an information entropy concept in an information theory to calculate the instability of the relationship between the text segment and left and right adjacent character strings, and a degree of freedom calculation formula is as follows:

s33: and extracting the solidification degree characteristic of the field data. The degree of solidification is used for measuring the degree of compactness in one character string, mutual information concept is used for calculation, the degree of solidification calculation formula is as follows, and x and y respectively represent two character string segments.

And according to the calculation results of the step S31-33, respectively counting the text segments, classifying the words and finding out words with high independence.

S4: filtering the acquired new word data set by using the named entity recognition result and the common word dictionary, removing common words and entity words to acquire and confirm sensitive words, and establishing a sensitive word bank based on the confirmed sensitive words; in step S4, a sensitive word library is constructed, and the screened sensitive words are stored in the sensitive word library.

In sensitive word screening, data in a part of new words are offset, so that some conventional words are usually mixed in the part of new words, a common word dictionary is collected first, and common words in the identified new words are filtered. For example, counting the frequency of the filtered new words appearing in the whole corpus, and sorting the new words in descending order according to the frequency; and (4) screening the sensitive words by a person with domain knowledge according to the sequence of the high frequency to the low frequency of the sorted words. And after the sensitive word library is constructed, inputting the identified sensitive words into the sensitive word library.

S5: and performing word segmentation by combining the sensitive word information acquired from the sensitive word bank and the named entity recognition result. For example, the field vocabulary is divided by using a jieba word segmentation tool, and a sensitive word bank is introduced in the word segmentation process, so that the error rate of word segmentation is reduced. And for the result after word segmentation, carrying out recombination and splitting again according to the identified named entity.

In the above embodiment, in order to deal with the problem of strong vocabulary limitation in the limited field, the method provided by the present invention optimizes the method for segmenting words in the limited field by stripping out the named entities and new words with strong field limitation and constructing the sensitive lexicon; on the basis of the traditional word segmentation method, the domain characteristics are fully considered, optimization is performed from two aspects of new word discovery and sensitive word bank expansion, a word segmentation system in a specific domain is better constructed, and the technical problem that in the prior art, the word segmentation accuracy in the limited domain is poor is solved.

In addition to the above method, the present invention further provides a word segmentation system based on the domain of restriction, as shown in fig. 2, the system comprising:

a preprocessing module 100, configured to preprocess data of a limited field;

the named entity recognition module 200 is configured to recognize a named entity in the preprocessed data, and extract an entity vocabulary in the data to obtain a named entity recognition result;

the new word discovery module 300 is configured to calculate feature information in the corpus of the limited field based on the preprocessed data, construct a new word discovery model according to the obtained feature information, and identify new words in the corpus by using the new word discovery model to obtain a new word data group;

the sensitive word screening module 400 is configured to filter the obtained new word data set by using the named entity recognition result and the common word dictionary, exclude common words and entity words, obtain and confirm sensitive words, and establish a sensitive word bank 500 based on the confirmed sensitive words;

and the word segmentation module 600 is configured to perform word segmentation by combining the sensitive word information obtained from the sensitive word bank and the recognition result of the named entity.

Firstly, a preprocessing module 100 preprocesses data of a limited field; then based on the preprocessed data, processing the data by two modules at the same time, identifying named entities in the data by a named entity identification module 200, extracting entity words in the data, calculating information such as N-element grammatical features, freedom degree features, solidification degree features and the like in the limited field language materials by a new word discovery module 300, constructing a new word discovery model according to the three features, and identifying new words in a material library; then, the named entity recognition result generated by the named entity recognition module 200 and the common word dictionary are used for filtering the found new words, the common words and the entity words are removed by the sensitive word screening module 400, the sensitive words are confirmed based on manual study, and the confirmed sensitive words are brought into the sensitive word bank 500; and finally, sensitive word stock information and named entity information are brought into the word segmentation module 600, so that the word segmentation accuracy of the limited field is improved, and a word segmentation model is optimized.

In the above specific embodiment, in order to deal with the problem of strong vocabulary limitation in the limited field, the system strips out the named entities and new words with strong field limitation, and optimizes the word segmentation method in the limited field by constructing the sensitive word bank; on the basis of the traditional word segmentation method, the domain characteristics are fully considered, optimization is performed from two aspects of new word discovery and sensitive word bank expansion, a word segmentation system in a specific domain is better constructed, and the technical problem that in the prior art, the word segmentation accuracy in the limited domain is poor is solved.

The above embodiments are only for illustrating the embodiments of the present invention and are not to be construed as limiting the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the embodiments of the present invention shall be included in the scope of the present invention.

Claims

1. A word segmentation method based on a defined field is characterized by comprising the following steps:

preprocessing data of a limited field;

identifying the named entity in the preprocessed data, and extracting entity words from the data to obtain a named entity identification result;

2. The word segmentation method according to claim 1, wherein the preprocessing of the data of the defined field includes at least one of:

3. The word segmentation method according to claim 2, wherein the recognizing the named entity in the preprocessed data and extracting the entity vocabulary in the data to obtain the recognition result of the named entity specifically comprises:

4. The word segmentation method according to claim 3, wherein the calculating feature information in the corpus of the limited field based on the preprocessed data specifically includes:

5. The word segmentation method according to claim 4, wherein the calculating of the N-gram feature in the domain-restricted corpus specifically includes:

6. The word segmentation method according to claim 4, wherein the degree of freedom features in the domain-restricted corpus are calculated according to the following formula:

wherein H (X) represents the degree of freedom of x character string, x_iIndicating all possible adjacent words of the x-string in the lexicon, I (x)_i) Denotes x_iSelf information of (1), P (x)_i) Denotes x_iProbability of adjacency string x.

7. The emergency braking method according to claim 4, wherein the freezing degree characteristic in the domain-limited corpus is calculated according to the following formula:

wherein, I (x; y) represents the solidity (the word forming capability of x and y) of the x character string and the y character string. P (x, y) represents the probability of x and y appearing simultaneously, P (x) and P (y) represent the probability of x and y appearing independently, and P (x | y) represents the conditional probability of x appearing when y appears.

8. A domain-restricted based word segmentation system, the system comprising: