CN108009156B - Chinese generalized text segmentation method based on partial supervised learning - Google Patents

Chinese generalized text segmentation method based on partial supervised learning Download PDF

Info

Publication number
CN108009156B
CN108009156B CN201711444997.XA CN201711444997A CN108009156B CN 108009156 B CN108009156 B CN 108009156B CN 201711444997 A CN201711444997 A CN 201711444997A CN 108009156 B CN108009156 B CN 108009156B
Authority
CN
China
Prior art keywords
data
word segmentation
seg
classifier
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711444997.XA
Other languages
Chinese (zh)
Other versions
CN108009156A (en
Inventor
王亚强
何思佑
唐聃
舒红平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN201711444997.XA priority Critical patent/CN108009156B/en
Publication of CN108009156A publication Critical patent/CN108009156A/en
Application granted granted Critical
Publication of CN108009156B publication Critical patent/CN108009156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Abstract

The invention belongs to the technical field of language processing, and discloses a Chinese generalized text segmentation method based on partial supervised learning. Through the comparison experiment of five groups of difficult data sets, the method can easily find that the result of the word segmentation of the short text is deeply influenced by the length of the context information, wherein the binary context information can be most attached to the characteristics of the word segmentation of the short text, and the word segmentation performance can be effectively improved; the binary and ternary mixed characteristics can express that the performance of each 'empty' message is the best, and more or less messages lose the performance; the application of partial supervised learning in short text word segmentation can also embody the excellent parameter filling capability of the short text word segmentation, so that the manual labeling work can be greatly reduced, and more excellent performance can be obtained.

Description

Chinese generalized text segmentation method based on partial supervised learning
Technical Field
The invention belongs to the technical field of language processing, and particularly relates to a Chinese generalized text segmentation method based on partial supervised learning.
Background
In natural language processing tasks, the most basic task is to segment out the blocks of a piece of text that contain the most basic semantics. The words can just meet the requirement of the task of the invention, the invention can easily segment and extract the words through the blank space in the language with the separator between the words like English, but the invention needs to separately carry out a word segmentation task in the language without the separator, such as Chinese. At present, there are two conventional methods, one of which is a matching-based method, namely: checking whether the current comparison object is a word by using a method of performing word-by-word comparison by using a manually constructed dictionary, stopping comparison when the current length object is found to be the maximum length capable of being a word, dividing the object, and continuing the next round of matching. The method is divided into forward and backward maximum matching methods according to different matching directions, and the essential methods are the same. Similar to this method, it is a full segmentation path selection method, and also depends on the dictionary constructed manually, and finds out all possible segmentation paths through dictionary matching, and finally finds out an optimal path through weight. The method described above has the greatest drawback that the dictionary is very dependent, that is, the dictionary must be updated by a large amount of manual work, and the word segmentation effect (for example, generalized text) on a special genre is also affected deeply due to the different word segmentation granularity of the dictionary. Statistical-based methods, which have been well developed as the computing power of computers has increased, such as labeling each word: { B, I, E, S } indicates the beginning of a word, the middle of a word, the end of a word, a single word, respectively. And then training a model by adopting a hidden Markov or conditional random field, and segmenting the unmarked new sentence by the trained model. The biggest defect of the statistical method is that the statistical method depends on a large corpus, and the corpus is constructed manually, so that the statistical method is very time-consuming and labor-consuming.
In summary, the problems of the prior art are as follows: large-scale manual data sets are relied on, and a large amount of manpower and time are consumed; the word recognition rate is low; text cannot be accurately cut into words of appropriate granularity.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a Chinese generalized text segmentation method based on partial supervised learning, which can save 10-50% of manual labeling data compared with the traditional method under the condition of the same segmentation effect.
The Chinese generalized text segmentation method based on the partial supervised learning extracts the characteristic information of the context with less noise according to the main characteristics of the short text and combines the partial supervised learning method to segment words;
the features of the short text include: binary context information for fitting short text word segmentation;
three elements are mixed before and after the three elements are mixed, and each empty information is expressed;
the partial supervised learning is used to patch parameters in short text segmentation.
Further, the Chinese generalized text segmentation method based on partial supervised learning specifically comprises the following steps:
step one, selecting features, setting the window size to be 1 to 3, adding a sum as a start and end symbol: "+" natural language processing & & & & "; extracting a window with a size of o _ p1_ self and a postamble with a size of two as o _ n2_ natural;
step two, obtaining a labeled small amount of word segmentation class data set P and an unlabeled large amount of mixed data set M, wherein the M contains all data of two classes of word segmentation and word non-segmentation; and introduces part of supervised learning.
Further, the naive Bayes classification method comprises the following steps: a Blank set B ═ B1., bl }, where each "empty" possesses feature information such as preceding and following paragraphs denoted by fn, fn is from all feature sets F ═ F1, F2., fn } extracted from the training set, and a category set C ═ { C1, C2} is defined for binary categories where C1 denotes a "participle" category and the corresponding C1 denotes a "not participle" category; to obtain the most probable classification result of a certain "null", the posterior probability is calculated according to the Bayes' theorem
The deformation of equation (1) is assumed independently from the conditions to be:
the laplacian smoothing formula is chosen to be deformed as:
where the number of times the feature f appears in "null" b is divided by the total number of times the feature in category c; | V | in the denominator represents the total number of features.
Further, the partial supervised learning method comprises the following steps:
the space between every two words is regarded as a single document, and all documents are defined into two types in advance: "word segmentation" and "no word segmentation";
only a small part of word segmentation class data is labeled, then the combination of likelihood estimation and an EM algorithm is carried out through a naive Bayes method for continuous iteration until an optimal classifier is trained finally.
Further, the EM algorithm specifically includes:
all data in P are first assigned to the c1 category and the data labels in P never change during subsequent iterations; then assigning all 'null' in the M dataset to the c1 category, the category of this data will change continuously during the iteration process; secondly, training an initial classifier initial-classifier by using naive Bayes, classifying data in the M data set by using the classifier, adding data with a result of c1 into a 'word segmentation' class data set seg, and otherwise, adding a c2 result into a 'word segmentation' class data set non-seg; and then entering an EM algorithm iteration process, reestablishing a new classifier by using a naive Bayes algorithm through a P, seg and non-seg data set, and classifying the seg and the non-seg until convergence to obtain a final classifier.
The invention mainly selects the following algorithms aiming at the Chinese short text word segmentation task: the naive Bayes and expectation maximization algorithm converts the Chinese word segmentation task into a text classification task, and long-term practice proves that the naive Bayes has excellent effect in the text classification task; and partial supervised learning is that the constraint optimization problem EM algorithm just fits the characteristic. The invention takes Chinese computer and medical related paper titles as an experimental corpus for analyzing the word segmentation task of the Chinese short text, wherein, a) the Chinese paper title accords with the short text characteristics and is very accurate and formal in word use, thereby reducing data noise. b) The short text can be used for transfer learning in the later stage of refining.
The performance of the method and the related characteristic rule of a found generalized text are proved through five groups of 'conventional' and a group of 'difficult' data sets through a contrast experiment, the accuracy of the method is improved by 17% -27% on average in comparison with the traditional method under the condition of labeling data with the same proportion (10% -50%) (for example, 10000 training data, wherein the proportion of the labeled data is 10%, namely only 1000 labeled data need to be manually labeled, and 9000 labeled data do not need any manual participation additionally), and the performance of the method is improved by 5% -8% through the performance balance degree measured by the F value, wherein the most important point is that the performance of the method using 100% labeled data can be achieved by using only 50% labeled data. The experimental results show that a large amount of resources and time can be saved in the labor consumption of data set construction. In addition, the feature extraction rule of the generalized text is summarized as follows: the binary context information can best fit the characteristics of short text word segmentation, and the word segmentation performance can be effectively improved; the binary and ternary mixed characteristics can express that the performance of each 'empty' message is best. According to the experimental result, the performance of the mixed features of the binary and ternary context information is improved by 4% -8% compared with the accuracy of the single unitary or ternary feature under the condition of the same proportion of labeled data, and the performance of the single binary context information is improved by about 8% compared with the performance of the single unitary or ternary context information.
Drawings
FIG. 1 is a flow chart of a Chinese generalized text segmentation method based on partial supervised learning according to an embodiment of the present invention.
Fig. 2 is a graph of a unary preceding and following (precision) image provided by the practice of the present invention.
FIG. 3 is a graph of a unary front-rear (F-score) image provided by the practice of the present invention.
Fig. 4 is a graph of the binary run-before and run-after (precision) provided by the practice of the present invention.
FIG. 5 is a binary front and back (F-score) plot provided by the practice of the present invention.
Fig. 6 is a ternary front and back (precision) diagram provided by the practice of the present invention.
FIG. 7 is a ternary front and back (F-score) plot provided by the practice of the present invention.
FIG. 8 is a diagram of the post-two and post-three mixing (precision) provided by the practice of the present invention.
FIG. 9 is a diagram of the post-two, three-component mixture (F-score) provided by the practice of the present invention.
FIG. 10 is a diagram of the first, second, and third elements provided by the practice of the present invention before and after mixing (precision).
FIG. 11 is a diagram of a first, second and third component mixture (F-score) according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.
The Chinese generalized text segmentation method based on the partial supervised learning provided by the embodiment of the invention treats a Chinese short text word segmentation task as a two-classification or three-classification problem, and extracts the context feature information with less noise according to the main features of the short text to perform word segmentation in combination with the partial supervised learning method.
The main features of the short text include: the binary context information is used for attaching short text word segmentation, so that the word segmentation performance is improved;
ternary mixing for expressing each empty information;
the partial supervised learning is used to patch parameters in short text segmentation.
As shown in fig. 1, the method for segmenting a generalized Chinese text based on partial supervised learning according to the embodiment of the present invention includes the following steps:
s101: selecting characteristics, setting the window size to be 1 to 3, and adding a sum as a start character and an end character; extracting the front part of a window with the size of a space between 'nature' being one, wherein the space is represented by a front postamble of a unary window of the space and corresponds to a characteristic 'word' in a text classification, each front postamble of the unary is regarded as a 'word', and the likelihood estimation is carried out by directly applying naive Bayes on the assumption that the conditions between the front postamble and the unary are independent;
s102: training an initial classifier by using a small amount of manually labeled 'word segmentation' category data sets P and an unlabeled large amount of mixed data sets M, wherein the M comprises all data of 'word segmentation' and 'word segmentation', and performing an EM algorithm iterative process by using the initial classifier and the mixed data sets M;
s103: constructing an initial classifier: finding well-defined nonsseg data to further distinguish seg and nonsseg categories; the SEM process in partially supervised learning is applied directly on top of the segmentation.
In a preferred embodiment of the invention: in step S101, feature selection is performed, and first, the length of an empty context feature is: windows are not uniform in size; here the window size is set to 1 to 3, adding a sum of & as start and end symbols for the extracted context length to be the same: "+" natural language processing & & & & "; extracting the preamble of "between nature" empty window size one "denoted as o _ p1 self, the postamble of two similarly sized as o _ n2 self, this" empty "is similarly denoted by its unary window preamble as o _ p1 self n1 self p2 self n2 self, plus its ternary preamble; for the feature "word" in the corresponding text classification, each element postamble is regarded as a "word", and the likelihood estimation is directly carried out by applying naive Bayes on the assumption that the conditions between the postamble and the postamble are independent.
The application principle of the present invention is further described below with reference to specific embodiments:
the Chinese generalized text segmentation method based on partial supervised learning provided by the embodiment of the invention specifically comprises the following steps:
1) an IEM procedure.
Firstly, a labeled small amount of 'word segmentation' category data set P and an unlabeled large amount of mixed data set M are obtained, wherein M contains all data of two categories of 'word segmentation' and 'word segmentation'.
All data in P is first assigned to the c1 category and the data labels in P never change during subsequent iterations; all "null" in the M dataset are then assigned to the c1 category, and the category of this data will change continuously during the iteration.
Then, naive Bayes is used to train an initial classifier initial-classifier, and the classifier is used to classify the data in the M data set, so that the data with the result of c1 is added to seg (word segmentation class data set), and the result of c1 is added to non-seg (word segmentation class data set).
And then entering an EM algorithm iteration process, reestablishing a new classifier by using a naive Bayes algorithm through a P, seg and non-seg data set, and classifying the seg and the non-seg until convergence to obtain a final classifier. Its pseudo code is as follows:
the above method is summarized as an IEM method, however, in the practical application process, it is found that the above method can only be applied to some cases where the noise is not large, for example, the data set only includes two classes, although it is very suitable for the case of this experiment of the present invention, in order to solve the problem of three or more classes, a partial supervised learning method for the case where the noise is large must be introduced.
2) And (5) SEM process.
In the multi-classification case, although only the desired positive classes are extracted by using the partially supervised learning, the IEM method mentioned in this case does not appear to be particularly good because the diversity of classes brings a lot of noise to the unlabeled data set M of the present invention, because it is not known which real nonsegs are in the M data set, and it is necessary to provide an improved method to find some nonsegs that can be determined as much as possible to help the present invention to further distinguish the seg and nonsegs classes. The SEM process is quite clear as set forth in BingLiu, and the pseudo code for the present invention is as follows, with direct application to the word segmentation:
n in the above code is the most likely nonsegs class of data defined by a threshold at initialization, SPY is the "SPY" data extracted from the P dataset.
The effect of the present invention will be described in detail with reference to the experiments.
1) Experimental data evaluation method.
(1) Characteristics of experimental data
The experimental data of the invention selects the title text contents of the treatises recorded in the journal of a plurality of computers and medical science in recent years, and the data has the following two characteristics: a) the text content shorter sentence refinement conforms to the short text characteristics. b) The text is less ambiguous with words. And the large span between computer terms and medical terms can be used as later cross-validation and is not discussed in this experiment for the time being.
(2) Feature selection
The main idea of the experiment is to regard the space between every two Chinese characters as an independent document, and its previous and following information as relatively independent features.
First, the context of an "empty" feature is long or short, i.e.: the windows are of different sizes. To fit short text features, here the window size is set to 1 to 3, e.g. "natural language processing" this short text, to make the extracted context length the same add a and & as start and end characters: ". x. natural language processing & & &. In this example, the "empty" window preceding with its unary window preceding may be denoted as o _ p1_ self _ n1_ p2_ self _ n2_ so that, in addition to its ternary preceding, it may be similarly denoted by "preceding" with its unary window size of one "denoted as o _ p1_ self and similarly following" o _ n2_ self. For the feature "word" in the corresponding text classification, each element postamble can be regarded as a "word", and the likelihood estimation can be directly carried out by applying naive Bayes on the assumption that the conditions between the postamble and the postamble are independent.
(3) Evaluation of performance
The experiment adopts a commonly adopted F-score measurement method, the definition of the F-score is that the precision is represented, the recall ratio is represented, the precision describes the correct probability of the word segmentation result, the recall ratio describes how to check the test data, and the calculation mode of p and r can be represented by a two-dimensional matrix
Table 1p rmatrix
In general, p and r are mutually exclusive, and the introduction of F-score is just to find a point of equilibrium between p and r.
2. Naive Bayes classification
(1) Naive Bayes classification
Naive Bayes is a classification method based on Bayes theorem to make feature independent hypothesis, and is the most widely applied Bayes classifier. In the experiment of the present invention, assuming that a Blank set B is { B1.. so, bl }, each "empty" has characteristic information of preceding and following paragraphs denoted by fn, and fn is from all characteristic sets F { F1, F2.., fn }, which are extracted from the training set, a category set C { C1, C2} is defined in advance, where C1 denotes a "participle" category, and a corresponding C2 denotes a "non-participle" category, because it is classified by two. To obtain the most likely classification result of a certain "null", the posterior probability needs to be calculated, according to the Bayes' theorem
It is assumed independently from the conditions that equation (1) is modified to
Since the feature which is not counted possibly is met in the classification process and corresponding smoothing processing is needed, the method selects the Laplace smoothing formula to be transformed into
Where the number of times the feature f appears in "null" b is divided by the total number of times the feature in category c; the | V | in the denominator represents the total number of categories, where the | V | value takes 2 due to the two-classification problem.
3. Expectation maximization algorithm
The expectation-maximization algorithm is mainly used for maximum likelihood estimation or maximum a posteriori probability estimation of a probability parameter model containing hidden variables. The method is different from other algorithms in that the method is not an actual algorithm, and can be regarded as an algorithm idea, and the algorithm idea is that missing data is filled up through iterative convergence of the algorithm under the condition that the existing incomplete data exists. A number of practices have shown that EM algorithms perform very well in iterative processes, but the final result may be a local optimum rather than a global optimum due to parameter initialization problems.
The EM algorithm mainly comprises two steps of a) expectentationstep, wherein the step is mainly to estimate incomplete parameters through existing probability distribution. b) The estimated distribution parameters are then re-aligned to the expected values found in the first step to maximize the likelihood of the data, giving the expected estimate of the unknown variable. The selection of EM algorithm iteration in this experiment is very desirable because the partial supervised learning used is just a learning method built on such incomplete data, and the use of EM algorithm can effectively help to perfect the missing unmarked data.
4. Partial supervised learning method
The partial supervised learning method is proposed by students such as BingLiu and the like, is an improved version relative to complete supervised learning, and aims to achieve the same or even better learning efficiency under the condition of reducing manual labeling data. Said another way, learning from labeled and unlabeled data (learning from labeledand unlabeled data) or from regular and unlabeled data (learning from posistive and unlabeled data) the space between every two words in the experiment of the present invention can be regarded as a single document, and all documents are defined in advance as two types: "participle" and "do not participle". The traditional learning method must respectively mark out samples of two categories, then carry out statistical learning and finally train out the required classifier, which is time-consuming work.
However, by using a partial supervised learning method, only a small part of "participle" category data can be labeled, and then the combination of likelihood estimation and the EM algorithm is continuously iterated by using the mentioned naive Bayes method until a desired classifier is finally trained.
5. Results of the experiment
1) And the results are shown
Three methods were mainly compared in this experiment: naive Bayes, IEM, SEM, accuracy and F-score of final results with same scale of labeled data. The training data and the testing data are all the relevant paper titles of the adopted computer and the medicine, the training data is 10,000 pieces of data, the testing data is 1,000 pieces of data, and the marking proportion is 10% -50%. The data in the figure represent the precision and F-score of the three methods, respectively. Different broken line graphs show word segmentation performance comparison of characteristic information of postambles before and after adopting different window sizes and before and after mixing, and mainly comprise a unitary postamble before and after, a binary postamble before and after, a ternary postamble before and after, and a binary and ternary mixed postamble before and after.
2) Analysis of the results
From the scale of the labeled data: under the condition of same proportion marking data, the performance of the SEM and IEM algorithms is equivalent under the condition of two classification, and the precision and the F-score are far higher than that of the method for performing word segmentation by directly using a naive Bayes method. This conclusion is evident from the above five experimental results, and the extraction ratio of SPY, whether 10% or 20%, has no significant effect on the experimental results due to the convergence properties of the EM algorithm itself.
From the information extraction length perspective: in the case of extracting only one element or extracting only two elements of preceding and following information, although the precision of word segmentation is high, the relative F-score is not ideal, and the weakness of recall ratio is indirectly explained. The situation of extracting the three preceding and following documents separately is less ideal compared with the former two, and the situation is found by analysis to be caused by the particularity of the data set: the title of the paper is a typical short text expression which has been characterized as formal, concise and precise. Therefore, the word formation rule of more than two characters is almost rarely seen, so that the effect of singly extracting the ternary features is not good.
When the performance of the two-element and three-element mixed features is analyzed to be compared with the performance of the one-element, two-element and three-element mixed features, the performance of the two-element and three-element mixed features is the highest in 5 groups of experiments, the accuracy and the F-score are very efficient, and the reason that the good performance of all the features is reduced on the contrary is that unnecessary noise of data is brought by the surplus of the features to interfere the correct judgment of the classifier on the category.
Finally, it was found that there was hardly any difference in performance between the SEM and IEM processes throughout the data set, but one experiment with a "difficult" data set: i.e. the M data set no longer contains only two classes but a plurality of classes, which brings more noise. The performance is shown in Table 2, and the SEM process is 0.5-1 percentage point higher than IEM precision on average under the condition of a 'difficult' data set.
TABLE 2 Multi-class Performance comparison
According to the invention, through a comparison experiment of five groups of difficult data sets, the result of short text word segmentation is easily found to be influenced by the length of the context information, wherein the binary context information can be most attached to the characteristics of the short text word segmentation, and the word segmentation performance can be effectively improved; the binary and ternary mixed characteristics can express that each 'empty' message has the best performance, and more or less messages lose performance. And secondly, the application of partial supervised learning in short text word segmentation can embody the excellent parameter filling capability of the short text word segmentation, so that the manual labeling work can be greatly reduced, and the more excellent performance can be obtained.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (1)

1. A Chinese generalized text segmentation method based on partial supervised learning is characterized in that the Chinese generalized text segmentation method based on partial supervised learning is used for extracting context feature information with smaller noise according to main features of short texts and performing word segmentation by combining a partial supervised learning method; the features of the short text include: binary context information for fitting short text word segmentation; three elements are mixed before and after the three elements are mixed, and each empty information is expressed; the partial supervised learning is used for filling parameters in short text word segmentation;
firstly, selecting features, setting the window size to be 1 to 3, and adding a sum as a start character and an end character; extracting a front part with a window size of 'empty' between 'nature' as one, wherein the 'empty' is represented by a front postamble of a unary window of the 'empty', each front postamble of the unary window is regarded as a 'word' corresponding to a characteristic 'word' in a text classification, and the likelihood estimation is carried out by applying naive Bayes on the assumption that conditions between the front postambles of the unary window are independent;
secondly, training an initial classifier through a small amount of manually labeled word segmentation class data sets P and an unlabeled large amount of mixed data sets M, wherein the M comprises all data of two classes of word segmentation and word segmentation avoidance, and performing an EM algorithm iteration process by using the initial classifier and the mixed data sets M;
thirdly, constructing an initial classifier: the non seg data is further distinguished from a seg and non-seg category data sets, wherein seg represents a "participle" category data set, and non-seg represents a "non-participle" category data set; directly applying the SEM process in partial supervised learning to the word segmentation to perform word segmentation;
wherein the performing feature selection in the first step comprises: the context of an "empty" feature is long or short, i.e.: windows are not uniform in size; window size is set to 1 to 3, adding a sum of & as start and end symbols for the extracted context length to be the same: "+" natural language processing & & & & "; extracting the preceding paragraph of the window size of "empty" between the "nature" as o _ p1_ self, the following paragraph of the size two as o _ n2_ natural, the "empty" as its unary window preceding paragraph as o _ p1_ self _ n1_ then _ p2_ self _ n2_ natural, and the ternary preceding paragraph as well as the similar;
the naive Bayes classification method comprises the following steps: a Blank set B ═ B1,...,blEvery null has characteristic information of the preceding and following text with fnIs represented by fnAll feature sets F ═ { F) extracted from the training set1,f2,...,fnDefine a class set C ═ C for two classes1,c2In which c is1Indicates the category of "participle", corresponds to c2Indicates the "not to participle" category; to obtain a certain empty classification result, the posterior probability is calculated according to the Bayes theorem
Figure FDA0002251874410000021
The deformation of equation (1) is assumed independently from the conditions to be:
Figure FDA0002251874410000022
the laplacian smoothing formula is chosen to be deformed as:
Figure FDA0002251874410000023
wherein
Figure FDA0002251874410000024
Represents the number of occurrences of feature f in "null" b divided by the total number of features in category c; | V | in the denominator represents the total number of features;
the second step also comprises an IEM process, and a small amount of labeled word segmentation class data sets P and a large amount of unlabeled mixed data sets M are obtained, wherein the M contains all data of two classes of word segmentation and word non-segmentation; all data in P are assigned to c1Class and data tag does not change in P during subsequent iterations; then assign all "null" in M dataset to c2A category, the category of this data will change continuously during the iteration process; then, an initial classifier initial-classifier is trained by utilizing naive Bayes, and the result is c when the classifier is used for classifying the data in the M data set1Data of (c) are added to seg, otherwise c2Adding the result into non-seg; the EM algorithm iteratively uses a naive Bayes algorithm to reestablish a new classifier through a P, seg and non-seg data set, and then classifies the seg and the non-seg until convergence to obtain a final classifier;
wherein, the EM algorithm comprises: all data in P are first given c1Categories and data labels never change in P during subsequent iterations; then assign all "null" in M dataset to c1A category, the category of this data will change continuously during the iteration process; then, an initial classifier initial-classifier is trained by utilizing naive Bayes, and the result is c when the classifier is used for classifying the data in the M data set1The data of (c) is added into the category data set seg of the word segmentation, otherwise c is added2Adding the result into a category data set non-seg of 'no word segmentation'; then, entering an EM algorithm iteration process, reestablishing a new classifier by using a naive Bayes algorithm through a P, seg and non-seg data set, and classifying the seg and the non-seg until convergence to obtain a final classifier;
wherein, part of the supervised learning method comprises the following steps: the space between every two words is regarded as a single document, and all documents are defined into two types in advance: "word segmentation" and "no word segmentation";
and only a small part of word segmentation class data is labeled by a partial supervised learning method, and then the combination of likelihood estimation and an EM algorithm is continuously iterated by a naive Bayes method until an expected classifier is finally trained.
CN201711444997.XA 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning Active CN108009156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711444997.XA CN108009156B (en) 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711444997.XA CN108009156B (en) 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning

Publications (2)

Publication Number Publication Date
CN108009156A CN108009156A (en) 2018-05-08
CN108009156B true CN108009156B (en) 2020-05-19

Family

ID=62061806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711444997.XA Active CN108009156B (en) 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning

Country Status (1)

Country Link
CN (1) CN108009156B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009156B (en) * 2017-12-27 2020-05-19 成都信息工程大学 Chinese generalized text segmentation method based on partial supervised learning
CN110110326B (en) * 2019-04-25 2020-10-27 西安交通大学 Text cutting method based on subject information
CN110457595B (en) * 2019-08-01 2023-07-04 腾讯科技(深圳)有限公司 Emergency alarm method, device, system, electronic equipment and storage medium
CN110532568B (en) * 2019-09-05 2022-07-01 哈尔滨理工大学 Chinese word sense disambiguation method based on tree feature selection and transfer learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016060687A1 (en) * 2014-10-17 2016-04-21 Machine Zone, Inc. System and method for language detection
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107491439A (en) * 2017-09-07 2017-12-19 成都信息工程大学 A kind of medical science archaic Chinese sentence cutting method based on Bayesian statistics study
CN108009156A (en) * 2017-12-27 2018-05-08 成都信息工程大学 A kind of Chinese generality text dividing method based on partial supervised study

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016060687A1 (en) * 2014-10-17 2016-04-21 Machine Zone, Inc. System and method for language detection
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107491439A (en) * 2017-09-07 2017-12-19 成都信息工程大学 A kind of medical science archaic Chinese sentence cutting method based on Bayesian statistics study
CN108009156A (en) * 2017-12-27 2018-05-08 成都信息工程大学 A kind of Chinese generality text dividing method based on partial supervised study

Also Published As

Publication number Publication date
CN108009156A (en) 2018-05-08

Similar Documents

Publication Publication Date Title
CN108009156B (en) Chinese generalized text segmentation method based on partial supervised learning
CN108460089B (en) Multi-feature fusion Chinese text classification method based on Attention neural network
CN107085581B (en) Short text classification method and device
CN108009148B (en) Text emotion classification representation method based on deep learning
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN111444342B (en) Short text classification method based on multiple weak supervision integration
CN107480688B (en) Fine-grained image identification method based on zero sample learning
CN109522412B (en) Text emotion analysis method, device and medium
CN107862046A (en) A kind of tax commodity code sorting technique and system based on short text similarity
CN110222329B (en) Chinese word segmentation method and device based on deep learning
CN108090489B (en) Off-line hand-written Uygur word recognition method based on grapheme segmentation based on computer
CN104750875B (en) A kind of machine error data classification method and system
CN107577702B (en) Method for distinguishing traffic information in social media
Tang et al. Zero-shot learning by mutual information estimation and maximization
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN114925205A (en) GCN-GRU text classification method based on comparative learning
CN111523311B (en) Search intention recognition method and device
CN111078874B (en) Foreign Chinese difficulty assessment method based on decision tree classification of random subspace
CN112732863B (en) Standardized segmentation method for electronic medical records
CN111898375B (en) Automatic detection and division method for article discussion data based on word vector sentence chain
CN103744830A (en) Semantic analysis based identification method of identity information in EXCEL document
CN114996442A (en) Text abstract generation system combining abstract degree judgment and abstract optimization
CN112883158A (en) Method, device, medium and electronic equipment for classifying short texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant