CN108009156A - A kind of Chinese generality text dividing method based on partial supervised study - Google Patents

A kind of Chinese generality text dividing method based on partial supervised study Download PDF

Info

Publication number
CN108009156A
CN108009156A CN201711444997.XA CN201711444997A CN108009156A CN 108009156 A CN108009156 A CN 108009156A CN 201711444997 A CN201711444997 A CN 201711444997A CN 108009156 A CN108009156 A CN 108009156A
Authority
CN
China
Prior art keywords
mrow
msub
classification
data
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711444997.XA
Other languages
Chinese (zh)
Other versions
CN108009156B (en
Inventor
王亚强
何思佑
唐聃
舒红平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN201711444997.XA priority Critical patent/CN108009156B/en
Publication of CN108009156A publication Critical patent/CN108009156A/en
Application granted granted Critical
Publication of CN108009156B publication Critical patent/CN108009156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to language processing techniques field, disclose a kind of Chinese generality text dividing method based on partial supervised study, regard Chinese short text participle task as two classification or three classification problems, and segmented according to context characteristic information bound fraction supervised learning method of the main feature of the short text extraction with smaller noise.The present invention is by the control experiment of five groups of additional one group of " difficulty " data sets, it is seen that the result of short text participle is influenced by context message length deeply, wherein binary context information can most be bonded the feature of short text participle, can effectively improve participle performance;2nd, ternary composite character can more give expression to the information of each " sky " its best performance show, then more or will lose performance less again;Application of the partial supervised study in short text participle can also embody its outstanding polishing parameter capabilities, can greatly reduce the work manually marked and obtain more outstanding performance.

Description

A kind of Chinese generality text dividing method based on partial supervised study
Technical field
The invention belongs to language processing techniques field, more particularly to a kind of Chinese generality text based on partial supervised study This cutting method.
Background technology
In natural language processing task, most basic task is to be syncopated as the block that one section of text includes most underlying semantics Come.And word exactly can most meet the requirement of this task of the invention, separator is carried between this kind of word similar to English The language present invention, which can easily be split word by space, to be extracted, but in the Chinese this language without separator The present invention is called the turn with regard to needing individually to carry out a participle task.Normal practice more traditional at present has two kinds, one is being based on Matched method, i.e.,:The method word for word compared using artificial constructed dictionary verify it is current compare object whether into Word, when find current length object be can be into the maximum length of word when, stop compare, by this object mark off come it is follow-up Continuous next round matching.It is divided into forward and backward maximum matching method again according to matched direction is different, its essential idea is all identical 's.With a kind of similarly full cutting route system of selection of this method, artificial constructed dictionary is equally relied on, passes through dictionary Matching finds out all possible cutting route and finds out an optimal path finally by weights.Methods described above is maximum to be lacked It is trapped in very serious in the dependence to dictionary, that is to say, that it is necessarily required to largely manually constantly update dictionary, and And since the participle granularity of dictionary is different, it is also deep impacted to the participle effect (such as generality text) of special style.It is based on The method of statistics, with the lifting Statistics-Based Method of computer computation ability have it is preferable development for example by each word into Rower is noted:{ B, I, E, S } is represented in prefix, word respectively, suffix, monosyllabic word.Then hidden Markov or condition random field are used Training pattern, segments the new sentence not marked by trained model.Above statistical method greatest drawback is equally A large-scale corpus is relied on, the structure of corpus is also by manually completing, and is the work of quite time-consuming effort.
In conclusion problem existing in the prior art:Rely on large-scale artificial data collection, it is necessary to substantial amounts of manpower and when Between consuming;Word recognition rate is low;Cannot be accurately by text dividing into the sizeable word of granularity.
The content of the invention
In view of the problems of the existing technology, the present invention provides a kind of Chinese generality text based on partial supervised study This cutting method, 10%~50% artificial labeled data can be saved compared with conventional method in the case where cutting effect is identical.
The present invention is achieved in that a kind of Chinese generality text dividing method based on partial supervised study, described Chinese generality text dividing method based on partial supervised study has smaller noise according to the extraction of the main feature of short text Context characteristic information bound fraction supervised learning method segmented;
The feature of the short text includes:Binary context information, for being bonded short text participle;
Ternary mixes context, for giving expression to each empty information;
The partial supervised study is used for the polishing parameter in short text participle.
Further, the Chinese generality text dividing method based on partial supervised study specifically includes:
Step 1, carries out feature selecting, and window size is arranged to 1 to 3, adds * and & as beginning and end mark:“*** Zi Ranyuyanchuli &&& ";Extract between " nature " empty window size and be expressed as o_ hereinafter from, size for two for o_p1_ The right languages of n2_;
Step 2, obtains a small amount of " participle " category dataset P marked and a large amount of mixing not marked All data of " participle " and " not segmenting " two classifications are contained in data set M, M;And introducing portion supervised learning.
Further, the sorting technique of the naive Bayesian includes:One Blank set B={ b1..., bl, each " sky " possesses the characteristic information such as context fnRepresent, fnFrom all characteristic set F={ f with being extracted in training set1, f2..., fn, for two class definitions, one category set C={ c1, c2Wherein c1Represent " participle " classification, corresponding c1Table Show " not segmenting " classification;Posterior probability need to be calculated to obtain some " sky " most probable classification results, is had according to Bayes' theorem
It is deformed into according to conditional independence assumption formula (1):
Selection Laplce's smoothing formula is deformed into:
WhereinRepresent the total degree of feature in number divided by classification c that feature f occurs in " sky " b;In denominator | V | represent the total quantity of feature.
Further, partial supervised learning method includes:
A single document is all regarded in space between each two word as, and all documents are defined as two classes in advance:" point Word " and " not segmenting ";
Only then " participle " categorical data of mark sub-fraction passes through Nae Bayesianmethod progress possibility predication and EM Algorithm is combined continuous iteration, until finally training an optimal grader.
Further, the EM algorithms specifically include:
Assign all data in p to c first1Classification and data label never changes in P in later iterative process; Then all " skies " is concentrated to assign c M data1Classification, the classification of this data will constantly change in an iterative process;Then utilize Naive Bayesian trains an initial grader initial-classifier, and number is concentrated to M data using this grader It is c according to classifying result1Data add in " participle " category dataset seg, it is on the contrary by c2As a result " not segmenting " is added In category dataset non-seg;Next enter EM algorithm iteration processes, pass through P, seg, non-using NB Algorithm Seg data sets re-establish a new grader and classification are carried out to seg and non-seg again until convergence obtains final classification Device.
The algorithm that the present invention mainly chooses for Chinese short text participle task has:Naive Bayesian and expectation maximization are calculated Chinese word segmentation task is converted into text categorization task by method, and long-term facts have proved naive Bayesian in text categorization task In have outstanding effect;And a partial supervised study inherently constrained optimization problem EM algorithm has just agreed with this Feature.The present invention is the participle task of the Chinese short text of analysis using Chinese language computer and medicine correlative theses title as experiment language Expect storehouse, wherein, a) Chinese Papers title meet short essay eigen and word it is very accurate, it is formal, reduce data noise.b) Very the refining later stage can do transfer learning to this kind of short text in itself.
The present invention by the control experiment of five groups " routine " additional one group of " difficulty " data set prove performance of the invention with It was found that the correlated characteristic rule of generality text, in the case of the labeled data of same ratio (10%-50%) (such as totally 10000 Bar training data, wherein labeled data ratio for 10% i.e. only 1000 need manually mark and it is extra 9000 need not It is any manually to participate in) present invention improves average 17%-27% compared with the accuracy of conventional method, and pass through the performance of F value metrics The degree of balance also improves the performance close to 5%-8%, wherein most noteworthy be that 50% mark number is used only in the present invention The performance for using 100% labeled data according to can reach traditional supervised learning method.By above experimental result as it can be seen that the present invention Substantial amounts of resource and time can be saved by being expended in the manpower of data set structure.In addition, sum up the feature of generality text It is as follows to extract rule:Binary context information can most be bonded the feature of short text participle, can effectively improve participle performance;2nd, Ternary composite character can more give expression to the information of each " sky " its best performance show.According to experimental result as can be seen that two, ternary Context information the composite character more single unitary of performance or ternary feature accuracy under same ratio labeled data improve 4%-8% etc., it is left that single binary context information also improves average 8% than single one, ternary context information Right performance.
Brief description of the drawings
Fig. 1 is the Chinese generality text dividing method flow chart based on partial supervised study that the present invention implements to provide.
Fig. 2 is unitary context (precision) figure that the present invention implements to provide.
Fig. 3 is unitary context (F-score) figure that the present invention implements to provide.
Fig. 4 is binary context (precision) figure that the present invention implements to provide.
Fig. 5 is binary context (F-score) figure that the present invention implements to provide.
Fig. 6 is ternary context (precision) figure that the present invention implements to provide.
Fig. 7 is ternary context (F-score) figure that the present invention implements to provide.
Fig. 8 is that the present invention implements the two of offer, ternary mixing context (precision) figure.
Fig. 9 is that the present invention implements the two of offer, ternary mixing context (F-score) figure.
Figure 10 be the present invention implement provide one, two, ternary mixing context (precision) figure.
Figure 11 be the present invention implement provide one, two, ternary mixing context (F-score) figure.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.
Chinese generality text dividing method provided in an embodiment of the present invention based on partial supervised study, it is described to be based on portion Divide the Chinese generality text dividing method of supervised learning, regard Chinese short text participle task as two classification or three classification are asked Topic, and according to context characteristic information bound fraction supervised learning method of the main feature of the short text extraction with smaller noise To be segmented.
The main feature of the short text includes:Binary context information, for being bonded short text participle, improves participle property Energy;
Ternary mixes, for giving expression to each empty information;
The partial supervised study is used for the polishing parameter in short text participle.
As shown in Figure 1, the Chinese generality text dividing method provided in an embodiment of the present invention based on partial supervised study Comprise the following steps:
S101:Feature selecting is carried out, window size is arranged to 1 to 3, adds * and & as beginning and end mark;Extraction " from Between so " " empty window size for one above ", " sky " is represented with its unitary window context, in corresponding text classification Feature " for word ", all regard one " word " as per unitary context, and assume conditional sampling between its context So directly possibility predication is carried out using naive Bayesian;
S102:The a large amount of blended datas not marked by a small amount of " participle " the category dataset P manually marked and one Collect M, all data that " participle " and " not segmenting " two classifications are contained in M train a preliminary classification device, use initial point Class device carries out EM algorithm iteration processes with mixed data set M;
S103:Build preliminary classification device:Find out very definite nonseg data and further discriminate between seg and nonseg classes Not;SEM processes during partial supervised is learnt are directly applied to participle above.
In a preferred embodiment of the invention:In step S101, feature selecting is carried out, first, an empty context is special Sign has with short, i.e.,:Window size differs;Here window size is arranged to 1 to 3, to make the context length of extraction is identical to add Add * and & as beginning and end mark:" * * * Zi Ranyuyanchulis &&& ";" empty window size is one between extraction " nature " O_p1_ is expressed as above " from similarly size is two to be expressed as the right languages of o_n2_ hereinafter, this " sky " with before its unitary window Be expressed as hereinafter o_p1_ from _ n1_ so _ p2_* from the _ right languages of n2_, along with its ternary context also similar expression;Relatively The feature " for word " in text classification is answered, all regards one " word " as per unitary context, and assume its context Between conditional sampling thus directly using naive Bayesian carry out possibility predication.
The application principle of the present invention is further described with reference to specific embodiment:
Chinese generality text dividing method provided in an embodiment of the present invention based on partial supervised study specifically includes:
1) IEM processes.
A small amount of " participle " category dataset P marked and a large amount of mixed number not marked can be obtained first According to all data that " participle " and " not segmenting " two classifications are contained in collection M, M.
Assign all data in P to c first1Classification and data label never changes in P in later iterative process; Then all " skies " is concentrated to assign c M data1Classification, the classification of this data will constantly change in an iterative process.
Then an initial grader inital-classifier is trained using naive Bayesian, uses this classification It is c that device, which classifies M data intensive data by result,1Data add in seg (" participle " category dataset), otherwise general c1As a result add in non-seg (" not segmenting " category dataset).
Next enter EM algorithm iteration processes, pass through P, seg, non-seg data set weights using NB Algorithm Newly establish a new grader and classification is carried out to seg and non-seg again until convergence obtains final classification device.Its pseudocode is such as Under:
The above method is summarized as IEM methods, but finds that the above method can be only applied in actual application In the case of only including two classifications in such as data set in the case that noise is little, this time tested although being very suitable for the present invention Situation but in order to solve to must be introduced into when three classification even more classification problems are done in the future it is a kind of for noise compared with The partial supervised learning method of big situation.
2) SEM processes.
In the case of more classification, although only extracting desired positive class using partial supervised study, but due to classification Diversity bring very big noise to the data set M that does not mark of the present invention, the IEM methods mentioned in this case seem Effect is not particularly good, and focuses on bottom this is because being not aware that in M data which is real nonseg data, it is necessary to carry For a kind of modification method come find out some can as far as possible determine some nonseg data come help the present invention further discriminate between seg and Nonseg classifications.SEM processes Bing Liu middle elaboration it is fully aware of, the present invention is directly applied to participle its pseudo- generation above Code is as follows:
The data for being most likely to be nonseg classifications that are defined according to threshold value when N is initialization in above-mentioned code, SPY be from " spy " data extracted in P data set.
The application effect of the present invention is explained in detail with reference to experiment.
1) experimental data appraisal procedure.
(1), experimental data feature
What experimental data of the present invention was chosen is the Article Titles that certain several computer is included with medicine periodical in recent years Content of text, the data have the characteristics that following two:A) content of text meets short essay eigen compared with phrase refining.B) text is used The accurate ambiguity of word is smaller.And the larger cross validation that may be used as the later stage of span between computerese and medical terminology, It wouldn't be discussed in this experiment.
(2), feature selecting
It has been mentioned that the main thought of this experiment is to regard " sky " between each two Chinese character as an independent document, Its context information will introduce the detailed extracting method of context information as relatively independent feature, the present invention.
First, the context feature of one " sky " has with short, i.e.,:Window size differs.In order to meet short essay eigen, Window size is arranged to 1 to 3, such as " natural language processing " this short text herein, in order to make the context length of extraction Identical addition * and & are as beginning and end mark:" * * * Zi Ranyuyanchulis &&& ".In this example, between extraction " nature " " empty window size for one above " is expressed as o_p1_ from similarly size is two to be expressed as the right languages of o_n2_ hereinafter, then This " sky " with its unitary window context can be expressed as o_p1_ from _ n1_ so _ p2_* from the _ right languages of n2_, so, then add The expression that above its ternary context can also be similar.Feature " for word " in corresponding text classification, here each First context can regard one " word " as, and assume conditional sampling between its context so can be directly using Piao Plain Bayes carries out possibility predication.
(3), Performance Evaluation
This time experiment will use the F-score measures generally used, and the definition of F-score is WhereinWhat is represented is precisionWhat is represented is recall rate, and precision describes word segmentation result correct probability, calls together The rate of returning describes how look into full performance to test data, and the calculation of p and r can be represented with a two-dimensional matrix
Table 1p*r matrix
In general, p and r are mutual exclusions, the introducing of F-score is precisely in order to find the equalization point of a p and r.
2nd, Naive Bayes Classification
(1), Naive Bayes Classification
Naive Bayesian is to do the sorting technique that feature independently assumes based on Bayes' theorem, is most widely used pattra leaves This grader.Assuming that possess a Blank set B={ b in present invention experiment1..., bl, each " sky " possesses context etc. Characteristic information fnRepresent, fnFrom all characteristic set F={ f with being extracted in training set1, f2..., fn, of the invention real In testing, since what is done is two classification so defining a category set C={ c in advance1, c2Wherein c1Represent " participle " classification, phase Corresponding c2Represent " not segmenting " classification.Need to calculate posterior probability to obtain some " sky " most probable classification results, according to Bayes' theorem has
It is deformed into according to conditional independence assumption formula (1)
Need to do corresponding smoothing processing since the feature not counted may be met in assorting process herein, the present invention Laplce's smoothing formula is have selected to be deformed into
WhereinRepresent the total degree of feature in number divided by classification c that feature f occurs in " sky " b;In denominator | V | represent the total quantity of classification, herein due to being two classification problems | and V | value takes 2.
3rd, expectation-maximization algorithm
Expectation-maximization algorithm be mainly used for the probability parameter model containing hidden variable maximal possibility estimation or greatly after Test probability Estimation.With other algorithm differences be in fact it be not a reality algorithm, it can be regarded as a kind of calculation Method thought, the thought of this algorithm, which is to gather around to have plenty of, to be restrained to fill up scarce by this algorithm iteration in the case of deficiency of data The data of mistake.Largely practice have shown that EM algorithms have very good performance in an iterative process, but since parameter initialization is asked The possible final result of topic can try to achieve local optimum rather than global optimum.
Mainly a) Expectation Step, the step mainly pass through existing probability point to EM algorithms in two steps Cloth estimates incomplete parameter.B) desired value tried to achieve by first step again to estimating distributed constant so that The likelihood of data is maximum, provides the expectation estimation of known variables.It is extremely to meet that EM algorithm iterations are selected in this experiment Demand because the study of used partial supervised is just built upon the study side in such a incomplete data Method, the use of EM algorithms can effectively help to improve the unlabeled data of those missings.
4th, partial supervised learning method
Partial supervised learning method is proposed by Bing Liu et al. scholar, is improved relative to one kind of complete supervised learning Version, it is intended to reach same even more preferably learning efficiency in the case of reducing artificial labeled data.Be said differently it may be said that It is from having marked and without being learnt (Learning from Labeled and Unlabeled in labeled data Examples) or from positive example and without labeled data learning (Learning from Positive and Unlabeled Examples) space between each two word during the present invention tests can be regarded as a single document, by all texts Shelves are defined as two classes in advance:" participle " and " not segmenting ".Traditional learning method must mark out the sample of two categories respectively Then the grader that statistical learning finally trains needs is carried out, this is quite time-consuming work.
But by partial supervised learning method, it can only mark " participle " categorical data of sub-fraction and then by carrying To Nae Bayesianmethod carry out possibility predication and EM algorithms and be combined continuous iteration, until it is last train one it is desired Grader.
5th, experimental result
1), result is shown
Three kinds of methods of main contrast in this time testing:Naive Bayesian, IEM, SEM same ratio labeled data In the case of final result accurate rate and F-score., training and test data are all the computers and the relevant opinion of medicine used Literary title, its training data are 10,000 datas, and test data is 1,000, and mark ratio is 10%~50%.Number in figure According to the precision and F-score for representing three kinds of methods respectively.Different line charts show using different windows size context and mix The participle performance for closing context characteristic information compares, and mainly has unitary context, binary context, ternary context and 1 Member mixing context.
2), interpretation of result
From labeled data ratio:Three kinds of method SEM and IEM algorithms are at two points in the case of same ratio labeled data Performance is suitable in the case of class, while is all divided in precision and F-score far above direct using Nae Bayesianmethod Word.This conclusion can clearly be seen in five set experimental result, and the extraction ratio either 10% of SPY Or the result of 20% pair of experiment does not all make significant difference, this is because what the convergence property of EM algorithms in itself was brought.
From the point of view of information extraction length:Although only extract unitary or segmented in the case of only extracting binary context information Precision it is very high but opposite F-score it is less desirable, side light recall ratio this weakness.Compare the above two, Individually the situation of extraction ternary context just seems more undesirable, is found by analysis this is because the particularity of data set is made Into:Article Titles are typical short text expression-forms, its feature by the agency of mistake, formal, terse, accurate.So two words The above is almost seldom seen into word rule, so individually extraction ternary characteristic effect is not fine.
When two, ternary composite character is analyzed compared with one, two, ternary composite character performance, two, three are easily found The performance of first composite character be it is highest in 5 groups of experiments, either precision or F-score all very efficiently, and by all spies Sign mix performance and decline on the contrary be due to feature the unnecessary data unnecessary noise jamming grader that brings to classification Correct judgement.
Finally, found in whole data set SEM and IEM processes in performance almost without any difference, but one Experiment on " difficulty " data set:I.e. M data, which is concentrated, is no longer only included two classifications but is brought comprising multiple classifications with this More noises.Its performance is as shown in table 2, and SEM process averages are higher by 0.5~1 than IEM precision in the case of " difficulty " data set Percentage point.
Table classification performance more than 2 compares
The present invention is by the control experiment of five groups of additional one group of " difficulty " data sets it is seen that the result that short text segments Deep to be influenced by context message length, wherein binary context information can most be bonded the feature of short text participle, can be effective Improve participle performance;2nd, ternary composite character can more give expression to the information of each " sky " its best performance show, then more or few again Performance will be lost.Secondly, application of the partial supervised study in short text participle can also embody its outstanding polishing ginseng Number ability, can greatly reduce the work manually marked and obtain more outstanding performance.
Fig. 2 is unitary context (precision) figure that the present invention implements to provide.
Fig. 3 is unitary context (F-score) figure that the present invention implements to provide.
Fig. 4 is binary context (precision) figure that the present invention implements to provide.
Fig. 5 is binary context (F-score) figure that the present invention implements to provide.
Fig. 6 is ternary context (precision) figure that the present invention implements to provide.
Fig. 7 is ternary context (F-score) figure that the present invention implements to provide.
Fig. 8 is that the present invention implements the two of offer, ternary mixing context (precision) figure.
Fig. 9 is that the present invention implements the two of offer, ternary mixing context (F-score) figure.
Figure 10 be the present invention implement provide one, two, ternary mixing context (precision) figure.
Figure 11 be the present invention implement provide one, two, ternary mixing context (F-score) figure.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.

Claims (5)

1. a kind of Chinese generality text dividing method based on partial supervised study, it is characterised in that described to be supervised based on part The Chinese generality text dividing method that educational inspector practises is special according to context of the main feature of the short text extraction with smaller noise Reference breath bound fraction supervised learning method is segmented;
The feature of the short text includes:Binary context information, for being bonded short text participle;
Ternary mixes context, for giving expression to each empty information;
The partial supervised study is used for the polishing parameter in short text participle.
2. the Chinese generality text dividing method as claimed in claim 1 based on partial supervised study, it is characterised in that institute The Chinese generality text dividing method based on partial supervised study is stated to specifically include:
Step 1, carries out feature selecting, and window size is arranged to 1 to 3, adds * and & as beginning and end mark:" * * * are natural Yu Yanchuli &&& ";Extract between " nature " empty window size for o_p1_ from, size for two to be expressed as o_n2_ hereinafter right Language;
Step 2, obtains a small amount of " participle " category dataset P marked and a large amount of blended datas not marked Collect M, all data of " participle " and " not segmenting " two classifications are contained in M;And introducing portion supervised learning.
3. the Chinese generality text dividing method as claimed in claim 2 based on partial supervised study, it is characterised in that institute Stating the sorting technique of naive Bayesian includes:One Blank set B={ b1..., b2, each " sky " possesses the spies such as context Reference breath fnRepresent, fnFrom all characteristic set F={ f with being extracted in training set1, f2..., fn, it is fixed for two classification One category set C={ c of justice1, c2Wherein c1Represent " participle " classification, corresponding c2Represent " not segmenting " classification;To obtain certain A " sky " most probable classification results need to calculate posterior probability, be had according to Bayes' theorem
It is deformed into according to conditional independence assumption formula (1):
<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>f</mi> <mn>2</mn> </msub> <mn>...</mn> <msub> <mi>f</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>2</mn> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
Selection Laplce's smoothing formula is deformed into:
<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>f</mi> <mn>2</mn> </msub> <mn>...</mn> <msub> <mi>f</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mn>1</mn> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mfrac> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>k</mi> </msub> <mo>,</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> <mrow> <mo>|</mo> <mi>V</mi> <mo>|</mo> <mo>+</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>2</mn> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mfrac> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>k</mi> </msub> <mo>,</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
WhereinRepresent the total degree of feature in number divided by classification c that feature f occurs in " sky " b;In denominator | V | Represent the total quantity of feature.
4. the Chinese generality text dividing method as claimed in claim 2 based on partial supervised study, it is characterised in that portion Supervised learning method is divided to include:
A single document is all regarded in space between each two word as, and all documents are defined as two classes in advance:" participle " With " not segmenting ";
Only then " participle " categorical data of mark sub-fraction passes through Nae Bayesianmethod progress possibility predication and EM algorithms Continuous iteration is combined, until finally training an optimal grader.
5. the Chinese generality text dividing method as claimed in claim 2 based on partial supervised study, it is characterised in that institute EM algorithms are stated to specifically include:
Assign all data in P to c first1Classification and data label never changes in P in later iterative process;Then All " skies " is concentrated to assign c M data2Classification, the classification of this data will constantly change in an iterative process;Then simplicity is utilized Bayes train an initial grader initial-classifier, using this grader to M data intensive data into Result is c by row classification1Data add in " participle " category dataset seg, it is on the contrary by c2As a result " not segmenting " classification is added In data set non-seg;Next enter EM algorithm iteration processes, pass through P, seg, non-seg using NB Algorithm Data set re-establishes a new grader and classification is carried out to seg and non-seg again until convergence obtains final classification device.
CN201711444997.XA 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning Active CN108009156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711444997.XA CN108009156B (en) 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711444997.XA CN108009156B (en) 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning

Publications (2)

Publication Number Publication Date
CN108009156A true CN108009156A (en) 2018-05-08
CN108009156B CN108009156B (en) 2020-05-19

Family

ID=62061806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711444997.XA Active CN108009156B (en) 2017-12-27 2017-12-27 Chinese generalized text segmentation method based on partial supervised learning

Country Status (1)

Country Link
CN (1) CN108009156B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110457595A (en) * 2019-08-01 2019-11-15 腾讯科技(深圳)有限公司 Emergency event alarm method, device, system, electronic equipment and storage medium
CN110532568A (en) * 2019-09-05 2019-12-03 哈尔滨理工大学 Chinese Word Sense Disambiguation method based on tree feature selecting and transfer learning
CN108009156B (en) * 2017-12-27 2020-05-19 成都信息工程大学 Chinese generalized text segmentation method based on partial supervised learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016060687A1 (en) * 2014-10-17 2016-04-21 Machine Zone, Inc. System and method for language detection
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107491439A (en) * 2017-09-07 2017-12-19 成都信息工程大学 A kind of medical science archaic Chinese sentence cutting method based on Bayesian statistics study

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009156B (en) * 2017-12-27 2020-05-19 成都信息工程大学 Chinese generalized text segmentation method based on partial supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016060687A1 (en) * 2014-10-17 2016-04-21 Machine Zone, Inc. System and method for language detection
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN107491439A (en) * 2017-09-07 2017-12-19 成都信息工程大学 A kind of medical science archaic Chinese sentence cutting method based on Bayesian statistics study

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009156B (en) * 2017-12-27 2020-05-19 成都信息工程大学 Chinese generalized text segmentation method based on partial supervised learning
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110110326B (en) * 2019-04-25 2020-10-27 西安交通大学 Text cutting method based on subject information
CN110457595A (en) * 2019-08-01 2019-11-15 腾讯科技(深圳)有限公司 Emergency event alarm method, device, system, electronic equipment and storage medium
CN110532568A (en) * 2019-09-05 2019-12-03 哈尔滨理工大学 Chinese Word Sense Disambiguation method based on tree feature selecting and transfer learning
CN110532568B (en) * 2019-09-05 2022-07-01 哈尔滨理工大学 Chinese word sense disambiguation method based on tree feature selection and transfer learning

Also Published As

Publication number Publication date
CN108009156B (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN107168955B (en) Utilize the Chinese word cutting method of the word insertion and neural network of word-based context
Mouchere et al. Icdar 2013 crohme: Third international competition on recognition of online handwritten mathematical expressions
CN107766324B (en) Text consistency analysis method based on deep neural network
Tweedie et al. Neural network applications in stylometry: The Federalist Papers
CN1977261B (en) Method and system for word sequence processing
CN108009156A (en) A kind of Chinese generality text dividing method based on partial supervised study
CN109740154A (en) A kind of online comment fine granularity sentiment analysis method based on multi-task learning
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN104331506A (en) Multiclass emotion analyzing method and system facing bilingual microblog text
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN108536870A (en) A kind of text sentiment classification method of fusion affective characteristics and semantic feature
Wang et al. Semeval-2021 task 9: Fact verification and evidence finding for tabular data in scientific documents (sem-tab-facts)
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN103020167B (en) A kind of computer Chinese file classification method
CN102200969A (en) Text sentiment polarity classification system and method based on sentence sequence
Zamora-Reina et al. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish
CN111708888A (en) Artificial intelligence based classification method, device, terminal and storage medium
CN110134934A (en) Text emotion analysis method and device
CN108228569A (en) A kind of Chinese microblog emotional analysis method based on Cooperative Study under the conditions of loose
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
Brocardo et al. Verifying online user identity using stylometric analysis for short messages
Pacheco et al. Random Forest with Increased Generalization: A Universal Background Approach for Authorship Verification.
CN103678318B (en) Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
Hätty et al. Predicting degrees of technicality in automatic terminology extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant