CN101645083A

CN101645083A - Acquisition system and method of text field based on concept symbols

Info

Publication number: CN101645083A
Application number: CN200910077018A
Authority: CN
Inventors: 韦向峰; 黄曾阳; 张全; 缪建明
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2009-01-16
Filing date: 2009-01-16
Publication date: 2010-02-10
Anticipated expiration: 2029-01-16
Also published as: CN101645083B

Abstract

The invention discloses an acquisition system and a method of the text field based on concept symbols. The system comprises a concept symbol set for expressing word concepts and field categories, a word knowledge base for storing word and concept symbols, a word segmentation processor, a statement semantic analyzer and a field arbiter. The method comprises the following steps: (1) segmenting an input text into paragraphs, statements and words; (2) carrying out semantic analysis on the statements for obtaining concept categories and semantic blocks of the statements; (3) obtaining activating words in the statements according to semantic concept symbols in the field concept symbol set and the word knowledge base; (4) carrying out comprehensive scoring on field concept symbols of the activating words and obtaining the field concept symbol with the highest score as the field of the statements; (5) merging the statements in the paragraphs according to the field concept symbols for obtaininga statement group and the field thereof; and (6) obtaining the field of the text according to a title of the text and the frequency of occurrence and the position of the statement group in the statement group.

Description

A kind of text field based on concept symbols obtain system and method

Technical field

The present invention relates to the field that utilizes Computer Science and Technology that text is carried out the spoken and written languages information processing, particularly a kind of text field based on concept symbols obtain system and method.

Background technology

The text classification technology is to utilize computing machine, according to certain rule, knowledge and step, one piece of text is classified as one or more domain class method for distinguishing and process.The conventional method of text classification is that text table is shown as proper vector, and when " angle " of the proper vector of two pieces of texts during less than certain angle, they are classified as same classification.Generally choose word constitutes text as text feature proper vector, the TF*IWF method that the building method of proper vector adopts the TF*IDF method more or derives thus, TF*IDF promptly use word in document the frequency of occurrences and in collection of document the product of the inverse of the frequency of occurrences as the value of this feature word correspondence in the proper vector.The k nearest neighbor method of text classification, bayes method, support vector machine, neural network, decision tree etc. all are the statistical methods based on the vector space model of text, good a large amount of text sets carry out the parameter optimization training to require that before classification prior classification is arranged, and new text can be included in a certain classification that defines after the training.Chinese patent file (publication number CN100353361) discloses a kind of method and apparatus of new proper vector weight towards text classification, on the basis of TF*IWF method, introduced the n speech root of DBV and TF, the experiment of the different characteristic speech number (50,100,200,500,1000,1500,2000,2500,3000,3500,4000) by the field of respectively classifying by the word frequency selection purposes, its experimental system better performances when finding to get 3500 speech.

Because file classification method requires to know in advance the field classification set and the criteria for classification of text, uncertain and training text collection obtains under the situation of difficult in class categories, and file classification method will be difficult to implement.Therefore, the text cluster technology has appearred again.Typical case's representative of text cluster method commonly used is the K-Means algorithm, and promptly at first optional K text is as cluster centre from text set, and other text incorporates in that nearest cluster according to the proper vector " distance " with cluster centre; And then with the average of the proper vector of all texts in K the class as new cluster centre, all texts again according to distances of clustering centers cluster again, so iterative computation is till the evaluation function convergence.But the field classification that the text automatic cluster obtains is very coarse, is difficult to adapt to actual demand owing to lack its result of guidance to different types of areas.And same text cluster method, better to certain text set effect, but may be very poor to another text set effect, promptly all there are shortcoming in the practicality of text cluster and stability.

To sum up, the statistical method of text classification needs a large amount of good corpus of prior classification, this divide time-like often be difficult to provide.Though and text cluster can overcome this shortcoming, cluster result is difficult to combine with the actual demand of classification.

Summary of the invention

In order to overcome above-mentioned the problems of the prior art, the invention provides a kind of system and method that obtains of the text field based on concept symbols, this system and method has the characteristics of the configurable and sorting technique regularization of criteria for classification, can be in the basic area classification that does not have to obtain under the situation of corpus text, and can customize the class categories of text according to actual needs, can be used for the automatic cluster of text.

In order to achieve the above object, the system that obtains of a kind of text field based on concept symbols provided by the invention as shown in Figure 1, comprising:

One field concept glossary of symbols is used to express word notion and field classification, and provides required field concept symbol to the field arbiter.

One word knowledge base is used to store word and concept symbols thereof, and provides required word and concept symbols thereof to word segmentation processing device and statement semantics analyzer.

One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer.

One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation.

One field arbiter is used for semantic concept symbol according to field concept glossary of symbols and word knowledge base and obtains activation word in the statement; Then according to semantic chunk type, field concept syntactics, the frequency of occurrence of the activation word in the statement and the position occurs the field concept symbol that activates word is carried out comprehensive grading, obtain the highest field concept of branch and meet field as statement; Then the statement in the paragraph is merged according to its field concept symbol, obtain sentence group and field thereof; Obtain at last the field of input text according to input text title, sentence group frequency of occurrence and position in input text.

Wherein, the character types of described semantic chunk is divided into: feature semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Described feature semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the feature semantic chunk E in the statement first order level; B) local feature semantic chunk El is the feature semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.

Wherein, described field concept glossary of symbols comprises following upper level node symbol:

" 71,72 " the expression psychological activity and the state of mind; " 8 " expression human thinking activities; " a, b " expression specialty and pursuit movable (work of second class); The activity of " d " expression theory; The work of " q6 " expression first kind; " q7 " represents extra-professional activity; The activity of " q8 " expression faith; " 6m " represents instinctive activity, wherein m=0～5; " 3228 α " represents calamity, wherein α=8～b; " 503,50 α " expression state, wherein α=8～b;

The field concept upper level node	The field of expression
The field concept upper level node	The field of expression	??71，72	The psychological activity and the state of mind
??8	Human thinking activities	??71，72	The psychological activity and the state of mind
??8	Human thinking activities	??a，b	Specialty and pursuit movable (work of second class)
??d	The theory activity	??a，b	Specialty and pursuit movable (work of second class)
??d	The theory activity	??q6	First kind work
??q7	Extra-professional activity	??q6	First kind work
??q7	Extra-professional activity	??q8	The faith activity
??6m(m＝0～5)	Instinctive activity	??q8	The faith activity
??6m(m＝0～5)	Instinctive activity	??3228α(α＝8～b)	Calamity
??503，50α(α＝8～b)	State	??3228α(α＝8～b)	Calamity

And the downward node symbol of field concept more specifically that extends of described upper level node.

Wherein, described field arbiter is determined the field of statement S as follows: at first, obtain to activate the type of word semantic chunk of living in from the result of sentence category analysis (sca); Then, the semantic chunk type sequence of pressing global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞(object semantic chunk B or actor semantic chunk A) is determined the field of statement S successively; A plurality of activation word (W are arranged in same type semantic chunk ₁, W ₂..., W _n) time, the field concept symbol of hypothesis activation word correspondence is respectively (D ₁, D ₂..., D _n), calculate the score of each field concept symbol in statement according to following computing formula so:

S(D _i)＝Rel(i)+Fre(i)+Pos(i)，1≤i≤n；

Wherein, i field concept symbol D of Rel (i) expression _iIn statement with other field concept symbol D _j(j ≠ i, 1≤j≤n) concerns score; Fre (i) represents i field concept symbol D _iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) represents i field concept symbol D _iAppearance position in statement S, position are big more by its value of back more.With score S (D _i) i the highest field concept symbol D _iField as statement S.

Wherein, described field arbiter judges that the principle of text field also comprises: if title is arranged in the text, the field of title is as the field of text so; If there is not title in the text, the sentence group field that the frequency that occurs at first in the text is maximum so is as the field of text.

The acquisition methods of a kind of text field based on concept symbols provided by the invention as shown in Figure 2, may further comprise the steps:

(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word.

An input text is used as a character string T in computing machine.With " carriage return, line feed " among character string T symbol is cut-off, is text T cutting several paragraphs P.With the characters such as " fullstop, question mark, exclamation and branches " among the paragraph P is cut-off, and paragraph P is cut into several statements S.

Statement S is made of Chinese character and other characters.If A, B, C are the Chinese characters that occurs among the statement S, if " AB " is the word in the word knowledge base, then " ABC " cutting is " AB/C "; In like manner, if " BC " is the word in the speech, then " ABC " cutting is " A/BC ".If " AB " and " BC " all is the word in the dictionary, divide the principle cutting to be " A/BC " according to left cut so; If " ABC " is the word in the dictionary, be "/ABC/ " according to the long principle cutting of major term so.So statement S is several words W by cutting, participle finishes.

(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation.

For each statement S, anolytic sentence obtains its semantic classes (sentence class) code SCode, format code SFomat, sentence class expression formula SExpression, the kind of the semantic chunk of formation statement, scope, the concrete title in sentence class expression formula or the like.The type of particularly determining semantic chunk is E (feature semantic chunk), A (actor semantic chunk), B (object semantic chunk), or C (contents semantic piece).In feature semantic chunk type E, be divided into two types again: a kind of Eg of being (global characteristics semantic chunk) is the feature semantic chunk E in the statement first order level; A kind of is El (local feature semantic chunk), and it is the feature semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.

(3) obtain the activation word: the field arbiter obtains activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base.

Activating word is the word that contains the field concept symbol among the statement S.The word knowledge base comprises: morphology, tone, senses of a dictionary entry number, adopted item No., concept classification, word frequency and linguistic context, semantic knowledge, sentence category code, format conversion,,,.Wherein semantic knowledge is with the symbolic formulation of notion primitive, and the field symbol also is a subclass in the notion primitive symbolism, so may contain the field concept symbolic information in the concept symbols of word.In notion primitive symbolism, not all notion primitive node all is used for the description field, and the upper level node of the notion relevant with the field has: 71,72 (psychological activity and the state of mind); 8 (human thinking activities); A, b (specialty and pursuit movable (work of second class)); D (theory activity); Q6 (first kind work); Q7 (extra-professional activity); Q8 (faith activity); 6m (m=0～5) (instinctive activity); 3228 α (α=8～b) (calamity); 503,50 α (α=8～b) (state).The upper level node of these field concept symbols can extend downwards and obtains more specifically field concept node symbol, for example a (professional activity) extends to downwards: a1 (politics), a2 (economy), a3 (culture), a4 (military affairs), a5 (law), a6 (science and technology), a7 (education), a8 (defending the guarantor), and a1 (politics) can extend to downwards successively: a11 (regime's activity), a113 (top leader (country or local government) change), a113b (election).

What the concept symbols of semantic knowledge used in the concept symbols in field and the word knowledge base is same notion primitive symbolism, when the upper level node that has occurred the field concept symbol in the concept symbols of the semantic knowledge of a word W or its were derived node, word W activated word.The field concept symbolic formulation field of a certain level or type, all spectra concept symbols that the activation word among the statement S is contained is used as the candidate field of statement S.

(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement.

Wherein, the statement field derives from the field concept symbol that activates word in the described step (4).When a plurality of activation word is arranged among the while statement S, determine the statement field as follows: at first, from the result of sentence category analysis (sca), obtain to activate the type of word semantic chunk of living in; The semantic chunk type sequence of pressing global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞object semantic chunk B or actor semantic chunk A is then determined the field of statement S successively, even there is the word of activation W then to get the field concept symbol of W among the Eg as the statement field, if do not activate word then then among the Eg from El, then from C, do not get if activate word among the El, if then from B or A, do not get among the C.

In the semantic chunk of same type, have a plurality of activation words (W1, W2 ..., Wn) time, the field concept symbol of hypothesis activation word correspondence is respectively (D1, D2, ..., Dn), calculate the score of each field concept symbol in statement according to following computing formula so: S (D _i)=Rel (i)+Fre (i)+Pos (i), 1≤i≤n.At formula S (D _iAmong)=Rel (i)+Fre (i)+Pos (i), Rel (i) represents i field concept symbol D _iIn statement with other field concept symbol D _j(j ≠ i, 1≤j≤n) concerns score; Fre (i) represents i field concept symbol D _iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) represents i field concept symbol D _iAppearance position in statement S, position are big more by its value of back more.With score S (D _i) i the highest field concept symbol D _iField as statement S.

The score value of Rel (i) is from field concept symbol D _iWith D _jRelation.Work as D _iBe D _jConcept extension when representing, D _iScore value add 1; Work as D _iWith D _jDuring strong correlation, D _iScore value add 1.If calculated S (D _i) back D _iBe the field of statement, D _iBefore have negative notion to modify, should get D so _i' (being its opposite field concept symbol) is as the field of statement.If if calculated S (D _i) back Di is the field of statement, and D _jRel (i)+Fre (i) score and D _iIdentical, and D _iWith D _jBe the child node of identical concept node, get D so _iWith D _jUpper level father node field concept symbol as the field of statement.

If one is activated word W _i(among 1≤i≤n) a plurality of field concept symbol (D are arranged _I1, D _I2..., D _Im), this m field concept symbol all needs to calculate S (D so _i) the field score value, just when calculating Rel (i), do not need to consider D _Ij(1≤j≤m) and D _Ik(the field concept syntactics between the j ≠ i, 1≤k≤m).If D _IjWith D _IkFinal calculating score value S (D _Ij) and S (D _Ik) still identical, get the field concept symbol that comes the front in the word knowledge base field so as statement S.

(5) sentence group and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence group and field thereof.

Sentence the group be made up of the statement of the same center of continuous description topic.Sentence group's center topic is meant topic or the field that identical or approximate field concept symbol is expressed.Minimum sentence group is a statement, and maximum sentence group is a paragraph.In the described step (5), for the statement (S among certain paragraph Pi of text T ₁, S ₂..., S _n), the sentence group ownership of each statement is definite according to following steps, as shown in Figure 3:

(5a) get first statement S ₁As sentence group G ₁, get S ₁Field D ₁As sentence group G ₁Field D _G1

(5b) S ₁Be current statement S _i, G ₁Be current sentence group G _j, change (5g);

If (5c) S _iField D _iBe S _I-1Field D _I-1Symbol extend statement S so _iBe included into G _j, G _jThe field change D into _i, change (5g);

If (5d) S _I-1Field D _I-1Be S _iField D _iSymbol extend statement S so _iBe included into G _j, change (5g);

If (5e) current statement S _iField D _iWith a last statement S _I-1Field D _I-1Identical, statement S so _iBe included into G _j, change (5g);

(5f) get S _iNext statement S _I+1Be new sentence group G _J+1, field D _Gj+1Be statement S _I+1Field D _I+1

If (5g) current statement S _iBe last statement S _n, change so (5n);

If (5k) S _iThe field be sky and S _iBe S ₁, statement S so ₂Be included into G ₁, G ₁The field change D into ₂, S ₂As current statement S _i, change (5c);

If (5l) S _iThe field be sky and S _iNot S ₁, statement S so _iBe included into G _j, change (5g);

If (5m) S _iThe field be not empty, so S _I+1As current statement S _i, change (5c);

(5n) all group G to obtaining _j, the sentence group that adjacent field is identical merges into a sentence group, 1≤j≤m wherein, 1≤m≤n.

Through above-mentioned steps and closing operation, a paragraph just is divided into several groups, simultaneously their field is also decided according to the field of statement, has realized in the paragraph sentence group's the division and the differentiation in sentence group field.

(6) text field is differentiated: the field arbiter obtains the field of input text according to text header, sentence group frequency of occurrence and position in input text.

Wherein, described step (6) also comprises: if title is arranged in the input text, the field of title is used as the field of input text so, if title paragraph P ₁In have only a sentence group, this group's field is exactly the field of text so; If paragraph P ₁In a plurality of groups are arranged, choose paragraph P so ₁In first group's field and last group's field jointly as the field of text.

If there is not title in the text, all groups' field is used as the candidate field of text field in the text so.N sentence group's field is designated as D=(D in proper order by sentence group appearance among the text T _G1, D _G2..., D _Gn), from D _G1To D _GnOperation according to the following steps, as shown in Figure 4:

(6a) D _G1As D _Gi, the statistics D in D _GiThe field number C that the field concept symbol is identical _Gi, with D _GiWith C _GiDeposit among the table HTab;

If (6b) D _GiBe D _Gn, change so (6f)

(6c) D _Gi+1As D _Gi

If (6d) D _GiThe field concept symbol deposited among the table HTab, change so (6c);

(6e) statistics D in D _GiThe field number C that the field concept symbol is identical _Gi, with D _GiWith C _GiDeposit among the table HTab, change (6b);

(6f) obtain showing HTab=((D _G1, C _G1) ..., (D _Gm, C _Gm)), 1≤m≤n wherein;

(6g) element (D among the his-and-hers watches HTab _Gj, C _Gj), 1≤j≤m is according to C _GjSize sort from big to small, newly shown HTab '=((D _G1', C _G1') ..., (D _Gm', C _Gm')).

The field of the field concept symbol of first element in the new table as text T, the field of text T can not obtain with above-mentioned steps when having title among the text T.

The invention has the advantages that:

When 1, text field provided by the invention obtains system and method and is used for text classification, do not need the good a large amount of corpus of classification in advance, only need to determine the field concept symbol relevant with class categories.

2, the text field provided by the invention field concept symbol that obtains system and method has the level characteristics, both can adapt to miscellaneous same level class categories, can also adapt to the hierarchical classification of striding of concrete tiny classification.

3, text field provided by the invention obtains the method that system and method mainly adopts semantic analysis and gos deep into the field classification that concept hierarchy is determined text, introduce simultaneously the frequency characteristic of statistical property again, make the processing of the accurate and suitable more extensive text of acquisition methods of text field.

4, text field provided by the invention obtains the classification that sentence group field that system and method proposes can be used for text and handles, and also can be used for the cluster analysis of text and the topic analysis of text.

Description of drawings

Fig. 1 is the structural drawing of the system that obtains of text field of the present invention;

Fig. 2 is the process flow diagram of the acquisition methods of text field of the present invention;

Fig. 3 is the process flow diagram of definite method in sentence group of the present invention and field thereof;

Fig. 4 is the process flow diagram of text of the present invention text field acquisition methods when not having title.

Embodiment

Below in conjunction with specific embodiment and accompanying drawing the present invention is elaborated.

At first, from the Internet download some about 11 pieces in the news report text of Athens Olympic Games 2004 match, totally 60 paragraghs, 6501 Chinese characters.

Secondly, according to " fundamental theorem in language concept space and mathematical physics expression " (Maritime Press, in July, 2004) the concrete perfect concept symbols in q73 (match) field of principle of design in and design symbol obtains the concept symbols collection about the field of competing.Word and semantic knowledge thereof about competing in the word knowledge base have been enriched simultaneously.

The 3rd, use the word segmentation processing device that one piece of text is carried out segmentation, subordinate sentence and word segmentation processing.For example following text: Title: the difference of Malaysia " little standard-bearer " one semifinals of not advancing to dive

In the match of the men's Olympic 10m platform event diving that www.xinhuanet.com Athens August 27 held in afternoon 27 day local time,, fail to be promoted to semifinals from Malay Brian-Nickerson's results in the qualifying rank the 19.According to rule, among 33 players of preliminary contest, achievement comes preceding 18 player and is promoted to semifinals.

After the processing through the word segmentation processing device, the result who obtains is as follows: [Title :] [Malaysia] [" little standard-bearer "] [one poor] [not advancing] [diving] [semifinals]

[www.xinhuanet.com] [Athens] [August 27] electricity is in [Olympic Games] [man] [ten meters] [diving tower] [diving] of [locality] [time] [27 days] [afternoon] [holding] [match]

[from] Brian-Nickerson's [preliminary contest] [achievement] [rank] [19] of [Malaysia]

[failing] [promotion] [semifinals]

[according to] [rule]

In [33] [player] of [preliminary contest]

[player] [promotion] [semifinals] of [18] before [achievement] comes

The 4th, use the statement semantics analyzer that statement is analyzed, use the field arbiter to obtain then and activate word and analyze sentence group and field thereof, after merging sentence group field, obtain following result:

//DOM：(q734)

Title:[Malaysia] difference of [" little standard-bearer "] do not advance [diving (a339 4)] [semifinals (q734)]

The www.xinhuanet.com [Athens (a219 10pw)] August 27 is in [match (q73)] of [Olympic Games (a339i)] [man] ten meters [diving tower (a339 4)] [divings (a339 4)] of 27 days [locality] [time] [afternoon] [holding (a02)], [from] Brian-Nickerson [preliminary contest (q734)] [achievement (a0099b)] of [Malaysia] [rank (q730e25d0[n])] the 19, [failing] [being promoted to (a01ad0ne25)] [semifinals (q734)].[according to] [rule (a009a9)], in 33 [player (q730)] of [preliminary contest (q734)], [achievement (a0099b)] comes preceding 18 [player (q730)] [being promoted to (a01ad0ne25)] [semifinals (q734)].

In text, first statement " Title: Malaysia ' semifinals of not advancing to dive of little standard-bearer ' one's difference ", its semantic analysis result is " Title: Malaysia ' little standard-bearer ' (SB) || one difference is not advanced (S0) || diving semifinals (SC) ".Because global characteristics semantic chunk Eg (being S0) does not have the field concept symbolic information, so choose the field of statement from the contents semantic piece C (being SC) that contains realm information." diving " and " semifinals " in the SC semantic chunk all contain the field concept symbolic information, the field of calculating them by score value concerns that score is all the same with frequency score, the position score of " but semifinals " is greater than " diving ", so the field of statement is " q734 ".Therefore first paragraph is altogether with regard to a statement, and whole paragraph is a sentence group, and sentence group's field is exactly " q734 ".Because first paragraph is text header, so the field of text just " q734 ".

Like this,, can obtain the field of statement, sentence group's field, finally obtain the field of text by analyzing the type that activates word residing semantic chunk in statement and word position, the frequency etc. according to the field concept symbol that activates word.

Claims

1, a kind of system that obtains of the text field based on concept symbols is characterized in that the described system that obtains comprises:

One field concept glossary of symbols is used to express word notion and field classification, and provides required field concept symbol to the field arbiter;

One word knowledge base is used to store word and concept symbols thereof, and provides required word and concept symbols thereof to word segmentation processing device and statement semantics analyzer;

One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer;

One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation;

2, the system that obtains of text field according to claim 1 is characterized in that, the character types of described semantic chunk is divided into: feature semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Described feature semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the feature semantic chunk E in the statement first order level; B) local feature semantic chunk El is the feature semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.

3, the system that obtains of text field according to claim 1 is characterized in that, described field concept glossary of symbols comprises following upper level node symbol:

The field concept upper level node The field of expression ??71，72 The psychological activity and the state of mind ??8 Human thinking activities ??a，b Specialty and pursuit are movable ??d The theory activity ??q6 First kind work ??q7 Extra-professional activity ??q8 The faith activity ??6m(m＝0～5) Instinctive activity ??3228α(α＝8～b) Calamity ??503,50α(α＝8～b) State

4, the system that obtains of text field according to claim 1 is characterized in that, described field arbiter is determined the field of statement S as follows: at first, obtain to activate the type of word semantic chunk of living in from the result that statement semantics is analyzed; Then, determine the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk ₁, W ₂..., W _nThe time, the field concept symbol of hypothesis activation word correspondence is respectively D ₁, D ₂..., D _n, calculate the score of each field concept symbol in statement according to following computing formula so:

S(D _i)＝Rel(i)+Fre(i)+Pos(i)，1≤i≤n；

Wherein, i field concept symbol D of Rel (i) expression _iIn statement with other field concept symbol D _j(j ≠ i, 1≤j≤n) concerns score; Fre (i) represents i field concept symbol D _iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) represents i field concept symbol D _iAppearance position in statement S, position are big more by its value of back more, with score S (D _i) i the highest field concept symbol D _iField as statement S.

5, the system that obtains of text field according to claim 1 is characterized in that, described field arbiter judges that the principle of text field also comprises: if title is arranged in the text, the field of title is as the field of text so; If there is not title in the text, the sentence group field that the frequency that occurs at first in the text is maximum so is as the field of text.

6, a kind of acquisition methods of the text field based on concept symbols may further comprise the steps:

(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word;

(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation;

(3) obtain the activation word: the field arbiter obtains activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base;

(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement;

(5) sentence group and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence group and field thereof;

According to the acquisition methods of the text field of claim 6, it is characterized in that 7, described step (4) is determined the field of statement S as follows: at first, from the result that statement semantics is analyzed, obtain to activate the type of word semantic chunk of living in; Then, determine the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg＞local feature semantic chunk El＞contents semantic piece C＞object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk ₁, W ₂..., W _nThe time, the field concept symbol of hypothesis activation word correspondence is respectively D ₁, D ₂..., D _n, calculate the score of each field concept symbol in statement according to following computing formula so:

S(D _i)＝Rel(i)+Fre(i)+Pos(i)，1≤i≤n；

8, according to the acquisition methods of the text field of claim 6, it is characterized in that, in the described step (5), for certain paragraph P of text T _iIn statement S ₁, S ₂..., S _n, the sentence group ownership of each statement is determined according to following steps:

If (5g) current statement S _iBe last statement S _n, change so (5n);

9, according to the acquisition methods of the text field of claim 6, it is characterized in that described step (6) also comprises: if title is arranged in the input text, the field of title is used as the field of input text so; If there is not title in the input text, the sentence group field that the frequency that occurs at first in the input text is maximum so is used as the step in the candidate field in input text field.

10, the acquisition methods of text field according to claim 9 is characterized in that, if there is not title in the text, n sentence group's field is designated as D=(D in proper order by sentence group appearance among the text T _G1, D _G2..., D _Gn), from D _G1To D _GnText field is obtained in operation according to the following steps:

If (6b) D _GiBe D _Gn, change so (6f);

(6c) D _Gi+1As D _Gi

(6g) element (D among the his-and-hers watches HTab _Gj, C _Gj), 1≤j≤m is according to C _GjSize sort from big to small, newly shown HTab '=((D _G1', C _G1') ..., (D _Gm', C _Gm')), the field of the field concept symbol of first element in this new table as text T.