CN101645083A - Acquisition system and method of text field based on concept symbols - Google Patents

Acquisition system and method of text field based on concept symbols Download PDF

Info

Publication number
CN101645083A
CN101645083A CN200910077018A CN200910077018A CN101645083A CN 101645083 A CN101645083 A CN 101645083A CN 200910077018 A CN200910077018 A CN 200910077018A CN 200910077018 A CN200910077018 A CN 200910077018A CN 101645083 A CN101645083 A CN 101645083A
Authority
CN
China
Prior art keywords
field
statement
text
concept
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910077018A
Other languages
Chinese (zh)
Other versions
CN101645083B (en
Inventor
韦向峰
黄曾阳
张全
缪建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN2009100770180A priority Critical patent/CN101645083B/en
Publication of CN101645083A publication Critical patent/CN101645083A/en
Application granted granted Critical
Publication of CN101645083B publication Critical patent/CN101645083B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an acquisition system and a method of the text field based on concept symbols. The system comprises a concept symbol set for expressing word concepts and field categories, a word knowledge base for storing word and concept symbols, a word segmentation processor, a statement semantic analyzer and a field arbiter. The method comprises the following steps: (1) segmenting an input text into paragraphs, statements and words; (2) carrying out semantic analysis on the statements for obtaining concept categories and semantic blocks of the statements; (3) obtaining activating words in the statements according to semantic concept symbols in the field concept symbol set and the word knowledge base; (4) carrying out comprehensive scoring on field concept symbols of the activating words and obtaining the field concept symbol with the highest score as the field of the statements; (5) merging the statements in the paragraphs according to the field concept symbols for obtaininga statement group and the field thereof; and (6) obtaining the field of the text according to a title of the text and the frequency of occurrence and the position of the statement group in the statement group.

Description

A kind of text field based on concept symbols obtain system and method
Technical field
The present invention relates to the field that utilizes Computer Science and Technology that text is carried out the spoken and written languages information processing, particularly a kind of text field based on concept symbols obtain system and method.
Background technology
The text classification technology is to utilize computing machine, according to certain rule, knowledge and step, one piece of text is classified as one or more domain class method for distinguishing and process.The conventional method of text classification is that text table is shown as proper vector, and when " angle " of the proper vector of two pieces of texts during less than certain angle, they are classified as same classification.Generally choose word constitutes text as text feature proper vector, the TF*IWF method that the building method of proper vector adopts the TF*IDF method more or derives thus, TF*IDF promptly use word in document the frequency of occurrences and in collection of document the product of the inverse of the frequency of occurrences as the value of this feature word correspondence in the proper vector.The k nearest neighbor method of text classification, bayes method, support vector machine, neural network, decision tree etc. all are the statistical methods based on the vector space model of text, good a large amount of text sets carry out the parameter optimization training to require that before classification prior classification is arranged, and new text can be included in a certain classification that defines after the training.Chinese patent file (publication number CN100353361) discloses a kind of method and apparatus of new proper vector weight towards text classification, on the basis of TF*IWF method, introduced the n speech root of DBV and TF, the experiment of the different characteristic speech number (50,100,200,500,1000,1500,2000,2500,3000,3500,4000) by the field of respectively classifying by the word frequency selection purposes, its experimental system better performances when finding to get 3500 speech.
Because file classification method requires to know in advance the field classification set and the criteria for classification of text, uncertain and training text collection obtains under the situation of difficult in class categories, and file classification method will be difficult to implement.Therefore, the text cluster technology has appearred again.Typical case's representative of text cluster method commonly used is the K-Means algorithm, and promptly at first optional K text is as cluster centre from text set, and other text incorporates in that nearest cluster according to the proper vector " distance " with cluster centre; And then with the average of the proper vector of all texts in K the class as new cluster centre, all texts again according to distances of clustering centers cluster again, so iterative computation is till the evaluation function convergence.But the field classification that the text automatic cluster obtains is very coarse, is difficult to adapt to actual demand owing to lack its result of guidance to different types of areas.And same text cluster method, better to certain text set effect, but may be very poor to another text set effect, promptly all there are shortcoming in the practicality of text cluster and stability.
To sum up, the statistical method of text classification needs a large amount of good corpus of prior classification, this divide time-like often be difficult to provide.Though and text cluster can overcome this shortcoming, cluster result is difficult to combine with the actual demand of classification.
Summary of the invention
In order to overcome above-mentioned the problems of the prior art, the invention provides a kind of system and method that obtains of the text field based on concept symbols, this system and method has the characteristics of the configurable and sorting technique regularization of criteria for classification, can be in the basic area classification that does not have to obtain under the situation of corpus text, and can customize the class categories of text according to actual needs, can be used for the automatic cluster of text.
In order to achieve the above object, the system that obtains of a kind of text field based on concept symbols provided by the invention as shown in Figure 1, comprising:
One field concept glossary of symbols is used to express word notion and field classification, and provides required field concept symbol to the field arbiter.
One word knowledge base is used to store word and concept symbols thereof, and provides required word and concept symbols thereof to word segmentation processing device and statement semantics analyzer.
One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer.
One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation.
One field arbiter is used for semantic concept symbol according to field concept glossary of symbols and word knowledge base and obtains activation word in the statement; Then according to semantic chunk type, field concept syntactics, the frequency of occurrence of the activation word in the statement and the position occurs the field concept symbol that activates word is carried out comprehensive grading, obtain the highest field concept of branch and meet field as statement; Then the statement in the paragraph is merged according to its field concept symbol, obtain sentence group and field thereof; Obtain at last the field of input text according to input text title, sentence group frequency of occurrence and position in input text.
Wherein, the character types of described semantic chunk is divided into: feature semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Described feature semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the feature semantic chunk E in the statement first order level; B) local feature semantic chunk El is the feature semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.
Wherein, described field concept glossary of symbols comprises following upper level node symbol:
" 71,72 " the expression psychological activity and the state of mind; " 8 " expression human thinking activities; " a, b " expression specialty and pursuit movable (work of second class); The activity of " d " expression theory; The work of " q6 " expression first kind; " q7 " represents extra-professional activity; The activity of " q8 " expression faith; " 6m " represents instinctive activity, wherein m=0~5; " 3228 α " represents calamity, wherein α=8~b; " 503,50 α " expression state, wherein α=8~b;
The field concept upper level node The field of expression
??71,72 The psychological activity and the state of mind
??8 Human thinking activities
??a,b Specialty and pursuit movable (work of second class)
??d The theory activity
??q6 First kind work
??q7 Extra-professional activity
??q8 The faith activity
??6m(m=0~5) Instinctive activity
??3228α(α=8~b) Calamity
??503,50α(α=8~b) State
And the downward node symbol of field concept more specifically that extends of described upper level node.
Wherein, described field arbiter is determined the field of statement S as follows: at first, obtain to activate the type of word semantic chunk of living in from the result of sentence category analysis (sca); Then, the semantic chunk type sequence of pressing global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>(object semantic chunk B or actor semantic chunk A) is determined the field of statement S successively; A plurality of activation word (W are arranged in same type semantic chunk 1, W 2..., W n) time, the field concept symbol of hypothesis activation word correspondence is respectively (D 1, D 2..., D n), calculate the score of each field concept symbol in statement according to following computing formula so:
S(D i)=Rel(i)+Fre(i)+Pos(i),1≤i≤n;
Wherein, i field concept symbol D of Rel (i) expression iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) represents i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) represents i field concept symbol D iAppearance position in statement S, position are big more by its value of back more.With score S (D i) i the highest field concept symbol D iField as statement S.
Wherein, described field arbiter judges that the principle of text field also comprises: if title is arranged in the text, the field of title is as the field of text so; If there is not title in the text, the sentence group field that the frequency that occurs at first in the text is maximum so is as the field of text.
The acquisition methods of a kind of text field based on concept symbols provided by the invention as shown in Figure 2, may further comprise the steps:
(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word.
An input text is used as a character string T in computing machine.With " carriage return, line feed " among character string T symbol is cut-off, is text T cutting several paragraphs P.With the characters such as " fullstop, question mark, exclamation and branches " among the paragraph P is cut-off, and paragraph P is cut into several statements S.
Statement S is made of Chinese character and other characters.If A, B, C are the Chinese characters that occurs among the statement S, if " AB " is the word in the word knowledge base, then " ABC " cutting is " AB/C "; In like manner, if " BC " is the word in the speech, then " ABC " cutting is " A/BC ".If " AB " and " BC " all is the word in the dictionary, divide the principle cutting to be " A/BC " according to left cut so; If " ABC " is the word in the dictionary, be "/ABC/ " according to the long principle cutting of major term so.So statement S is several words W by cutting, participle finishes.
(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation.
For each statement S, anolytic sentence obtains its semantic classes (sentence class) code SCode, format code SFomat, sentence class expression formula SExpression, the kind of the semantic chunk of formation statement, scope, the concrete title in sentence class expression formula or the like.The type of particularly determining semantic chunk is E (feature semantic chunk), A (actor semantic chunk), B (object semantic chunk), or C (contents semantic piece).In feature semantic chunk type E, be divided into two types again: a kind of Eg of being (global characteristics semantic chunk) is the feature semantic chunk E in the statement first order level; A kind of is El (local feature semantic chunk), and it is the feature semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.
(3) obtain the activation word: the field arbiter obtains activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base.
Activating word is the word that contains the field concept symbol among the statement S.The word knowledge base comprises: morphology, tone, senses of a dictionary entry number, adopted item No., concept classification, word frequency and linguistic context, semantic knowledge, sentence category code, format conversion,,,.Wherein semantic knowledge is with the symbolic formulation of notion primitive, and the field symbol also is a subclass in the notion primitive symbolism, so may contain the field concept symbolic information in the concept symbols of word.In notion primitive symbolism, not all notion primitive node all is used for the description field, and the upper level node of the notion relevant with the field has: 71,72 (psychological activity and the state of mind); 8 (human thinking activities); A, b (specialty and pursuit movable (work of second class)); D (theory activity); Q6 (first kind work); Q7 (extra-professional activity); Q8 (faith activity); 6m (m=0~5) (instinctive activity); 3228 α (α=8~b) (calamity); 503,50 α (α=8~b) (state).The upper level node of these field concept symbols can extend downwards and obtains more specifically field concept node symbol, for example a (professional activity) extends to downwards: a1 (politics), a2 (economy), a3 (culture), a4 (military affairs), a5 (law), a6 (science and technology), a7 (education), a8 (defending the guarantor), and a1 (politics) can extend to downwards successively: a11 (regime's activity), a113 (top leader (country or local government) change), a113b (election).
What the concept symbols of semantic knowledge used in the concept symbols in field and the word knowledge base is same notion primitive symbolism, when the upper level node that has occurred the field concept symbol in the concept symbols of the semantic knowledge of a word W or its were derived node, word W activated word.The field concept symbolic formulation field of a certain level or type, all spectra concept symbols that the activation word among the statement S is contained is used as the candidate field of statement S.
(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement.
Wherein, the statement field derives from the field concept symbol that activates word in the described step (4).When a plurality of activation word is arranged among the while statement S, determine the statement field as follows: at first, from the result of sentence category analysis (sca), obtain to activate the type of word semantic chunk of living in; The semantic chunk type sequence of pressing global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>object semantic chunk B or actor semantic chunk A is then determined the field of statement S successively, even there is the word of activation W then to get the field concept symbol of W among the Eg as the statement field, if do not activate word then then among the Eg from El, then from C, do not get if activate word among the El, if then from B or A, do not get among the C.
In the semantic chunk of same type, have a plurality of activation words (W1, W2 ..., Wn) time, the field concept symbol of hypothesis activation word correspondence is respectively (D1, D2, ..., Dn), calculate the score of each field concept symbol in statement according to following computing formula so: S (D i)=Rel (i)+Fre (i)+Pos (i), 1≤i≤n.At formula S (D iAmong)=Rel (i)+Fre (i)+Pos (i), Rel (i) represents i field concept symbol D iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) represents i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) represents i field concept symbol D iAppearance position in statement S, position are big more by its value of back more.With score S (D i) i the highest field concept symbol D iField as statement S.
The score value of Rel (i) is from field concept symbol D iWith D jRelation.Work as D iBe D jConcept extension when representing, D iScore value add 1; Work as D iWith D jDuring strong correlation, D iScore value add 1.If calculated S (D i) back D iBe the field of statement, D iBefore have negative notion to modify, should get D so i' (being its opposite field concept symbol) is as the field of statement.If if calculated S (D i) back Di is the field of statement, and D jRel (i)+Fre (i) score and D iIdentical, and D iWith D jBe the child node of identical concept node, get D so iWith D jUpper level father node field concept symbol as the field of statement.
If one is activated word W i(among 1≤i≤n) a plurality of field concept symbol (D are arranged I1, D I2..., D Im), this m field concept symbol all needs to calculate S (D so i) the field score value, just when calculating Rel (i), do not need to consider D Ij(1≤j≤m) and D Ik(the field concept syntactics between the j ≠ i, 1≤k≤m).If D IjWith D IkFinal calculating score value S (D Ij) and S (D Ik) still identical, get the field concept symbol that comes the front in the word knowledge base field so as statement S.
(5) sentence group and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence group and field thereof.
Sentence the group be made up of the statement of the same center of continuous description topic.Sentence group's center topic is meant topic or the field that identical or approximate field concept symbol is expressed.Minimum sentence group is a statement, and maximum sentence group is a paragraph.In the described step (5), for the statement (S among certain paragraph Pi of text T 1, S 2..., S n), the sentence group ownership of each statement is definite according to following steps, as shown in Figure 3:
(5a) get first statement S 1As sentence group G 1, get S 1Field D 1As sentence group G 1Field D G1
(5b) S 1Be current statement S i, G 1Be current sentence group G j, change (5g);
If (5c) S iField D iBe S I-1Field D I-1Symbol extend statement S so iBe included into G j, G jThe field change D into i, change (5g);
If (5d) S I-1Field D I-1Be S iField D iSymbol extend statement S so iBe included into G j, change (5g);
If (5e) current statement S iField D iWith a last statement S I-1Field D I-1Identical, statement S so iBe included into G j, change (5g);
(5f) get S iNext statement S I+1Be new sentence group G J+1, field D Gj+1Be statement S I+1Field D I+1
If (5g) current statement S iBe last statement S n, change so (5n);
If (5k) S iThe field be sky and S iBe S 1, statement S so 2Be included into G 1, G 1The field change D into 2, S 2As current statement S i, change (5c);
If (5l) S iThe field be sky and S iNot S 1, statement S so iBe included into G j, change (5g);
If (5m) S iThe field be not empty, so S I+1As current statement S i, change (5c);
(5n) all group G to obtaining j, the sentence group that adjacent field is identical merges into a sentence group, 1≤j≤m wherein, 1≤m≤n.
Through above-mentioned steps and closing operation, a paragraph just is divided into several groups, simultaneously their field is also decided according to the field of statement, has realized in the paragraph sentence group's the division and the differentiation in sentence group field.
(6) text field is differentiated: the field arbiter obtains the field of input text according to text header, sentence group frequency of occurrence and position in input text.
Wherein, described step (6) also comprises: if title is arranged in the input text, the field of title is used as the field of input text so, if title paragraph P 1In have only a sentence group, this group's field is exactly the field of text so; If paragraph P 1In a plurality of groups are arranged, choose paragraph P so 1In first group's field and last group's field jointly as the field of text.
If there is not title in the text, all groups' field is used as the candidate field of text field in the text so.N sentence group's field is designated as D=(D in proper order by sentence group appearance among the text T G1, D G2..., D Gn), from D G1To D GnOperation according to the following steps, as shown in Figure 4:
(6a) D G1As D Gi, the statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab;
If (6b) D GiBe D Gn, change so (6f)
(6c) D Gi+1As D Gi
If (6d) D GiThe field concept symbol deposited among the table HTab, change so (6c);
(6e) statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab, change (6b);
(6f) obtain showing HTab=((D G1, C G1) ..., (D Gm, C Gm)), 1≤m≤n wherein;
(6g) element (D among the his-and-hers watches HTab Gj, C Gj), 1≤j≤m is according to C GjSize sort from big to small, newly shown HTab '=((D G1', C G1') ..., (D Gm', C Gm')).
The field of the field concept symbol of first element in the new table as text T, the field of text T can not obtain with above-mentioned steps when having title among the text T.
The invention has the advantages that:
When 1, text field provided by the invention obtains system and method and is used for text classification, do not need the good a large amount of corpus of classification in advance, only need to determine the field concept symbol relevant with class categories.
2, the text field provided by the invention field concept symbol that obtains system and method has the level characteristics, both can adapt to miscellaneous same level class categories, can also adapt to the hierarchical classification of striding of concrete tiny classification.
3, text field provided by the invention obtains the method that system and method mainly adopts semantic analysis and gos deep into the field classification that concept hierarchy is determined text, introduce simultaneously the frequency characteristic of statistical property again, make the processing of the accurate and suitable more extensive text of acquisition methods of text field.
4, text field provided by the invention obtains the classification that sentence group field that system and method proposes can be used for text and handles, and also can be used for the cluster analysis of text and the topic analysis of text.
Description of drawings
Fig. 1 is the structural drawing of the system that obtains of text field of the present invention;
Fig. 2 is the process flow diagram of the acquisition methods of text field of the present invention;
Fig. 3 is the process flow diagram of definite method in sentence group of the present invention and field thereof;
Fig. 4 is the process flow diagram of text of the present invention text field acquisition methods when not having title.
Embodiment
Below in conjunction with specific embodiment and accompanying drawing the present invention is elaborated.
At first, from the Internet download some about 11 pieces in the news report text of Athens Olympic Games 2004 match, totally 60 paragraghs, 6501 Chinese characters.
Secondly, according to " fundamental theorem in language concept space and mathematical physics expression " (Maritime Press, in July, 2004) the concrete perfect concept symbols in q73 (match) field of principle of design in and design symbol obtains the concept symbols collection about the field of competing.Word and semantic knowledge thereof about competing in the word knowledge base have been enriched simultaneously.
The 3rd, use the word segmentation processing device that one piece of text is carried out segmentation, subordinate sentence and word segmentation processing.For example following text: Title: the difference of Malaysia " little standard-bearer " one semifinals of not advancing to dive
In the match of the men's Olympic 10m platform event diving that www.xinhuanet.com Athens August 27 held in afternoon 27 day local time,, fail to be promoted to semifinals from Malay Brian-Nickerson's results in the qualifying rank the 19.According to rule, among 33 players of preliminary contest, achievement comes preceding 18 player and is promoted to semifinals.
After the processing through the word segmentation processing device, the result who obtains is as follows: [Title :] [Malaysia] [" little standard-bearer "] [one poor] [not advancing] [diving] [semifinals]
[www.xinhuanet.com] [Athens] [August 27] electricity is in [Olympic Games] [man] [ten meters] [diving tower] [diving] of [locality] [time] [27 days] [afternoon] [holding] [match]
[from] Brian-Nickerson's [preliminary contest] [achievement] [rank] [19] of [Malaysia]
[failing] [promotion] [semifinals]
[according to] [rule]
In [33] [player] of [preliminary contest]
[player] [promotion] [semifinals] of [18] before [achievement] comes
The 4th, use the statement semantics analyzer that statement is analyzed, use the field arbiter to obtain then and activate word and analyze sentence group and field thereof, after merging sentence group field, obtain following result:
//DOM:(q734)
Title:[Malaysia] difference of [" little standard-bearer "] do not advance [diving (a339 4)] [semifinals (q734)]
The www.xinhuanet.com [Athens (a219 10pw)] August 27 is in [match (q73)] of [Olympic Games (a339i)] [man] ten meters [diving tower (a339 4)] [divings (a339 4)] of 27 days [locality] [time] [afternoon] [holding (a02)], [from] Brian-Nickerson [preliminary contest (q734)] [achievement (a0099b)] of [Malaysia] [rank (q730e25d0[n])] the 19, [failing] [being promoted to (a01ad0ne25)] [semifinals (q734)].[according to] [rule (a009a9)], in 33 [player (q730)] of [preliminary contest (q734)], [achievement (a0099b)] comes preceding 18 [player (q730)] [being promoted to (a01ad0ne25)] [semifinals (q734)].
In text, first statement " Title: Malaysia ' semifinals of not advancing to dive of little standard-bearer ' one's difference ", its semantic analysis result is " Title: Malaysia ' little standard-bearer ' (SB) || one difference is not advanced (S0) || diving semifinals (SC) ".Because global characteristics semantic chunk Eg (being S0) does not have the field concept symbolic information, so choose the field of statement from the contents semantic piece C (being SC) that contains realm information." diving " and " semifinals " in the SC semantic chunk all contain the field concept symbolic information, the field of calculating them by score value concerns that score is all the same with frequency score, the position score of " but semifinals " is greater than " diving ", so the field of statement is " q734 ".Therefore first paragraph is altogether with regard to a statement, and whole paragraph is a sentence group, and sentence group's field is exactly " q734 ".Because first paragraph is text header, so the field of text just " q734 ".
Like this,, can obtain the field of statement, sentence group's field, finally obtain the field of text by analyzing the type that activates word residing semantic chunk in statement and word position, the frequency etc. according to the field concept symbol that activates word.

Claims (10)

1, a kind of system that obtains of the text field based on concept symbols is characterized in that the described system that obtains comprises:
One field concept glossary of symbols is used to express word notion and field classification, and provides required field concept symbol to the field arbiter;
One word knowledge base is used to store word and concept symbols thereof, and provides required word and concept symbols thereof to word segmentation processing device and statement semantics analyzer;
One word segmentation processing device, being used for the input text cutting is paragraph, statement, word, and sends into the statement semantics analyzer;
One statement semantics analyzer is used for statement is carried out semantic analysis, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation;
One field arbiter is used for semantic concept symbol according to field concept glossary of symbols and word knowledge base and obtains activation word in the statement; Then according to semantic chunk type, field concept syntactics, the frequency of occurrence of the activation word in the statement and the position occurs the field concept symbol that activates word is carried out comprehensive grading, obtain the highest field concept of branch and meet field as statement; Then the statement in the paragraph is merged according to its field concept symbol, obtain sentence group and field thereof; Obtain at last the field of input text according to input text title, sentence group frequency of occurrence and position in input text.
2, the system that obtains of text field according to claim 1 is characterized in that, the character types of described semantic chunk is divided into: feature semantic chunk E, actor semantic chunk A, object semantic chunk B and contents semantic piece C; Described feature semantic chunk type E is divided into two types: a) global characteristics semantic chunk Eg is the feature semantic chunk E in the statement first order level; B) local feature semantic chunk El is the feature semantic chunk E of nested statement S ' time nested statement S ' in the semantic chunk.
3, the system that obtains of text field according to claim 1 is characterized in that, described field concept glossary of symbols comprises following upper level node symbol:
The field concept upper level node The field of expression ??71,72 The psychological activity and the state of mind ??8 Human thinking activities ??a,b Specialty and pursuit are movable ??d The theory activity ??q6 First kind work ??q7 Extra-professional activity ??q8 The faith activity ??6m(m=0~5) Instinctive activity ??3228α(α=8~b) Calamity ??503,50α(α=8~b) State
And the downward node symbol of field concept more specifically that extends of described upper level node.
4, the system that obtains of text field according to claim 1 is characterized in that, described field arbiter is determined the field of statement S as follows: at first, obtain to activate the type of word semantic chunk of living in from the result that statement semantics is analyzed; Then, determine the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk 1, W 2..., W nThe time, the field concept symbol of hypothesis activation word correspondence is respectively D 1, D 2..., D n, calculate the score of each field concept symbol in statement according to following computing formula so:
S(D i)=Rel(i)+Fre(i)+Pos(i),1≤i≤n;
Wherein, i field concept symbol D of Rel (i) expression iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) represents i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) represents i field concept symbol D iAppearance position in statement S, position are big more by its value of back more, with score S (D i) i the highest field concept symbol D iField as statement S.
5, the system that obtains of text field according to claim 1 is characterized in that, described field arbiter judges that the principle of text field also comprises: if title is arranged in the text, the field of title is as the field of text so; If there is not title in the text, the sentence group field that the frequency that occurs at first in the text is maximum so is as the field of text.
6, a kind of acquisition methods of the text field based on concept symbols may further comprise the steps:
(1) segmentation subordinate sentence participle: the word segmentation processing device is the input text cutting paragraph, statement, word;
(2) statement semantics analysis: the statement semantics analyzer carries out semantic analysis to statement, obtains the concept classification of statement and the semantic chunk of formation statement, comprising: the role of semantic chunk, border and inner formation;
(3) obtain the activation word: the field arbiter obtains activation word in the statement according to the semantic concept symbol in field concept glossary of symbols and the word knowledge base;
(4) the statement field is differentiated: the field arbiter carries out comprehensive grading according to activating semantic chunk type, field concept syntactics, the frequency of occurrence of word in the statement and the position occurring to the field concept symbol that activates word, obtains the field of the highest field concept symbol of branch as statement;
(5) sentence group and field thereof are differentiated: the field arbiter merges according to its field concept symbol the statement in the paragraph, obtains sentence group and field thereof;
(6) text field is differentiated: the field arbiter obtains the field of input text according to text header, sentence group frequency of occurrence and position in input text.
According to the acquisition methods of the text field of claim 6, it is characterized in that 7, described step (4) is determined the field of statement S as follows: at first, from the result that statement semantics is analyzed, obtain to activate the type of word semantic chunk of living in; Then, determine the field of statement S successively by the semantic chunk type sequence of " global characteristics semantic chunk Eg>local feature semantic chunk El>contents semantic piece C>object semantic chunk B or actor semantic chunk A "; A plurality of activation word W are arranged in same type semantic chunk 1, W 2..., W nThe time, the field concept symbol of hypothesis activation word correspondence is respectively D 1, D 2..., D n, calculate the score of each field concept symbol in statement according to following computing formula so:
S(D i)=Rel(i)+Fre(i)+Pos(i),1≤i≤n;
Wherein, i field concept symbol D of Rel (i) expression iIn statement with other field concept symbol D j(j ≠ i, 1≤j≤n) concerns score; Fre (i) represents i field concept symbol D iHigh more its value of frequency of occurrence in statement S, the frequency is big more; Pos (i) represents i field concept symbol D iAppearance position in statement S, position are big more by its value of back more, with score S (D i) i the highest field concept symbol D iField as statement S.
8, according to the acquisition methods of the text field of claim 6, it is characterized in that, in the described step (5), for certain paragraph P of text T iIn statement S 1, S 2..., S n, the sentence group ownership of each statement is determined according to following steps:
(5a) get first statement S 1As sentence group G 1, get S 1Field D 1As sentence group G 1Field D G1
(5b) S 1Be current statement S i, G 1Be current sentence group G j, change (5g);
If (5c) S iField D iBe S I-1Field D I-1Symbol extend statement S so iBe included into G j, G jThe field change D into i, change (5g);
If (5d) S I-1Field D I-1Be S iField D iSymbol extend statement S so iBe included into G j, change (5g);
If (5e) current statement S iField D iWith a last statement S I-1Field D I-1Identical, statement S so iBe included into G j, change (5g);
(5f) get S iNext statement S I+1Be new sentence group G J+1, field D Gj+1Be statement S I+1Field D I+1
If (5g) current statement S iBe last statement S n, change so (5n);
If (5k) S iThe field be sky and S iBe S 1, statement S so 2Be included into G 1, G 1The field change D into 2, S 2As current statement S i, change (5c);
If (5l) S iThe field be sky and S iNot S 1, statement S so iBe included into G j, change (5g);
If (5m) S iThe field be not empty, so S I+1As current statement S i, change (5c);
(5n) all group G to obtaining j, the sentence group that adjacent field is identical merges into a sentence group, 1≤j≤m wherein, 1≤m≤n.
9, according to the acquisition methods of the text field of claim 6, it is characterized in that described step (6) also comprises: if title is arranged in the input text, the field of title is used as the field of input text so; If there is not title in the input text, the sentence group field that the frequency that occurs at first in the input text is maximum so is used as the step in the candidate field in input text field.
10, the acquisition methods of text field according to claim 9 is characterized in that, if there is not title in the text, n sentence group's field is designated as D=(D in proper order by sentence group appearance among the text T G1, D G2..., D Gn), from D G1To D GnText field is obtained in operation according to the following steps:
(6a) D G1As D Gi, the statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab;
If (6b) D GiBe D Gn, change so (6f);
(6c) D Gi+1As D Gi
If (6d) D GiThe field concept symbol deposited among the table HTab, change so (6c);
(6e) statistics D in D GiThe field number C that the field concept symbol is identical Gi, with D GiWith C GiDeposit among the table HTab, change (6b);
(6f) obtain showing HTab=((D G1, C G1) ..., (D Gm, C Gm)), 1≤m≤n wherein;
(6g) element (D among the his-and-hers watches HTab Gj, C Gj), 1≤j≤m is according to C GjSize sort from big to small, newly shown HTab '=((D G1', C G1') ..., (D Gm', C Gm')), the field of the field concept symbol of first element in this new table as text T.
CN2009100770180A 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols Expired - Fee Related CN101645083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100770180A CN101645083B (en) 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100770180A CN101645083B (en) 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols

Publications (2)

Publication Number Publication Date
CN101645083A true CN101645083A (en) 2010-02-10
CN101645083B CN101645083B (en) 2012-07-04

Family

ID=41656971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100770180A Expired - Fee Related CN101645083B (en) 2009-01-16 2009-01-16 Acquisition system and method of text field based on concept symbols

Country Status (1)

Country Link
CN (1) CN101645083B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937462A (en) * 2010-09-03 2011-01-05 中国科学院声学研究所 Method and system for automatically evaluating literature
CN104281566A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Semantic text description method and semantic text description system
CN106250398A (en) * 2016-07-19 2016-12-21 北京京东尚科信息技术有限公司 A kind of complaint classifying content decision method complaining event and device
CN106294186A (en) * 2016-08-30 2017-01-04 深圳市悲画软件自动化技术有限公司 Intelligence software automated testing method
CN108153734A (en) * 2017-12-26 2018-06-12 北京嘉和美康信息技术有限公司 A kind of text handling method and device
CN109564505A (en) * 2016-01-27 2019-04-02 伯尼塞艾公司 Teaching programming language is configured with to work with the artificial intelligence engine of the housebroken artificial intelligence model of training
CN110413989A (en) * 2019-06-19 2019-11-05 北京邮电大学 A kind of text field based on domain semantics relational graph determines method and system
CN112699237A (en) * 2020-12-24 2021-04-23 百度在线网络技术(北京)有限公司 Label determination method, device and storage medium
US11120299B2 (en) 2016-01-27 2021-09-14 Microsoft Technology Licensing, Llc Installation and operation of different processes of an AI engine adapted to different configurations of hardware located on-premises and in hybrid environments
US11775850B2 (en) 2016-01-27 2023-10-03 Microsoft Technology Licensing, Llc Artificial intelligence engine having various algorithms to build different concepts contained within a same AI model
US11841789B2 (en) 2016-01-27 2023-12-12 Microsoft Technology Licensing, Llc Visual aids for debugging
US11868896B2 (en) 2016-01-27 2024-01-09 Microsoft Technology Licensing, Llc Interface for working with simulations on premises

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001344256A (en) * 2000-06-01 2001-12-14 Matsushita Electric Ind Co Ltd Word class automatic determination device, example sentence retrieval device, medium, and information aggregate
JP2002259371A (en) * 2001-03-02 2002-09-13 Nippon Telegr & Teleph Corp <Ntt> Method and device for summarizing document, document summarizing program and recording medium recording program
CN101067808B (en) * 2007-05-24 2010-12-15 上海大学 Text key word extracting method
CN100520782C (en) * 2007-11-09 2009-07-29 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101281530A (en) * 2008-05-20 2008-10-08 上海大学 Key word hierarchy clustering method based on conception deriving tree

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937462A (en) * 2010-09-03 2011-01-05 中国科学院声学研究所 Method and system for automatically evaluating literature
CN101937462B (en) * 2010-09-03 2016-08-24 中国科学院声学研究所 Literature review automatic searching method and system
CN104281566A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Semantic text description method and semantic text description system
CN109564505A (en) * 2016-01-27 2019-04-02 伯尼塞艾公司 Teaching programming language is configured with to work with the artificial intelligence engine of the housebroken artificial intelligence model of training
US11164109B2 (en) 2016-01-27 2021-11-02 Microsoft Technology Licensing, Llc Artificial intelligence engine for mixing and enhancing features from one or more trained pre-existing machine-learning models
CN109564505B (en) * 2016-01-27 2022-03-25 微软技术许可有限责任公司 Artificial intelligence engine, system and machine readable storage device
US11775850B2 (en) 2016-01-27 2023-10-03 Microsoft Technology Licensing, Llc Artificial intelligence engine having various algorithms to build different concepts contained within a same AI model
US11868896B2 (en) 2016-01-27 2024-01-09 Microsoft Technology Licensing, Llc Interface for working with simulations on premises
US11762635B2 (en) 2016-01-27 2023-09-19 Microsoft Technology Licensing, Llc Artificial intelligence engine with enhanced computing hardware throughput
US11842172B2 (en) 2016-01-27 2023-12-12 Microsoft Technology Licensing, Llc Graphical user interface to an artificial intelligence engine utilized to generate one or more trained artificial intelligence models
US11841789B2 (en) 2016-01-27 2023-12-12 Microsoft Technology Licensing, Llc Visual aids for debugging
US11100423B2 (en) 2016-01-27 2021-08-24 Microsoft Technology Licensing, Llc Artificial intelligence engine hosted on an online platform
US11120299B2 (en) 2016-01-27 2021-09-14 Microsoft Technology Licensing, Llc Installation and operation of different processes of an AI engine adapted to different configurations of hardware located on-premises and in hybrid environments
US11120365B2 (en) 2016-01-27 2021-09-14 Microsoft Technology Licensing, Llc For hierarchical decomposition deep reinforcement learning for an artificial intelligence model
CN106250398B (en) * 2016-07-19 2020-03-27 北京京东尚科信息技术有限公司 Method and device for classifying and judging complaint content of complaint event
CN106250398A (en) * 2016-07-19 2016-12-21 北京京东尚科信息技术有限公司 A kind of complaint classifying content decision method complaining event and device
CN106294186A (en) * 2016-08-30 2017-01-04 深圳市悲画软件自动化技术有限公司 Intelligence software automated testing method
CN108153734A (en) * 2017-12-26 2018-06-12 北京嘉和美康信息技术有限公司 A kind of text handling method and device
CN110413989B (en) * 2019-06-19 2020-11-20 北京邮电大学 Text field determination method and system based on field semantic relation graph
CN110413989A (en) * 2019-06-19 2019-11-05 北京邮电大学 A kind of text field based on domain semantics relational graph determines method and system
CN112699237B (en) * 2020-12-24 2021-10-15 百度在线网络技术(北京)有限公司 Label determination method, device and storage medium
CN112699237A (en) * 2020-12-24 2021-04-23 百度在线网络技术(北京)有限公司 Label determination method, device and storage medium

Also Published As

Publication number Publication date
CN101645083B (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN101645083B (en) Acquisition system and method of text field based on concept symbols
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN106570179B (en) A kind of kernel entity recognition methods and device towards evaluation property text
CN104778209B (en) A kind of opining mining method for millions scale news analysis
CN106709040B (en) Application search method and server
CN101470732B (en) Auxiliary word stock generation method and apparatus
US20210056571A1 (en) Determining of summary of user-generated content and recommendation of user-generated content
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
CN107180025B (en) Method and device for identifying new words
CN108073568A (en) keyword extracting method and device
CN105550269A (en) Product comment analyzing method and system with learning supervising function
CN107193801A (en) A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN106202372A (en) A kind of method of network text information emotional semantic classification
CN101127042A (en) Sensibility classification method based on language model
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN110674296B (en) Information abstract extraction method and system based on key words
CN109934251B (en) Method, system and storage medium for recognizing text in Chinese language

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120704

Termination date: 20160116

EXPY Termination of patent right or utility model