CN100405362C - New Chinese characters spoken language analytic method and device - Google Patents

New Chinese characters spoken language analytic method and device Download PDF

Info

Publication number
CN100405362C
CN100405362C CNB2005101093358A CN200510109335A CN100405362C CN 100405362 C CN100405362 C CN 100405362C CN B2005101093358 A CNB2005101093358 A CN B2005101093358A CN 200510109335 A CN200510109335 A CN 200510109335A CN 100405362 C CN100405362 C CN 100405362C
Authority
CN
China
Prior art keywords
semantic
node
sentence
speech
centerdot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005101093358A
Other languages
Chinese (zh)
Other versions
CN1949211A (en
Inventor
宗成庆
左云存
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CNB2005101093358A priority Critical patent/CN100405362C/en
Publication of CN1949211A publication Critical patent/CN1949211A/en
Application granted granted Critical
Publication of CN100405362C publication Critical patent/CN100405362C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to new Chinese spoken language analysis device. It uses statistical method to automatically get semantic rule from nurture language material to generate semantic classification tree by which the words and expressions are analyzed to gain its corresponding one or many semantics and probability, uses statistics analysis module to select and combine the analysis result of the semantic classification tree to gain whole sentence field action. It is proved by the experiment that the method has higher accuracy rate and robustness, is suited for Chinese spoken language superficial layer semantic analysis in defined field.

Description

A kind of Chinese characters spoken language analytic method and device
Technical field
The invention belongs to natural language processing field, particularly Interpreter, the method and apparatus of man-machine conversation and spoken language analyzing.
Background technology
Along with the development of aspects such as world economy and science and technology, the people of country variant are frequent day by day in the interchange of multiple occasions such as tourism, trade, and thereupon, linguistic obstacle also becomes and becomes increasingly conspicuous.The Interpreter is a purpose with the free communication that reaches between the different language, more and more is subject to people's attention.In addition, in daily application, as telephone counseling, automatic paging, hotel reservation, order tickets by telephone, conference reservation or the like, attendant's work can replace with interactive system fully.The interactive system information storage is many, and information retrieval speed is fast, and the ability that dynamically updates information is strong.Therefore, development Interpreter's technology and spoken dialogue system are containing huge social and economic benefit.
The spoken language analyzing technology is one of gordian technique in oral translation system and the interactive system.Since the eighties of last century the eighties, be the developed country of representative with the U.S., Germany, France and Japan, spoken analytic technique has been done number of research projects.Massachusetts Institute Technology (MIT) is once in the eighties latter stage and the nineties initial stage, the VOYAGER system that is used for the geography information consulting, PEGACUS system and the ATIS of plane ticket booking system that is used for the GALAXY system of geography and weather information library inquiry and is used for the information consultation of online civil aviaton have successively been developed, the spoken language analyzing device TINA that MIT is special for these conversational systems have designed; In 1993, European Union has set up LRE (LanguageResearch and Engineering) plan, Japan Ministry of Education has also set up UGD (Understanding and Generation of Dialogue) plan simultaneously, and these plans all are specifically designed to subsidizes the spoken language analyzing technical research.
The method that at present common spoken language analyzing device adopts totally is divided into rule-based method and based on the method for statistics.Rule-based method is traditional spoken language analyzing method, and a common shortcoming is that robustness is good inadequately, can not good treatment for the barbarism in the spoken language.In recent years, spoken language analyzing method based on statistical model has obtained more applications, the extensive corpus of this method utilization is as the source of its knowledge, can obtain knowledge automatically, therefore, can alleviate people's burden to a great extent, and its Knowledge Source is in real text, so often have reasonable robustness and field transplantability.Yet, the constraint that the spoken language analyzing method of statistics is difficult to handle the structural relation of sentence and grows distance.The method that the spoken language analyzing device of this paper design adopts statistics and rule to combine, experimental result proves that this method has higher accuracy rate, is the semantic analytic method of a kind of effective spoken shallow-layer.
The spoken language analyzing device has important effect in the system such as translation and man-machine conversation automatically at spoken language.Traditional spoken language analyzing device adopts rule-based method, is difficult to handle spoken non-standard phenomena, is unfavorable for handling sentence middle and long distance restriction relation based on the spoken language analyzing device of statistical model method.
Summary of the invention
The object of the present invention is to provide semantic analytic method of a kind of new Chinese characters spoken language shallow-layer and device.
This paper has designed a kind of spoken language analyzing device based on statistics and the regular method that combines, utilize statistical method from corpus, to obtain semantic rules automatically, the generative semantics classification tree, utilize semantic classification trees to resolving with the semantic closely-related word of sentence shallow-layer in the Chinese sentence to be resolved then, obtain one or more semantemes and the probability thereof of each word correspondence, the analytic model of utilization statistics is at last selected the analysis result of semantic classification trees and is made up, thereby obtains the territorial behavior of whole sentence.Experimental result shows that this method has higher accuracy rate and robustness, is fit to be applied in semantic parsing of Chinese characters spoken language shallow-layer in qualification field.
The present invention with the advanced C-STAR of research alliance of international voiced translation (Consortium for SpeechTranslation Advanced Research international) propose in the middle of transform territorial behavior among the form IF (Interchange Format) as spoken shallow-layer semantic expressiveness, the speak intention and the key concept of sentence is described in territorial behavior, and the function of the semantic resolver of spoken shallow-layer is exactly the territorial behavior that obtains the Chinese characters spoken language sentence.
Characteristics of the present invention are to have higher robustness, can handle in the barbarism that exists in the spoken language and the sentence restriction relation of long distance between the word preferably.In addition, adopt the method for statistics to obtain rule automatically from language material, can realize the transplanting of resolution system between different field fast, what make that this technology can be very fast realizes commercialization in different field.Fig. 1 is a structured flowchart of the present invention, and the semantic resolver of spoken shallow-layer mainly comprises training and resolve two parts, by pretreatment unit, handmarking's device, search device, semantic classification tree device and statistics analytic model device and form.Wherein, pretreatment unit is connected in handmarking's device, and handmarking's device is connected in semantic classification tree device and statistics analytic model device, searches device and is connected in the semantic classification tree device.
Technical scheme
Chinese characters spoken language parsing based on semantic classification trees is divided into training and resolves two parts, and detailed process is as follows:
Training process comprises:
A) spoken language materials of collection association area;
B) sentence is carried out pre-service;
C) the semantic related term of the semanteme of the territorial behavior of mark sentence and keyword correspondence and keyword;
D) utilize the language material constructing semantic classification tree device of mark and the parameter that obtains to add up the analytic model device;
Resolving comprises:
E) sentence is carried out pre-service;
F) search in the sentence to be resolved and the closely-related key words of territorial behavior;
G) utilize semantic classification trees to obtain one or more probability semantic and various semantemes of each key words;
H) result who utilizes the statistics analytic model that step g) is obtained selects and makes up the shallow-layer semantic domain behavior that obtains sentence.
The semantic uncertainty method for expressing of mentioning in the step d).
The semantic probability method for expressing of node in the semantic classification trees of mentioning in the step d).
The semantic classification trees construction algorithm of mentioning in the step d),
(1) but set up storehouse T and preserve current all partial node pointers;
(2) the relevant part of speech of all of the A that mark comes out in the corpus adds question (A), contain the language material set K of all sentences of speech A to be resolved as root node, all semanteme and the probability thereof of A among the record K in the root node, the root node expression formula is initialized as "<+〉 ", and the root node pointer is added T;
(3) if T is empty, all nodes can not divide again, generate complete semantic classification trees, finish algorithm; If T is not empty, take out uppermost node pointer, use each part of speech among the question (A) to replace node expression respectively, generate the individual problem of 4M*n (n is the number of symbol "+" in the node expression);
(4) calculate according to formula (1) and (2) and make the problem of Δ i maximum as node problems, if node is non-leaf node, execution in step (5), otherwise, return step (3);
(5) set up the left and right sides child node of node, all sentences are divided into two parts in the set of node language material, meet of the language material set of the sentence of node problems as left child node, do not meet of the language material set of the sentence of node problems as right child node, write down the semanteme and the probabilistic information of speech to be resolved in the child node language material set of the left and right sides respectively, left and right sides child node pointer is added T;
(6) left child node expression formula is made as the father node problem of left subtree, and right child node expression formula is made as the father node expression formula of right subtree, returns step (3).
The statistics analytic model of mentioning in the step d).
Describe each related detailed problem in the technical solution of the present invention below in detail.
1. pretreatment unit
Pre-service comprises participle and lexical semantic classification two parts, and purpose is the part of speech sequence that obtains the sentence correspondence.
The present invention be directed to the spoken language analyzing of specific area, the vocabulary that is run into is very limited, and we have adopted forward maximum match method to carry out participle, and its accuracy can satisfy the needs of system.
Lexical semantic is sorted out, and exactly each vocabulary is belonged to different semantic category the insides, and this is similar to part-of-speech tagging, but this moment mark be not part of speech, but the semantic category under the vocabulary.Semantic classification is to carry out according to dictionary.We have defined a semantic category dictionary, and this dictionary has carried out semantic classification to the vocabulary in the specific area.The principle of classification is to carry out according to the semantic function of vocabulary in sentence, and the vocabulary that semantic function is identical is classified as a class.Such as " single room " and " single room ", their meaning of a word is identical, and its semantic function is inevitable identical, so these two vocabulary belong to a class.Again such as " greatly " and " little ", their meaning of a word difference, but the semantic function in sentence is identical substantially, so they are classified as a class.Vocabulary in sentence through semantic the classification after, just obtain a semantic category sequence.Table 1 is the part semantic category in hotel reservation field.
The vocabulary that table 1. semantic category and semantic category comprise
Semantic category The vocabulary that semantic category comprises
N_C_COST Expense charge funds take cost
N_C_BED The big bed of bed berth bunk bed
N_O_COUNTRY_PERSON Englishman Japanese American German
N_C_NAME Name name full name your name
V_INCLUDE Comprise having and add
V_RESERVE Reservation is ordered confirmation slip and is ordered
2. labelling apparatus
Mark adopts artificial method, mainly comprises two parts work: 1) territorial behavior mark, come out the territorial behavior mark of a sentence correspondence exactly.Be labeled as " give-information+disposition+reservation+room " as sentence " I want to subscribe a single room "; 2) the semantic and semantic related term mark of the keyword relevant with territorial behavior.The semantic marker of the part of speech correspondence relevant with territorial behavior is come out, simultaneously, the part of speech that influences this part of speech semanteme in the sentence also mark is come out.For example, in sentence " I am scheduled to a single room ", through after the pre-service, " single room " represented by part of speech " ROOM_INFO ", and in this sentence, " single room " corresponding " room " in the territorial behavior, we carry out semantic marker with " room " to " ROOM_INFO ", simultaneously, " being scheduled to " this speech is that decision " ROOM_INFO " is labeled as the key factor of " room ", so the part of speech " V_RESERVE " corresponding " being scheduled to " is added among the semantic related term class set question (ROOM_INFO) of " ROOM_INFO ".
3. search device
The function of searching device is finding out with the closely-related part of speech of territorial behavior in the input sentence, as the input of semantic classification trees.
4. semantic classification tree device
Semantic classification trees is the binary tree that comprises a series of semantic ruleses.Fig. 2 is an abridged semantic classification trees exemplary plot, is used for resolving the semanteme of part of speech V_DAO."+" expression comprises the interval of at least one word, "<" and "〉among the figure " represent the beginning and end of sentence, triangle is represented leaf node, dotted line is represented the clipped set, has omitted the semanteme and the probabilistic information thereof of part of speech among the figure on each node.Each non-leaf node comprises a problem, whether the sentence of judging input is complementary with certain expression formula, when resolving, suppose that the sentence of importing is A, can be described as at the matching process of root node: IF (A match<+V_DAO+ 〉) THEN GOTO left-son; ELSE GOTOright-son. arrives after the node, proceeds coupling, up to arriving leaf node, the result who obtains resolving.
Labelling apparatus has obtained all parts of speech relevant with territorial behavior and their semantic related term class sets separately, is used for constructing semantic classification tree device, is that each part of speech relevant with territorial behavior generates a semantic classification trees.Suppose that speech to be resolved is A, its semantic related term class set is question (A).Beginning root node expression formula is initialized as "<+〉 ", then, goes to replace "+" in the node expression with the generation node problems with certain the part of speech w among the question (A).For example, we then generate problem<w+ that root node comprises with "+" in " w+ " replacement root node expression formula 〉.The possible substitute mode of each part of speech has four kinds: " w ", " w+ ", "+w ", "+w+ ".Like this, when the expression formula that is replaced comprises N (N 〉=0) individual symbol "+", and there is the substitute mode that 4M * the N kind is possible in the number of members of question (A) when being M (M>0), correspondingly generates 4M * N the problem that can select.Like this, we must provide a method of selecting problem, select the method for problem directly to influence the scale of tree, thereby influence analytic efficiency and speed.When generating each node, make as soon as possible that the semanteme at this node reaches a definite state, for this reason, we adopt i (T) (as shown in Equation 1) to represent the uncertainty of node semanteme.
i ( T ) = Σ j ∈ S Σ k ∈ S , k ! = j P ( j / T ) × P ( k / T ) - - - ( 1 )
Wherein, S represents the semanteme set of speech to be resolved at node, P (j/T) is illustrated in T node part of speech semanteme and is the probability of j, if (specific explanations is seen the semantic classification tree generation algorithm) sentence adds up to n (n>0) in the set of the language material of node T, the sentence number that part of speech semantic marker wherein to be resolved is j is m (m>0), so, P (j/T)=m/n.I (T) is big more, illustrates at the semanteme of this node uncertain more.For each node, only problem is the problem that makes that Δ i (as shown in Equation 2) is maximum.
Δi=i(T)-P i×i(L)-P r×i(R) (2)
Wherein, p lAnd p rRepresent that respectively the T node arrives the probability of its left and right sides child node, can compare with sentence number in the set of T node language material with sentence number in the language material set of left and right sides child node respectively and try to achieve that they are node problems decisions.I (L) and i (R) represent the semantic uncertainty of left and right sides child node respectively.Δ i is big more, and it is many more to show that semantic uncertainty descends, and the speed that semantic trend is stable is fast more.When following two kinds of situations occurring, can select problem Δ i=0 for all, this node can not divide again, is leaf node.
(1) have only a kind of semanteme at node, promptly i (T) is zero;
(2) at node multiple semanteme is arranged, but all problems that the part of speech in the existing related term class set generates can not separate them all.
When all nodes that do not divide all are leaf node, just generated a complete semantic classification trees.
Suppose that part of speech to be resolved is A, the number of members of its semantic relevant word set question (A) is M.According to above-mentioned explanation, the generative process of corresponding semantic classification trees can be described as following algorithm formally:
(1) but set up storehouse T and preserve current all partial node pointers.
(2) the relevant part of speech of all of the A that mark comes out in the corpus adds question (A), contain the language material set K of all sentences of speech A to be resolved as root node, all semanteme and the probability thereof of A among the record K in the root node, the root node expression formula is initialized as "<+〉 ", and the root node pointer is added T.
(3) if T is empty, all nodes can not divide again, generate complete semantic classification trees, finish algorithm; If T is not empty, take out uppermost node pointer, use each part of speech among the question (A) to replace node expression respectively, generate the individual problem of 4M*n (n is the number of symbol "+" in the node expression).
(4) calculate according to formula (1) and (2) and make the problem of Δ i maximum as node problems, if node is non-leaf node, execution in step (5), otherwise, return step (3).
(5) set up the left and right sides child node of node, all sentences are divided into two parts in the set of node language material, meet of the language material set of the sentence of node problems as left child node, do not meet of the language material set of the sentence of node problems as right child node, write down the semanteme and the probabilistic information of speech to be resolved in the child node language material set of the left and right sides respectively, left and right sides child node pointer is added T.
(6) left child node expression formula is made as the father node problem of left subtree, and right child node expression formula is made as the father node expression formula of right subtree, returns step (3).
After semantic classification trees of each part of speech relevant generation with territorial behavior, just can carry out semanteme to the crucial part of speech in the input sentence with these semantic classification trees resolves, when resolving according to the sentence of input whether with node in problem mate and select the path, although resolve each time and not necessarily can match leaf node, we can be a plurality of semantemes in the node of last coupling and probability thereof as the result of speech to be resolved input as the statistics analytic model.
5. statistics resolver
After by semantic classification trees each and the closely-related part of speech of territorial behavior in the sentence being resolved, we have obtained the one or more semantemes of each part of speech and the probability of various semantemes, and the function of statistics resolver is the shallow-layer semantic domain behavior that generates whole sentence from these semantic results.
The N meta-model is a kind of mathematical model the most frequently used in the natural language processing.Hypothetical sequence x 1x 2... x mBe a N (N 〉=1) rank Markov chain, so a certain element x iThe probability that occurs is only relevant with N-1 the element in its front, promptly
p(x i|x 1...x i-1)=p(x i|x i-n+1...x i-1) (3)
If the territorial behavior of a sentence is by s 1, s 2... s mThis M semantic the composition, in theory the territorial behavior of this sentence at most Plant possibility, suppose w i∈ { s 1, s 2... s m, and w i≠ w j, sequence W=w then 1w 2W mCan represent wherein any one possibility, suppose that the territorial behavior each several part of a sentence also satisfies Markov property, the probability that W occurs is:
p(W)=p(w 1w 2...w m)
=p(w 1)×p(w 2|w 1)×...×p(w i|w i-n+1...w i-1)×...×p(w m|w m-n+1...w m-1) (4)
If the employing binary model, then:
p(W)=p(w 1w 2...w m)
=p(w 1)×p(w 2|w 1)×...×p(w i|w i-1)×...×p(w m|w m-1) (5)
Suppose count (w I-1w i) expression w I-1w iThe number of times that occurs in the semanteme of corpus then gets according to maximal possibility estimation:
p ( w i | w i - 1 ) = count ( w i - 1 w i ) count ( w i - 1 ) - - - ( 6 )
We can according to formula (5) calculate M! Plant the probability that the possibility situation occurs separately, calculate the territorial behavior W ' of the situation of probability maximum, promptly as sentence
W ′ = arg max w 1 w 2 . . . w m p ( w 1 w 2 . . . w m )
= arg max w 1 w 2 . . . w m p ( w 1 ) × p ( w 2 | w 1 ) × . . . × p ( w i | w i - 1 ) × . . . × p ( w m | w m - 1 ) - - - ( 7 )
What consider above is at s 1, s 2... s mSituation about determining, but, after the semantic classification trees parsing, a plurality of semantemes of each part of speech relevant corresponding generation of possibility in the sentence with territorial behavior, the number of supposing speech to be resolved in the sentence is N, and each part of speech is resolved the semantic number that obtains through semantic classification trees and is respectively K 1, K 2... K N, the possible result of territorial behavior has K 1* K 2* ... * K N* N! Kind, suppose p ( s i , k i | c i ) ( 0 < k i &le; K i ) Be part of speech c in current sentence environment iTo semantic
Figure C20051010933500132
Probability (resolve obtaining) by semantic classification trees, W=w 1w 2... w nRepresent a possible territorial behavior, wherein w i &Element; { s 1 , k 1 , s 2 , k 2 , . . . , s n , k n } , And w i≠ w j, we take into account part of speech to the transition probability of semantic marker, the territorial behavior W '=w of sentence 1w 2... w nCan adopt following formula to calculate:
W &prime; = arg max k 1 , k 2 , . . . k n , w 1 , w 2 . . . w n p ( s 1 , k 1 , s 2 , k 2 , . . . , s n , k n | c 1 , c 2 , . . . c n ) p ( w 1 w 2 . . . w n )
= arg max k 1 , k 2 , . . . k n , w 1 , w 2 . . . w n p ( s 1 , k 1 | c 1 ) &times; . . . &times; p ( s n , k n | c n ) &times; p ( w 1 ) &times; p ( w 2 | w 1 ) . . . p ( w n | w n - 1 ) - - - ( 8 )
The w that utilizes formula (8) to obtain 1w 2... w nBe the territorial behavior of sentence.The meaning that formula (8) is described is as follows: when the territorial behavior of obtaining a sentence is represented, at first select a semanteme, represent by the forward part of formula, then for each key words, the semanteme of the selection of a plurality of key wordses is carried out the adjustment of order, represent by the rear section of formula.Therefrom as can be seen, we are when obtaining the sentence territorial behavior, mainly considered the factor of two aspects, it is in current sentence environment on the one hand and the semantic probabilistic information of the closely-related word of territorial behavior, resolve acquisition by semantic classification trees, be the possibility that makes up between the various semantemes on the other hand, represent that its concrete parameter is learnt to obtain from the territorial behavior of mark by binary model.Like this, both guaranteed that the semanteme of single word among the result in the end had higher probability, and also made the territorial behavior of acquisition meet realistic meaning.
Description of drawings
Fig. 1 is a spoken language analyzing structure drawing of device of the present invention.
Fig. 2 is a semantic classification trees exemplary plot of the present invention.
Embodiment
In order to describe the embodiment of this invention in detail, we are example with the semantic resolution system of hotel reservation field Chinese characters spoken language shallow-layer, are illustrated.
Training process:
We have collected the language material of some in the hotel reservation field, then these language materials are carried out participle and lexical semantic classification, on this basis, language material is carried out manual mark, the territorial behavior that obtains each sentence correspondence is represented, in the sentence and the semanteme of the closely-related crucial part of speech of territorial behavior and its correspondence (semantic represent) by the single part in the territorial behavior, and the relevant part of speech of the semanteme of keyword; Then, utilize the language material constructing semantic classification tree of mark, and obtain the parameter of binary model, the parameter that obtains adding up analytic model.
Resolving:
Sentence for a needs parsing, at first carry out participle, carrying out lexical semantic sorts out, find out in the sentence then and the closely-related part of speech of territorial behavior, utilize the semantic classification tree device respectively each part of speech to be resolved, obtain the semanteme and the probability thereof of its correspondence, last, the result who utilizes the statistics analytic model that the semantic classification tree device is obtained further selects and makes up, and obtains the shallow-layer territorial behavior of sentence correspondence.

Claims (6)

1. the method step based on the semantic resolver of Chinese characters spoken language shallow-layer of semantic classification trees is:
Training process comprises:
A) spoken language materials of collection association area;
B) sentence is carried out pre-service;
C) the semantic related term of the semanteme of the territorial behavior of mark sentence and keyword correspondence and keyword;
D) utilize the language material constructing semantic classification tree device of mark and the parameter that obtains to add up the analytic model device;
Resolving comprises:
E) sentence is carried out pre-service;
F) search in the sentence to be resolved and the closely-related key words of territorial behavior;
G) utilize semantic classification trees to obtain one or more probability semantic and various semantemes of each key words;
H) result who utilizes the statistics analytic model that step g) is obtained selects and makes up the shallow-layer semantic domain behavior that obtains sentence.
2. according to the method for claim 1, it is characterized in that semantic described in the step d), its semantic uncertainty is expressed as and comprises:
i ( T ) = &Sigma; j &Element; S &Sigma; k &Element; S , k ! = j P ( j / T ) &times; P ( k / T ) - - - ( 1 )
Wherein, S represents the semanteme set of speech to be resolved at node T, and P (j/T) is illustrated in T node part of speech semanteme and is the probability of j, sentence adds up to n in the language material set, the sentence number that part of speech semantic marker to be resolved is j is m, and i (T) is big more, and is uncertain more at the semanteme of this node T.
3. according to the method for claim 1, it is characterized in that, the semantic probability tables of node is shown P (j/T) in the semantic classification trees described in the step d), if sentence adds up to n (n>0) in the set of the language material of node T, the sentence number that part of speech semantic marker wherein to be resolved is j is m (m>0), so, P (j/T)=m/n.。
4. according to the method for claim 1, it is characterized in that, the semantic classification trees construction algorithm of mentioning in the step d),
(1) but set up storehouse T and preserve current all partial node pointers;
(2) the relevant part of speech of all of the A that mark comes out in the corpus adds question (A), contain the language material set K of all sentences of speech A to be resolved as root node, all semanteme and the probability thereof of A among the record K in the root node, the root node expression formula is initialized as "<+〉 ", and the root node pointer is added T;
(3) if T is empty, all nodes can not divide again, generate complete semantic classification trees, finish algorithm; If T is not empty, take out uppermost node pointer, use each part of speech among the question (A) to replace node expression respectively, generate 4M*n problem;
(4) calculate according to formula (1) and (2) and make the problem of Δ i maximum as node problems, if node is non-leaf node, execution in step (5), otherwise, return step (3);
(5) set up the left and right sides child node of node, all sentences are divided into two parts in the set of node language material, meet of the language material set of the sentence of node problems as left child node, do not meet of the language material set of the sentence of node problems as right child node, write down the semanteme and the probabilistic information of speech to be resolved in the child node language material set of the left and right sides respectively, left and right sides child node pointer is added T;
(6) left child node expression formula is made as the father node problem of left subtree, and right child node expression formula is made as the father node expression formula of right subtree, returns step (3).
5. according to the method for claim 1, it is characterized in that the statistics analytic model described in the step d) is:
W &prime; = arg max w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w m p ( w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w m )
= arg max w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w m p ( w 1 ) &times; p ( w 2 | w 1 ) &times; &CenterDot; &CenterDot; &CenterDot; &times; p ( w i | w i - 1 ) &times; &CenterDot; &CenterDot; &CenterDot; &times; p ( w m | w m - 1 ) - - - ( 7 )
In the formula: the territorial behavior W '=w of sentence 1w 2W n, the part of speech sequence is W=w 1w 2W mAfter by semantic classification trees each and the closely-related part of speech of territorial behavior in the sentence being resolved, obtain the one or more semantemes of each part of speech and the probability of various semantemes, the statistics analytic model is the shallow-layer semantic domain behavior that generates whole sentence from these semantic results.
6. a Chinese characters spoken language resolver is characterized in that, the spoken language analyzing device by pretreatment unit, handmarking's device, search device, semantic classification tree device and statistics analytic model device is formed, wherein:
Pretreatment unit is used to export the part of speech sequence of sentence correspondence;
Handmarking's device is connected in pretreatment unit, to pre-service obtain the part of speech sequence generate the territorial behavior mark, an and corresponding semantic marker of a part of speech relevant generation with territorial behavior;
Search device and be connected in the semantic classification tree device, be used for converting the input sentence to chunk to be resolved and send into the semantic classification tree device;
The semantic classification tree device is used to receive the semantic and semantic related term mark of the keyword relevant with territorial behavior of artificial labelling apparatus, generates the binary tree that comprises a series of semantic ruleses;
Statistics analytic model device receives the territorial behavior mark of artificial labelling apparatus, obtains the one or more semantemes of each part of speech and the probability of various semantemes, generates the shallow-layer semantic domain behavior of whole sentence from these semantic results.
CNB2005101093358A 2005-10-13 2005-10-13 New Chinese characters spoken language analytic method and device Expired - Fee Related CN100405362C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005101093358A CN100405362C (en) 2005-10-13 2005-10-13 New Chinese characters spoken language analytic method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005101093358A CN100405362C (en) 2005-10-13 2005-10-13 New Chinese characters spoken language analytic method and device

Publications (2)

Publication Number Publication Date
CN1949211A CN1949211A (en) 2007-04-18
CN100405362C true CN100405362C (en) 2008-07-23

Family

ID=38018731

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005101093358A Expired - Fee Related CN100405362C (en) 2005-10-13 2005-10-13 New Chinese characters spoken language analytic method and device

Country Status (1)

Country Link
CN (1) CN100405362C (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101789008B (en) * 2010-01-26 2012-02-01 武汉理工大学 Man-machine interface system knowledge base and construction method thereof
WO2013088287A1 (en) * 2011-12-12 2013-06-20 International Business Machines Corporation Generation of natural language processing model for information domain
CN102708453B (en) * 2012-05-14 2016-08-10 北京奇虎科技有限公司 The method and device of solution of terminal fault is provided
CN104680177B (en) * 2015-03-03 2018-06-26 赵天奇 A kind of general-purpose type intelligence learning detection of agricultural products and the method and apparatus of classification
CN106326303B (en) * 2015-06-30 2019-09-13 芋头科技(杭州)有限公司 A kind of spoken semantic analysis system and method
CN105912521A (en) * 2015-12-25 2016-08-31 乐视致新电子科技(天津)有限公司 Method and device for parsing voice content
CN108417205B (en) * 2018-01-19 2020-12-18 苏州思必驰信息科技有限公司 Semantic understanding training method and system
CN108804424B (en) * 2018-06-08 2020-05-05 广州荔支网络技术有限公司 Corpus training method and device, electronic equipment and storage medium
CN111292751B (en) * 2018-11-21 2023-02-28 北京嘀嘀无限科技发展有限公司 Semantic analysis method and device, voice interaction method and device, and electronic equipment
CN111046146B (en) * 2019-12-27 2023-05-12 北京百度网讯科技有限公司 Method and device for generating information
CN112580365B (en) * 2020-11-05 2024-06-11 科大讯飞(北京)有限公司 Chapter analysis method, electronic equipment and storage device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5418717A (en) * 1990-08-27 1995-05-23 Su; Keh-Yih Multiple score language processing system
CN1570921A (en) * 2003-07-22 2005-01-26 中国科学院自动化研究所 Spoken language analyzing method based on statistic model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5418717A (en) * 1990-08-27 1995-05-23 Su; Keh-Yih Multiple score language processing system
CN1570921A (en) * 2003-07-22 2005-01-26 中国科学院自动化研究所 Spoken language analyzing method based on statistic model

Also Published As

Publication number Publication date
CN1949211A (en) 2007-04-18

Similar Documents

Publication Publication Date Title
CN100405362C (en) New Chinese characters spoken language analytic method and device
CN107038229B (en) Use case extraction method based on natural semantic analysis
CN107220237A (en) A kind of method of business entity&#39;s Relation extraction based on convolutional neural networks
CN110674252A (en) High-precision semantic search system for judicial domain
CN101937430A (en) Method for extracting event sentence pattern from Chinese sentence
Tur et al. Exploiting the semantic web for unsupervised natural language semantic parsing
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN112328800A (en) System and method for automatically generating programming specification question answers
CN105138864A (en) Protein interaction relationship data base construction method based on biomedical science literature
CN112883175A (en) Meteorological service interaction method and system combining pre-training model and template generation
CN112784602A (en) News emotion entity extraction method based on remote supervision
Chernova Occupational skills extraction with FinBERT
CN111738008B (en) Entity identification method, device and equipment based on multilayer model and storage medium
JPH1196177A (en) Method for generating term dictionary, and storage medium recording term dictionary generation program
Chakma et al. Deep semantic role labeling for tweets using 5W1H: Who, What, When, Where, Why and How
Goienetxea et al. Towards the use of similarity distances to music genre classification: A comparative study
CN101499056A (en) Backward reference sentence pattern language analysis method
CN115730078A (en) Event knowledge graph construction method and device for class case retrieval and electronic equipment
Kerkvliet et al. Who mentions whom? recognizing political actors in proceedings
CN1570921A (en) Spoken language analyzing method based on statistic model
Amezian et al. Training an LSTM-based Seq2Seq model on a Moroccan biscript lexicon
Simov et al. A reservoir computing approach to word sense disambiguation
Dandapat Part-of-Speech tagging for Bengali
CN113590768A (en) Training method and device of text relevance model and question-answering method and device
Ali et al. AI-Natural Language Processing (NLP)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080723

Termination date: 20181013

CF01 Termination of patent right due to non-payment of annual fee