CN106844333A

CN106844333A - A kind of statement analytical method and system based on semantic and syntactic structure

Info

Publication number: CN106844333A
Application number: CN201611183668.XA
Authority: CN
Inventors: 简仁贤; 梅森傑
Original assignee: Intelligent Technology (shanghai) Co Ltd
Current assignee: Intelligent Technology (shanghai) Co Ltd
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-06-13

Abstract

The invention provides a kind of statement analytical method and system based on semantic and syntactic structure, wherein inventive method is comprised the following steps：Step 1：Input original sentence；Step 2：Initial training language material is produced using the original sentence；Step 3：The training corpus after artificial correction is obtained, middle trained language material is defined as；Step 4：The correctness of the middle trained corpus labeling is verified, if the mark of the middle trained language material is correct, the middle trained language material final training corpus is defined as, and enter step 5；The otherwise circulation of return to step 3 is performed；Step 5：Bring the final training corpus into training pattern；Training corpus needed for producing user with semiautomatic fashion, improves the efficiency for producing training corpus；Training pattern, the correctness of training corpus can be come using customized training corpus；The result for visualizing semantic character labeling is provided；Can be in each model of continuous training under same system, so that lifting system overall efficiency.

Description

A kind of statement analytical method and system based on semantic and syntactic structure

Technical field

The invention belongs to computer application field, and in particular to a kind of statement analytical method based on semantic and syntactic structure And system.

Background technology

It is a large amount of in natural language to have to the description of all kinds of events in human lives that (small to one action, goes through greatly to one Historical event part), while the time, place, the role for participating in, the content etc. the relation between state and event that are also produced including event With the description of feature.With the rise of internet correlation technique, people more and more depend on network to obtain information, and mutual The information of networking shows magnanimity, increases severely and the characteristic such as redundancy, in order to be able to more preferable monitoring and uses information therein, allows machine The event in text can be analyzed, event-oriented Sentence analysis research seems more and more important.Sentence analysis refer to just to language Each composition function and semanteme in sentence are analyzed, and will be input into the linear word order in sentence between word, become one it is non-linear Data structure.

Current main theory of the natural language processing field on Sentence analysis includes：Interdependent syntax, Chomsky development Formal grammar theory be phrase structure grammar and its extension, such as：Lexical-Functional Grammar, functional unification grammar, broad sense phrase knot The phrase structure grammar that structure grammer, centre word drive.The thought of these methods is built upon on the knowledge of grammar of English basis, The composition in sentence is not divided into event and event role and the relation between them is analyzed from the angle for understanding event.Mesh The preceding research for event has focused largely on from text identification and extraction event and event role extracts, based on event from Dynamic digest and text such as automatically generate at the aspect, and these are studied all in the urgent need to the Sentence analysis side based on event structure of the invention The support of method.

Semantic character labeling is a core technology in natural language processing.Traditionally semantic character labeling is using instruction Practice part-of-speech tagging model, interdependent syntactic analysis model etc. to reach the semantic role in parsing sentence.However, these models are point Dissipate and non-be present in same system.Additionally, existing semantic character labeling is only capable of providing the system for having trained completion, it is impossible to accord with Close the different demand of user to provide different types of training corpus, cannot also allow user voluntarily persistently to lift efficiency.

The content of the invention

For defect of the prior art, invention system combines various models, reaches independent generation and trains language Material, and can independently repair each model and lift the efficiency of semantic character labeling.

A kind of statement analytical method based on semantic and syntactic structure, it is it is critical that comprise the following steps：

Step 1：Input original sentence；

Step 2：Initial training language material is produced using the original sentence；

Step 3：The training corpus after artificial correction is obtained, middle trained language material is defined as；

Step 4：The correctness of the middle trained corpus labeling is verified, if the mark of the middle trained language material is just Really, the middle trained language material is defined as final training corpus, and enters step 5；The otherwise circulation of return to step 3 is performed；

Step 5：Bring the final training corpus into training pattern.

The inventive method principle：The present invention allows the user independently to produce training corpus, and can independently repair each Model lifts the efficiency of semantic character labeling.When user is estimated, and any one sentence is used as into training corpus, can carry out Procedure below：Original sentence is first input into current Sentence analysis system, preliminary training corpus is produced；Then by with language The expert for learning background is manually marked and is changed, and verifies the correctness of corpus labeling, and artificial mark can be returned if wrong The step of note；Final training corpus after confirmation can again be input into system, and the model to be trained may be selected, for example：Word Property marking model, interdependent syntactic analysis model, semantic character labeling model, and then lifting system overall efficiency.

More preferably to realize the present invention, may further be：Original sentence produces concretely comprising the following steps for initial training language material：

Step 2.1：Participle；

Step 2.2：Part-of-speech tagging；

Step 2.3：Interdependent syntactic analysis；

Step 2.4：Semantic role is analyzed.

Optionally：In the step 3, the mark of the initial training language material is carried out by the expert with linguistics background Manual amendment and correction.

Optionally：The checking middle trained corpus labeling correctness in the step 4 is concretely comprised the following steps：

Step 11：Whether information box bit quantity is correct in judging the middle trained language material；It is then to carry out step 12；It is no, The then circulation of return to step 3 is performed；

Step 12：Whether judge in the middle trained language material comprising verb；It is then to carry out step 13；It is no, then return to step Rapid 3 circulation is performed；

Step 13：Whether verb has corresponding semantic role to mark in judging the middle trained language material；It is then to carry out Step 14；No, then the circulation of return to step 3 is performed；

Step 14：Whether judge the dependence of each participle in the middle trained language material has correct link；It is then to enter Row step 5；No, then the circulation of return to step 3 is performed.

Optionally：The training pattern is part-of-speech tagging model, or is interdependent syntactic analysis model, or is semantic angle Color marking model.

Sentence analysis system based on the inventive method, including Sentence analysis module, for original sentence generation is preliminary Training corpus；

Language material authentication module, the correctness for verifying the middle trained corpus labeling.

Optionally：In the Sentence analysis module containing participle model, part-of-speech tagging model, interdependent syntactic analysis model and Semantic character labeling model.

Optionally：Contain information box digit interrogation model, verb interrogation model, semantic role in the language material authentication module Tag query model and dependence checking model.

Beneficial effects of the present invention：Training corpus needed for producing user with semiautomatic fashion, improves and produces training language The efficiency of material；Training pattern, the correctness of training corpus can be come using customized training corpus；There is provided and visualize semantic role The result of mark；Can be in each model of continuous training under same system, such as:Part-of-speech tagging and interdependent syntactic analysis model, so that Lifting system overall efficiency.

Brief description of the drawings

Fig. 1 shows the flow chart of the inventive method；

Fig. 2 shows implementation process flow chart of the present invention.

Specific embodiment

The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this Scope.

As depicted in figs. 1 and 2：A kind of statement analytical method based on semantic and syntactic structure, comprises the following steps：

Step S101：Input original sentence；

Step S102：Initial training language material is produced using the original sentence；

Step S103：The training corpus after artificial correction is obtained, middle trained language material is defined as；

Step S104：The correctness of the middle trained corpus labeling is verified, if the mark of the middle trained language material It is correct, the middle trained language material is defined as final training corpus, and enter step S105；Otherwise return to step 3 is circulated Perform；

Step S105：Bring the final training corpus into training pattern.

Wherein, original sentence produces concretely comprising the following steps for initial training language material:

Step 2.1：Participle；

Step 2.2：Part-of-speech tagging；

Step 2.3：Interdependent syntactic analysis；

Step 2.4：Semantic role is analyzed.

In addition, verifying concretely comprising the following steps for the middle trained corpus labeling correctness：

Step 11：Whether information box bit quantity is correct in judging the middle trained language material；It is then to carry out step 12；It is no, Then return to step S103 circulations are performed；

Step 12：Whether judge in the middle trained language material comprising verb；It is then to carry out step 13；It is no, then return to step Rapid S103 circulations are performed；

Step 13：Whether verb has corresponding semantic role to mark in judging the middle trained language material；It is then to carry out Step 14；No, then return to step S103 circulations are performed；

Step 14：Whether judge the dependence of each participle in the middle trained language material has correct link；It is then to enter Row step 5；No, then return to step S103 circulations are performed.

The training pattern is part-of-speech tagging model, or is interdependent syntactic analysis model, or is semantic character labeling Model.

In addition, the Sentence analysis system based on the inventive method, including Sentence analysis module, for original sentence to be generated Initial training language material；

Language material authentication module, the correctness for verifying the middle trained corpus labeling；Wherein, the Sentence analysis mould Contain participle model, part-of-speech tagging model, interdependent syntactic analysis model and semantic character labeling model in block；

Looked into containing information box digit interrogation model, verb interrogation model, semantic role mark in the language material authentication module Ask model and dependence checking model.

The inventive method is implemented：By taking sentence " I likes playing basketball " as an example：

By initial one read statement analysis system, current system is obtained to the analysis of sentence：1st, participle：I/like/beat/ Basketball；2nd, part-of-speech tagging：My r/ likes v/ to beat v/ basketballs n；3rd, interdependent syntactic analysis：My 2SBV/ likes 0HED/ to make 2VOB/ baskets Ball 3VOB；4th, semantic role analysis：Agent (like, I) agent (beat, I) patient (beating, basketball) ATP (likes, basket Ball) AFT (beating, basketball)；This analysis is initial training language material.Will be via manually making corrections its analysing content to optimize overall system System；Wherein, the expert with linguistics background carries out manual amendment and correction to the mark of the initial training language material.

Initial training language material is transferred into artificial correction, ATP (liking, basketball) is changed to ATP (liking, play basketball)：AFT (beat, Basketball) it is changed to AFT (beating, like)；

Language material checking is carried out to the training corpus after correction, checks whether wrong on annotation formatting；It is errorless, then use the language Material training Sentence analysis system, so as to realize reaching the effect of optimization total system.

Finally it should be noted that：Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations；To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that：Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent；And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.

Claims

1. a kind of statement analytical method based on semantic and syntactic structure, it is characterised in that comprise the following steps：

Step 1：Input original sentence；

Step 4：The correctness of the middle trained corpus labeling is verified, if the mark of the middle trained language material is correct, The middle trained language material is defined as final training corpus, and enters step 5；The otherwise circulation of return to step 3 is performed；

Step 5：Bring the final training corpus into training pattern.

2. the statement analytical method based on semantic and syntactic structure according to claim 1, it is characterised in that original sentence Produce concretely comprising the following steps for initial training language material:

Step 2.1：Participle；

Step 2.2：Part-of-speech tagging；

Step 2.3：Interdependent syntactic analysis；

Step 2.4：Semantic role is analyzed.

3. the statement analytical method based on semantic and syntactic structure according to claim 1, it is characterised in that：The step In 3, manual amendment and correction are carried out to the mark of the initial training language material by the expert with linguistics background.

4. the statement analytical method based on semantic and syntactic structure according to claim 1, it is characterised in that：The step The checking middle trained corpus labeling correctness in 4 is concretely comprised the following steps：

Step 11：Whether information box bit quantity is correct in judging the middle trained language material；It is then to carry out step 12；It is no, then return Step 3 circulation is returned to perform；

Step 12：Whether judge in the middle trained language material comprising verb；It is then to carry out step 13；It is no, then return to step 3 Circulation is performed；

Step 14：Whether judge the dependence of each participle in the middle trained language material has correct link；It is then to be walked Rapid 5；No, then the circulation of return to step 3 is performed.

5. the statement analytical method based on semantic and syntactic structure according to claim 1, it is characterised in that：The training Model is part-of-speech tagging model, or is interdependent syntactic analysis model, or is semantic character labeling model.

6. the Sentence analysis system of claim 1 methods described is based on, it is characterised in that：Including Sentence analysis module, for inciting somebody to action Original sentence generates initial training language material；

7. Sentence analysis system according to claim 6, it is characterised in that：Contain participle mould in the Sentence analysis module Type, part-of-speech tagging model, interdependent syntactic analysis model and semantic character labeling model.

8. Sentence analysis system according to claim 7, it is characterised in that：Contain information box in the language material authentication module Digit interrogation model, verb interrogation model, semantic role tag query model and dependence checking model.