CN115618879A

CN115618879A - Natural language-oriented chunk syntactic dependency graph representation and data labeling method

Info

Publication number: CN115618879A
Application number: CN202211118840.9A
Authority: CN
Inventors: 荀恩东; 邵田
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2023-01-17

Abstract

The invention provides a natural language-oriented chunk syntax dependency graph representation and data labeling method, which is characterized in that chunk dependency analysis is carried out on the basis of chunk identification, a chunk syntax dependency graph and a labeling technology thereof are defined, a corresponding chunk system is established, the accuracy of syntax analysis is improved by the idea of taking a language chunk as a core, concepts of self-sufficient sentences and non-self-sufficient sentences are proposed, and self-sufficient structures are established by complementing default components in the non-self-sufficient sentences. Therefore, complete and accurate knowledge information is provided for further completing complete syntactic analysis and semantic analysis, and a computer is enabled to complete understanding of the natural language.

Description

Natural language-oriented chunk syntax dependency graph representation and data labeling method

Technical Field

The invention relates to a natural language processing method, in particular to a chunk syntactic dependency graph representation and data tagging method facing to natural language.

Background

Natural language processing is a cross discipline integrating linguistics, computer disciplines and mathematics, and is also an important direction in the fields of computer disciplines and artificial intelligence.

Natural language processing studies have enabled various theories and methods for efficient communication between humans and computers using natural language. The implementation of man-machine natural language communication means that a computer can understand not only the meaning of a natural language text but also a given intention, thought, and the like expressed in the natural language text, the former being called natural language understanding and the latter being called natural language generation, to which the present invention specifically pertains.

The natural language understanding research uses an electronic computer to simulate the language understanding process of human, so that the computer can understand the natural language of human society, such as Chinese, english, and the like, to replace part of mental work of human. The development of 5 th generation computer is one of the main targets to make the computer possess the capability of understanding natural language. The computer is used for solving the natural language, namely, the comprehensive understanding of the syntax, the semantics and the pragmatic use of the natural language is required to be achieved. Therefore, natural language understanding can be further divided into three aspects of syntactic analysis, semantic analysis and pragmatic analysis, the three aspects generally have precedence relationship, the analysis result of the former step provides information for the analysis result of the next step, the syntactic analysis provides information for the semantic analysis, and the semantic analysis provides information for the pragmatic analysis, but the final purpose is to enable a computing mechanism to solve natural language. The specific part of this patent is syntactic analysis.

Syntactic analysis is the process of analyzing an input sentence to derive the syntactic structure of the sentence. The most common syntactic analysis tasks can be classified into the following three types, depending on the representation form of the syntactic structure. The first is syntactic structure analysis (also called phrase structure analysis) and also called constituent syntactic analysis (constituent syntactic syntax) which is used to identify phrase structures and hierarchical syntactic relations between phrases in sentences. The second is dependency analysis, also called dependency syntax analysis (dependency syntax analysis), which is called dependency analysis for short, and is used to identify the interdependence relationship between words in a sentence. Third, deep Grammar syntax analysis, i.e. deep syntax and semantic analysis are performed on a sentence by using a deep Grammar, such as Lexical Tree Adjacent Grammar (LTAG), lexical Functional Grammar (LFG), combinatorial Category Grammar (CCG), and the like.

Dependency syntax generally describes the framework of a language structure in terms of word-to-word dependencies. Dependency syntax considers that a verb in a "predicate" is the center of a sentence, and other components are directly or indirectly linked to the verb. "dependency" refers to a relationship between words that is subject to dominance and dominance, that is not peer-to-peer, and that has an orientation. Specifically, the dominant component is called the dominant component, and the dominant component is called the subordinate component. And syntactic structures also essentially contain word-to-word dependencies (modifiers). A dependency relationship connects two words, a core word (head) and a dependency word (dependent). Dependencies can be subdivided into different types, representing specific syntactic relationships between two words.

The current dependency parsing mainly refers to word dependency parsing, however, when word dependency parsing develops to the present, the word segmentation granularity is too small, word segmentation part-of-speech tagging is not accurate, and further, the parsing is not accurate. For example, fig. 1 is a 40 ton production of finished ore deposit in chinese in 2014 using the language cloud platform of hayagar. "in the word dependency analysis result, the two-guest structure is not correctly recognized, the indirect object" mineral gold production "and the direct object" 40 tons "are not recognized, and the" production "verb-and-noun-facultative word should be a noun but is analyzed as a verb, thereby causing an error in the syntactic structure analysis result. The same double-object sentence '3456 new scientific and technological enterprises in one year' has no interference of part-of-speech tagging errors, and the syntactic structure of the double-object sentence is not correctly analyzed.

The block analysis, which can be called shallow layer analysis or partial analysis, adopts a "divide and conquer" strategy to decompose a complex problem into a plurality of small problems, focuses on solving the small problems and improving the accuracy, and is used as the analysis of an intermediate structure to further achieve the purpose of partial or complete syntactic analysis. In particular, chunk analysis may be divided into chunk identification, analysis of inter-chunk relationships, and analysis of the internal structure of the chunks. The research of the current chunk analysis focuses on the identification of the chunks and the analysis of the internal relation, and the relation between the chunks is analyzed temporarily without relevant documents.

Chunk dependency analysis builds syntactic dependencies between chunks, i.e., after splitting out chunks. FIG. 2 shows that the block dependency analysis is performed on "40 tons of finished ore gold production in the Chinese of 2014". "in the analysis results. It can be seen that compared with word dependency, the block dependency has larger word segmentation granularity, enhances the robustness of syntactic analysis to a certain extent, provides accurate information for semantic analysis in the next step, and provides assistance for completing understanding of a computer to natural language.

Compared with word dependency, the chunk dependency analyzes sentences to chunks without further analyzing words or performing word segmentation part-of-speech tagging on the words, and avoids the phenomenon of syntactic analysis errors caused by word segmentation part-of-speech tagging. Meanwhile, in the current complete syntactic analysis, aiming at the condition that errors are easy to occur in word dependency analysis, a strategy of 'divide and conquer' is adopted, sentences are firstly analyzed to chunks, the dependency relationship among the chunks is established, the underlying syntactic ambiguity is resolved, and accurate and sufficient information is provided for completing the complete syntactic analysis.

Disclosure of Invention

The invention provides a natural language-oriented chunk syntactic dependency graph representation and data labeling method, which relates to dependency relationship analysis, solves the problem that the syntactic analysis is inaccurate when the current complete syntactic analysis faces large-scale real corpora, and provides an intermediate structure taking chunk dependency analysis as semantic analysis by using the related theory of chunk analysis for reference, so as to provide more accurate knowledge for semantic analysis, and the technical scheme is as follows:

a method for representing and labeling data of chunk syntax dependency graph facing natural language includes the following steps:

s1: carrying out chunk marking on the sentence, and segmenting different types of chunks in the sentence by using different marking symbols;

s2: after the chunks are segmented, locating the predicate chunks in the sentences according to different types of the chunks, and constructing the dependency relationship between other chunks in the sentences and the predicate chunks;

s3: searching for a default component of a non-self-sufficient sentence, wherein the default component needs to be searched in the small sentence or the context of the small sentence, and establishing the dependency relationship between the default component and the sentence chunk;

s4: and carrying out small block segmentation on the chunks, and respectively establishing the dependency relationship between the chunks and the predicates.

Further, in step S1, the chunks are represented as adjacent word sequences in the sentence, and are non-recursive structures that bear a certain syntactic function and meet grammatical rules, and after the sentence is segmented into chunks, a linear sequence of the chunks is obtained.

In step S1, the chunks are divided into syntactic chunks and non-syntactic chunks according to whether they serve as syntactic components or not; the syntactic chunk is divided into a predicate chunk, a subject chunk, an object chunk, a shape chunk and a complement chunk according to the position of the syntactic chunk in a sentence, and the non-syntactic chunk is divided into an engagement chunk and an auxiliary chunk according to the function of the non-syntactic chunk.

Marking a statement sentence module or a statement sentence module and a similar sentence module and a complementary sentence module which are adjacent with the statement sentence module by "(); marking the block of the shape language and the block of the complement which are separated by punctuation marks or other components with [ ]; marking a subject block of the predicate and an object block of the predicate by using the '{ }'; marking the connection blocks by using "< >; the auxiliary chunks are labeled with "< >", and the subject chunks of the sombrerology and the object chunks of the sombrerology are not labeled with symbols.

Further, in step S2, it is set that the statement chunk is the core of a sentence, each type of non-statement chunk is subject to the statement chunk and depends on the statement chunk, and if there is a dependency relationship between a non-statement chunk and the statement chunk, the non-statement chunk is called a dependent component of the statement chunk, and the statement chunk is a dependent object of the non-statement chunk.

The sentence structure comprises the following steps that (1) the sentence chunks are used as dependency objects of all chunks in a sentence, four nodes are set and respectively represent main language positions, guest language positions, modifying language positions and language positions of the sentence chunks, all non-sentence chunks depend on the four nodes of the sentence chunks according to the categories of the non-sentence chunks, dependency lines point to dependent components of the non-sentence chunks from the four nodes of the sentence chunks, and each sentence chunk and the dependent components around the sentence chunk form a self-sufficient structure together:

(1) The subject comprises a large subject and a small subject in the subject predicate sentence and depends on the No. 1 bit of the statement chunk; when the whole of the No. 1 bit is a noun component, the relation between the two is defined as NP-SBJ, and when the whole of the No. 1 bit is a predicate component, the relation between the two is defined as VP-SBJ and points to the No. 4 bit of the predicate component;

(2) The zhuang and complement depend on the 2 nd bit of the statement chunk; defining the relation between the predicate chunk and the component on the No. 2 bit as NULL-MOD;

(3) Objects, including a far and near object in the two objects, depend on the number 3 bit of the statement chunk; when the whole of the No. 3 bit is a sovereign word component, defining the relation between the two as NP-OBJ, and when the whole of the No. 3 bit is a predicate component, defining the relation between the two as VP-OBJ and pointing to the No. 4 bit of the predicate component;

(4) The predicate associated block refers to an empty statement language which is the same as the statement language, the predicate associated block is connected with the number 4 position of the statement language chunk, and the number 4 position of the predicate associated block is pointed to from the number 4 position of the statement language chunk; the relationship between the two is defined as VP-EMP.

Furthermore, in step S2, when the chunk dependency graph is labeled with a specification, the segmentation is divided into self-contained sentences, non-self-contained sentences and small block segmentation, the self-contained sentences are that the necessary domination components of the statement chunks are all inside the small sentences, the non-self-contained sentences are that the necessary domination components of the statement chunks are omitted, and the small block segmentation is that some chunks do not depend on the statement chunks as a whole, but depend on some chunks in a small block dependency manner, and at this time, the small block segmentation needs to be performed according to the chunk structure.

The labels of the self-sufficient sentences comprise basic sentence patterns and special sentence patterns, wherein the basic sentence patterns are divided into single sentences, compound sentences, main and predicate sentences, double object sentences, predicate main sentences and predicate object sentences; the special sentence pattern is divided into successive predicate sentences, accompanying sentences, noun predicate sentences, somatological independent sentences, predicate independent sentences, inverted sentences, unowned sentences and adage.

The annotations of the non-self-sufficient sentences include subject chunk default, object chunk default, subject chunk default, complement chunk default, and predicate chunk default.

Further, in step S4, the small block segmentation is divided into two cases: firstly, when the default component is a part of a certain chunk, the default component needs to be cut out, and after the default component is formed into a chunk, the dependency relationship between the default component and the statement chunk is established; and secondly, aiming at the complex zhuang language chunk and the complement chunk, the complex modifier chunk is cut into different small chunks, and the dependency relationship between the complex modifier chunk and the predicate chunk is respectively established.

According to the chunk syntactic dependency graph representation and data tagging method for the natural language, the chunk dependency structure is used as the intermediate structure for syntactic analysis, so that not only are language cognitive rules met, but also calculation is facilitated, the process of linguistic analysis is more explanatory and controllable, meanwhile, analysis errors caused by word segmentation part of speech tagging are avoided, and more accurate knowledge is provided for complete syntactic analysis and semantic analysis.

Drawings

FIG. 1 is an example of word dependence analysis results for a sentence using a Hadamard language cloud platform;

FIG. 2 is some example of a chunk dependency graph;

FIG. 3 is a flow diagram of the natural language oriented chunk syntactic dependency representation and data tagging method;

FIG. 4 is a schematic diagram of the chunk classification system;

FIG. 5 is an exemplary diagram of the chunk dependency graph;

FIG. 6 is a schematic diagram illustrating the chunk dependency;

FIG. 7 is a diagram of a single sentence;

FIG. 8 is a diagram of a compound sentence;

FIG. 9 is a diagram of a predicate sentence;

FIG. 10 is a diagram of a double-guest sentence;

FIG. 11 is a schematic diagram of a predicate sentence;

FIG. 12 is a schematic view of a predicate sentence;

FIG. 13 is a schematic view of a conjunctive sentence;

FIG. 14 is a diagram of an inclusive statement;

FIG. 15 is a diagram of a noun predicate sentence;

FIG. 16 is a diagram of a body-word single sentence;

FIG. 17 is a schematic diagram of a single sentence of a predicate;

FIG. 18 is a schematic view of a flip-chip sentence;

FIG. 19 is a schematic of a no-main sentence;

FIG. 20 is a schematic diagram of an adage;

FIG. 21 is a subject chunk default diagram;

FIG. 22 is an object chunk default diagram;

FIG. 23 is a diagram of a shape language chunk default;

FIG. 24 is a complement chunk default diagram;

FIG. 25 is a diagram of a predicate chunk default;

FIG. 26 is a schematic illustration of segmentation of host guest chunks;

FIG. 27 is a schematic diagram of modifier chunk segmentation;

FIG. 28 is a first labeled diagram of chunk dependencies;

FIG. 29 is a second labeled diagram of chunk dependencies;

FIG. 30 is a labeled diagram of chunk dependencies;

FIG. 31 is a fourth labeled diagram of chunk dependencies.

Detailed Description

As shown in FIG. 3, the natural language-oriented chunk syntactic dependency representation and data tagging method focuses on analysis of dependency relationships among chunks, and establishes a chunk classification system using chunk functions as classification criteria, wherein the chunk functions correspond to the chunk classification system and refer to syntactic functions borne by chunks in sentences, and the syntactic functions refer to syntactic functions that a language unit can make a certain component in a sentence, and are commonly known as a subject, a statement, an object, a fixed language, a shape, a complement and the like. Highlighting the idea of sentence block as core, building self-sufficient structure by complementing default components in the small sentence. The method comprises two stages:

the first stage is chunk identification and partitioning, mainly marking the boundaries of chunks in order to mark syntactic labels;

the second phase is the construction of the chunk dependency graph, and the dependency relationship among the chunks is marked on the basis of the result of the first phase.

The first stage is as follows: chunk identification and partitioning, primarily marking chunk boundaries for marking syntax labels.

For chunk identification, determining the boundaries of a chunk means determining the left and right boundaries of the chunk. For example: this phrase/only/yes/an example/. Taking the subject chunk "this word" as an example, the left boundary is "this" and the right boundary is "word".

The method comprises the following steps: and carrying out chunk marking on the sentence, namely segmenting different types of chunks in the sentence by using different marks.

Marking a statement sentence module or a statement sentence module and a similar sentence module and a complementary sentence module which are adjacent with the statement sentence module by "(); marking the block of the shape language and the block of the complement which are separated by punctuation marks or other components with [ ]; marking a subject chunk with a predicate and an object chunk with the predicate by using the '{ }'; marking the connection blocks by using "< >; the auxiliary chunks are labeled with "< < >". It should be noted that, no notation is defined for the body-part-of-speech subject chunk and the body-part-of-speech object chunk, since the body-part-of-speech subject chunk and the body-part-of-speech object chunk occupy most of them, and therefore, the body-part-of-speech subject chunk and the body-part-of-speech object chunk are located according to the position before and after the "()" statement chunk, that is, the body-part-of-speech subject chunk that is generally considered to be in front of the statement chunk, and the body-part-of-speech object chunk that is behind the statement chunk. If the dependency relationship between the chunks cannot be distinguished according to the positions, the chunks can be distinguished according to the dependency relationship between the chunks in the next step.

Before specifying the construction of the chunk dependency graph in the second stage, a detailed description of the chunk definition and classification system is required.

1. Definition of the chunks:

the chunks are formed by integrating continuous words or morphemes, have language interplay units with characteristics of modularization, shape and meaning integration and prefabrication, and generally participate in information encoding and decoding in an integral form in language use. Specifically, a chunk is represented as a sequence of adjacent words in a sentence, is a non-recursive structure which bears a certain syntactic function in the sentence and conforms to a syntactic rule, and includes both connective components and auxiliary components between the small sentences and syntactic function components in the small sentences. After the sentence is chunked, a linear sequence of chunks is obtained, instead of a hierarchical structure, as shown in the following examples, chunks are separated by oblique lines.

This sentence/only/yes/an example/.

I/feel/look like/he draws landscape/nice.

2. Classification system of chunks:

at present, most of chunk systems are strictly divided according to the properties of chunks, and on the basis of a phrase structure, the chunk systems are divided from the perspective of grammatical functions and divided into syntactic chunks and non-syntactic chunks according to whether the syntactic components are used or not;

as shown in fig. 4: the syntax chunks are divided into predicate chunks, subject chunks, object chunks, shape chunks and complement chunks according to the positions of the syntax chunks in sentences. The non-syntactic chunks are divided into concatenation chunks and auxiliary chunks according to their function.

3. Identification and partitioning of chunks:

the first stage is based on the classification system of the chunks, and the chunks in the sentence are identified and classified into categories.

1. Phrase module

The sentence group block is a group block formed by core sentences, is the core of the sentence level and is represented by the innermost brace "()". The predicate chunks are mainly acted by predicate words or predicate structures, and empty predicate chunks also exist in some special sentences.

1) The temperature of the entire glacier (all (at)) melting point, except for the active layer.

2) He () beijing person.

2. Subject module

The subject block is the subject in the subject predicate structure, including the large subject in the subject predicate sentence. A subject chunk is structurally dependent on a predicate chunk.

1) We have cleaned the classroom (go (eat)) and cooked.

2) This event { [ in his view ] a troublesome place (stile (much)) }.

When the whole subject is in a predicated structure, the whole structure is referred to by statement blocks in the predicated structure, namely the subject blocks which are regarded as dominated by the statement blocks in the predicated structure.

3) { (Do not use (take)) Pen (write) } (also (possible)).

3. Object language building block

The object group is the object in the bingo structure, including the direct object and the indirect object in the double-object structure. An object chunk is structurally dependent on the statement chunk.

1) [ at his maturity, ] he (climbed over) the pearl crest.

2) [2014 ] production of gold (finished) mineral gold of China | | |40 tons.

When the whole object is in a predicate structure, the same processing is performed as that of the predicate subject.

3) I (now (acknowledge)) { you ((do) better than i) }.

4. Zhuangyu block

A block of a shape language refers to a block that acts as a shape language in a sentence, generally located in front of the block, and may be immediately adjacent to the block or isolated by other components or punctuation, acting as a modifier to the block, subject to the block.

1) The technical enterprise is grown in 3465 within one year (newly added).

2) [ do not identify the education of children ] hope (in education institutions) to take the best way.

5. Complement module

A complement chunk refers to a chunk which serves as a complement in a sentence, is generally positioned at the rear part of the phrase chunk, can be adjacent to the phrase chunk or isolated by other components or punctuations, plays a role in modifying the phrase chunk and is dominated by the phrase chunk.

1) < even > you (do me better) < also > (cannot (change)) violation.

2) (fly into) a bird [ coming ].

6. Linking block

The linking block is composed of conjunctions, phonetic labels, insertion words, etc., and mainly plays a linking function in sentences, belonging to chapter components. Indicated by an angle bracket "< >".

1) He < not only did < eat >, < did < take up) an apple.

2) Car <, needless to say, > (yes) of course, head, etc.

7. Auxiliary block

The auxiliary chunks are generally formed by fictional words such as tone words and sound-making words, have no structural relation with other components in the sentence in syntax, mainly bear the function of expressing tone in the sentence, and are expressed by "< >".

1) He (got away) < < do > >?

2) < < good > >, I (know).

3) < < swla > > -, the sea (being his (singing)) the last song.

And a second stage: and building the chunk dependency graph, wherein the building of the chunk dependency graph is performed on the basis of the first-stage chunk identification and classification.

Step two: after the chunks are segmented, predicate chunks in sentences are positioned according to different types of the chunks, and the dependency relationship between other chunks in the sentences and the predicate chunks is constructed. In previous studies, analysis of chunks stopped with identification and type analysis of chunks, and dependencies between chunks were ignored. However, it should be noted that the dependency relationships between the anchor and auxiliary chunks and the syntax chunks are not constructed, because the anchor and auxiliary chunks are generally sentence-level components, which are different from the intra-sentence components. This step is one of the special points of this patent.

1. Chunk dependency graph representation:

when the sentence is subjected to chunk dependency analysis, the invention sets the statement language chunk as the core of the sentence, each type of non-statement language chunk is subjected to the governance of the statement language chunk and depends on the statement language chunk, if the dependency relationship exists between a certain non-statement language chunk and the statement language chunk, the non-statement language chunk is called as a dependent component of the statement language chunk, and the statement language chunk is a dependent object of the non-statement language chunk. Except for some special single words, one or more predicates blocks are generally considered to exist in the sentence, and the non-predicates block depends on at least one predicates block. As shown in fig. 5: the sentence structure is used as the dependency object of each block, the left, right, upper and lower parts of the sentence structure have four points which respectively represent the subject position (position No. 1), the object position (position No. 3), the modified phrase position (position No. 2) and the non-statement phrase position (position No. 4), each non-statement phrase block depends on four nodes of the statement phrase block according to the type, and the dependency lines point to the dependent components from the four nodes of the statement phrase block.

1. The subject, including the large subject in the predicate statement, depends on bit 1 of the predicate block. In the subsequent analysis, when the whole of the 1 st digit is a sovereign component, the relationship between the two is defined as NP-SBJ, and when the whole of the 1 st digit is a predicate component, the relationship between the two is defined as VP-SBJ, and the 4 th digit of the predicate component is pointed to.

2. The zhuang and the complement are dependent on the 2 nd bit of the sentence block. In the subsequent analysis, the relationship between the predicate chunk and the component on bit 2 is defined as NULL-MOD.

3. The object, including the near-far object in the dual objects, depends on the number 3 bit of the predicate chunk. In the subsequent analysis, when the whole of the No. 3 bit is a body-word component, the relation between the two is defined as NP-OBJ, and when the whole of the No. 3 bit is a predicate component, the relation between the two is defined as VP-OBJ, and the No. 4 bit of the predicate component is pointed to.

4. The predicate association block refers to an empty statement language identical to the statement language, the predicate association block is connected with the 4 th position of the statement language chunk, and the 4 th position of the predicate association block is pointed to from the 4 th position of the statement language chunk. In subsequent analyses, the relationship between the two was defined as VP-EMP.

Therefore, the present invention can preliminarily distinguish the relationship between the predicate and its dependent blocks into the following 6 types, as shown in fig. 6.

Furthermore, the block dependency relationship using the statement chunk as a core not only constructs a dependency relationship for the statement chunk and the dependent components appearing around the statement chunk, but also constructs a dependency relationship for the dependent components appearing in the context by default in the current clause, and ensures the integrity of the dependent components around the statement chunk as much as possible. Each predicate block and its peripheral dependent components together form a self-contained structure.

Fig. 6 also reflects the expression of the syntactic label, which corresponds to the syntactic property and the syntactic function, that is, the syntactic property and the syntactic function including the language unit. The syntactic property refers to the property of the language unit, and the property of the language unit can be divided into parts of speech and the property of phrases according to different sizes of the language unit. The part of speech can be divided into body words and predicates, the body words can be divided into nouns, pronouns and the like, and the predicates can be divided into verbs, adjectives and the like. The classification of the phrase properties is the same as the part of speech.

The correspondence of the syntactic labels in fig. 6 is as follows:

(1) Syntactic properties

NP indicates that the syntactic property of the linguistic unit is a somatotropic component;

VP indicates that the syntactic property of the linguistic unit is a predicated component;

NULL indicates that the syntactic property of the language unit is uncertain, and the syntactic property can be a body word component or a predicate component;

(2) Syntactic functions

SBJ indicates that the syntactic function of the linguistic unit is to serve as a subject component;

OBJ indicates that the syntactic function of the linguistic unit is to act as an object component;

MOD indicates that the syntactic function of the language unit is to act as a modifier component;

EMP denotes that the syntactic function of the linguistic unit is a component associated with a predicate.

2. Chunk dependency graph labeling specification:

self-contained sentences, i.e., essential dominating components of predicate chunks are all within clauses, and non-self-contained sentences refer to omission of essential dominating components of predicate chunks. The indispensable domination component refers to the indispensable argument of the predicate chunk, and the price of a verb in linguistics can be used for reference, for example, "like" is a divalent verb, the indispensable domination component of the bivalent verb is two, and the two indispensable domination components are respectively the specific things of the sender and the preference of the action, and the rest components are the unnecessary domination components of the predicate chunk except the indispensable domination component. In addition, the self-sufficient sentence may share the unnecessary dominating component with other components, except for omitting the unnecessary dominating component, and in this case, it is also necessary to find the shared unnecessary dominating component. In addition, some chunks are not dependent on the predicate block as a whole, but depend on a predicate block in a small block dependence manner, and in this case, a small block needs to be selected according to the chunk structure. Therefore, the labeling specification of the block dependency graph is introduced from three aspects of self-contained sentences, non-self-contained sentences and small block segmentation.

1. Annotation of self-sufficient sentences

And the chunk dependency marking adopts the maximum chunk marking criterion, and if the chunk is dominated, the whole chunk depends on the language chunk. The dependency relationship is not unique, and if the same chunk is subject to multiple predicate chunks, it may depend on multiple predicate chunks. The following description will take the basic syntax and the special sentence pattern of Chinese as an example to illustrate the processing manner of the chunk dependency graph for different sentence patterns.

(1) Basic sentence pattern

a. A single sentence:

as shown in fig. 7, the temperature of the entire glacier is at the melting point except for the active layer.

b. Compound sentence

As shown in fig. 8, he not only eaten, but also picked an apple.

c. Statement of main and subordinate

As shown in fig. 9, this is a lot more cumbersome in what he sees.

d. Double-guest sentence

As shown in fig. 10, 3465 scientific and technological enterprises are newly developed in one year.

e. Wording main sentence

As shown in fig. 11, writing without taking a pen is also possible.

f. Worriside sentence

As shown in fig. 12, you need to try to see the problem from this perspective.

(2) Special sentence pattern

a. Sentence with meaning

As shown in fig. 13, we have swept the classroom to eat.

b. Concurrent statement

As shown in fig. 14, the boss makes a trip to the office.

c. Noun-predicate sentence

As shown in fig. 15, he beijing.

d. Body part of speech single sentence

As shown in FIG. 16, panda!

e. Single sentence of wording

As shown in fig. 17, how?

f. Inverted sentence

As shown in FIG. 18, quick walk Bar, you!

g. Without main sentence

As shown in fig. 19, the rain of one day blows the wind of one day.

h. Hangul

As shown in FIG. 20, "day is regular" and "should be treated with good rules".

2. Labeling of non-self-sufficient sentences

a. Subject chunk Default

As shown in fig. 21, he shakes the clothes and then wears it.

b. Object chunk Default

As shown in fig. 22, he has a ticket, i do not.

c. Scholar chunk default

As shown in fig. 23, he opens the door, walks in, silently.

d. Complement chunk Default

As shown in FIG. 24, the user arranges and positions the contents of the study in order.

e. Statement chunk default

As shown in fig. 25, some have written agreements and some have spoken agreements.

Step three: finding the default component of a non-self-contained sentence requires finding the default component in the sentence itself or in the context of the sentence and establishing its dependency relationship with the predicate chunks.

In building dependencies between a predicate chunk and other chunks, two cases are generally encountered: firstly, the sentence is a self-contained sentence, and the dependent components of the predicate chunks can be found from the sentence; secondly, the sentence is a non-self-sufficient sentence, namely when some dependent components are omitted, the default component of the small sentence needs to be searched. That is, when a sentence is a non-self-contained sentence, i.e. some components are omitted due to context or some contextual factors, then it is necessary to find the default component in the sentence itself or in the context of the sentence and establish its dependency relationship with the sentence set. In addition, it should be noted that the default phrase chunk, complement chunk and statement chunk inside the sentence are also regarded as a special default, and also need to be retrieved in this step, which is the same in nature as the default of the subject chunk and object chunk, i.e. the component is omitted due to context and influence of context, and shares the omitted component with other components in the sentence. This step is also one of the special points of this patent.

Step four: and carrying out small block segmentation on the chunks, and respectively establishing the dependency relationship between the chunks and the statement sentence chunks.

When building the dependency relationship between the predicate chunk and other chunks, sometimes the chunks need to be split into small chunks to build a more complete chunk dependency relationship. The small block segmentation is mainly divided into two cases: firstly, when the default component is a certain part of a certain chunk, the default component needs to be cut out, and after the chunk is formed, the dependency relationship between the default component and the statement language chunk is established; and secondly, aiming at the complex zhuang language chunk and the complement chunk (collectively called as a modifier chunk), cutting the complex modifier chunk into different small chunks, and respectively establishing the dependency relationship between the small chunks and the modifier chunk. This step is also one of the special points of this patent.

Small block segmentation example:

1. master object chunk splitting

As shown in FIG. 26, he had his schoolbag dropped, which is very distracting.

2. Modifier chunk splitting

As shown in fig. 27, we have a clearer view of this fact today by this method.

The following are several example operations of the invention:

example 1: the temperature of the whole glacier is at the melting point except the active layer.

The method comprises the following steps: and carrying out chunk marking on the sentence, namely cutting different types of chunks in the sentence by using different marks. The chunk annotation results for example 1 are: the temperature of the entire glacier (all (at)) melting point, except for the active layer. The object is divided into five blocks, namely, a shape language block 'excluding the active layer', a subject block 'temperature of the whole glacier', a shape language block 'average', a statement language block 'at' and an object block 'melting point'.

Step two: finding the predicated language chunks in the sentence, and constructing the dependency relationship between other chunks in the sentence and the predicated language chunks. The phrase block in example 1 is "at", and four dots are provided on the front, rear, upper, and lower sides thereof, respectively, to indicate that other blocks may be dependent on the block, wherein the phrase block "except for the active layer" is connected to the 2 nd modification point position above the phrase block, "the temperature of the entire glacier" is connected to the 1 st subject point position in front of the phrase block, and the object block "fusion point" is connected to the 3 rd subject point position behind the phrase block.

Step three: find the default component. The sentence group "in" belongs to a self-contained sentence and has no missing components.

Step four: and cutting the small blocks. The block of the shape language is non-nested, and small block segmentation is not needed.

This completes the labeling of chunk dependency on the sentence, and the result is shown in fig. 28.

Example 2: he has a ticket, i do not.

The method comprises the following steps: the sentence is marked by the chunk, and the result of the chunk marking of example 2 is: he (has) a ticket, I (does). Five chunks are cut out, namely, a subject chunk "he", a predicate chunk "present", an object chunk "ticket", a subject chunk "me", and a predicate chunk "absent", respectively.

Step two: finding the predicate chunks in the sentence, and constructing the dependency relationship between other chunks in the sentence and the predicate chunks. Since the sentence has two small sentences, there are two sentence sets, respectively, "with" and "without". First, the statement language chunk "has" is labeled, wherein the subject language chunk "he" is connected with the subject language phase 1 in front of the statement language chunk, and the object language chunk "ticket" is connected with the object language phase 3 behind the statement language chunk. And secondly, marking the statement language block 'none', wherein the subject language block 'I' is connected with the subject language phase No. 1 in front of the statement language block, and as the 'none' is a bivalent verb, the default is that the action related person is a small non-self-sufficient sentence, and therefore, the step three is carried out to find the default component.

Step three: find the default component. The second clause "i don't" is a non-self-sufficient clause, and the default verb is involved, so that the object "ticket" in the previous clause is found from the context as an object chunk that is omitted from the statement chunk "no", and the object chunk "ticket" is shared with the statement chunk "of the previous clause, so that the object chunk" ticket "is connected with the object bit No. 3 behind the statement chunk" no ", and the default object chunk is found back.

Step four: and cutting the small blocks. The chunks are non-nested, and small chunk segmentation is not needed.

This completes the labeling of chunk dependency for the sentence, and the result is shown in fig. 29.

Example 3: today we have a clearer understanding of this by this method.

The method comprises the following steps: the sentence is subjected to chunk marking, and the chunk marking result of the example 3 is as follows: today we (with) have a clearer view of this fact by this method. Four chunks are cut out, respectively, the shape chunk "this thing is done by this method today", the subject chunk "we", the statement chunk "having" and the object chunk "being more clearly known".

Step two: finding the predicated language chunks in the sentence, and constructing the dependency relationship between other chunks in the sentence and the predicated language chunks. The sentence's predicate chunk is ' owned ', where the subject chunk ' we ' is connected to subject bit No. 1 in front of the predicate chunk, and the object chunk ' clearer cognition ' is connected to object bit No. 3 behind the predicate chunk. Since the zhuang chunk "today is a nested structure for this event by this method," we go to step four to split the small chunks.

Step three: a default component is sought. The sentence group of the sentence is self-contained, and has no province components.

Step four: and cutting the small blocks. The phrase chunk "today is for the event by this method" is a nested structure, which is further divided into three small phrase chunks, which are "today" respectively, "for the event" by this method ", and then the three small chunks are respectively connected with the 2 nd modifier bit above the phrase chunk.

This completes the labeling of chunk dependency for the sentence, and the result is shown in fig. 30.

Example 4: you need to try to see the problem from this perspective.

The method comprises the following steps: the sentence is marked by the chunk, and the result of the chunk marking of example 4 is: you (need) { (try) { (from this perspective (treat)) -question } }. Six chunks are cut out, namely, a subject chunk "you", a statement chunk "needs", "tries", "looks", a shape chunk "from this point of view", and an object chunk "question". The special point of the sentence is that the object is nested, and when the subject block with the predicate and the object block with the predicate are processed, the subject block and the object block are directly further divided into small blocks.

Step two: finding the predicate chunks in the sentence, and constructing the dependency relationship between other chunks in the sentence and the predicate chunks. The sentence has three sentence modules, which are respectively 'need', 'try' and 'treat'. It is first clear that the outermost predicate chunk is "needed", where the subject chunk "you" is connected to the subject bit 1 in front of the predicate chunk "needed". The statement block of the second layer is "try". The statement chunk at the third layer is 'treat', the shape chunk 'is connected with the No. 2 modification phrase position above the statement chunk' treat 'from the angle', and the object chunk 'question' is connected with the No. 3 object position behind the statement chunk 'treat'. It should be noted that, in the process of labeling the predicate object, the predicate chunks in the predicate object are used to refer to the whole predicate object and establish a dependency relationship with the predicate chunks of the previous layer, and the predicate subject performs the same processing.

Therefore, after labeling the above contents, it is necessary to establish a dependency relationship between the predicate object and the upper layer object block, connect the whole predicate object block indicated by the 4 th bit of the object block "try" with the 3 rd bit behind the outermost layer object block "need", connect the whole predicate object block indicated by the 4 th bit of the object block "see" with the 3 rd bit of the upper layer object block "try", and establish a dependency relationship therebetween. In addition, the phrase chunk "tries" to "see" the default subject chunk, so go to step three, looking for the default component.

Step three: a default component is sought. The predicates chunk "try" and "treat" default subject chunk, so the outermost subject chunk "you" is found from the context as the subject chunk "try" and "treat" omitted subject chunk, and the outermost predicates chunk "needs to" share its subject chunk "you", so the subject chunk "you" and predicates chunk "try" and "treat" are concatenated with the 1 st subject phase in front of, and the default subject chunk is found back.

Step four: and cutting the small blocks. The chunks are non-nested, and small chunk splitting is not needed.

This completes the labeling of chunk dependency on the sentence, and the result is shown in fig. 31.

The definitions referred to in the present invention are detailed below:

definition 1: chunk dependency refers to a relationship between chunks and allocation and dominance, which is not peer-to-peer, and which has a direction. Specifically, the dominant component is called the dominant component, and the dominant component is called the dependent component. While syntactic structures also essentially contain dependencies (embellishments) between chunks. A dependency connects two chunks, a core chunk (generally referred to as a predicate chunk) and a dependency chunk (generally referred to as a chunk other than the predicate chunk). Dependencies can be subdivided into different types, representing specific syntactic relationships between two words.

Definition 2: the price of the predicate. Price allocation is one of the most important problems in the current grammatical theory system, namely price (valency), which is a term in chemistry and is also translated into a group number limit, and mainly considers how many identical components of a certain component exist, namely how many mandatory matched components of a certain component need to exist. Mainly aims at verbs, namely nominal words with different properties, which are governed by one verb, the price of the verb is determined by the number of the nominal words with different properties governed by the verb. A verb is a univalent verb, such as "swimming", if it can govern a noun word of a nature, and can only govern a noun phrase of a nature of the action issuer; if a verb can control nominal words of two properties, it is a bivalent verb, such as 'like', and can control two nominal phrases, namely the sender of the action and the related person of the action; a verb is a trivalent verb if it can govern three kinds of noun words, for example, a "send" can govern three noun phrases at the same time, which are the sender of an action, the direct and indirect referents of the action. The price of the verb can also be classified into the adjective, and the price of the verb and the price of the adjective are called the price of the predicate in a unified way.

Definition 3: the self-sufficient structure refers to a sentence with self-sufficient syntax and semantics in natural language, generally takes the price of a core predicate in the sentence as a measurement standard, takes the price of the predicate as an example, and if a corresponding noun phrase matched with the price of the predicate appears in the same sentence, the sentence is called a self-sufficient sentence.

Definition 4: the non-self-sufficient structure refers to a sentence with the syntax and the semantics being unable to self-sufficient in the natural language, the price number of a core predicate in the sentence is also taken as a measurement standard, the price number of the predicate is taken as an example, if a corresponding nominal phrase matched with the price number of the predicate does not appear in the same sentence, namely, some nominal phrases related to verbs are omitted, the sentence is called as a non-self-sufficient sentence.

Definition 5: nesting means that a chunk is composed of other chunks, for example, "today we have a clearer understanding of this thing through this method. The ' chunk of the schlieren ' for the event by the method today ' is composed of three chunks which are ' today ' for the ' event by the method '.

The method uses the block dependency structure as the intermediate structure to carry out syntactic analysis, not only accords with the language cognition rule, but also is convenient to calculate, so that the process of the linguistic analysis has higher interpretability and controllability, simultaneously avoids analysis errors caused by word segmentation part of speech tagging, and provides more accurate knowledge for complete syntactic analysis and semantic analysis. The invention takes the block as the unit of the syntax system, does not need word segmentation and part of speech tagging; establishing self-sufficient structures of a subject, an object and a modifier by taking the statement as a core; the chapter components, the tone components and the phrase components are distinguished.

First, the chunks conform to language-awareness rules. As a whole store and a whole extract the language segments used, the academia generally accepted that chunks exist objectively in the language fact. From the perspective of language acquisition and processing, it is also in line with cognitive rules to use and process languages with chunks as basic units.

Second, chinese endianness is flexible, but the chunks are internally stable. Chinese is a typical meaning-type language, the conditions of language order inversion and omission often occur, and semantic analysis completely depending on the structure is difficult to advance. However, a chunk has formal integrity, is relatively stable both formally and semantically, and is a formal and semantic integrated language unit. The structural property of the chunk itself and the syntactic function of the chunk in the sentence are relatively stable, and the subsequent complete syntactic analysis and semantic analysis are feasible on the basis of chunk identification.

On the other hand, compared with other syntax theories, the chunk dependency grammar is used as the starting point of semantic analysis, and the advantages of each system are inherited, and meanwhile, the method is also developed:

compared with the dependency grammar, the chunk dependency grammar still emphasizes the predicate core, and the other components in the sentence are governed by the predicate. The difference is that the chunk dependency grammar selects the chunks with integrated meanings as the analysis unit, thereby avoiding the ambiguity problem caused by fuzzy word boundaries and difficult definition. And the domination component of the predicates in the block dependency grammar is a block unit with complete meaning and independent function, and the detailed analysis of the internal components of the block is not carried out any more.

Compared with grammar generation, the block dependency grammar does not need to build grammar rules layer by layer from bottom to top, only the combination rule of the upper layer is reserved, the hierarchy of the grammar structure is shallow, only one layer to two layers are needed in most cases, the granularity of an analysis unit is large, the combination relation is simple, the semantic relation is clear, and a structural basis is provided for subsequent semantic calculation.

Compared with category grammar, the chunk dependency grammar system is simple and clear, easy to understand and strong in operability. The category grammar is a vocabulary-based formalization theory, needs to consume a great deal of energy to construct a strict word category, and the analysis process is syntactic and semantic integration analysis, so that the requirement on an algorithm and an analysis system is high, and the realization is difficult. The block dependency grammar only emphasizes the syntactic and semantic integration of the analysis unit, so that the complex task of semantic analysis is modularized, and the analysis process has higher operability, interpretability and controllability.

Furthermore, the chunk dependency grammar adopts chunks as analysis units, so that uncertainty or ambiguity of boundaries of Chinese words can be avoided. Because Chinese lacks formal tags, chinese words are combined in the same way as the basic structural types of phrases, resulting in unclear boundaries between words and phrases. In addition, the use of Chinese words in a clutch mode also makes word unit identification difficult. The remarkable characteristic of natural language is that the situation of 'special case' is seen everywhere, it is difficult to analyze its type and analyze the internal structure composition according to the general linguistic theory, and the block is the isomorphic carrier of syntactic semantic analysis, which has relative integrity and independence, and the whole expressed semantic is relatively stable. During calculation, the chunks are packaged into the language unit with integrated shape and meaning, the internal structure can be delayed, knowledge is combined to be identified at the later stage, and the effects of eliminating word segmentation fragments and enhancing robustness are achieved.

Claims

1. A natural language-oriented chunk syntactic dependency graph representation and data labeling method includes the following steps:

s1: carrying out chunk marking on the sentence, and cutting different types of chunks in the sentence by using different marking symbols;

s3: searching for a default component of a non-self-contained sentence, wherein the default component needs to be searched in the small sentence or the context of the small sentence, and establishing a dependency relationship between the default component and the statement sentence module;

s4: and carrying out small block segmentation on the chunks, and respectively establishing the dependency relationship between the chunks and the statement sentence chunks.

2. A natural language-oriented chunk syntactic dependency representation and data tagging method according to claim 1, wherein: in step S1, the chunks are represented as adjacent word sequences in a sentence, and are non-recursive structures that bear a certain syntactic function and conform to grammatical rules, and after the sentence is segmented, a linear sequence of chunks is obtained.

3. The natural language oriented chunk syntactic dependency representation and data tagging method of claim 1, wherein: in step S1, the chunks are divided into syntactic chunks and non-syntactic chunks, depending on whether they serve as syntactic components or not; the syntax blocks are divided into a predicate block, a subject block, an object block, a shape block and a complement block according to the positions of the syntax blocks in sentences, and the non-syntax blocks are divided into a connection block and an auxiliary block according to the functions of the non-syntax blocks.

4. A natural language oriented chunk syntactic dependency representation and data tagging method according to claim 3, wherein: marking a predicate chunk or a predicate chunk and a shape chunk and a complement chunk which are adjacent to the predicate chunk by using "(); marking a zhuang chunk and a complement chunk separated by a punctuation mark or other components with a [ ]; marking a subject chunk with a predicate and an object chunk with the predicate by using the '{ }'; marking the connection blocks by using "< >; the auxiliary chunks are labeled with "< >", and the subject chunks of the body part of speech and the object chunks of the body part of speech are not labeled with symbols.

5. A natural language-oriented chunk syntactic dependency representation and data tagging method according to claim 1, wherein: in step S2, it is set that the statement chunk is a core of a sentence, each type of non-statement chunk is governed by the statement chunk and depends on the statement chunk, and if there is a dependency relationship between a non-statement chunk and the statement chunk, the non-statement chunk is called as a dependent component of the statement chunk, and the statement chunk is a dependent object of the non-statement chunk.

6. The natural language oriented chunk syntactic dependency representation and data tagging method of claim 5, wherein: the sentence structure comprises the following steps that (1) the sentence chunks are used as dependency objects of all chunks in a sentence, four nodes are set and respectively represent main language positions, guest language positions, modifying language positions and language positions of the sentence chunks, all non-sentence chunks depend on the four nodes of the sentence chunks according to the categories of the non-sentence chunks, dependency lines point to dependent components of the non-sentence chunks from the four nodes of the sentence chunks, and each sentence chunk and the dependent components around the sentence chunk form a self-sufficient structure together:

(4) The predicate correlation block refers to an empty statement language which is the same as the statement language, the predicate correlation block is connected with the 4 th position of the statement language chunk, and the 4 th position of the predicate correlation block is pointed to from the 4 th position of the statement language chunk; the relationship between the two is defined as VP-EMP.

7. A natural language-oriented chunk syntactic dependency representation and data tagging method according to claim 1, wherein: in step S2, when the chunk dependency graph is labeled with a specification, the segmentation is divided into self-sufficient sentences, non-self-sufficient sentences and small block segmentation, the self-sufficient sentences are that the necessary domination components of the statement chunks are all inside the small sentences, the non-self-sufficient sentences are that the necessary domination components of the statement chunks are omitted, the small block segmentation is that some chunks do not wholly depend on the statement chunks, but some chunks depend on the statement chunks in a small block dependence manner, and at this time, the small block segmentation needs to be performed according to a chunk structure.

8. A natural language-oriented chunk syntactic dependency representation and data labeling method according to claim 7, wherein: the labels of the self-sufficient sentences comprise basic sentence patterns and special sentence patterns, wherein the basic sentence patterns are divided into single sentences, compound sentences, major-predicate sentences, double-object sentences, predicate main sentences and predicate object sentences; the special sentence patterns are divided into conjunctive predicate sentences, conjunctive sentences, noun predicate sentences, somatological independent sentences, predicate independent sentences, inverted sentences, unowned sentences and adage.

9. The natural language oriented chunk syntactic dependency representation and data tagging method of claim 7, wherein: the annotations of the non-self-sufficient sentences include subject chunk default, object chunk default, subject chunk default, complement chunk default, and predicate chunk default.

10. A natural language-oriented chunk syntactic dependency representation and data tagging method according to claim 1, wherein: in step S4, the small block segmentation is divided into two cases: firstly, when the default component is a part of a certain chunk, the default component needs to be cut out, and after the default component is formed into a chunk, the dependency relationship between the default component and the statement chunk is established; and secondly, aiming at the complex zhuang language chunk and the complement chunk, the complex modifier chunk is cut into different small chunks, and the dependency relationship between the complex modifier chunk and the predicate chunk is respectively established.