CN112395871A

CN112395871A - Collocation configuration type automatic acquisition method and system and visualization method

Info

Publication number: CN112395871A
Application number: CN202011413473.6A
Authority: CN
Inventors: 唐旭日
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-02-23

Abstract

The invention discloses a collocation formula automatic acquisition method and system and a visualization method, and belongs to the technical field of natural language processing. The automatic acquisition method comprises the following steps: extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form; clustering the collocation formula instance set into a plurality of communities; and for each clustering community, acquiring a collocation formula corresponding to the community. The invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. The collocation formula provides a typical semantic communication function of a specific language, and the typical semantic communication function comprises information such as a syntactic pattern, words and words, and the strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.

Description

Collocation configuration type automatic acquisition method and system and visualization method

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a collocation formula automatic acquisition method and system and a visualization method.

Background

Information applications based on natural language processing techniques, including automatic grammar modification systems, online education systems, etc., rely on explicit language knowledge bases. Due to the complexity of the language itself, it takes a lot of time, labor and financial resources to construct a language knowledge base manually, such as a dictionary and a grammar knowledge base, and there are defects in coverage rate, consistency and the like. Automatic acquisition of language knowledge is an effective way to build a language knowledge base. Meanwhile, knowledge acquisition based on deep learning is not yet interpretable. The type of interpretable linguistic knowledge that is currently available for automatic acquisition is primarily collocation knowledge. Cognitive linguistics proposes a new basic constitution unit-formula of linguistic knowledge. The formula is a 'shape-meaning' complex, and the collocation formula is a formula expression form proposed by the cognitive linguistic research based on a corpus, and comprises various types of knowledge such as syntax, vocabulary, association strength between words and formulas and the like. The existing theoretical research shows that the collocation formula has high interpretability, can be used for explaining various language phenomena, and has wide application prospects in the aspects of automatic grammar correction, online education and the like.

However, the collocation format is not strictly formalized, and a mature automatic acquisition and visualization method is not available.

Disclosure of Invention

Aiming at the defects and the improvement requirements of the prior art, the invention provides a collocation structural formula automatic acquisition method, a collocation structural formula automatic acquisition system and a collocation structural formula visualization method, and aims to provide a strict collocation structural formula formal definition, a corresponding automatic acquisition method and a corresponding visualization method.

Given a target word or a syntactic pattern and a specific corpus analyzed by dependency syntax, the method can automatically acquire a collocation formula example of the target word or the syntactic pattern from the corpus, automatically generate a collocation formula by a clustering method, and provide a visualization method. The generated knowledge base has strong interpretability and can be used in the fields of language online education, automatic grammar correction and the like.

To achieve the above object, according to a first aspect of the present invention, there is provided a collocation formula automatic acquisition method, including the steps of:

s1, extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form;

s2, clustering the collocation formula instance set into a plurality of communities;

and S3, acquiring a collocation formula corresponding to each clustering community.

Preferably, a collocation formula instance set of the target word is extracted from the corpus as follows:

(A1) searching a corpus, acquiring all sentence instances containing target words, converting each acquired sentence instance into a dependency tree through dependency syntax analysis, and forming a dependency tree set by all the dependency trees;

(A2) for each dependency tree in the set of dependency trees, constructing a set of collocation formula instances by:

initializing a dependency subtree to be empty, traversing each triple in the dependency tree, selecting a center word or a triple with the same dependency word as a target word, adding the triple into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished.

Has the advantages that: the collocation formula example of the target word of the invention takes the target word as the center, and obtains the syntactic components depending on the syntactic components of the target word through dependency grammar connection, thereby comprehensively reflecting the specific syntactic usage mode of the target word.

Preferably, a collocation formula instance set of the target syntactic patterns is extracted from the corpus as follows:

(B1) extracting search terms from the target syntax mode and constructing a search term set;

(B2) searching a corpus, acquiring sentence instances containing all search words in a search word set, and converting each acquired sentence instance into a dependency tree through dependency syntax analysis to form a dependency tree set;

(B3) for each dependency tree in the dependency tree set, judging whether all triples in the target dependency tree are contained in the dependency tree and the sequence in the target dependency tree is consistent with the sequence in the dependency tree, if so, entering a step (B4), otherwise, not acquiring; thus constructing a set of collocation formula instances:

(B4) comparing the dependency tree with a target dependency tree to determine a matching item of the wildcard;

(B5) initializing a dependency subtree to be null, traversing each triple in the dependency tree, selecting a triple meeting any one of the following conditions to be added into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished, wherein the conditions are as follows:

1) the triple exists in the target dependency tree;

2) the dependency word of the triple is a matching term;

3) the core word of the triplet is the matching term.

Has the advantages that: the collocation formula example of the syntactic pattern of the invention acquires the key syntactic part determined by dependency syntactic connection, and determines the real words related to the syntactic pattern by using wildcard technology, thereby acquiring the dependent syntactic components and the depended syntactic components of the real words to form the specific use information which comprehensively reflects the syntactic pattern.

Preferably, in step S2, a collocation formula instance set Γ' ═ { C is given_iAnd its corresponding maximum clustering distance D ═ epsilon_iAnd f, clustering the gamma' in the following way:

(C1) investigation of C in sequence_iWill contain C_iIs initialized to I_i＝{C_iAnd set C_iIs True, and goes to step (C2);

(C2) obtaining C_iOf (2)

Wherein the content of the first and second substances,

is C_jRelative to C_iR is the maximum search distance, and go to step (C3);

(C3) examine C in N one by one_jIf C is present_jHas an access value of False, and if for community I_iAll the formula C in_kAll are provided with

Wherein the content of the first and second substances,

is C_kRelative to C_jA distance of ∈ of_jIs C_jMaximum clustering distance of C, then C_jPut into community I_iIs provided with C_jAccess value of True and get C_jOf (2)

And updating N such that N ═ N'.

Has the advantages that: the invention clusters the collocation formula examples on the basis of calculating the ordered similarity of the collocation formula examples, and determines the typical semantic use range of the collocation formula by calculating the correlation strength of the words and the collocation formula, thereby achieving the purpose of simulating the process of accumulating, abstracting and inducing human beings in the reading process and finally learning language knowledge, and forming a language knowledge form which is similar to human language knowledge and is convenient to understand and explain.

Preferably, formula C is used_iThe maximum clustering distance calculation process is as follows:

(D1) obtaining C_iDistance set relative to each collocation formula instance in collocation formula instance set Γ

n is the number of gamma' containing the matching formula example;

(D2) taking the distance as a horizontal axis and the number of collocation formulas in the distance value interval as a vertical axis, making a histogram of D, and defining the distance value corresponding to the 15 th percentile as p₁，p₁The distance value p with the number of times of the first distance value being 0₂；

(D3) Obtaining the mean square error sigma of D;

(D4)C_ithe maximum clustering distance value of (a) is calculated as follows:

wherein, delta is a multiplication parameter, and is more than or equal to 1 and less than or equal to 5.

Has the advantages that: the invention takes the distance between the specific collocation formula example and all collocation formula examples as a random variable, and well acquires the maximum clustering distance of the collocation formula examples by investigating the distribution of probability quality functions and taking 15% as an empirical value.

Preferably, formula C is used_jExample C for the collocation configuration_kIs a distance of

The calculation process is as follows:

(E1) based on C_jAnd C_kSimilarity of middle triplets, calculating C_jAnd C_kFeature similarity of

(E2) Based on C_jAnd C_kFeature similarity of

Calculating C_jRelative to C_kOrderly similarity of

Wherein len (C) is the number of triples contained in C, and alpha and beta respectively represent C_jOr C_kThe weights of the different features that they have in the similarity calculation, α + β ═ 1;

(E3) based on C_jRelative to C_kOrderly similarity of

Calculating C_jRelative to C_kIs a distance of

Has the advantages that: the invention introduces an asymmetrical similarity calculation method based on Amostversey, and utilizes the setting of the characteristic weight of different parts in asymmetrical similarity calculation, thereby realizing the ordered similarity between the collocation formula examples and further solving the containment relationship between the collocation formula examples.

Preferably, C is calculated based on the similarity of the triplets_jAnd C_kFeature similarity of

Wherein, C_j＝(e₁＝<t₁，h₁，c₁>，e₂＝<t₂，h₂，c₂>，...e_J＝<t_J，h_J，c_J>)，C_k＝(e₁＝<t₁，h₁，c₁>，e₂＝<t₂，h₂，c₂>，...e_K＝<t_K，h_K，c_K>) In the triple, t is a dependency type, h is a central word, and c is a dependency word, and the calculation process is as follows:

initializing a matrix M of size (J +1) × (K +1), setting the cell values in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:

wherein sim (e)_p，e_q) Representing triplets e_pAnd e_qThe similarity of (c) is calculated by the following formula:

wherein:

sim(h_p，h_q)＝cosine(vec(h_p)，vec(h_q))

sim(c_p，c_q)＝cosine(vec(c_p)，vec(c_q))

wherein, cosine () is cosine function, vec (-) is word vector;

after the calculation is completed, C_jAnd C_kFeature similarity of

Has the advantages that: according to the invention, by adopting the dynamic programming algorithm, the cost of different paths in the process can be considered by the dynamic programming algorithm, so that the maximum characteristic similarity of the two collocation formula examples is obtained while the current sequence of the triples in the collocation formula examples is considered.

Preferably, step S3 includes the following sub-steps:

(1) cluster community I_iAll the combinations in the formula C_iAccording to the sequence of the triples, a sequence of the triples is formed, as shown in the following formula:

g_i＝(<e₁，e₂>，<e₂，e₃>，...<e_k-1，e_k>)

(2) merging slave I_iConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be

(3) Selecting a node n with the income degree of 0 and the maximum arc weight as an initial node, traversing G by a depth-first method, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation formula;

(4) for any node b in G', from I_iThe word set W ═ W of the word configuration collocation formula appearing at the node is obtained_i}, then the word w_iThe strength of the correlation with the formula G' is the P value of the Fisher exact test

Represents;

(5) g', word set W corresponding to nodes in graph and correlation strength of word set W

Collectively form a cluster community I_iThe obtained collocation formula.

Has the advantages that: according to the method, the initial node is selected through weight, and a weight priority strategy is adopted in the depth priority traversal process, so that the optimal path is obtained from the directed cyclic graph and is used as a typical representative syntax mode of a collocation formula.

To achieve the above object, according to a second aspect of the present invention, there is provided a collocation formula visualization method, including the steps of:

automatically obtaining a collocation formula using the method of the first aspect;

for each obtained collocation formula, the dependency type and the central word are taken as nodes, and the nodes are linearly arranged from left to right according to the sequence of the dependency type and the central word in G', for example, connection arcs among the nodes have directionality, wherein the initial node is the central word node, the arrow points to the dependency word node, and the connection arcs display connection weights.

Has the advantages that: the collocation formula is converted into the directed graph by utilizing the self-attribute that the collocation formula takes the dependency tree as the frame, and the collocation formula knowledge base generated by the directivity of the directed graph has strong interpretability and can be used in the fields of language online education, automatic grammar correction and the like.

To achieve the above object, according to a third aspect of the present invention, there is provided a collocation configuration type automatic acquisition system, including: a computer-readable storage medium and a processor;

the computer-readable storage medium is used for storing executable instructions;

the processor is configured to read executable instructions stored in the computer-readable storage medium, and execute the collocated automatic acquisition method of the first aspect.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

the invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. The collocation formula provides a typical semantic communication function of a specific language, and the typical semantic communication function comprises information such as a syntactic pattern, words and words, and the strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.

Drawings

FIG. 1 is a flow chart of an automatic acquisition method of collocation configuration according to the present invention;

FIG. 2 is a visual result of a collocation configuration provided by the present invention;

fig. 3 is a visualization result of a specific collocation configuration provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the present invention provides an automatic acquisition method of collocation formula, which comprises the following steps: inputting a target word/target syntactic pattern and a dependency syntactic tree set (corpus), extracting collocation formula examples, calculating the similarity of the collocation formula examples, clustering the collocation formula examples based on a clustering algorithm of community voting, generating a collocation formula for each clustering community, and finally, graphing the collocation formula.

1. Input format definition

The input includes both the search target structure and the dependency tree set.

The retrieval target structure is the first input and has two forms: (1) a single word, which may be a verb, a noun, an adjective, or an adverb, is called a core word w, such as the adjective "happy" in chinese; (2) and (5) dependent subtrees. The dependency subtree comprises a plurality of triples, wherein the triples are defined as < t, h, c >, wherein t is a dependency type, h is a central word, and c is a dependency word. Thus, a dependency sub-tree can be expressed as an ordered sequence of triples:

K＝(e₁＝<t₁，h₁，c₁>，e₂＝<t₂，h₂，c₂>，...e_n＝<t_n，h_n，c_n>)

wherein, the CORE word wildcard is included and is marked as CORE, and the CORE can be h_iOr c_i. For example, the dependent subtree "auxpass (CORE, by) aspect" contains two triples, where CORE is the CORE wildcard, auxpass and aspect are the dependent types, respectively, CORE is the CORE, and "quilt" and "away" are the dependent words.

The second input is dependency tree set Γ ═ T_j}. A corpus is given, if an input target structure is a single word, the corpus can be directly searched to obtain a sentence instance, and gamma is obtained through dependency syntax analysis; if the input target structure is a dependency subtree, first extracting search terms from the dependency subtree, then searching the corpus to obtain sentence instances, and then obtaining a sentence instance in the dependency sentenceAfter the method analysis, a sentence instance dependency tree containing a dependency subtree is selected as gamma. Like the dependency subtree, a single dependency tree is represented as:

T_j＝(e₁＝<t₁，h₁，c₁>，e₂＝<t₂，h₂，c₂>，...e_m＝<t_m，h_m，c_m>)

2. extraction of collocation formula examples

(1) If the input target structure is a single word, traversing the dependency tree set gamma and analyzing the dependency trees T one by one for obtaining the collocation formula example_jAnd selecting the center word or the triple with the dependency word as the core word to jointly form a dependency subtree.

The selection conditions are as follows: for dependency tree T_jEach triplet e of_i＝<t_i，h_i，c_i>If h is_iW or c_iWhen w is equal to e, e is_iAnd adding to the dependent subtree. The formed dependency subtree is the slave T_jThe collocation formula example obtained in (1). Namely:

C＝(e₁＝<t₁，h₁，c₁>，e₂＝<t₂，h₂，c₂>，...e_k＝<t_k，h_k，c_k>)

wherein for any e_i＝<t_i，h_i，c_i>，h_iW or c_i＝w。

(2) If the input target structure is a dependency tree K, traversing the dependency tree set gamma, and analyzing the dependency trees T one by one_jTo obtain the matching formula example.

For dependency tree T_jIf for triplet e in K_i＝<t_i，h_i，c_i>All have e_i∈T_jAnd the triplet order in K and T_jIf the sequences are consistent, the sequence can be selected from T_jThe collocation configuration example is obtained, otherwise, the collocation configuration example does not need to be obtained. Dependency tree T for retrievable collocation instances_jComparing K, the matching item of wildcard CORE can be determined and marked as w_coreDefining the matching formula as C, checking T one by one_jTriple of (e)_i＝<t_i，h_i，c_i>Adding to C if the following conditions are met:

(1)e_i∈K；

(2) or c_i＝w_core；

(3) Or h_i＝w_core。

3. Similarity calculation of collocation formula examples

Two triplets e₁＝<t₁，h₁，c₁>And e₂＝<t₂，h₂，c₂>The method for calculating the similarity comprises the following steps:

wherein:

sim(h₁，h₂)＝cosine(vec(h₁)，vec(h₂))

sim(c₁，c₂)＝cosine(vec(c₁)，vec(c₂))

wherein cosine () is a cosine function, vec (-) is a word vector. Word vector acquisition may employ a currently popular algorithm, such as the word2vec algorithm.

Given two collocation formula examples C₁And C₂Respectively is as follows:

C₁＝(e₁＝<t₁，h₁，c₁>，e₂＝<t₂，h₂，c₂>，...e_k＝<t_k，h_k，c_k>)

C₂＝(e₁＝<t₁，h₁，c₁>，e₂＝<t₂，h₂，c₂>，...e_l＝<t_l，h_l，c_l>)

C₁and C₂Feature similarity of

The calculation method comprises the following steps: a matrix M of size (k +1) × (l +1) is initialized. Setting the cell value in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:

wherein sim (e)_i，e_j) The triple similarity calculation method defined above is adopted. After the calculation is completed, let C₁And C₂Feature similarity of

Based on C₁And C₂Feature similarity of (C), calculating₁Relative to C₂Ordered similarity of (a):

in the above formula, the lowest similarity of two examples of the collocation formula is set to be 0.05, len (C) is the number of triples contained in C,

is represented by C₁And C₂Feature similarity of

Alpha and beta respectively represent C₁Or C₂The different features have a weight in the similarity calculation of α + β of 1. C₁Relative to C₂Ordered similarity of (1) and C₂Relative to C₁The ordered similarity of (a) is different.

4. Clustering method based on collocation formula community

Based on C₁Relative to C₂Order similarity of (1), calculating C₁Relative to C₂Is a distance of

Giving a dependency tree set gamma, acquiring collocation formula instances one by one, and acquiring a collocation formula instance set gamma' ═ C_i}。

C_iThe maximum clustering distance value calculating method comprises the following steps:

(1) obtaining C_iDistance set with respect to each collocation formula instance in Γ

n is the number of matching examples contained in the gamma';

(2) making a histogram of D, wherein the horizontal axis is distance and the value range is [0-1 ]]The vertical axis is the number of collocation formulas in the distance value range. Suppose Γ is neutral with C_iThe maximum 15% of the matching configuration examples capable of forming matching is defined, so the distance value corresponding to the 15 th percentile is defined as p₁，p₁The distance value p with the number of times of the first distance value being 0₂；

(3) Obtaining the mean square error sigma of D;

(4) definition structure C_iThe maximum cluster distance value of (a) is:

In step S2, givenMatching formula example set Γ' ═ C_iAnd its corresponding maximum clustering distance D ═ epsilon_iAnd f, clustering the gamma' in the following way:

(C1) investigation of C in sequence_iWill contain C_iIs initialized to I_i＝{C_iAnd set C_iIs True, and goes to (C2);

(C2) obtaining C_iOf (2)

Wherein the content of the first and second substances,

is C_jRelative to C_iR is the maximum search distance, and go to (C3);

Wherein the content of the first and second substances,

And updating N such that N ═ N'.

The present embodiment sets the maximum search distance r to 0.6.

5. Generation of collocation constructs

For each cluster community I_iThe collocation formula is obtained by adopting the following method:

(1) will I_iAll the combinations in the formula C_iThe binary elements of the triples are formed according to the sequence of the triplesThe sequence of the group is shown as the following formula:

g_i＝(<e₁，e₂>，<e₂，e₃>，...<e_k-1，e_k>)

will be described below<e_i，e_i+1>Referred to as a doublet.

(3) And selecting a node n with the income degree of 0 and the maximum arc weight, traversing G by using a depth-first method by taking the n as an initial node, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation structure. The average connection weight is the sum of the weights of all the connections in the path from the starting node to the destination node/the number of connections in the path.

(4) For any node b in G', from I_iThe word set W ═ W of the word configuration collocation formula appearing at the node is obtained_i}, then the word w_iThe strength of the correlation with the formula G' is expressed as the P value of the Fisher exact test and is calculated as:

let w_iThe number of occurrences at the node is f_i ¹，w_iFrequency of occurrence in a given corpus is f_i ²，I_iThe number of the collocation formula examples contained in (1) is N, the number of sentences contained in the given corpus is N, the average length of the collocation formula examples is 20, and let a be f_i ¹，b＝n×(20-f_i ¹)，c＝f_i ²-f_i ¹，d＝N-f_i ¹Then P can be calculated as:

(5) g', the word set corresponding to the nodes in the graph and the correlation strength thereof are formed by I_iThe obtained collocation formula.

6. Collocation structure type visualization

For the following collocation formula:

(<D1，X1，CORE-WORD>，

<D2，CORE-WORD，X2>

<D3，CORE-WORD，X3>

X1＝{(W_1，V_1，F_1)}

X2＝{(W_2，V_2，F_2)，(W_3，V_3，F_3)}

X3＝{(W_4，V_4，F_4)}

CORE-WORD＝{(W_5，V_5，F_5)，(W_6，V__6，F_6)，(W_7，V__7，F_7)})

wherein D1, D2, and D3 are dependency types, and X1, X2, and X3 are syntax slot placeholders, respectively, that point to three sets of word information structures, respectively, wherein each word information structure includes a part of speech (W), an association strength (V), and a frequency (F), which is the frequency with which a word occurs in the community.

The visualization rules are as follows: taking the dependency type and the CORE-WORD as nodes, and linearly arranging the nodes from left to right according to the sequence of the dependency type and the CORE-WORD in the G', for example, the connection arcs between the nodes have directionality, wherein the starting node is a headword node, i.e., a first placeholder in the triplet, the arrow points to a dependency WORD node, i.e., a second placeholder in the triplet, and the connection arcs display connection weights. The visualization result is shown in fig. 2.

Specifically, a plurality of collocation formulas can be obtained by using a dependency subtree "auxpass (CORE, by)" as an input format and a corpus "people's daily newspaper" as a corpus, one of which is shown in fig. 3. The collocation formula provides a typical operation mode of the 'quilt' words in Chinese and has good interpretability: the ROOT node indicates that the collocation formula can not be embedded into other syntactic components, the WORD and the association strength thereof given in the NSUBJ node indicate that the semantic type of the subject component is human, the semantic in the other NSUBJ node also takes human as a main body, the CORE-WORD node is a verb, and besides the verb and the verb, the CORE-WORD node also comprises part of idioms.

The technology overcomes the defects of the prior two types of automatic acquisition technologies of language knowledge: the first kind of automatic linguistic knowledge acquiring technology is from the linguistic field of corpus, and has the main knowledge form of being limited to matching information between two words and small information amount to meet the requirement of language online education, automatic grammar correction and other difficulties. The second type of automatic acquisition method of language knowledge is derived from a natural language processing technology based on deep learning, exists in a super-large scale parameter form, has low interpretability, and cannot meet the requirements of language education and grammar correction feedback on explicit language rules. The invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. As shown in fig. 3, the collocation formula provides a typical semantic interaction function of a specific language (chinese in the figure), including information of syntax patterns, words, and strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An automatic acquisition method of collocation formula is characterized in that the method comprises the following steps:

2. The method of claim 1, wherein the set of collocation formula instances of the target term is extracted from the corpus as follows:

3. The method of claim 1, wherein the set of collocation formula instances of the target syntactic patterns is extracted from the corpus as follows:

1) the triple exists in the target dependency tree;

2) the dependency word of the triple is a matching term;

3) the core word of the triplet is the matching term.

4. A method according to any one of claims 1 to 3, wherein in step S2, a collocation formula instance set Γ' ═ C is given_iAnd its corresponding maximum clustering distance D ═ epsilon_iAnd f, clustering the gamma' in the following way:

(C2) obtaining C_iOf (2)

Wherein the content of the first and second substances,

is C_jRelative to C_iR is the maximum search distance, and go to step (C3);

Wherein the content of the first and second substances,

And updating N such that N ═ N'.

5. The method of claim 4, wherein formula instance C is collocated_iThe maximum clustering distance calculation process is as follows:

n is the number of gamma' containing the matching formula example;

(D3) Obtaining the mean square error sigma of D;

(D4)C_ithe maximum clustering distance value of (a) is calculated as follows:

6. A method according to claim 4 or 5, characterized in that formula instance C is collocated_jExample C for the collocation configuration_kIs a distance of

The calculation process is as follows:

(E2) Based on C_jAnd C_kFeature similarity of

Calculating C_jRelative to C_kOrderly similarity of

Wherein le (C) is the number of triplets contained in C, and alpha and beta respectively represent C_jOr C_kThe weights of the different features that they have in the similarity calculation, α + β ═ 1;

(E3) based on C_jRelative to C_kOrderly similarity of

Calculating C_jRelative to C_kIs a distance of

7. The method of claim 6, wherein C is calculated based on similarity of triples_jAnd C_kFeature similarity of

wherein:

sim(h_p，h_q)＝cosine(vec(h_p)，vec(h_q))

sⁱ(c_p，c_q)＝cosine(vec(c_p)，vec(c_q))

wherein, cosine (·) is cosine function, uec (·) is word vector;

after the calculation is completed, C_jAnd C_kFeature similarity of

8. The method according to any of claims 1 to 7, characterized in that step S3 comprises the sub-steps of:

(F1) cluster community I_iAll the combinations in the formula C_iAccording to the sequence of the triples, a sequence of the triples is formed, as shown in the following formula:

g_i＝(<e₁，e₂>，<e₂，e₃>，...<e_k-1，e_k>)

(F2) merging slave I_iConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be

(F3) Selecting a node n with the income degree of 0 and the maximum arc weight as an initial node, traversing G by a depth-first method, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation formula;

(F4) for any node b in G', from I_iThe word set W ═ W of the word configuration collocation formula appearing at the node is obtained_i}, then the word w_iThe strength of the correlation with the formula G' is the P value of the Fisher exact test

Represents;

(F5) g', word set W corresponding to nodes in graph and correlation strength of word set W

Collectively form a cluster community I_iThe obtained collocation formula.

9. A visual method for collocation configuration is characterized by comprising the following steps:

automatically obtaining a collocation construct using a method according to any one of claims 1 to 8;

10. An automatic acquisition system of collocation configuration, comprising: a computer-readable storage medium and a processor;

the processor is used for reading executable instructions stored in the computer-readable storage medium and executing the collocation structural automatic acquisition method of any one of claims 1-8.