CN112395871A - Collocation configuration type automatic acquisition method and system and visualization method - Google Patents

Collocation configuration type automatic acquisition method and system and visualization method Download PDF

Info

Publication number
CN112395871A
CN112395871A CN202011413473.6A CN202011413473A CN112395871A CN 112395871 A CN112395871 A CN 112395871A CN 202011413473 A CN202011413473 A CN 202011413473A CN 112395871 A CN112395871 A CN 112395871A
Authority
CN
China
Prior art keywords
collocation
dependency
formula
word
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011413473.6A
Other languages
Chinese (zh)
Inventor
唐旭日
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202011413473.6A priority Critical patent/CN112395871A/en
Publication of CN112395871A publication Critical patent/CN112395871A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The invention discloses a collocation formula automatic acquisition method and system and a visualization method, and belongs to the technical field of natural language processing. The automatic acquisition method comprises the following steps: extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form; clustering the collocation formula instance set into a plurality of communities; and for each clustering community, acquiring a collocation formula corresponding to the community. The invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. The collocation formula provides a typical semantic communication function of a specific language, and the typical semantic communication function comprises information such as a syntactic pattern, words and words, and the strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.

Description

Collocation configuration type automatic acquisition method and system and visualization method
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a collocation formula automatic acquisition method and system and a visualization method.
Background
Information applications based on natural language processing techniques, including automatic grammar modification systems, online education systems, etc., rely on explicit language knowledge bases. Due to the complexity of the language itself, it takes a lot of time, labor and financial resources to construct a language knowledge base manually, such as a dictionary and a grammar knowledge base, and there are defects in coverage rate, consistency and the like. Automatic acquisition of language knowledge is an effective way to build a language knowledge base. Meanwhile, knowledge acquisition based on deep learning is not yet interpretable. The type of interpretable linguistic knowledge that is currently available for automatic acquisition is primarily collocation knowledge. Cognitive linguistics proposes a new basic constitution unit-formula of linguistic knowledge. The formula is a 'shape-meaning' complex, and the collocation formula is a formula expression form proposed by the cognitive linguistic research based on a corpus, and comprises various types of knowledge such as syntax, vocabulary, association strength between words and formulas and the like. The existing theoretical research shows that the collocation formula has high interpretability, can be used for explaining various language phenomena, and has wide application prospects in the aspects of automatic grammar correction, online education and the like.
However, the collocation format is not strictly formalized, and a mature automatic acquisition and visualization method is not available.
Disclosure of Invention
Aiming at the defects and the improvement requirements of the prior art, the invention provides a collocation structural formula automatic acquisition method, a collocation structural formula automatic acquisition system and a collocation structural formula visualization method, and aims to provide a strict collocation structural formula formal definition, a corresponding automatic acquisition method and a corresponding visualization method.
Given a target word or a syntactic pattern and a specific corpus analyzed by dependency syntax, the method can automatically acquire a collocation formula example of the target word or the syntactic pattern from the corpus, automatically generate a collocation formula by a clustering method, and provide a visualization method. The generated knowledge base has strong interpretability and can be used in the fields of language online education, automatic grammar correction and the like.
To achieve the above object, according to a first aspect of the present invention, there is provided a collocation formula automatic acquisition method, including the steps of:
s1, extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form;
s2, clustering the collocation formula instance set into a plurality of communities;
and S3, acquiring a collocation formula corresponding to each clustering community.
Preferably, a collocation formula instance set of the target word is extracted from the corpus as follows:
(A1) searching a corpus, acquiring all sentence instances containing target words, converting each acquired sentence instance into a dependency tree through dependency syntax analysis, and forming a dependency tree set by all the dependency trees;
(A2) for each dependency tree in the set of dependency trees, constructing a set of collocation formula instances by:
initializing a dependency subtree to be empty, traversing each triple in the dependency tree, selecting a center word or a triple with the same dependency word as a target word, adding the triple into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished.
Has the advantages that: the collocation formula example of the target word of the invention takes the target word as the center, and obtains the syntactic components depending on the syntactic components of the target word through dependency grammar connection, thereby comprehensively reflecting the specific syntactic usage mode of the target word.
Preferably, a collocation formula instance set of the target syntactic patterns is extracted from the corpus as follows:
(B1) extracting search terms from the target syntax mode and constructing a search term set;
(B2) searching a corpus, acquiring sentence instances containing all search words in a search word set, and converting each acquired sentence instance into a dependency tree through dependency syntax analysis to form a dependency tree set;
(B3) for each dependency tree in the dependency tree set, judging whether all triples in the target dependency tree are contained in the dependency tree and the sequence in the target dependency tree is consistent with the sequence in the dependency tree, if so, entering a step (B4), otherwise, not acquiring; thus constructing a set of collocation formula instances:
(B4) comparing the dependency tree with a target dependency tree to determine a matching item of the wildcard;
(B5) initializing a dependency subtree to be null, traversing each triple in the dependency tree, selecting a triple meeting any one of the following conditions to be added into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished, wherein the conditions are as follows:
1) the triple exists in the target dependency tree;
2) the dependency word of the triple is a matching term;
3) the core word of the triplet is the matching term.
Has the advantages that: the collocation formula example of the syntactic pattern of the invention acquires the key syntactic part determined by dependency syntactic connection, and determines the real words related to the syntactic pattern by using wildcard technology, thereby acquiring the dependent syntactic components and the depended syntactic components of the real words to form the specific use information which comprehensively reflects the syntactic pattern.
Preferably, in step S2, a collocation formula instance set Γ' ═ { C is giveniAnd its corresponding maximum clustering distance D ═ epsiloniAnd f, clustering the gamma' in the following way:
(C1) investigation of C in sequenceiWill contain CiIs initialized to Ii={CiAnd set CiIs True, and goes to step (C2);
(C2) obtaining CiOf (2)
Figure BDA0002812319830000031
Wherein the content of the first and second substances,
Figure BDA0002812319830000032
is CjRelative to CiR is the maximum search distance, and go to step (C3);
(C3) examine C in N one by onejIf C is presentjHas an access value of False, and if for community IiAll the formula C inkAll are provided with
Figure BDA0002812319830000041
Wherein the content of the first and second substances,
Figure BDA0002812319830000042
is CkRelative to CjA distance of ∈ ofjIs CjMaximum clustering distance of C, then CjPut into community IiIs provided with CjAccess value of True and get CjOf (2)
Figure BDA0002812319830000043
And updating N such that N ═ N'.
Has the advantages that: the invention clusters the collocation formula examples on the basis of calculating the ordered similarity of the collocation formula examples, and determines the typical semantic use range of the collocation formula by calculating the correlation strength of the words and the collocation formula, thereby achieving the purpose of simulating the process of accumulating, abstracting and inducing human beings in the reading process and finally learning language knowledge, and forming a language knowledge form which is similar to human language knowledge and is convenient to understand and explain.
Preferably, formula C is usediThe maximum clustering distance calculation process is as follows:
(D1) obtaining CiDistance set relative to each collocation formula instance in collocation formula instance set Γ
Figure BDA0002812319830000044
n is the number of gamma' containing the matching formula example;
(D2) taking the distance as a horizontal axis and the number of collocation formulas in the distance value interval as a vertical axis, making a histogram of D, and defining the distance value corresponding to the 15 th percentile as p1,p1The distance value p with the number of times of the first distance value being 02
(D3) Obtaining the mean square error sigma of D;
(D4)Cithe maximum clustering distance value of (a) is calculated as follows:
Figure BDA0002812319830000045
wherein, delta is a multiplication parameter, and is more than or equal to 1 and less than or equal to 5.
Has the advantages that: the invention takes the distance between the specific collocation formula example and all collocation formula examples as a random variable, and well acquires the maximum clustering distance of the collocation formula examples by investigating the distribution of probability quality functions and taking 15% as an empirical value.
Preferably, formula C is usedjExample C for the collocation configurationkIs a distance of
Figure BDA0002812319830000046
The calculation process is as follows:
(E1) based on CjAnd CkSimilarity of middle triplets, calculating CjAnd CkFeature similarity of
Figure BDA0002812319830000047
(E2) Based on CjAnd CkFeature similarity of
Figure BDA0002812319830000048
Calculating CjRelative to CkOrderly similarity of
Figure BDA0002812319830000049
Figure BDA0002812319830000051
Wherein len (C) is the number of triples contained in C, and alpha and beta respectively represent CjOr CkThe weights of the different features that they have in the similarity calculation, α + β ═ 1;
(E3) based on CjRelative to CkOrderly similarity of
Figure BDA0002812319830000052
Calculating CjRelative to CkIs a distance of
Figure BDA0002812319830000053
Figure BDA0002812319830000054
Has the advantages that: the invention introduces an asymmetrical similarity calculation method based on Amostversey, and utilizes the setting of the characteristic weight of different parts in asymmetrical similarity calculation, thereby realizing the ordered similarity between the collocation formula examples and further solving the containment relationship between the collocation formula examples.
Preferably, C is calculated based on the similarity of the tripletsjAnd CkFeature similarity of
Figure BDA0002812319830000055
Wherein, Cj=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eJ=<tJ,hJ,cJ>),Ck=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eK=<tK,hK,cK>) In the triple, t is a dependency type, h is a central word, and c is a dependency word, and the calculation process is as follows:
initializing a matrix M of size (J +1) × (K +1), setting the cell values in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:
Figure BDA0002812319830000056
wherein sim (e)p,eq) Representing triplets epAnd eqThe similarity of (c) is calculated by the following formula:
Figure BDA0002812319830000057
wherein:
Figure BDA0002812319830000058
sim(hp,hq)=cosine(vec(hp),vec(hq))
sim(cp,cq)=cosine(vec(cp),vec(cq))
wherein, cosine () is cosine function, vec (-) is word vector;
after the calculation is completed, CjAnd CkFeature similarity of
Figure BDA0002812319830000061
Has the advantages that: according to the invention, by adopting the dynamic programming algorithm, the cost of different paths in the process can be considered by the dynamic programming algorithm, so that the maximum characteristic similarity of the two collocation formula examples is obtained while the current sequence of the triples in the collocation formula examples is considered.
Preferably, step S3 includes the following sub-steps:
(1) cluster community IiAll the combinations in the formula CiAccording to the sequence of the triples, a sequence of the triples is formed, as shown in the following formula:
gi=(<e1,e2>,<e2,e3>,...<ek-1,ek>)
(2) merging slave IiConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be
Figure BDA0002812319830000062
(3) Selecting a node n with the income degree of 0 and the maximum arc weight as an initial node, traversing G by a depth-first method, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation formula;
(4) for any node b in G', from IiThe word set W ═ W of the word configuration collocation formula appearing at the node is obtainedi}, then the word wiThe strength of the correlation with the formula G' is the P value of the Fisher exact test
Figure BDA0002812319830000063
Represents;
(5) g', word set W corresponding to nodes in graph and correlation strength of word set W
Figure BDA0002812319830000064
Collectively form a cluster community IiThe obtained collocation formula.
Has the advantages that: according to the method, the initial node is selected through weight, and a weight priority strategy is adopted in the depth priority traversal process, so that the optimal path is obtained from the directed cyclic graph and is used as a typical representative syntax mode of a collocation formula.
To achieve the above object, according to a second aspect of the present invention, there is provided a collocation formula visualization method, including the steps of:
automatically obtaining a collocation formula using the method of the first aspect;
for each obtained collocation formula, the dependency type and the central word are taken as nodes, and the nodes are linearly arranged from left to right according to the sequence of the dependency type and the central word in G', for example, connection arcs among the nodes have directionality, wherein the initial node is the central word node, the arrow points to the dependency word node, and the connection arcs display connection weights.
Has the advantages that: the collocation formula is converted into the directed graph by utilizing the self-attribute that the collocation formula takes the dependency tree as the frame, and the collocation formula knowledge base generated by the directivity of the directed graph has strong interpretability and can be used in the fields of language online education, automatic grammar correction and the like.
To achieve the above object, according to a third aspect of the present invention, there is provided a collocation configuration type automatic acquisition system, including: a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is configured to read executable instructions stored in the computer-readable storage medium, and execute the collocated automatic acquisition method of the first aspect.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
the invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. The collocation formula provides a typical semantic communication function of a specific language, and the typical semantic communication function comprises information such as a syntactic pattern, words and words, and the strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.
Drawings
FIG. 1 is a flow chart of an automatic acquisition method of collocation configuration according to the present invention;
FIG. 2 is a visual result of a collocation configuration provided by the present invention;
fig. 3 is a visualization result of a specific collocation configuration provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the present invention provides an automatic acquisition method of collocation formula, which comprises the following steps: inputting a target word/target syntactic pattern and a dependency syntactic tree set (corpus), extracting collocation formula examples, calculating the similarity of the collocation formula examples, clustering the collocation formula examples based on a clustering algorithm of community voting, generating a collocation formula for each clustering community, and finally, graphing the collocation formula.
1. Input format definition
The input includes both the search target structure and the dependency tree set.
The retrieval target structure is the first input and has two forms: (1) a single word, which may be a verb, a noun, an adjective, or an adverb, is called a core word w, such as the adjective "happy" in chinese; (2) and (5) dependent subtrees. The dependency subtree comprises a plurality of triples, wherein the triples are defined as < t, h, c >, wherein t is a dependency type, h is a central word, and c is a dependency word. Thus, a dependency sub-tree can be expressed as an ordered sequence of triples:
K=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...en=<tn,hn,cn>)
wherein, the CORE word wildcard is included and is marked as CORE, and the CORE can be hiOr ci. For example, the dependent subtree "auxpass (CORE, by) aspect" contains two triples, where CORE is the CORE wildcard, auxpass and aspect are the dependent types, respectively, CORE is the CORE, and "quilt" and "away" are the dependent words.
The second input is dependency tree set Γ ═ Tj}. A corpus is given, if an input target structure is a single word, the corpus can be directly searched to obtain a sentence instance, and gamma is obtained through dependency syntax analysis; if the input target structure is a dependency subtree, first extracting search terms from the dependency subtree, then searching the corpus to obtain sentence instances, and then obtaining a sentence instance in the dependency sentenceAfter the method analysis, a sentence instance dependency tree containing a dependency subtree is selected as gamma. Like the dependency subtree, a single dependency tree is represented as:
Tj=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...em=<tm,hm,cm>)
2. extraction of collocation formula examples
(1) If the input target structure is a single word, traversing the dependency tree set gamma and analyzing the dependency trees T one by one for obtaining the collocation formula examplejAnd selecting the center word or the triple with the dependency word as the core word to jointly form a dependency subtree.
The selection conditions are as follows: for dependency tree TjEach triplet e ofi=<ti,hi,ci>If h isiW or ciWhen w is equal to e, e isiAnd adding to the dependent subtree. The formed dependency subtree is the slave TjThe collocation formula example obtained in (1). Namely:
C=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...ek=<tk,hk,ck>)
wherein for any ei=<ti,hi,ci>,hiW or ci=w。
(2) If the input target structure is a dependency tree K, traversing the dependency tree set gamma, and analyzing the dependency trees T one by onejTo obtain the matching formula example.
For dependency tree TjIf for triplet e in Ki=<ti,hi,ci>All have ei∈TjAnd the triplet order in K and TjIf the sequences are consistent, the sequence can be selected from TjThe collocation configuration example is obtained, otherwise, the collocation configuration example does not need to be obtained. Dependency tree T for retrievable collocation instancesjComparing K, the matching item of wildcard CORE can be determined and marked as wcoreDefining the matching formula as C, checking T one by onejTriple of (e)i=<ti,hi,ci>Adding to C if the following conditions are met:
(1)ei∈K;
(2) or ci=wcore
(3) Or hi=wcore
3. Similarity calculation of collocation formula examples
Two triplets e1=<t1,h1,c1>And e2=<t2,h2,c2>The method for calculating the similarity comprises the following steps:
Figure BDA0002812319830000091
wherein:
Figure BDA0002812319830000092
sim(h1,h2)=cosine(vec(h1),vec(h2))
sim(c1,c2)=cosine(vec(c1),vec(c2))
wherein cosine () is a cosine function, vec (-) is a word vector. Word vector acquisition may employ a currently popular algorithm, such as the word2vec algorithm.
Given two collocation formula examples C1And C2Respectively is as follows:
C1=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...ek=<tk,hk,ck>)
C2=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...el=<tl,hl,cl>)
C1and C2Feature similarity of
Figure BDA0002812319830000101
The calculation method comprises the following steps: a matrix M of size (k +1) × (l +1) is initialized. Setting the cell value in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:
Figure BDA0002812319830000102
wherein sim (e)i,ej) The triple similarity calculation method defined above is adopted. After the calculation is completed, let C1And C2Feature similarity of
Figure BDA0002812319830000103
Based on C1And C2Feature similarity of (C), calculating1Relative to C2Ordered similarity of (a):
Figure BDA0002812319830000104
in the above formula, the lowest similarity of two examples of the collocation formula is set to be 0.05, len (C) is the number of triples contained in C,
Figure BDA0002812319830000105
is represented by C1And C2Feature similarity of
Figure BDA0002812319830000106
Alpha and beta respectively represent C1Or C2The different features have a weight in the similarity calculation of α + β of 1. C1Relative to C2Ordered similarity of (1) and C2Relative to C1The ordered similarity of (a) is different.
4. Clustering method based on collocation formula community
Based on C1Relative to C2Order similarity of (1), calculating C1Relative to C2Is a distance of
Figure BDA0002812319830000107
Figure BDA0002812319830000108
Giving a dependency tree set gamma, acquiring collocation formula instances one by one, and acquiring a collocation formula instance set gamma' ═ Ci}。
CiThe maximum clustering distance value calculating method comprises the following steps:
(1) obtaining CiDistance set with respect to each collocation formula instance in Γ
Figure BDA0002812319830000111
n is the number of matching examples contained in the gamma';
(2) making a histogram of D, wherein the horizontal axis is distance and the value range is [0-1 ]]The vertical axis is the number of collocation formulas in the distance value range. Suppose Γ is neutral with CiThe maximum 15% of the matching configuration examples capable of forming matching is defined, so the distance value corresponding to the 15 th percentile is defined as p1,p1The distance value p with the number of times of the first distance value being 02
(3) Obtaining the mean square error sigma of D;
(4) definition structure CiThe maximum cluster distance value of (a) is:
Figure BDA0002812319830000112
wherein, delta is a multiplication parameter, and is more than or equal to 1 and less than or equal to 5.
In step S2, givenMatching formula example set Γ' ═ CiAnd its corresponding maximum clustering distance D ═ epsiloniAnd f, clustering the gamma' in the following way:
(C1) investigation of C in sequenceiWill contain CiIs initialized to Ii={CiAnd set CiIs True, and goes to (C2);
(C2) obtaining CiOf (2)
Figure BDA0002812319830000113
Wherein the content of the first and second substances,
Figure BDA0002812319830000114
is CjRelative to CiR is the maximum search distance, and go to (C3);
(C3) examine C in N one by onejIf C is presentjHas an access value of False, and if for community IiAll the formula C inkAll are provided with
Figure BDA0002812319830000115
Wherein the content of the first and second substances,
Figure BDA0002812319830000116
is CkRelative to CjA distance of ∈ ofjIs CjMaximum clustering distance of C, then CjPut into community IiIs provided with CjAccess value of True and get CjOf (2)
Figure BDA0002812319830000117
And updating N such that N ═ N'.
The present embodiment sets the maximum search distance r to 0.6.
5. Generation of collocation constructs
For each cluster community IiThe collocation formula is obtained by adopting the following method:
(1) will IiAll the combinations in the formula CiThe binary elements of the triples are formed according to the sequence of the triplesThe sequence of the group is shown as the following formula:
gi=(<e1,e2>,<e2,e3>,...<ek-1,ek>)
will be described below<ei,ei+1>Referred to as a doublet.
(2) Merging slave IiConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be
Figure BDA0002812319830000121
(3) And selecting a node n with the income degree of 0 and the maximum arc weight, traversing G by using a depth-first method by taking the n as an initial node, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation structure. The average connection weight is the sum of the weights of all the connections in the path from the starting node to the destination node/the number of connections in the path.
(4) For any node b in G', from IiThe word set W ═ W of the word configuration collocation formula appearing at the node is obtainedi}, then the word wiThe strength of the correlation with the formula G' is expressed as the P value of the Fisher exact test and is calculated as:
let wiThe number of occurrences at the node is fi 1,wiFrequency of occurrence in a given corpus is fi 2,IiThe number of the collocation formula examples contained in (1) is N, the number of sentences contained in the given corpus is N, the average length of the collocation formula examples is 20, and let a be fi 1,b=n×(20-fi 1),c=fi 2-fi 1,d=N-fi 1Then P can be calculated as:
Figure BDA0002812319830000122
(5) g', the word set corresponding to the nodes in the graph and the correlation strength thereof are formed by IiThe obtained collocation formula.
6. Collocation structure type visualization
For the following collocation formula:
(<D1,X1,CORE-WORD>,
<D2,CORE-WORD,X2>
<D3,CORE-WORD,X3>
X1={(W_1,V_1,F_1)}
X2={(W_2,V_2,F_2),(W_3,V_3,F_3)}
X3={(W_4,V_4,F_4)}
CORE-WORD={(W_5,V_5,F_5),(W_6,V__6,F_6),(W_7,V__7,F_7)})
wherein D1, D2, and D3 are dependency types, and X1, X2, and X3 are syntax slot placeholders, respectively, that point to three sets of word information structures, respectively, wherein each word information structure includes a part of speech (W), an association strength (V), and a frequency (F), which is the frequency with which a word occurs in the community.
The visualization rules are as follows: taking the dependency type and the CORE-WORD as nodes, and linearly arranging the nodes from left to right according to the sequence of the dependency type and the CORE-WORD in the G', for example, the connection arcs between the nodes have directionality, wherein the starting node is a headword node, i.e., a first placeholder in the triplet, the arrow points to a dependency WORD node, i.e., a second placeholder in the triplet, and the connection arcs display connection weights. The visualization result is shown in fig. 2.
Specifically, a plurality of collocation formulas can be obtained by using a dependency subtree "auxpass (CORE, by)" as an input format and a corpus "people's daily newspaper" as a corpus, one of which is shown in fig. 3. The collocation formula provides a typical operation mode of the 'quilt' words in Chinese and has good interpretability: the ROOT node indicates that the collocation formula can not be embedded into other syntactic components, the WORD and the association strength thereof given in the NSUBJ node indicate that the semantic type of the subject component is human, the semantic in the other NSUBJ node also takes human as a main body, the CORE-WORD node is a verb, and besides the verb and the verb, the CORE-WORD node also comprises part of idioms.
The technology overcomes the defects of the prior two types of automatic acquisition technologies of language knowledge: the first kind of automatic linguistic knowledge acquiring technology is from the linguistic field of corpus, and has the main knowledge form of being limited to matching information between two words and small information amount to meet the requirement of language online education, automatic grammar correction and other difficulties. The second type of automatic acquisition method of language knowledge is derived from a natural language processing technology based on deep learning, exists in a super-large scale parameter form, has low interpretability, and cannot meet the requirements of language education and grammar correction feedback on explicit language rules. The invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. As shown in fig. 3, the collocation formula provides a typical semantic interaction function of a specific language (chinese in the figure), including information of syntax patterns, words, and strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An automatic acquisition method of collocation formula is characterized in that the method comprises the following steps:
s1, extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form;
s2, clustering the collocation formula instance set into a plurality of communities;
and S3, acquiring a collocation formula corresponding to each clustering community.
2. The method of claim 1, wherein the set of collocation formula instances of the target term is extracted from the corpus as follows:
(A1) searching a corpus, acquiring all sentence instances containing target words, converting each acquired sentence instance into a dependency tree through dependency syntax analysis, and forming a dependency tree set by all the dependency trees;
(A2) for each dependency tree in the set of dependency trees, constructing a set of collocation formula instances by:
initializing a dependency subtree to be empty, traversing each triple in the dependency tree, selecting a center word or a triple with the same dependency word as a target word, adding the triple into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished.
3. The method of claim 1, wherein the set of collocation formula instances of the target syntactic patterns is extracted from the corpus as follows:
(B1) extracting search terms from the target syntax mode and constructing a search term set;
(B2) searching a corpus, acquiring sentence instances containing all search words in a search word set, and converting each acquired sentence instance into a dependency tree through dependency syntax analysis to form a dependency tree set;
(B3) for each dependency tree in the dependency tree set, judging whether all triples in the target dependency tree are contained in the dependency tree and the sequence in the target dependency tree is consistent with the sequence in the dependency tree, if so, entering a step (B4), otherwise, not acquiring; thus constructing a set of collocation formula instances:
(B4) comparing the dependency tree with a target dependency tree to determine a matching item of the wildcard;
(B5) initializing a dependency subtree to be null, traversing each triple in the dependency tree, selecting a triple meeting any one of the following conditions to be added into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished, wherein the conditions are as follows:
1) the triple exists in the target dependency tree;
2) the dependency word of the triple is a matching term;
3) the core word of the triplet is the matching term.
4. A method according to any one of claims 1 to 3, wherein in step S2, a collocation formula instance set Γ' ═ C is giveniAnd its corresponding maximum clustering distance D ═ epsiloniAnd f, clustering the gamma' in the following way:
(C1) investigation of C in sequenceiWill contain CiIs initialized to Ii={CiAnd set CiIs True, and goes to step (C2);
(C2) obtaining CiOf (2)
Figure FDA0002812319820000021
Wherein the content of the first and second substances,
Figure FDA0002812319820000022
is CjRelative to CiR is the maximum search distance, and go to step (C3);
(C3) examine C in N one by onejIf C is presentjHas an access value of False, and if for community IiAll the formula C inkAll are provided with
Figure FDA0002812319820000023
Wherein the content of the first and second substances,
Figure FDA0002812319820000024
is CkRelative to CjA distance of ∈ ofjIs CjMaximum clustering distance of C, then CjPut into community IiIs provided with CjAccess value of True and get CjOf (2)
Figure FDA0002812319820000025
And updating N such that N ═ N'.
5. The method of claim 4, wherein formula instance C is collocatediThe maximum clustering distance calculation process is as follows:
(D1) obtaining CiDistance set relative to each collocation formula instance in collocation formula instance set Γ
Figure FDA0002812319820000026
n is the number of gamma' containing the matching formula example;
(D2) taking the distance as a horizontal axis and the number of collocation formulas in the distance value interval as a vertical axis, making a histogram of D, and defining the distance value corresponding to the 15 th percentile as p1,p1The distance value p with the number of times of the first distance value being 02
(D3) Obtaining the mean square error sigma of D;
(D4)Cithe maximum clustering distance value of (a) is calculated as follows:
Figure FDA0002812319820000031
wherein, delta is a multiplication parameter, and is more than or equal to 1 and less than or equal to 5.
6. A method according to claim 4 or 5, characterized in that formula instance C is collocatedjExample C for the collocation configurationkIs a distance of
Figure FDA0002812319820000032
The calculation process is as follows:
(E1) based on CjAnd CkSimilarity of middle triplets, calculating CjAnd CkFeature similarity of
Figure FDA0002812319820000033
(E2) Based on CjAnd CkFeature similarity of
Figure FDA0002812319820000034
Calculating CjRelative to CkOrderly similarity of
Figure FDA0002812319820000035
Figure FDA0002812319820000036
Wherein le (C) is the number of triplets contained in C, and alpha and beta respectively represent CjOr CkThe weights of the different features that they have in the similarity calculation, α + β ═ 1;
(E3) based on CjRelative to CkOrderly similarity of
Figure FDA0002812319820000037
Calculating CjRelative to CkIs a distance of
Figure FDA0002812319820000038
Figure FDA0002812319820000039
7. The method of claim 6, wherein C is calculated based on similarity of triplesjAnd CkFeature similarity of
Figure FDA00028123198200000311
Wherein, Cj=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eJ=<tJ,hJ,cJ>),Ck=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eK=<tK,hK,cK>) In the triple, t is a dependency type, h is a central word, and c is a dependency word, and the calculation process is as follows:
initializing a matrix M of size (J +1) × (K +1), setting the cell values in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:
Figure FDA00028123198200000310
wherein sim (e)p,eq) Representing triplets epAnd eqThe similarity of (c) is calculated by the following formula:
Figure FDA0002812319820000041
wherein:
Figure FDA0002812319820000042
sim(hp,hq)=cosine(vec(hp),vec(hq))
si(cp,cq)=cosine(vec(cp),vec(cq))
wherein, cosine (·) is cosine function, uec (·) is word vector;
after the calculation is completed, CjAnd CkFeature similarity of
Figure FDA0002812319820000043
8. The method according to any of claims 1 to 7, characterized in that step S3 comprises the sub-steps of:
(F1) cluster community IiAll the combinations in the formula CiAccording to the sequence of the triples, a sequence of the triples is formed, as shown in the following formula:
gi=(<e1,e2>,<e2,e3>,...<ek-1,ek>)
(F2) merging slave IiConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be
Figure FDA0002812319820000044
(F3) Selecting a node n with the income degree of 0 and the maximum arc weight as an initial node, traversing G by a depth-first method, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation formula;
(F4) for any node b in G', from IiThe word set W ═ W of the word configuration collocation formula appearing at the node is obtainedi}, then the word wiThe strength of the correlation with the formula G' is the P value of the Fisher exact test
Figure FDA0002812319820000045
Represents;
(F5) g', word set W corresponding to nodes in graph and correlation strength of word set W
Figure FDA0002812319820000046
Collectively form a cluster community IiThe obtained collocation formula.
9. A visual method for collocation configuration is characterized by comprising the following steps:
automatically obtaining a collocation construct using a method according to any one of claims 1 to 8;
for each obtained collocation formula, the dependency type and the central word are taken as nodes, and the nodes are linearly arranged from left to right according to the sequence of the dependency type and the central word in G', for example, connection arcs among the nodes have directionality, wherein the initial node is the central word node, the arrow points to the dependency word node, and the connection arcs display connection weights.
10. An automatic acquisition system of collocation configuration, comprising: a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is used for reading executable instructions stored in the computer-readable storage medium and executing the collocation structural automatic acquisition method of any one of claims 1-8.
CN202011413473.6A 2020-12-02 2020-12-02 Collocation configuration type automatic acquisition method and system and visualization method Pending CN112395871A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011413473.6A CN112395871A (en) 2020-12-02 2020-12-02 Collocation configuration type automatic acquisition method and system and visualization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011413473.6A CN112395871A (en) 2020-12-02 2020-12-02 Collocation configuration type automatic acquisition method and system and visualization method

Publications (1)

Publication Number Publication Date
CN112395871A true CN112395871A (en) 2021-02-23

Family

ID=74604428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011413473.6A Pending CN112395871A (en) 2020-12-02 2020-12-02 Collocation configuration type automatic acquisition method and system and visualization method

Country Status (1)

Country Link
CN (1) CN112395871A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201618A (en) * 2022-02-17 2022-03-18 药渡经纬信息科技(北京)有限公司 Drug development literature visualization interpretation method and system
CN116227497A (en) * 2022-11-29 2023-06-06 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network
CN116562278A (en) * 2023-03-02 2023-08-08 华中科技大学 Word similarity detection method and system
CN116562278B (en) * 2023-03-02 2024-05-14 华中科技大学 Word similarity detection method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201618A (en) * 2022-02-17 2022-03-18 药渡经纬信息科技(北京)有限公司 Drug development literature visualization interpretation method and system
CN116227497A (en) * 2022-11-29 2023-06-06 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network
CN116227497B (en) * 2022-11-29 2023-09-26 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network
CN116562278A (en) * 2023-03-02 2023-08-08 华中科技大学 Word similarity detection method and system
CN116562278B (en) * 2023-03-02 2024-05-14 华中科技大学 Word similarity detection method and system

Similar Documents

Publication Publication Date Title
RU2628431C1 (en) Selection of text classifier parameter based on semantic characteristics
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN111259653A (en) Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation
US10445428B2 (en) Information object extraction using combination of classifiers
US20170161255A1 (en) Extracting entities from natural language texts
RU2640297C2 (en) Definition of confidence degrees related to attribute values of information objects
CN111339269A (en) Knowledge graph question-answer training and application service system with automatically generated template
AU2014315620A1 (en) Methods and systems of four valued analogical transformation operators used in natural language processing and other applications
CN112395871A (en) Collocation configuration type automatic acquisition method and system and visualization method
CN109840255A (en) Reply document creation method, device, equipment and storage medium
Becker et al. COCO-EX: A tool for linking concepts from texts to ConceptNet
CN111460145A (en) Learning resource recommendation method, device and storage medium
US20090234852A1 (en) Sub-linear approximate string match
Krishna et al. A dataset for sanskrit word segmentation
CN113705237A (en) Relation extraction method and device fusing relation phrase knowledge and electronic equipment
CN112613321A (en) Method and system for extracting entity attribute information in text
CN111737541B (en) Semantic recognition and evaluation method supporting multiple languages
CN111723182A (en) Key information extraction method and device for vulnerability text
CN111143448A (en) Knowledge base construction method
CN114138929A (en) Question answering method and device
KR102330190B1 (en) Apparatus and method for embedding multi-vector document using semantic decomposition of complex documents
CN113468875A (en) MNet method for semantic analysis of natural language interaction interface of SCADA system
Eppa et al. Machine Learning Techniques for Multisource Plagiarism Detection
CN106156259A (en) A kind of user behavior information displaying method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination