CN112395871A - Collocation configuration type automatic acquisition method and system and visualization method - Google Patents
Collocation configuration type automatic acquisition method and system and visualization method Download PDFInfo
- Publication number
- CN112395871A CN112395871A CN202011413473.6A CN202011413473A CN112395871A CN 112395871 A CN112395871 A CN 112395871A CN 202011413473 A CN202011413473 A CN 202011413473A CN 112395871 A CN112395871 A CN 112395871A
- Authority
- CN
- China
- Prior art keywords
- collocation
- dependency
- formula
- word
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Abstract
The invention discloses a collocation formula automatic acquisition method and system and a visualization method, and belongs to the technical field of natural language processing. The automatic acquisition method comprises the following steps: extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form; clustering the collocation formula instance set into a plurality of communities; and for each clustering community, acquiring a collocation formula corresponding to the community. The invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. The collocation formula provides a typical semantic communication function of a specific language, and the typical semantic communication function comprises information such as a syntactic pattern, words and words, and the strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a collocation formula automatic acquisition method and system and a visualization method.
Background
Information applications based on natural language processing techniques, including automatic grammar modification systems, online education systems, etc., rely on explicit language knowledge bases. Due to the complexity of the language itself, it takes a lot of time, labor and financial resources to construct a language knowledge base manually, such as a dictionary and a grammar knowledge base, and there are defects in coverage rate, consistency and the like. Automatic acquisition of language knowledge is an effective way to build a language knowledge base. Meanwhile, knowledge acquisition based on deep learning is not yet interpretable. The type of interpretable linguistic knowledge that is currently available for automatic acquisition is primarily collocation knowledge. Cognitive linguistics proposes a new basic constitution unit-formula of linguistic knowledge. The formula is a 'shape-meaning' complex, and the collocation formula is a formula expression form proposed by the cognitive linguistic research based on a corpus, and comprises various types of knowledge such as syntax, vocabulary, association strength between words and formulas and the like. The existing theoretical research shows that the collocation formula has high interpretability, can be used for explaining various language phenomena, and has wide application prospects in the aspects of automatic grammar correction, online education and the like.
However, the collocation format is not strictly formalized, and a mature automatic acquisition and visualization method is not available.
Disclosure of Invention
Aiming at the defects and the improvement requirements of the prior art, the invention provides a collocation structural formula automatic acquisition method, a collocation structural formula automatic acquisition system and a collocation structural formula visualization method, and aims to provide a strict collocation structural formula formal definition, a corresponding automatic acquisition method and a corresponding visualization method.
Given a target word or a syntactic pattern and a specific corpus analyzed by dependency syntax, the method can automatically acquire a collocation formula example of the target word or the syntactic pattern from the corpus, automatically generate a collocation formula by a clustering method, and provide a visualization method. The generated knowledge base has strong interpretability and can be used in the fields of language online education, automatic grammar correction and the like.
To achieve the above object, according to a first aspect of the present invention, there is provided a collocation formula automatic acquisition method, including the steps of:
s1, extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form;
s2, clustering the collocation formula instance set into a plurality of communities;
and S3, acquiring a collocation formula corresponding to each clustering community.
Preferably, a collocation formula instance set of the target word is extracted from the corpus as follows:
(A1) searching a corpus, acquiring all sentence instances containing target words, converting each acquired sentence instance into a dependency tree through dependency syntax analysis, and forming a dependency tree set by all the dependency trees;
(A2) for each dependency tree in the set of dependency trees, constructing a set of collocation formula instances by:
initializing a dependency subtree to be empty, traversing each triple in the dependency tree, selecting a center word or a triple with the same dependency word as a target word, adding the triple into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished.
Has the advantages that: the collocation formula example of the target word of the invention takes the target word as the center, and obtains the syntactic components depending on the syntactic components of the target word through dependency grammar connection, thereby comprehensively reflecting the specific syntactic usage mode of the target word.
Preferably, a collocation formula instance set of the target syntactic patterns is extracted from the corpus as follows:
(B1) extracting search terms from the target syntax mode and constructing a search term set;
(B2) searching a corpus, acquiring sentence instances containing all search words in a search word set, and converting each acquired sentence instance into a dependency tree through dependency syntax analysis to form a dependency tree set;
(B3) for each dependency tree in the dependency tree set, judging whether all triples in the target dependency tree are contained in the dependency tree and the sequence in the target dependency tree is consistent with the sequence in the dependency tree, if so, entering a step (B4), otherwise, not acquiring; thus constructing a set of collocation formula instances:
(B4) comparing the dependency tree with a target dependency tree to determine a matching item of the wildcard;
(B5) initializing a dependency subtree to be null, traversing each triple in the dependency tree, selecting a triple meeting any one of the following conditions to be added into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished, wherein the conditions are as follows:
1) the triple exists in the target dependency tree;
2) the dependency word of the triple is a matching term;
3) the core word of the triplet is the matching term.
Has the advantages that: the collocation formula example of the syntactic pattern of the invention acquires the key syntactic part determined by dependency syntactic connection, and determines the real words related to the syntactic pattern by using wildcard technology, thereby acquiring the dependent syntactic components and the depended syntactic components of the real words to form the specific use information which comprehensively reflects the syntactic pattern.
Preferably, in step S2, a collocation formula instance set Γ' ═ { C is giveniAnd its corresponding maximum clustering distance D ═ epsiloniAnd f, clustering the gamma' in the following way:
(C1) investigation of C in sequenceiWill contain CiIs initialized to Ii={CiAnd set CiIs True, and goes to step (C2);
(C2) obtaining CiOf (2)Wherein the content of the first and second substances,is CjRelative to CiR is the maximum search distance, and go to step (C3);
(C3) examine C in N one by onejIf C is presentjHas an access value of False, and if for community IiAll the formula C inkAll are provided withWherein the content of the first and second substances,is CkRelative to CjA distance of ∈ ofjIs CjMaximum clustering distance of C, then CjPut into community IiIs provided with CjAccess value of True and get CjOf (2)And updating N such that N ═ N'.
Has the advantages that: the invention clusters the collocation formula examples on the basis of calculating the ordered similarity of the collocation formula examples, and determines the typical semantic use range of the collocation formula by calculating the correlation strength of the words and the collocation formula, thereby achieving the purpose of simulating the process of accumulating, abstracting and inducing human beings in the reading process and finally learning language knowledge, and forming a language knowledge form which is similar to human language knowledge and is convenient to understand and explain.
Preferably, formula C is usediThe maximum clustering distance calculation process is as follows:
(D1) obtaining CiDistance set relative to each collocation formula instance in collocation formula instance set Γn is the number of gamma' containing the matching formula example;
(D2) taking the distance as a horizontal axis and the number of collocation formulas in the distance value interval as a vertical axis, making a histogram of D, and defining the distance value corresponding to the 15 th percentile as p1,p1The distance value p with the number of times of the first distance value being 02;
(D3) Obtaining the mean square error sigma of D;
(D4)Cithe maximum clustering distance value of (a) is calculated as follows:
wherein, delta is a multiplication parameter, and is more than or equal to 1 and less than or equal to 5.
Has the advantages that: the invention takes the distance between the specific collocation formula example and all collocation formula examples as a random variable, and well acquires the maximum clustering distance of the collocation formula examples by investigating the distribution of probability quality functions and taking 15% as an empirical value.
Preferably, formula C is usedjExample C for the collocation configurationkIs a distance ofThe calculation process is as follows:
Wherein len (C) is the number of triples contained in C, and alpha and beta respectively represent CjOr CkThe weights of the different features that they have in the similarity calculation, α + β ═ 1;
Has the advantages that: the invention introduces an asymmetrical similarity calculation method based on Amostversey, and utilizes the setting of the characteristic weight of different parts in asymmetrical similarity calculation, thereby realizing the ordered similarity between the collocation formula examples and further solving the containment relationship between the collocation formula examples.
Preferably, C is calculated based on the similarity of the tripletsjAnd CkFeature similarity ofWherein, Cj=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eJ=<tJ,hJ,cJ>),Ck=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eK=<tK,hK,cK>) In the triple, t is a dependency type, h is a central word, and c is a dependency word, and the calculation process is as follows:
initializing a matrix M of size (J +1) × (K +1), setting the cell values in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:
wherein sim (e)p,eq) Representing triplets epAnd eqThe similarity of (c) is calculated by the following formula:
wherein:
sim(hp,hq)=cosine(vec(hp),vec(hq))
sim(cp,cq)=cosine(vec(cp),vec(cq))
wherein, cosine () is cosine function, vec (-) is word vector;
Has the advantages that: according to the invention, by adopting the dynamic programming algorithm, the cost of different paths in the process can be considered by the dynamic programming algorithm, so that the maximum characteristic similarity of the two collocation formula examples is obtained while the current sequence of the triples in the collocation formula examples is considered.
Preferably, step S3 includes the following sub-steps:
(1) cluster community IiAll the combinations in the formula CiAccording to the sequence of the triples, a sequence of the triples is formed, as shown in the following formula:
gi=(<e1,e2>,<e2,e3>,...<ek-1,ek>)
(2) merging slave IiConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be
(3) Selecting a node n with the income degree of 0 and the maximum arc weight as an initial node, traversing G by a depth-first method, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation formula;
(4) for any node b in G', from IiThe word set W ═ W of the word configuration collocation formula appearing at the node is obtainedi}, then the word wiThe strength of the correlation with the formula G' is the P value of the Fisher exact testRepresents;
(5) g', word set W corresponding to nodes in graph and correlation strength of word set WCollectively form a cluster community IiThe obtained collocation formula.
Has the advantages that: according to the method, the initial node is selected through weight, and a weight priority strategy is adopted in the depth priority traversal process, so that the optimal path is obtained from the directed cyclic graph and is used as a typical representative syntax mode of a collocation formula.
To achieve the above object, according to a second aspect of the present invention, there is provided a collocation formula visualization method, including the steps of:
automatically obtaining a collocation formula using the method of the first aspect;
for each obtained collocation formula, the dependency type and the central word are taken as nodes, and the nodes are linearly arranged from left to right according to the sequence of the dependency type and the central word in G', for example, connection arcs among the nodes have directionality, wherein the initial node is the central word node, the arrow points to the dependency word node, and the connection arcs display connection weights.
Has the advantages that: the collocation formula is converted into the directed graph by utilizing the self-attribute that the collocation formula takes the dependency tree as the frame, and the collocation formula knowledge base generated by the directivity of the directed graph has strong interpretability and can be used in the fields of language online education, automatic grammar correction and the like.
To achieve the above object, according to a third aspect of the present invention, there is provided a collocation configuration type automatic acquisition system, including: a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is configured to read executable instructions stored in the computer-readable storage medium, and execute the collocated automatic acquisition method of the first aspect.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
the invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. The collocation formula provides a typical semantic communication function of a specific language, and the typical semantic communication function comprises information such as a syntactic pattern, words and words, and the strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.
Drawings
FIG. 1 is a flow chart of an automatic acquisition method of collocation configuration according to the present invention;
FIG. 2 is a visual result of a collocation configuration provided by the present invention;
fig. 3 is a visualization result of a specific collocation configuration provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the present invention provides an automatic acquisition method of collocation formula, which comprises the following steps: inputting a target word/target syntactic pattern and a dependency syntactic tree set (corpus), extracting collocation formula examples, calculating the similarity of the collocation formula examples, clustering the collocation formula examples based on a clustering algorithm of community voting, generating a collocation formula for each clustering community, and finally, graphing the collocation formula.
1. Input format definition
The input includes both the search target structure and the dependency tree set.
The retrieval target structure is the first input and has two forms: (1) a single word, which may be a verb, a noun, an adjective, or an adverb, is called a core word w, such as the adjective "happy" in chinese; (2) and (5) dependent subtrees. The dependency subtree comprises a plurality of triples, wherein the triples are defined as < t, h, c >, wherein t is a dependency type, h is a central word, and c is a dependency word. Thus, a dependency sub-tree can be expressed as an ordered sequence of triples:
K=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...en=<tn,hn,cn>)
wherein, the CORE word wildcard is included and is marked as CORE, and the CORE can be hiOr ci. For example, the dependent subtree "auxpass (CORE, by) aspect" contains two triples, where CORE is the CORE wildcard, auxpass and aspect are the dependent types, respectively, CORE is the CORE, and "quilt" and "away" are the dependent words.
The second input is dependency tree set Γ ═ Tj}. A corpus is given, if an input target structure is a single word, the corpus can be directly searched to obtain a sentence instance, and gamma is obtained through dependency syntax analysis; if the input target structure is a dependency subtree, first extracting search terms from the dependency subtree, then searching the corpus to obtain sentence instances, and then obtaining a sentence instance in the dependency sentenceAfter the method analysis, a sentence instance dependency tree containing a dependency subtree is selected as gamma. Like the dependency subtree, a single dependency tree is represented as:
Tj=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...em=<tm,hm,cm>)
2. extraction of collocation formula examples
(1) If the input target structure is a single word, traversing the dependency tree set gamma and analyzing the dependency trees T one by one for obtaining the collocation formula examplejAnd selecting the center word or the triple with the dependency word as the core word to jointly form a dependency subtree.
The selection conditions are as follows: for dependency tree TjEach triplet e ofi=<ti,hi,ci>If h isiW or ciWhen w is equal to e, e isiAnd adding to the dependent subtree. The formed dependency subtree is the slave TjThe collocation formula example obtained in (1). Namely:
C=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...ek=<tk,hk,ck>)
wherein for any ei=<ti,hi,ci>,hiW or ci=w。
(2) If the input target structure is a dependency tree K, traversing the dependency tree set gamma, and analyzing the dependency trees T one by onejTo obtain the matching formula example.
For dependency tree TjIf for triplet e in Ki=<ti,hi,ci>All have ei∈TjAnd the triplet order in K and TjIf the sequences are consistent, the sequence can be selected from TjThe collocation configuration example is obtained, otherwise, the collocation configuration example does not need to be obtained. Dependency tree T for retrievable collocation instancesjComparing K, the matching item of wildcard CORE can be determined and marked as wcoreDefining the matching formula as C, checking T one by onejTriple of (e)i=<ti,hi,ci>Adding to C if the following conditions are met:
(1)ei∈K;
(2) or ci=wcore;
(3) Or hi=wcore。
3. Similarity calculation of collocation formula examples
Two triplets e1=<t1,h1,c1>And e2=<t2,h2,c2>The method for calculating the similarity comprises the following steps:
wherein:
sim(h1,h2)=cosine(vec(h1),vec(h2))
sim(c1,c2)=cosine(vec(c1),vec(c2))
wherein cosine () is a cosine function, vec (-) is a word vector. Word vector acquisition may employ a currently popular algorithm, such as the word2vec algorithm.
Given two collocation formula examples C1And C2Respectively is as follows:
C1=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...ek=<tk,hk,ck>)
C2=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...el=<tl,hl,cl>)
C1and C2Feature similarity ofThe calculation method comprises the following steps: a matrix M of size (k +1) × (l +1) is initialized. Setting the cell value in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:
wherein sim (e)i,ej) The triple similarity calculation method defined above is adopted. After the calculation is completed, let C1And C2Feature similarity of
Based on C1And C2Feature similarity of (C), calculating1Relative to C2Ordered similarity of (a):
in the above formula, the lowest similarity of two examples of the collocation formula is set to be 0.05, len (C) is the number of triples contained in C,is represented by C1And C2Feature similarity ofAlpha and beta respectively represent C1Or C2The different features have a weight in the similarity calculation of α + β of 1. C1Relative to C2Ordered similarity of (1) and C2Relative to C1The ordered similarity of (a) is different.
4. Clustering method based on collocation formula community
Giving a dependency tree set gamma, acquiring collocation formula instances one by one, and acquiring a collocation formula instance set gamma' ═ Ci}。
CiThe maximum clustering distance value calculating method comprises the following steps:
(1) obtaining CiDistance set with respect to each collocation formula instance in Γn is the number of matching examples contained in the gamma';
(2) making a histogram of D, wherein the horizontal axis is distance and the value range is [0-1 ]]The vertical axis is the number of collocation formulas in the distance value range. Suppose Γ is neutral with CiThe maximum 15% of the matching configuration examples capable of forming matching is defined, so the distance value corresponding to the 15 th percentile is defined as p1,p1The distance value p with the number of times of the first distance value being 02;
(3) Obtaining the mean square error sigma of D;
(4) definition structure CiThe maximum cluster distance value of (a) is:
wherein, delta is a multiplication parameter, and is more than or equal to 1 and less than or equal to 5.
In step S2, givenMatching formula example set Γ' ═ CiAnd its corresponding maximum clustering distance D ═ epsiloniAnd f, clustering the gamma' in the following way:
(C1) investigation of C in sequenceiWill contain CiIs initialized to Ii={CiAnd set CiIs True, and goes to (C2);
(C2) obtaining CiOf (2)Wherein the content of the first and second substances,is CjRelative to CiR is the maximum search distance, and go to (C3);
(C3) examine C in N one by onejIf C is presentjHas an access value of False, and if for community IiAll the formula C inkAll are provided withWherein the content of the first and second substances,is CkRelative to CjA distance of ∈ ofjIs CjMaximum clustering distance of C, then CjPut into community IiIs provided with CjAccess value of True and get CjOf (2)And updating N such that N ═ N'.
The present embodiment sets the maximum search distance r to 0.6.
5. Generation of collocation constructs
For each cluster community IiThe collocation formula is obtained by adopting the following method:
(1) will IiAll the combinations in the formula CiThe binary elements of the triples are formed according to the sequence of the triplesThe sequence of the group is shown as the following formula:
gi=(<e1,e2>,<e2,e3>,...<ek-1,ek>)
will be described below<ei,ei+1>Referred to as a doublet.
(2) Merging slave IiConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be
(3) And selecting a node n with the income degree of 0 and the maximum arc weight, traversing G by using a depth-first method by taking the n as an initial node, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation structure. The average connection weight is the sum of the weights of all the connections in the path from the starting node to the destination node/the number of connections in the path.
(4) For any node b in G', from IiThe word set W ═ W of the word configuration collocation formula appearing at the node is obtainedi}, then the word wiThe strength of the correlation with the formula G' is expressed as the P value of the Fisher exact test and is calculated as:
let wiThe number of occurrences at the node is fi 1,wiFrequency of occurrence in a given corpus is fi 2,IiThe number of the collocation formula examples contained in (1) is N, the number of sentences contained in the given corpus is N, the average length of the collocation formula examples is 20, and let a be fi 1,b=n×(20-fi 1),c=fi 2-fi 1,d=N-fi 1Then P can be calculated as:
(5) g', the word set corresponding to the nodes in the graph and the correlation strength thereof are formed by IiThe obtained collocation formula.
6. Collocation structure type visualization
For the following collocation formula:
(<D1,X1,CORE-WORD>,
<D2,CORE-WORD,X2>
<D3,CORE-WORD,X3>
X1={(W_1,V_1,F_1)}
X2={(W_2,V_2,F_2),(W_3,V_3,F_3)}
X3={(W_4,V_4,F_4)}
CORE-WORD={(W_5,V_5,F_5),(W_6,V__6,F_6),(W_7,V__7,F_7)})
wherein D1, D2, and D3 are dependency types, and X1, X2, and X3 are syntax slot placeholders, respectively, that point to three sets of word information structures, respectively, wherein each word information structure includes a part of speech (W), an association strength (V), and a frequency (F), which is the frequency with which a word occurs in the community.
The visualization rules are as follows: taking the dependency type and the CORE-WORD as nodes, and linearly arranging the nodes from left to right according to the sequence of the dependency type and the CORE-WORD in the G', for example, the connection arcs between the nodes have directionality, wherein the starting node is a headword node, i.e., a first placeholder in the triplet, the arrow points to a dependency WORD node, i.e., a second placeholder in the triplet, and the connection arcs display connection weights. The visualization result is shown in fig. 2.
Specifically, a plurality of collocation formulas can be obtained by using a dependency subtree "auxpass (CORE, by)" as an input format and a corpus "people's daily newspaper" as a corpus, one of which is shown in fig. 3. The collocation formula provides a typical operation mode of the 'quilt' words in Chinese and has good interpretability: the ROOT node indicates that the collocation formula can not be embedded into other syntactic components, the WORD and the association strength thereof given in the NSUBJ node indicate that the semantic type of the subject component is human, the semantic in the other NSUBJ node also takes human as a main body, the CORE-WORD node is a verb, and besides the verb and the verb, the CORE-WORD node also comprises part of idioms.
The technology overcomes the defects of the prior two types of automatic acquisition technologies of language knowledge: the first kind of automatic linguistic knowledge acquiring technology is from the linguistic field of corpus, and has the main knowledge form of being limited to matching information between two words and small information amount to meet the requirement of language online education, automatic grammar correction and other difficulties. The second type of automatic acquisition method of language knowledge is derived from a natural language processing technology based on deep learning, exists in a super-large scale parameter form, has low interpretability, and cannot meet the requirements of language education and grammar correction feedback on explicit language rules. The invention takes specific words or specific sentence patterns as units, adopts a clustering method to simulate the cognitive rule learned by human language, and obtains a collocation formula. As shown in fig. 3, the collocation formula provides a typical semantic interaction function of a specific language (chinese in the figure), including information of syntax patterns, words, and strength of association between the words and the collocation formula. On one hand, the method overcomes the defect of insufficient information amount of collocation, has strong interpretability, and can meet the requirements of online language education and grammar correction.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. An automatic acquisition method of collocation formula is characterized in that the method comprises the following steps:
s1, extracting a collocation formula instance set of a target word or a target syntactic pattern from a corpus, wherein the target syntactic pattern is in a dependency tree form;
s2, clustering the collocation formula instance set into a plurality of communities;
and S3, acquiring a collocation formula corresponding to each clustering community.
2. The method of claim 1, wherein the set of collocation formula instances of the target term is extracted from the corpus as follows:
(A1) searching a corpus, acquiring all sentence instances containing target words, converting each acquired sentence instance into a dependency tree through dependency syntax analysis, and forming a dependency tree set by all the dependency trees;
(A2) for each dependency tree in the set of dependency trees, constructing a set of collocation formula instances by:
initializing a dependency subtree to be empty, traversing each triple in the dependency tree, selecting a center word or a triple with the same dependency word as a target word, adding the triple into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished.
3. The method of claim 1, wherein the set of collocation formula instances of the target syntactic patterns is extracted from the corpus as follows:
(B1) extracting search terms from the target syntax mode and constructing a search term set;
(B2) searching a corpus, acquiring sentence instances containing all search words in a search word set, and converting each acquired sentence instance into a dependency tree through dependency syntax analysis to form a dependency tree set;
(B3) for each dependency tree in the dependency tree set, judging whether all triples in the target dependency tree are contained in the dependency tree and the sequence in the target dependency tree is consistent with the sequence in the dependency tree, if so, entering a step (B4), otherwise, not acquiring; thus constructing a set of collocation formula instances:
(B4) comparing the dependency tree with a target dependency tree to determine a matching item of the wildcard;
(B5) initializing a dependency subtree to be null, traversing each triple in the dependency tree, selecting a triple meeting any one of the following conditions to be added into the dependency subtree, and taking the dependency subtree as a corresponding collocation formula example of the dependency tree after the traversal is finished, wherein the conditions are as follows:
1) the triple exists in the target dependency tree;
2) the dependency word of the triple is a matching term;
3) the core word of the triplet is the matching term.
4. A method according to any one of claims 1 to 3, wherein in step S2, a collocation formula instance set Γ' ═ C is giveniAnd its corresponding maximum clustering distance D ═ epsiloniAnd f, clustering the gamma' in the following way:
(C1) investigation of C in sequenceiWill contain CiIs initialized to Ii={CiAnd set CiIs True, and goes to step (C2);
(C2) obtaining CiOf (2)Wherein the content of the first and second substances,is CjRelative to CiR is the maximum search distance, and go to step (C3);
(C3) examine C in N one by onejIf C is presentjHas an access value of False, and if for community IiAll the formula C inkAll are provided withWherein the content of the first and second substances,is CkRelative to CjA distance of ∈ ofjIs CjMaximum clustering distance of C, then CjPut into community IiIs provided with CjAccess value of True and get CjOf (2)And updating N such that N ═ N'.
5. The method of claim 4, wherein formula instance C is collocatediThe maximum clustering distance calculation process is as follows:
(D1) obtaining CiDistance set relative to each collocation formula instance in collocation formula instance set Γn is the number of gamma' containing the matching formula example;
(D2) taking the distance as a horizontal axis and the number of collocation formulas in the distance value interval as a vertical axis, making a histogram of D, and defining the distance value corresponding to the 15 th percentile as p1,p1The distance value p with the number of times of the first distance value being 02;
(D3) Obtaining the mean square error sigma of D;
(D4)Cithe maximum clustering distance value of (a) is calculated as follows:
wherein, delta is a multiplication parameter, and is more than or equal to 1 and less than or equal to 5.
6. A method according to claim 4 or 5, characterized in that formula instance C is collocatedjExample C for the collocation configurationkIs a distance ofThe calculation process is as follows:
Wherein le (C) is the number of triplets contained in C, and alpha and beta respectively represent CjOr CkThe weights of the different features that they have in the similarity calculation, α + β ═ 1;
7. The method of claim 6, wherein C is calculated based on similarity of triplesjAnd CkFeature similarity ofWherein, Cj=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eJ=<tJ,hJ,cJ>),Ck=(e1=<t1,h1,c1>,e2=<t2,h2,c2>,...eK=<tK,hK,cK>) In the triple, t is a dependency type, h is a central word, and c is a dependency word, and the calculation process is as follows:
initializing a matrix M of size (J +1) × (K +1), setting the cell values in the first row and the first column to 0; starting with line 2, cell 2, the cell values are computed row by row such that:
wherein sim (e)p,eq) Representing triplets epAnd eqThe similarity of (c) is calculated by the following formula:
wherein:
sim(hp,hq)=cosine(vec(hp),vec(hq))
si(cp,cq)=cosine(vec(cp),vec(cq))
wherein, cosine (·) is cosine function, uec (·) is word vector;
8. The method according to any of claims 1 to 7, characterized in that step S3 comprises the sub-steps of:
(F1) cluster community IiAll the combinations in the formula CiAccording to the sequence of the triples, a sequence of the triples is formed, as shown in the following formula:
gi=(<e1,e2>,<e2,e3>,...<ek-1,ek>)
(F2) merging slave IiConstructing a directed graph G by all the obtained binary group sequences, wherein the nodes are triples, the directions of the nodes are determined by the binary groups, and the connection weight of the binary groups is calculated to be
(F3) Selecting a node n with the income degree of 0 and the maximum arc weight as an initial node, traversing G by a depth-first method, acquiring all subgraphs, and selecting a subgraph G' which has the highest average connection weight and contains a target structure as a syntactic mode of a collocation formula;
(F4) for any node b in G', from IiThe word set W ═ W of the word configuration collocation formula appearing at the node is obtainedi}, then the word wiThe strength of the correlation with the formula G' is the P value of the Fisher exact testRepresents;
9. A visual method for collocation configuration is characterized by comprising the following steps:
automatically obtaining a collocation construct using a method according to any one of claims 1 to 8;
for each obtained collocation formula, the dependency type and the central word are taken as nodes, and the nodes are linearly arranged from left to right according to the sequence of the dependency type and the central word in G', for example, connection arcs among the nodes have directionality, wherein the initial node is the central word node, the arrow points to the dependency word node, and the connection arcs display connection weights.
10. An automatic acquisition system of collocation configuration, comprising: a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is used for reading executable instructions stored in the computer-readable storage medium and executing the collocation structural automatic acquisition method of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011413473.6A CN112395871A (en) | 2020-12-02 | 2020-12-02 | Collocation configuration type automatic acquisition method and system and visualization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011413473.6A CN112395871A (en) | 2020-12-02 | 2020-12-02 | Collocation configuration type automatic acquisition method and system and visualization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112395871A true CN112395871A (en) | 2021-02-23 |
Family
ID=74604428
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011413473.6A Pending CN112395871A (en) | 2020-12-02 | 2020-12-02 | Collocation configuration type automatic acquisition method and system and visualization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112395871A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114201618A (en) * | 2022-02-17 | 2022-03-18 | 药渡经纬信息科技(北京)有限公司 | Drug development literature visualization interpretation method and system |
CN116227497A (en) * | 2022-11-29 | 2023-06-06 | 广东外语外贸大学 | Sentence structure analysis method and device based on deep neural network |
CN116562278A (en) * | 2023-03-02 | 2023-08-08 | 华中科技大学 | Word similarity detection method and system |
CN116562278B (en) * | 2023-03-02 | 2024-05-14 | 华中科技大学 | Word similarity detection method and system |
-
2020
- 2020-12-02 CN CN202011413473.6A patent/CN112395871A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114201618A (en) * | 2022-02-17 | 2022-03-18 | 药渡经纬信息科技(北京)有限公司 | Drug development literature visualization interpretation method and system |
CN116227497A (en) * | 2022-11-29 | 2023-06-06 | 广东外语外贸大学 | Sentence structure analysis method and device based on deep neural network |
CN116227497B (en) * | 2022-11-29 | 2023-09-26 | 广东外语外贸大学 | Sentence structure analysis method and device based on deep neural network |
CN116562278A (en) * | 2023-03-02 | 2023-08-08 | 华中科技大学 | Word similarity detection method and system |
CN116562278B (en) * | 2023-03-02 | 2024-05-14 | 华中科技大学 | Word similarity detection method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2628431C1 (en) | Selection of text classifier parameter based on semantic characteristics | |
RU2628436C1 (en) | Classification of texts on natural language based on semantic signs | |
CN106776544B (en) | Character relation recognition method and device and word segmentation method | |
CN111259653A (en) | Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation | |
US10445428B2 (en) | Information object extraction using combination of classifiers | |
US20170161255A1 (en) | Extracting entities from natural language texts | |
RU2640297C2 (en) | Definition of confidence degrees related to attribute values of information objects | |
CN111339269A (en) | Knowledge graph question-answer training and application service system with automatically generated template | |
AU2014315620A1 (en) | Methods and systems of four valued analogical transformation operators used in natural language processing and other applications | |
CN112395871A (en) | Collocation configuration type automatic acquisition method and system and visualization method | |
CN109840255A (en) | Reply document creation method, device, equipment and storage medium | |
Becker et al. | COCO-EX: A tool for linking concepts from texts to ConceptNet | |
CN111460145A (en) | Learning resource recommendation method, device and storage medium | |
US20090234852A1 (en) | Sub-linear approximate string match | |
Krishna et al. | A dataset for sanskrit word segmentation | |
CN113705237A (en) | Relation extraction method and device fusing relation phrase knowledge and electronic equipment | |
CN112613321A (en) | Method and system for extracting entity attribute information in text | |
CN111737541B (en) | Semantic recognition and evaluation method supporting multiple languages | |
CN111723182A (en) | Key information extraction method and device for vulnerability text | |
CN111143448A (en) | Knowledge base construction method | |
CN114138929A (en) | Question answering method and device | |
KR102330190B1 (en) | Apparatus and method for embedding multi-vector document using semantic decomposition of complex documents | |
CN113468875A (en) | MNet method for semantic analysis of natural language interaction interface of SCADA system | |
Eppa et al. | Machine Learning Techniques for Multisource Plagiarism Detection | |
CN106156259A (en) | A kind of user behavior information displaying method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |