AU2021105953A4

AU2021105953A4 - Method for fine-grained domain terminology self-learning based on contextual semantics

Info

Publication number: AU2021105953A4
Application number: AU2021105953A
Authority: AU
Inventors: Jianhui Chen; Shaofu Lin
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-10-28
Anticipated expiration: 2029-08-19

Abstract

In order to solve the problem that the existing textual terminology learning technology based on large training samples cannot meet the requirement of fine-grained domain terminology learning of smaller samples, the invention provides a fine-grained domain terminology self learning method based on context semantics. By fusing context semantic information, the statistical features and language features of candidate terminology in a corpus are comprehensively expressed from the perspective of the recurrence times of context information of candidate terminology, with reference to domain relevance and domain consistency, the domain dependency bias value of candidate terminology is calculated by using log likelihood ratio for reference, and finally new domain terminology are independently discovered by synthesizing the membership activation value of each candidate term. The self-learning technology of fine-grained domain terminology based on contextual semantics of the present invention can realize self-learning of terminology sets and promote the construction of specific domain ontology. It can not only be applied to term discovery and extraction in domains such as cognitive functions, but also used as a candidate concept generation tool in the concept extraction method. 1/1 Knowledge source Data preprocessing art-of -speech Syntax analysis Morphological tagging restoration andidate terminology Candidate context set terminology Target data Context similarity Control data corpus corpus Domain discrimination Dependence difference Membership activationvalue Domain terminology Figure 1

Description

1/1

Knowledge source

Data preprocessing art-of -speech Syntax analysis Morphological tagging restoration

andidate terminology Candidate context set terminology

Target data Context similarity Control data corpus corpus

Domain discrimination

Dependence difference

Membership activationvalue

Domain terminology

Figure 1

A self-learning Method for Fine-grained Domain Terminology Based on Contextual

Semantics

TECHNICAL DOMAIN

The invention relates to a self-learning method for big data-driven domain terminology,

in particular to the self-learning of domain terminology collections based on text data

resources such as blogs, documents, web pages, etc., to realize the self-expansion of the

domain terminology library.

BACKGROUND

Big data knowledge engineering is an important content of artificial intelligence research,

and text data such as blogs, documents, and web pages are the most important sources of

knowledge. Traditional text-based terminology learning technology mainly uses machine

learning methods based on large training samples such as conditional random domains,

targeting core and large-scale terminology in various domains. For example, gene names

and protein names in the field of bioinformatics, addresses, occupations and other terms

in the field of social media. However, with the continuous deepening of knowledge

driven artificial intelligence applications, the required knowledge is becoming more

refined and specialized. Fine-grained domain terminology recognition and extraction for

small samples have become an important technology development trend for text-based

terminology learning. Nowadays, textual terminology learning technology based on large

training samples is difficult to meet the demand.

SUMMARY

In order to solve the problem that the existing textual terminology learning technology

based on large training samples cannot meet the needs of fine-grained domain terminology learning of smaller samples, the invention provides a fine-grained domain terminology self-learning method based on context semantics. By fusing context semantic information, the statistical features and language features of candidate terminology in a corpus are comprehensively expressed from the perspective of the recurrence times of context information of candidate terminology, with reference to domain relevance and domain consistency, the domain dependency bias value of candidate terminology is calculated by using log likelihood ratio for reference, and finally new domain terminology are independently discovered by synthesizing the membership activation value of each candidate term. The self-learning technology of fine-grained domain terminology based on contextual semantics of the present invention can realize self learning of terminology sets and promote the construction of specific domain ontology. It can not only be applied to term discovery and extraction in domains such as cognitive functions, but also Used as a candidate concept generation tool in the concept extraction method.

In order to solve the technical problem, the specific steps of the technical solution

adopted by the present invention are as follows:

Step 1: Build the initial terminology set and target corpus

Based on the existing terminology set in the field, it can be simplified or manually

constructed independently to obtain an initial term set consisting of 20-30 words, and the

initial terminology set is extracted by using Maximum Matching Algorithm to construct

the context set under the 35 word size window to form the target corpus;

Step 2: Construction of a control corpus

The control data set should be divided into two parts: a general control corpus subset and

a domain control corpus subset; the former is composed of multi-domain terminology and

contexts outside the target domain; the latter is composed of domain terminology and

their contexts other than the terminology to be learned in the target domain composition;

Step 3: Knowledge source preprocessing based on context balanced binary tree

For the knowledge source to be extracted, use natural language processing technology to

identify noun phrases as candidate terminology sets, and extract their context sets under a

-word window to construct candidate terminology context balanced binary tree, where

the nodes number and storage value of the candidate terminology context balances binary

tree respectively store candidate terminology and their corresponding context sets, as the

basis for further screening and processing;

Step 4: Terminology domain discrimination calculation based on context-corpus

correlation hypothesis

First, construct the correlation hypothesis between the term context and the corpus. On

this basis, comprehensively apply the log-likelihood ratio and the sentence similarity

measure based on the context vector to calculate the term domain discrimination Dtn(t);

Step 5: Calculate the domain-dependent bias value of candidate terminology

Construct a "headword-modifier" morphological skeleton model, and calculate the

similarity of the candidate terminology "headword" context in the target corpus and the

control corpus; first define the candidate terminology domain-dependent bias independent

variable DRG = W2/ W, where Wi>,W23OWi and W2 are the frequency of candidate

terminology contexts in the target corpus and the control corpus, respectively, and then

use the domain-dependent bias function Dte(t) =e n*DRG*ln2 (1), where e is the natural logarithm, n is the adjustment factor, and the value range of n is 10000-12000. Then the domains-dependent bias value of the candidate terminology is calculated, and then a binary tree of candidate terminology-dependent adjustment factors is constructed. Among them, The node number and storage value of binary tree of candidate term-dependent adjustment factors respectively store candidate terminology and their domain-dependent bias values;

Step 6: Calculate the membership activation value of the candidate terminology

Combining the results of step 4 and step 5, integrate the candidate terminology context

balanced binary tree and candidate terminology-dependent adjustment factor binary tree

to construct a "discrimination-bias-membership" three-layer mapping activation model,

and calculate the membership activation value of the candidate terminology, that is,

Dtm(t)= Dtn(t)*Dte(t), where Dtn(t) represents the degree of term domain discrimination,

which is obtained from the result of step 4, and Dte(t) represents the bias value of

candidate term domain- dependent, which is obtained from the result of step 5; construct

candidate term membership activation value binary tree, where the node number and

storage value of the candidate term membership activation value binary tree respectively

store the candidate terminology and its membership activation value;

Step 7: Self-leaming of fine-grained domain terminology

Based on the candidate term membership activation value binary tree, set the critical

activation value, and draw the accuracy curve corresponding to the different critical

activation value, take the threshold corresponding to the highest accuracy value as the

critical activation value, and the terminology that meets the critical value are regarded as new discovered terminology in the domain and are added to the initial terminology set, and return to step 1.

Further, in the step 4, the specific method and process of calculating the term domain

discrimination based on the context-corpus correlation hypothesis is:

Step 1): Define context-corpus correlation hypothesis

Hypothesis 1: The context of candidate terminology appears at the same frequency in the

target corpus and the control corpus;

Hypothesis 2: The context of candidate terminology appears at different frequency in the

target corpus and the control corpus;

Step 2): Construction of the target corpus vector set

First, based on the target corpus, train a context-based "incoming-hiding-feedback" three

layer neural network model; secondly, look through all the contexts in the control corpus,

and input each context into the neural network model word by word to obtain the

corresponding multidimensional word vector of words, and the average value of each

dimension of all word vectors are used to construct the context vector; finally, the context

vectors of all the contexts in the control corpus are summarized to construct the control

corpus vector set;

Step 3): Construct the control corpus vector set

Firstly, based on the control corpus, train a context-based "incoming-hiding-feedback"

three-layer neural network model; secondly, look through all the contexts in the control

corpus, and input each context into the neural network model word by word to obtain the

dimension of all word vectors are used to construct the context vector; finally, the context vectors of all the contexts in the control corpus are summarized to construct the control corpus vector set;

Step 4): Construct candidate terminology context vector

First, based on the candidate terminology, look through the candidate term context

balanced binary tree to extract the corresponding context; then input the context one by

one into the three-layer neural network model of the control corpus to obtain the

multidimensional word vector corresponding to each word; finally use the average value

of each dimension of all word vectors to construct candidate terminology context vector;

Step 5): Terminology domain discrimination calculation combining log-likelihood

estimation and sentence similarity calculation

On the basis of two hypotheses L (H) and L (A) defined in step 1), use the binomial

distribution hypothesis to calculate the likelihood estimates of L (H1 ) and L (H2)

wherein L (/O =B ( ;W, +2 ;P) B (W ; +2;P) , L (A) =B3( W; W,

+±F 2 ;P) B (2 2; X +2;P 2 ) , wherein W and 2 represent the frequency of the candidate

term context in the target corpus and the control corpus, respectively; P and P2 represent

the probability of occurrence of context of the candidate term in the target corpus and the

control corpus; combined with the binomial distribution hypothesis B ( 2; 2 ; P) +W ,

the formula is transformed into(W+W2 )Pw 2 (1-P) (2), Pis the probability that the W2 !*W!

candidate term context in hypothesis 1 appears in the target corpus, and the

T corresponding log-likelihood ratio with 2 as the base T is calculated as

L(H1 ) PWI(1 - P)W 2 PW2 (1 - p)WI Ty= -Log L(H = -Log 2) P1( P 1 )W 2 P 2(1 - P2 )wi (3), which is used to calculate the probability of the context-corpus correlation hypothesis; then by using

CosDis(a,b) a*b a l*|b (4) to calculate the sentence similarity between each context

sentence vector of the candidate term and each context sentence vector in the target

corpus vector set, where a represents each context sentence vector of the candidate term,

b represents each context sentence vector in the target corpus vector set; calculate the

sentence similarity between each context sentence vector of the candidate term and each

context sentence vector in the target corpus vector set, Wi is obtained by counting the

frequency of similarity, and the threshold exceeds 50 times; each context sentence vector

of the candidate term is calculated and the sentence of each context sentence vector in the

control corpus vector set similarity, W2 is obtained by counting the frequency of

similarity, and the threshold exceeds 50 times.

BRIEF DESCRIPTION OF THE FIGURES

Fig. 1 is a flow chart of the self-leaming method for fine-grained domain terminology

based on contextual semantics according to the present invention.

DESCRIPTION OF THE INVENTION

The present invention will be further described below in conjunction with the drawings

and implementation cases:

The source data used in the domain term discovery method of the present invention is

from the PLOS ONE website, and 5000 articles are randomly crawled by searching the

"fMRI" and "Cognitive Function" keywords;

The concept collection of cognitive function terminology is composed of 803 cognitive

function terminology in the Cognitive atlas website;

The method flow chart of this embodiment is shown in Fig. 1, and specifically includes

the following steps:

The concept set of cognitive function terminology is composed of 803 cognitive function

terminology in the Cognitive atlas website;

Step 1: Build the initial terminology set and original target corpus

The initial terminology set is constructed by filtering the top 10 cognitive function

terminology that appear most frequently in the source data;

The original target corpus is composed of 932 segments in the source data, and each

abstract contains terminology in the initial terminology set;

Step 2: Construction of the original control corpus

The original control corpus is composed of paragraphs that do not contain 803

terminology in the source data set and 25032 paragraphs that contain 803 paragraphs that

are not in the same sentence as the terminology;

Step 3: Build the original knowledge source corpus

The knowledge source corpus comes from the 150 latest articles that we randomly

crawled from the PLOS ONE website by searching the keywords "fMRI" and "Cognitive

Function" to construct a test corpus. Based on the cognitive glossary, 20 cognitive

functional terminology in these articles are marked.

Step 4: Data preprocessing to obtain candidate terminology set, target corpus context,

control corpus context and knowledge source context

Step (1): Use the HanLp tool to perform part-of-speech tagging and syntactic analysis on

the knowledge source data, and extract all noun phrases in the corpus;

Step (2): Remove words such as articles and descriptive adjectives from the noun phrases

above;

Step 3: Split the noun phrases connected by "and" or "or" into two parts, for example,

split "anchoringandapperception" into "anchoring" and "apperception";

Step (4): Further split from noun phrases of similar grammatical structures such as

"nounnoun" or "adjectivelnoun", and extract more fine-grained candidate terminology for

the second time, for example, "audiovisual" is generated from "audiovisual perception"

and "perception";

Step (5): Morphological restoration and de-duplication, to obtain a set of candidate

terminology; extract the context information corresponding to each candidate term in the

knowledge source corpus, and take 35 words around the candidate term as its term

context. The same can also be used to control the context of the target corpus to obtain

the context of the target corpus.

Step 5: Calculate the discrimination degree of terminology domain

Use the log likelihood ratio (5) to calculate the possibility of context-corpus correlation

hypothesis, with the binomial distribution hypothesis, the formula can be transformed

into formula (6); then calculate the sentence similarity between each context vector of the

candidate term and each vector in the target corpus vector set by using (7), and then count

the number of times that the similarity exceeds the set threshold. The number of

occurrences of the candidate term context in the control corpus can also be obtained in

the same way.

Step 6: Calculate the domain-dependent bias value of candidate terminology

For each candidate term, calculate the dependent difference value Dtn(t) of each term

according to formula (1);

Step 7: Terminology self-learning

Then calculate the domain candidate term membership activation value

Dtm(t)=2.3501686958E-39 according to formula (2), set the critical activation value, and

the term that meets the critical value is regarded as a new term found in the domain and

added to the initial terminology set, and repeat step 1 to realize the self-learning of the

terminology set and the self-improvement of this method.

In this experiment, a total of 29 domain terminology were selected, of which 25 were

found to be cognitive function terminology, and the term discovery accuracy rate was

86.20%. The following table shows the detailed results of term findings:

Table 1 Details of terminology findings

concep Dtm(t) Remar conce Dtm(t) Remark langua 2.3501686958 True fixatio 1.0677460999E-15 new entity researc 1.9968347325 False pain 4.5413844800E-14 new entity search 2.0031186447 new detecti 7.6070869893E-14 True emotio 1.5136559401 True inhibit 1.8986916997E-12 new entity executi 2.7032331048 False associ 3.9663843892E-11 new entity movem 1.7365628165 True learnin 2.1390137641E-10 new entity percept 2.5561699747 True decisio 3.0194903713E-9 new entity reactio 1.1475290757 False contex 4.6116571446E-9 new entity valence 1.0022558949 new focus 1.7419689090E-8 new entity knowle 6.4427228347 new percep 5.9751417093E-7 new entity stress 8.2889849868 new action 3.8605674748E-4 new entity strateg 2.5904326047 new activat 7.2569481516E-4 True judgme 6.8298661996 new attenti 0.00133302481525 new entity integrat 1.2711981915 new memor 0.00367965728662 new entity interact 3.0983927291 False

In order to verify the effectiveness of the method of the present invention, the algorithm

proposed in this experiment is compared with the results of DR-DC, CTROL, CRF and

other algorithms. The experimental results show that the accuracy rate of the DR-DC

algorithm is 16.52%, the accuracy rate of the CTROL algorithm is 31.09%, and the

accuracy rate of the CRF algorithm is 43.22%. The experimental results show that the

term discovery has a higher accuracy rate by adopting the self-learning technology

algorithm of fine-grained domain terminology based on contextual semantics.

It can be seen that the self-learning technology of fine-grained domain terminology based

on contextual semantics is conducive to the self-learning of the domain terminology set

of text data resources and realizes the self-expansion of the domain terminology database.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1.A self-learning method for fine-grained domain terminology based on contextual

semantics, characterized in that it includes the following steps:

Step 1: Build the initial terminology set and target corpus

the context set under the 35 word size window to form the target corpus;

Step 2: Construction of a control corpus

Step 3: Knowledge source preprocessing based on context balanced binary tree

basis for further screening and processing;

Step 4: Terminology domain discrimination calculation based on context-corpus

correlation hypothesis

Step 5: Calculate the domain-dependent bias value of candidate terminology

Construct a "headword-modifier" morphological skeleton model, and calculate the

variable DRG = W2/ W, where Wi>,W23OWi and W2 are the frequency of candidate

use the domain-dependent bias function Dte(t) =e n*DRG*ln2 (1), where e is the natural

logarithm, n is the adjustment factor, and the value range of n is 10000-12000. Then the

domains-dependent bias value of the candidate terminology is calculated, and then a

binary tree of candidate terminology-dependent adjustment factors is constructed. Among

them, The node number and storage value of binary tree of candidate term-dependent

adjustment factors respectively store candidate terminology and their domain-dependent

bias values;

Step 6: Calculate the membership activation value of the candidate terminology

which is obtained from the result of step 4, and Dte(t) represents the bias value of candidate term domain- dependent, which is obtained from the result of step 5; construct candidate term membership activation value binary tree, where the node number and storage value of the candidate term membership activation value binary tree respectively store the candidate terminology and its membership activation value;

Step 7: Self-leaming of fine-grained domain terminology

critical activation value, and the terminology that meets the critical value are regarded as

new discovered terminology in the domain and are added to the initial terminology set,

and return to step 1.

discrimination based on the context-corpus correlation hypothesis is:

Step 1): Define context-corpus correlation hypothesis

target corpus and the control corpus;

Step 2): Construction of the target corpus vector set

and input each context into the neural network model word by word to obtain the

corresponding multidimensional word vector of words, and the average value of each dimension of all word vectors are used to construct the context vector; finally, the context vectors of all the contexts in the control corpus are summarized to construct the control corpus vector set;

Step 3): Construct the control corpus vector set

corpus vector set;

Step 4): Construct candidate terminology context vector

Step 5): Terminology domain discrimination calculation combining log-likelihood

estimation and sentence similarity calculation

On the basis of two hypotheses L (H1) and L (H) defined in step 1), use the binomial

distribution hypothesis to calculate the likelihood estimates of L (H1) and L (H)

wherein L (1) =B ( 1; 1 +f2;P) B (2§; 1 +2;P), L (2) =B (1; 1

+92; P) B (2; 1 2) +f2;P , wherein X and 2 represent the frequency of the candidate term context in the target corpus and the control corpus, respectively; P and P represent 2 the probability of occurrence of context of the candidate term in the target corpus and the control corpus; combined with the binomial distribution hypothesis B (;W 1 +/2;P)

, the formula is transformed into (W + W2)!pw(1_ P)w (2) ,Pis the probability that the W2! *W!

candidate term context in hypothesis 1 appears in the target corpus, and the

T corresponding log-likelihood ratio with 2 as the base T is calculated as

L(H1 ) PW1(1 - P)W 2 PW2 (1 - P)WI Ttf = -Log = -Log L(H 2 ) P1r1(1-P)W 2 Pw2( - p 2 )W (3), which is used to

calculate the probability of the context-corpus correlation hypothesis; then by using

a*b CosDis(a,b)= a__ al*|b| (4) to calculate the sentence similarity between each context

similarity, and the threshold exceeds 50 times.