AU2021105953A4 - Method for fine-grained domain terminology self-learning based on contextual semantics - Google Patents

Method for fine-grained domain terminology self-learning based on contextual semantics Download PDF

Info

Publication number
AU2021105953A4
AU2021105953A4 AU2021105953A AU2021105953A AU2021105953A4 AU 2021105953 A4 AU2021105953 A4 AU 2021105953A4 AU 2021105953 A AU2021105953 A AU 2021105953A AU 2021105953 A AU2021105953 A AU 2021105953A AU 2021105953 A4 AU2021105953 A4 AU 2021105953A4
Authority
AU
Australia
Prior art keywords
terminology
context
candidate
corpus
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
AU2021105953A
Inventor
Jianhui Chen
Shaofu Lin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to AU2021105953A priority Critical patent/AU2021105953A4/en
Application granted granted Critical
Publication of AU2021105953A4 publication Critical patent/AU2021105953A4/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In order to solve the problem that the existing textual terminology learning technology based on large training samples cannot meet the requirement of fine-grained domain terminology learning of smaller samples, the invention provides a fine-grained domain terminology self learning method based on context semantics. By fusing context semantic information, the statistical features and language features of candidate terminology in a corpus are comprehensively expressed from the perspective of the recurrence times of context information of candidate terminology, with reference to domain relevance and domain consistency, the domain dependency bias value of candidate terminology is calculated by using log likelihood ratio for reference, and finally new domain terminology are independently discovered by synthesizing the membership activation value of each candidate term. The self-learning technology of fine-grained domain terminology based on contextual semantics of the present invention can realize self-learning of terminology sets and promote the construction of specific domain ontology. It can not only be applied to term discovery and extraction in domains such as cognitive functions, but also used as a candidate concept generation tool in the concept extraction method. 1/1 Knowledge source Data preprocessing art-of -speech Syntax analysis Morphological tagging restoration andidate terminology Candidate context set terminology Target data Context similarity Control data corpus corpus Domain discrimination Dependence difference Membership activationvalue Domain terminology Figure 1

Description

1/1
Knowledge source
Data preprocessing art-of -speech Syntax analysis Morphological tagging restoration
andidate terminology Candidate context set terminology
Target data Context similarity Control data corpus corpus
Domain discrimination
Dependence difference
Membership activationvalue
Domain terminology
Figure 1
A self-learning Method for Fine-grained Domain Terminology Based on Contextual
Semantics
TECHNICAL DOMAIN
The invention relates to a self-learning method for big data-driven domain terminology,
in particular to the self-learning of domain terminology collections based on text data
resources such as blogs, documents, web pages, etc., to realize the self-expansion of the
domain terminology library.
BACKGROUND
Big data knowledge engineering is an important content of artificial intelligence research,
and text data such as blogs, documents, and web pages are the most important sources of
knowledge. Traditional text-based terminology learning technology mainly uses machine
learning methods based on large training samples such as conditional random domains,
targeting core and large-scale terminology in various domains. For example, gene names
and protein names in the field of bioinformatics, addresses, occupations and other terms
in the field of social media. However, with the continuous deepening of knowledge
driven artificial intelligence applications, the required knowledge is becoming more
refined and specialized. Fine-grained domain terminology recognition and extraction for
small samples have become an important technology development trend for text-based
terminology learning. Nowadays, textual terminology learning technology based on large
training samples is difficult to meet the demand.
SUMMARY
In order to solve the problem that the existing textual terminology learning technology
based on large training samples cannot meet the needs of fine-grained domain terminology learning of smaller samples, the invention provides a fine-grained domain terminology self-learning method based on context semantics. By fusing context semantic information, the statistical features and language features of candidate terminology in a corpus are comprehensively expressed from the perspective of the recurrence times of context information of candidate terminology, with reference to domain relevance and domain consistency, the domain dependency bias value of candidate terminology is calculated by using log likelihood ratio for reference, and finally new domain terminology are independently discovered by synthesizing the membership activation value of each candidate term. The self-learning technology of fine-grained domain terminology based on contextual semantics of the present invention can realize self learning of terminology sets and promote the construction of specific domain ontology. It can not only be applied to term discovery and extraction in domains such as cognitive functions, but also Used as a candidate concept generation tool in the concept extraction method.
In order to solve the technical problem, the specific steps of the technical solution
adopted by the present invention are as follows:
Step 1: Build the initial terminology set and target corpus
Based on the existing terminology set in the field, it can be simplified or manually
constructed independently to obtain an initial term set consisting of 20-30 words, and the
initial terminology set is extracted by using Maximum Matching Algorithm to construct
the context set under the 35 word size window to form the target corpus;
Step 2: Construction of a control corpus
The control data set should be divided into two parts: a general control corpus subset and
a domain control corpus subset; the former is composed of multi-domain terminology and
contexts outside the target domain; the latter is composed of domain terminology and
their contexts other than the terminology to be learned in the target domain composition;
Step 3: Knowledge source preprocessing based on context balanced binary tree
For the knowledge source to be extracted, use natural language processing technology to
identify noun phrases as candidate terminology sets, and extract their context sets under a
-word window to construct candidate terminology context balanced binary tree, where
the nodes number and storage value of the candidate terminology context balances binary
tree respectively store candidate terminology and their corresponding context sets, as the
basis for further screening and processing;
Step 4: Terminology domain discrimination calculation based on context-corpus
correlation hypothesis
First, construct the correlation hypothesis between the term context and the corpus. On
this basis, comprehensively apply the log-likelihood ratio and the sentence similarity
measure based on the context vector to calculate the term domain discrimination Dtn(t);
Step 5: Calculate the domain-dependent bias value of candidate terminology
Construct a "headword-modifier" morphological skeleton model, and calculate the
similarity of the candidate terminology "headword" context in the target corpus and the
control corpus; first define the candidate terminology domain-dependent bias independent
variable DRG = W2/ W, where Wi>,W23OWi and W2 are the frequency of candidate
terminology contexts in the target corpus and the control corpus, respectively, and then
use the domain-dependent bias function Dte(t) =e n*DRG*ln2 (1), where e is the natural logarithm, n is the adjustment factor, and the value range of n is 10000-12000. Then the domains-dependent bias value of the candidate terminology is calculated, and then a binary tree of candidate terminology-dependent adjustment factors is constructed. Among them, The node number and storage value of binary tree of candidate term-dependent adjustment factors respectively store candidate terminology and their domain-dependent bias values;
Step 6: Calculate the membership activation value of the candidate terminology
Combining the results of step 4 and step 5, integrate the candidate terminology context
balanced binary tree and candidate terminology-dependent adjustment factor binary tree
to construct a "discrimination-bias-membership" three-layer mapping activation model,
and calculate the membership activation value of the candidate terminology, that is,
Dtm(t)= Dtn(t)*Dte(t), where Dtn(t) represents the degree of term domain discrimination,
which is obtained from the result of step 4, and Dte(t) represents the bias value of
candidate term domain- dependent, which is obtained from the result of step 5; construct
candidate term membership activation value binary tree, where the node number and
storage value of the candidate term membership activation value binary tree respectively
store the candidate terminology and its membership activation value;
Step 7: Self-leaming of fine-grained domain terminology
Based on the candidate term membership activation value binary tree, set the critical
activation value, and draw the accuracy curve corresponding to the different critical
activation value, take the threshold corresponding to the highest accuracy value as the
critical activation value, and the terminology that meets the critical value are regarded as new discovered terminology in the domain and are added to the initial terminology set, and return to step 1.
Further, in the step 4, the specific method and process of calculating the term domain
discrimination based on the context-corpus correlation hypothesis is:
Step 1): Define context-corpus correlation hypothesis
Hypothesis 1: The context of candidate terminology appears at the same frequency in the
target corpus and the control corpus;
Hypothesis 2: The context of candidate terminology appears at different frequency in the
target corpus and the control corpus;
Step 2): Construction of the target corpus vector set
First, based on the target corpus, train a context-based "incoming-hiding-feedback" three
layer neural network model; secondly, look through all the contexts in the control corpus,
and input each context into the neural network model word by word to obtain the
corresponding multidimensional word vector of words, and the average value of each
dimension of all word vectors are used to construct the context vector; finally, the context
vectors of all the contexts in the control corpus are summarized to construct the control
corpus vector set;
Step 3): Construct the control corpus vector set
Firstly, based on the control corpus, train a context-based "incoming-hiding-feedback"
three-layer neural network model; secondly, look through all the contexts in the control
corpus, and input each context into the neural network model word by word to obtain the
corresponding multidimensional word vector of words, and the average value of each
dimension of all word vectors are used to construct the context vector; finally, the context vectors of all the contexts in the control corpus are summarized to construct the control corpus vector set;
Step 4): Construct candidate terminology context vector
First, based on the candidate terminology, look through the candidate term context
balanced binary tree to extract the corresponding context; then input the context one by
one into the three-layer neural network model of the control corpus to obtain the
multidimensional word vector corresponding to each word; finally use the average value
of each dimension of all word vectors to construct candidate terminology context vector;
Step 5): Terminology domain discrimination calculation combining log-likelihood
estimation and sentence similarity calculation
On the basis of two hypotheses L (H) and L (A) defined in step 1), use the binomial
distribution hypothesis to calculate the likelihood estimates of L (H1 ) and L (H2)
wherein L (/O =B ( ;W, +2 ;P) B (W ; +2;P) , L (A) =B3( W; W,
+±F 2 ;P) B (2 2; X +2;P 2 ) , wherein W and 2 represent the frequency of the candidate
term context in the target corpus and the control corpus, respectively; P and P2 represent
the probability of occurrence of context of the candidate term in the target corpus and the
control corpus; combined with the binomial distribution hypothesis B ( 2; 2 ; P) +W ,
the formula is transformed into(W+W2 )Pw 2 (1-P) (2), Pis the probability that the W2 !*W!
candidate term context in hypothesis 1 appears in the target corpus, and the
T corresponding log-likelihood ratio with 2 as the base T is calculated as
L(H1 ) PWI(1 - P)W 2 PW2 (1 - p)WI Ty= -Log L(H = -Log 2) P1( P 1 )W 2 P 2(1 - P2 )wi (3), which is used to calculate the probability of the context-corpus correlation hypothesis; then by using
CosDis(a,b) a*b a l*|b (4) to calculate the sentence similarity between each context
sentence vector of the candidate term and each context sentence vector in the target
corpus vector set, where a represents each context sentence vector of the candidate term,
b represents each context sentence vector in the target corpus vector set; calculate the
sentence similarity between each context sentence vector of the candidate term and each
context sentence vector in the target corpus vector set, Wi is obtained by counting the
frequency of similarity, and the threshold exceeds 50 times; each context sentence vector
of the candidate term is calculated and the sentence of each context sentence vector in the
control corpus vector set similarity, W2 is obtained by counting the frequency of
similarity, and the threshold exceeds 50 times.
BRIEF DESCRIPTION OF THE FIGURES
Fig. 1 is a flow chart of the self-leaming method for fine-grained domain terminology
based on contextual semantics according to the present invention.
DESCRIPTION OF THE INVENTION
The present invention will be further described below in conjunction with the drawings
and implementation cases:
The source data used in the domain term discovery method of the present invention is
from the PLOS ONE website, and 5000 articles are randomly crawled by searching the
"fMRI" and "Cognitive Function" keywords;
The concept collection of cognitive function terminology is composed of 803 cognitive
function terminology in the Cognitive atlas website;
The method flow chart of this embodiment is shown in Fig. 1, and specifically includes
the following steps:
The concept set of cognitive function terminology is composed of 803 cognitive function
terminology in the Cognitive atlas website;
Step 1: Build the initial terminology set and original target corpus
The initial terminology set is constructed by filtering the top 10 cognitive function
terminology that appear most frequently in the source data;
The original target corpus is composed of 932 segments in the source data, and each
abstract contains terminology in the initial terminology set;
Step 2: Construction of the original control corpus
The original control corpus is composed of paragraphs that do not contain 803
terminology in the source data set and 25032 paragraphs that contain 803 paragraphs that
are not in the same sentence as the terminology;
Step 3: Build the original knowledge source corpus
The knowledge source corpus comes from the 150 latest articles that we randomly
crawled from the PLOS ONE website by searching the keywords "fMRI" and "Cognitive
Function" to construct a test corpus. Based on the cognitive glossary, 20 cognitive
functional terminology in these articles are marked.
Step 4: Data preprocessing to obtain candidate terminology set, target corpus context,
control corpus context and knowledge source context
Step (1): Use the HanLp tool to perform part-of-speech tagging and syntactic analysis on
the knowledge source data, and extract all noun phrases in the corpus;
Step (2): Remove words such as articles and descriptive adjectives from the noun phrases
above;
Step 3: Split the noun phrases connected by "and" or "or" into two parts, for example,
split "anchoringandapperception" into "anchoring" and "apperception";
Step (4): Further split from noun phrases of similar grammatical structures such as
"nounnoun" or "adjectivelnoun", and extract more fine-grained candidate terminology for
the second time, for example, "audiovisual" is generated from "audiovisual perception"
and "perception";
Step (5): Morphological restoration and de-duplication, to obtain a set of candidate
terminology; extract the context information corresponding to each candidate term in the
knowledge source corpus, and take 35 words around the candidate term as its term
context. The same can also be used to control the context of the target corpus to obtain
the context of the target corpus.
Step 5: Calculate the discrimination degree of terminology domain
Use the log likelihood ratio (5) to calculate the possibility of context-corpus correlation
hypothesis, with the binomial distribution hypothesis, the formula can be transformed
into formula (6); then calculate the sentence similarity between each context vector of the
candidate term and each vector in the target corpus vector set by using (7), and then count
the number of times that the similarity exceeds the set threshold. The number of
occurrences of the candidate term context in the control corpus can also be obtained in
the same way.
Step 6: Calculate the domain-dependent bias value of candidate terminology
For each candidate term, calculate the dependent difference value Dtn(t) of each term
according to formula (1);
Step 7: Terminology self-learning
Then calculate the domain candidate term membership activation value
Dtm(t)=2.3501686958E-39 according to formula (2), set the critical activation value, and
the term that meets the critical value is regarded as a new term found in the domain and
added to the initial terminology set, and repeat step 1 to realize the self-learning of the
terminology set and the self-improvement of this method.
In this experiment, a total of 29 domain terminology were selected, of which 25 were
found to be cognitive function terminology, and the term discovery accuracy rate was
86.20%. The following table shows the detailed results of term findings:
Table 1 Details of terminology findings
concep Dtm(t) Remar conce Dtm(t) Remark langua 2.3501686958 True fixatio 1.0677460999E-15 new entity researc 1.9968347325 False pain 4.5413844800E-14 new entity search 2.0031186447 new detecti 7.6070869893E-14 True emotio 1.5136559401 True inhibit 1.8986916997E-12 new entity executi 2.7032331048 False associ 3.9663843892E-11 new entity movem 1.7365628165 True learnin 2.1390137641E-10 new entity percept 2.5561699747 True decisio 3.0194903713E-9 new entity reactio 1.1475290757 False contex 4.6116571446E-9 new entity valence 1.0022558949 new focus 1.7419689090E-8 new entity knowle 6.4427228347 new percep 5.9751417093E-7 new entity stress 8.2889849868 new action 3.8605674748E-4 new entity strateg 2.5904326047 new activat 7.2569481516E-4 True judgme 6.8298661996 new attenti 0.00133302481525 new entity integrat 1.2711981915 new memor 0.00367965728662 new entity interact 3.0983927291 False
In order to verify the effectiveness of the method of the present invention, the algorithm
proposed in this experiment is compared with the results of DR-DC, CTROL, CRF and
other algorithms. The experimental results show that the accuracy rate of the DR-DC
algorithm is 16.52%, the accuracy rate of the CTROL algorithm is 31.09%, and the
accuracy rate of the CRF algorithm is 43.22%. The experimental results show that the
term discovery has a higher accuracy rate by adopting the self-learning technology
algorithm of fine-grained domain terminology based on contextual semantics.
It can be seen that the self-learning technology of fine-grained domain terminology based
on contextual semantics is conducive to the self-learning of the domain terminology set
of text data resources and realizes the self-expansion of the domain terminology database.

Claims (1)

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:
1.A self-learning method for fine-grained domain terminology based on contextual
semantics, characterized in that it includes the following steps:
Step 1: Build the initial terminology set and target corpus
Based on the existing terminology set in the field, it can be simplified or manually
constructed independently to obtain an initial term set consisting of 20-30 words, and the
initial terminology set is extracted by using Maximum Matching Algorithm to construct
the context set under the 35 word size window to form the target corpus;
Step 2: Construction of a control corpus
The control data set should be divided into two parts: a general control corpus subset and
a domain control corpus subset; the former is composed of multi-domain terminology and
contexts outside the target domain; the latter is composed of domain terminology and
their contexts other than the terminology to be learned in the target domain composition;
Step 3: Knowledge source preprocessing based on context balanced binary tree
For the knowledge source to be extracted, use natural language processing technology to
identify noun phrases as candidate terminology sets, and extract their context sets under a
-word window to construct candidate terminology context balanced binary tree, where
the nodes number and storage value of the candidate terminology context balances binary
tree respectively store candidate terminology and their corresponding context sets, as the
basis for further screening and processing;
Step 4: Terminology domain discrimination calculation based on context-corpus
correlation hypothesis
First, construct the correlation hypothesis between the term context and the corpus. On
this basis, comprehensively apply the log-likelihood ratio and the sentence similarity
measure based on the context vector to calculate the term domain discrimination Dtn(t);
Step 5: Calculate the domain-dependent bias value of candidate terminology
Construct a "headword-modifier" morphological skeleton model, and calculate the
similarity of the candidate terminology "headword" context in the target corpus and the
control corpus; first define the candidate terminology domain-dependent bias independent
variable DRG = W2/ W, where Wi>,W23OWi and W2 are the frequency of candidate
terminology contexts in the target corpus and the control corpus, respectively, and then
use the domain-dependent bias function Dte(t) =e n*DRG*ln2 (1), where e is the natural
logarithm, n is the adjustment factor, and the value range of n is 10000-12000. Then the
domains-dependent bias value of the candidate terminology is calculated, and then a
binary tree of candidate terminology-dependent adjustment factors is constructed. Among
them, The node number and storage value of binary tree of candidate term-dependent
adjustment factors respectively store candidate terminology and their domain-dependent
bias values;
Step 6: Calculate the membership activation value of the candidate terminology
Combining the results of step 4 and step 5, integrate the candidate terminology context
balanced binary tree and candidate terminology-dependent adjustment factor binary tree
to construct a "discrimination-bias-membership" three-layer mapping activation model,
and calculate the membership activation value of the candidate terminology, that is,
Dtm(t)= Dtn(t)*Dte(t), where Dtn(t) represents the degree of term domain discrimination,
which is obtained from the result of step 4, and Dte(t) represents the bias value of candidate term domain- dependent, which is obtained from the result of step 5; construct candidate term membership activation value binary tree, where the node number and storage value of the candidate term membership activation value binary tree respectively store the candidate terminology and its membership activation value;
Step 7: Self-leaming of fine-grained domain terminology
Based on the candidate term membership activation value binary tree, set the critical
activation value, and draw the accuracy curve corresponding to the different critical
activation value, take the threshold corresponding to the highest accuracy value as the
critical activation value, and the terminology that meets the critical value are regarded as
new discovered terminology in the domain and are added to the initial terminology set,
and return to step 1.
Further, in the step 4, the specific method and process of calculating the term domain
discrimination based on the context-corpus correlation hypothesis is:
Step 1): Define context-corpus correlation hypothesis
Hypothesis 1: The context of candidate terminology appears at the same frequency in the
target corpus and the control corpus;
Hypothesis 2: The context of candidate terminology appears at different frequency in the
target corpus and the control corpus;
Step 2): Construction of the target corpus vector set
First, based on the target corpus, train a context-based "incoming-hiding-feedback" three
layer neural network model; secondly, look through all the contexts in the control corpus,
and input each context into the neural network model word by word to obtain the
corresponding multidimensional word vector of words, and the average value of each dimension of all word vectors are used to construct the context vector; finally, the context vectors of all the contexts in the control corpus are summarized to construct the control corpus vector set;
Step 3): Construct the control corpus vector set
Firstly, based on the control corpus, train a context-based "incoming-hiding-feedback"
three-layer neural network model; secondly, look through all the contexts in the control
corpus, and input each context into the neural network model word by word to obtain the
corresponding multidimensional word vector of words, and the average value of each
dimension of all word vectors are used to construct the context vector; finally, the context
vectors of all the contexts in the control corpus are summarized to construct the control
corpus vector set;
Step 4): Construct candidate terminology context vector
First, based on the candidate terminology, look through the candidate term context
balanced binary tree to extract the corresponding context; then input the context one by
one into the three-layer neural network model of the control corpus to obtain the
multidimensional word vector corresponding to each word; finally use the average value
of each dimension of all word vectors to construct candidate terminology context vector;
Step 5): Terminology domain discrimination calculation combining log-likelihood
estimation and sentence similarity calculation
On the basis of two hypotheses L (H1) and L (H) defined in step 1), use the binomial
distribution hypothesis to calculate the likelihood estimates of L (H1) and L (H)
wherein L (1) =B ( 1; 1 +f2;P) B (2§; 1 +2;P), L (2) =B (1; 1
+92; P) B (2; 1 2) +f2;P , wherein X and 2 represent the frequency of the candidate term context in the target corpus and the control corpus, respectively; P and P represent 2 the probability of occurrence of context of the candidate term in the target corpus and the control corpus; combined with the binomial distribution hypothesis B (;W 1 +/2;P)
, the formula is transformed into (W + W2)!pw(1_ P)w (2) ,Pis the probability that the W2! *W!
candidate term context in hypothesis 1 appears in the target corpus, and the
T corresponding log-likelihood ratio with 2 as the base T is calculated as
L(H1 ) PW1(1 - P)W 2 PW2 (1 - P)WI Ttf = -Log = -Log L(H 2 ) P1r1(1-P)W 2 Pw2( - p 2 )W (3), which is used to
calculate the probability of the context-corpus correlation hypothesis; then by using
a*b CosDis(a,b)= a__ al*|b| (4) to calculate the sentence similarity between each context
sentence vector of the candidate term and each context sentence vector in the target
corpus vector set, where a represents each context sentence vector of the candidate term,
b represents each context sentence vector in the target corpus vector set; calculate the
sentence similarity between each context sentence vector of the candidate term and each
context sentence vector in the target corpus vector set, Wi is obtained by counting the
frequency of similarity, and the threshold exceeds 50 times; each context sentence vector
of the candidate term is calculated and the sentence of each context sentence vector in the
control corpus vector set similarity, W2 is obtained by counting the frequency of
similarity, and the threshold exceeds 50 times.
AU2021105953A 2021-08-19 2021-08-19 Method for fine-grained domain terminology self-learning based on contextual semantics Active AU2021105953A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021105953A AU2021105953A4 (en) 2021-08-19 2021-08-19 Method for fine-grained domain terminology self-learning based on contextual semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2021105953A AU2021105953A4 (en) 2021-08-19 2021-08-19 Method for fine-grained domain terminology self-learning based on contextual semantics

Publications (1)

Publication Number Publication Date
AU2021105953A4 true AU2021105953A4 (en) 2021-10-28

Family

ID=78179673

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2021105953A Active AU2021105953A4 (en) 2021-08-19 2021-08-19 Method for fine-grained domain terminology self-learning based on contextual semantics

Country Status (1)

Country Link
AU (1) AU2021105953A4 (en)

Similar Documents

Publication Publication Date Title
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
Hadni et al. Word sense disambiguation for Arabic text categorization.
Rahimi et al. An overview on extractive text summarization
CN108038106B (en) Fine-grained domain term self-learning method based on context semantics
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
Saju et al. A survey on efficient extraction of named entities from new domains using big data analytics
Alqahtani et al. A survey of text matching techniques
Adhitama et al. Topic labeling towards news document collection based on Latent Dirichlet Allocation and ontology
Bella et al. Domain-based sense disambiguation in multilingual structured data
Karpagam et al. A framework for intelligent question answering system using semantic context-specific document clustering and Wordnet
Kalo et al. Knowlybert-hybrid query answering over language models and knowledge graphs
Prabowo et al. Systematic literature review on abstractive text summarization using kitchenham method
Tungthamthiti et al. Recognition of sarcasm in microblogging based on sentiment analysis and coherence identification
CN114239828A (en) Supply chain affair map construction method based on causal relationship
AU2021105953A4 (en) Method for fine-grained domain terminology self-learning based on contextual semantics
Liu et al. Modelling and Implementation of a Knowledge Question-answering System for Product Quality Problem Based on Knowledge Graph
CN114298020A (en) Keyword vectorization method based on subject semantic information and application thereof
Akhgari et al. Sem-TED: semantic twitter event detection and adapting with news stories
Lezama Sanchez et al. A Behavior Analysis of the Impact of Semantic Relationships on Topic Discovery
Tohalino et al. Using virtual edges to extract keywords from texts modeled as complex networks
Ibrahiem et al. FEATURE EXTRACTION ENCHANCEMENT IN USERS’ATTITUDE DETECTION
Tang et al. Resolve Out of Vocabulary with Long Short-Term Memory Networks for Morphology
Guruvayur et al. Automatic Relationship Construction in Domain Ontology Engineering using Semantic and Thematic Graph Generation Process and Convolution Neural Network
Mills et al. A comparative survey on NLP/U methodologies for processing multi-documents
Aamot Literature-based discovery for oceanographic climate science

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)