CN111144119A - Entity identification method for improving knowledge migration - Google Patents

Entity identification method for improving knowledge migration Download PDF

Info

Publication number
CN111144119A
CN111144119A CN201911374613.0A CN201911374613A CN111144119A CN 111144119 A CN111144119 A CN 111144119A CN 201911374613 A CN201911374613 A CN 201911374613A CN 111144119 A CN111144119 A CN 111144119A
Authority
CN
China
Prior art keywords
word
sentence
auxiliary
model
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911374613.0A
Other languages
Chinese (zh)
Other versions
CN111144119B (en
Inventor
赵平
孙连英
涂帅
王金峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Union University
Original Assignee
Beijing Union University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Union University filed Critical Beijing Union University
Priority to CN201911374613.0A priority Critical patent/CN111144119B/en
Publication of CN111144119A publication Critical patent/CN111144119A/en
Application granted granted Critical
Publication of CN111144119B publication Critical patent/CN111144119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a scenic spot entity identification method, which solves the problem that labeled data in the tourism field is difficult to obtain by using the thought of knowledge migration, solves the problems of serious dependence on labeled data and labeled data quality in scenic spots identified by using a deep learning method, and solves the problem of word ambiguity represented by Chinese named entity identification characteristics by using a BERT + BilSTM + CRF method fused with a language model. The method utilizes the labeled data of the existing auxiliary field samples to evaluate the target field training set according to three levels of keywords, sentences and extensibility by utilizing the evaluation result. Experiments prove that the invention can obtain more obvious identification effect only by taking 1/4 labeled data. In addition, on the premise of not needing to label a large amount of manual data, the existing label data of the auxiliary field samples are used for expansion, and on the premise of not influencing the identification efficiency, the time and the energy spent on manually labeling the data are reduced.

Description

Entity identification method for improving knowledge migration
Technical Field
The invention relates to a scenic spot entity identification method, in particular to a scenic spot entity identification model for improving knowledge migration.
Background
The method has important significance for researches such as question-answering systems, public opinion analysis, personalized recommendation and the like in the tourism field by carrying out efficient information management and data mining on massive unstructured texts similar to tourism travel notes, and directly influences the information extraction in the tourism field on the entity identification accuracy of scenic spots.
For the identification of tourist attractions, there are currently the following categories: machine learning (hidden markov models and conditional random fields) and deep learning (convolutional neural networks) based methods. The hidden Markov model is a double random process for identifying the scenic spots, semantic information between contexts cannot be considered by the method, the word ambiguity problem represented by text features cannot be solved in the process of extracting features from the text, scenic spot words in the tourist field generally have different meanings under different contexts, for example, Huangshan can refer to Huangshan city of Anhui province under different contexts, belong to place names, can also refer to Huangshan in tourist attraction areas and the like, and then the scenic spot entity identification efficiency is general. The method based on the conditional random field mainly depends on manually constructing the feature template, for the tourism field, the number of scenic spot entities is too large to be listed one by one, time and labor are consumed when the feature template is manually constructed, and information of context and semantics cannot be considered. The method based on the convolutional neural network has high scenic spot identification efficiency, but a large amount of artificial labeling linguistic data are needed, the identification result seriously depends on the linguistic data labeling quality, in addition, the artificial labeling consumes great energy, and the linguistic data quality of the automatically labeled training set directly influences the identification efficiency. Therefore, the biggest problem at present for tourist attraction identification is that: 1) for the duplicate names of tourist attractions, the problem of different meanings of scenic spot words in different contexts cannot be solved when the text features are expressed; 2) for a specific tourism field, the number of scenic spot entities is too large, the scenic spot entities cannot be listed one by one, time and labor are consumed when a feature template is constructed manually, manual marking data is needed when a machine learning algorithm is used for learning, the model depends heavily on the quality of the marking data, and the marking data are difficult to obtain;
disclosure of Invention
The invention aims to solve the problems and provides a scenic spot entity recognition model for improving knowledge migration. The auxiliary field text is normalized and labeled data, so the difficulty of migration is how to evaluate the similarity between the auxiliary field and the target field, and the semantic information about the target field in the auxiliary field is expanded as much as possible without generating negative migration in the processes of feature extraction and knowledge migration.
In contrast, the method provides two different calculation modes of keyword importance and sample expandability for the text characteristics of the travel field to evaluate the quality of one sample. Three different degrees of similarity are designed to evaluate the similarity between the auxiliary field and the target field. The method has the advantage of expanding the target field training set by using the auxiliary field, and can accurately and effectively identify the scenic spots.
In order to achieve the purpose, the invention adopts the following technical scheme:
a scenic spot entity recognition model for improving knowledge migration comprises the following specific steps:
the method comprises the following steps: training a Chinese named entity recognition model by using a BERT + BilSTM + CRF method by using an auxiliary field training set, wherein the Chinese named entity recognition model comprises a BERT model, a BiILSTM and a CRF layer, and specifically comprises the following steps: and obtaining text word vectors by the training set through a BERT model, then carrying out named entity recognition through BiILSTM deep learning context characteristic information, and finally processing the output sequence of the BilSTM by using a CRF layer.
Step two: training a word2Vec model by using an auxiliary field training set, wherein the trained word2Vec model is called an auxiliary field word vectorization model, training the word2Vec model by using a target field training set, and the trained word2Vec model is called a target field word vectorization model;
step three: calculating the importance of words for each sample in the auxiliary field training set, and arranging the importance of the words according to a sequence from big to small, wherein the first m words are auxiliary field keywords; calculating the importance of words for each sample in the target field training set, and arranging the importance of the words according to a sequence from large to small, wherein the first m words are keywords of the target field;
step four: calculating the similarity between the auxiliary domain keywords and the target domain keywords obtained in the second step, and setting a keyword level similarity threshold;
step five: calculating the similarity between the auxiliary field sentence and the target field sentence, and setting a sentence level similarity threshold;
step six: calculating the expandability of the auxiliary field sample, and setting a expandability threshold;
step seven: expanding the target domain sample with the auxiliary domain sample; training a scenic spot entity recognition classifier by using the extended target field sample;
the scenic spot entity recognition model for improving knowledge migration of claim 1, wherein the named entity recognition model in the first step is a method comprising:
(1-1) inputting the auxiliary field training set into a BERT model, wherein the auxiliary field training set is a text set which is acquired from a daily report of people and is marked with names of people, places and organizations, and the BERT model outputs text word vectors;
(1-2) inputting the Chinese named entity recognition model into the text word vector in the step (1-1); extracting context information by using a BiILSTM deep learning method;
and (1-3) processing the output sequence of the BilSTM by using a CRF layer, and obtaining a global optimal sequence according to labels between adjacent layers by combining a state transition matrix in the CRF.
(1-4) outputting the Chinese named entity recognition model as a predicted entity label;
the second step comprises the following specific steps:
(2-1) the target field sample set is travel notes on manually crawled travel websites such as hornet cells;
(2-2) performing word segmentation on the auxiliary field sample set by using a jieba word segmentation method to obtain an auxiliary field word segmentation text, and performing word segmentation on the target field sample set by using a jieba word segmentation method to obtain a target field word segmentation text;
and (2-3) loading stop words and a user-defined dictionary. The user-defined dictionary is composed of words, and the expressed meaning is the words which are not separated by the word splitter;
(2-4) training a word2Vec model by using the auxiliary field word segmentation text to obtain an auxiliary field word vectorization model, and training the word2Vec model by using the target field word segmentation text to obtain an auxiliary field word vectorization model;
the third step comprises the following specific steps:
(3-1) calculating keyword frequency KF in sentence for target domain samplei,jComputing keyword frequencies KF in sentences for the auxiliary domain samplesi,j', wherein the ith keyword frequency KFi,jThe calculation method is as follows:
Figure BDA0002340584590000031
in the formula, KFi,jRepresenting the frequency of occurrence, n, of the keyword i in the sentence ji,jRepresenting the number of times the keyword i appears in the sentence j.
(3-2) calculating a sample inverse sentence frequency ISF for the auxiliary domain sample, and calculating a sample inverse sentence frequency ISF' for the target domain sample;
Figure BDA0002340584590000032
wherein: SF (sequence frequency) represents sentence frequency, ISF (inverse sequence frequency) represents inverse sentence frequency, ISFiRepresenting the inverse sentence frequency of the word i, | S | sentence total amount, | j: ti∈SjI denotes ti∈SjAnd 1 is added to prevent denominator from changing to zero and leading to meaningless expression.
(3-3) calculating the importance degree of a certain word i in the sentence j for the auxiliary field sample, wherein the calculation formula is as follows: i (I, j) ═ KFi,j*ISFi
(3-4) calculating the importance degree of a certain word i in the sentence j for the target field sample, wherein the calculation formula is as follows: i (I, j) ═ KFi,j′*ISFi′;
The fourth concrete step is as follows:
(4-1) for auxiliary Domain keywords
Figure BDA0002340584590000033
Calculating to obtain L by using the auxiliary field word2Vec language model obtained by training in the second stepword={l1,l2,…,ln};
(4-2) for target Domain keyword
Figure BDA0002340584590000034
M is obtained by calculating the word2Vec language model of the target field obtained by the training in the step twoword={m1,m2,…,mn};
(4-3) is in step three
Figure BDA0002340584590000035
And
Figure BDA0002340584590000036
and calculating the similarity of the keywords according to the cosine similarity in the following way:
Figure BDA0002340584590000041
(4-4) setting keyword level similarity threshold values (0.4, 0.6);
the concrete steps of the fifth step are as follows:
(5-1) for each sentence x in the auxiliary domain samplesCalculating sentence vectors by using the auxiliary field word2Vec language model obtained by training in the second step to obtain Lsen={l1,l2,…,ln};
(5-2) for each sentence x in the target domain samplesCalculating sentence vectors by using the target field word2Vec language model obtained by training in the step two to obtain Msen={m1,m2,…,mn};
(5-3) to LsenAnd MsenAnd calculating sentence level similarity according to the cosine similarity in the following way:
Figure BDA0002340584590000043
(5-4) setting sentence level similarity thresholds (0.4, 0.6);
the sixth step is to determine the sample scalability, and the specific steps are as follows:
(6-1) obtaining sim according to the preceding paragraphsenAnd simwordFrom
Figure BDA0002340584590000044
Figure BDA0002340584590000045
Calculating sample scalability SEA, wherein α is SEA weight occupied by sentence-level similarity, and β is SEA weight occupied by keyword similarity;
(6-2) setting a scalability threshold (0.4, 0.6);
the seventh concrete step is as follows:
(7-1) expanding the sample with high keyword similarity to a target field sample set according to the keyword similarity threshold;
(7-2) expanding the samples with high sentence similarity into the target domain samples according to the sentence level similarity threshold;
(7-3) expanding the sample with high expandability into the target field sample according to the sample expandability capacity similarity threshold;
has the advantages that:
the invention solves the problem of difficult acquisition of the labeled data in the tourism field by using the idea of knowledge migration, and simultaneously solves the problems of serious dependence on the labeled data and labeled data quality in the scenic spot identification by using a deep learning method. And expanding a target field training set according to the evaluation of three levels of keywords, sentences and expandable capacity by using the idea of transfer learning and the labeled data of the existing auxiliary field samples and the evaluation result.
A large number of experiments prove that the invention can obtain more remarkable identification effect only by taking 1/4 marking data. In addition, on the premise of not needing to label a large amount of manual data, the method can be expanded by means of the labeled data of the existing auxiliary field samples, so that the entity extraction of the target field samples is realized, and the time and the energy spent on manually labeling the data are greatly reduced on the premise of not influencing the recognition efficiency.
Drawings
FIG. 1 is a diagram of the algorithm structure of the present invention
FIG. 2 is a diagram of classifier models
FIG. 3 is a graph of classifier hierarchy verification
FIG. 4 is a graph of similarity thresholds for different keywords
FIG. 5 is a graph of similarity thresholds for different sentences
FIG. 6 is a graph of different SEAs
FIG. 7 is a graph of sample size impact results for different target domains
Detailed Description
The invention will be further explained with reference to the following drawings and examples
As shown in fig. 1, an improved knowledge migration entity recognition model research and application includes the following specific steps:
(1) for auxiliary field sample XsAnd a small number of target field samples XtTraining Chinese entity recognition classifier C using the classifier model of FIG. 21(x) And scenery spot entity recognition classifier C2(x)。C1(x) For detecting word-polysemous problems in Chinese named entity recognition, C2(x) The method is used for detecting the problem of scenic spot entity identification in the travel field named entity identification. In the design of the classifier in FIG. 2, a BERT embedded entity recognition model is used, the purpose is to solve the problem of ambiguous text representation of Chinese words during feature extraction, a method of memorizing neural network in two-way long-and-short time is used for learning context feature information, a conditional random field is used for processing an output sequence of a previous layer, and a state transition matrix in the conditional random field is combined to extract a global optimal sequence.
Use of auxiliary Domain sample test set C1(x) And testing to obtain a test result curve chart shown in fig. 3, wherein the P value is the accuracy, the R value is the recall rate, and the F value is the comprehensive evaluation index.
(2) Initializing a similarity threshold m, and marking the proportion mu of a small number of sample sets in the target field in the expanded training set;
(3) for each Xt=TrtTo Xs、XtPreprocessing, respectively training corresponding language models, and for any X(s) epsilon XsIs provided with
Figure BDA0002340584590000051
n is the vector dimension, for X (t) e XtIs provided with
Figure BDA0002340584590000052
(4) Computing keyword frequency, text similarity, and extensibility for each sample in the auxiliary domain and the sample in the target domain
Figure BDA0002340584590000053
Calculating the importance of the words and obtaining the first m most relevant keywords
Figure BDA0002340584590000054
And
Figure BDA0002340584590000055
the calculation method comprises the following steps: i (I, j) ═ KFi,j*ISFi
Wherein KFi,jIndicating the frequency of occurrence, ISF, of key i in sentence jiRepresenting the inverse sentence frequency of the word i
② for each sample
Figure BDA0002340584590000061
vsen(xt)∈v(xt) And calculating text similarity at sentence level, wherein the calculation method comprises the following steps:
Figure BDA0002340584590000062
③ for each assistive domain sample there is:
Figure BDA0002340584590000063
calculating the expandable capacity of the system, wherein the calculation method comprises the following steps:
Figure BDA0002340584590000064
wherein α and β represent weight coefficients;
(4) expanding the sample with stronger expansibility in the auxiliary field sample set to the target field sample set Tr according to the SEA value obtained by ③ in (2)tTo obtain
Figure BDA0002340584590000065
(5) For expanded
Figure BDA0002340584590000066
Training a new scenery spot entity recognition model c (x) by using a BERT + BilSTM + CRF method;
(6) using target Domain test set TetObtaining a group of identification results;
(7) updating the value m, and repeatedly verifying to obtain an experimental result shown in the figure 5;
(8) updating the SEA threshold value, and repeatedly verifying to obtain an experimental result shown in the figure 6;
(9) updating the mu value, and repeatedly verifying to obtain an experimental result shown in the figure 7;
it can be seen that: the method provided by the method only marks data in the target field of 1/4, and the accuracy of the test result is as high as 95.06%.

Claims (10)

1. An entity identification method for improving knowledge migration, characterized by:
the method comprises the following steps: training a Chinese named entity recognition model by using a BERT + BilSTM + CRF method by using an auxiliary field training set, wherein the Chinese named entity recognition model comprises a BERT model, a BiILSTM and a CRF layer, and specifically comprises the following steps: obtaining text word vectors by a BERT model through a training set, then carrying out named entity recognition by BiILSTM deep learning context characteristic information, and finally processing the output sequence of the BilSTM by using a CRF layer;
step two: training a word2Vec model by using an auxiliary field training set, wherein the trained word2Vec model is called an auxiliary field word vectorization model, training the word2Vec model by using a target field training set, and the trained word2Vec model is called a target field word vectorization model;
step three: for each auxiliary field sample, calculating the importance degree of words, and arranging the importance degrees of the words according to a sequence from large to small, wherein the first m words are auxiliary field keywords
Figure FDA0002340584580000011
For each target field sample, calculating the importance degree of words, and arranging the importance degrees of the words according to a sequence from large to small, wherein the first m words are keywords of the target field
Figure FDA0002340584580000012
Step four: calculating the auxiliary domain keywords obtained in the third step
Figure FDA0002340584580000013
And target domain keywords
Figure FDA0002340584580000014
Setting a similarity threshold value;
step five: calculating the similarity between the auxiliary field sentence and the target field sentence, and setting a sentence level similarity threshold;
step six: calculating the expandability of the auxiliary field sample, and setting a expandability threshold;
step seven: and (3) training a scenic spot entity recognition classifier by using the expanded target field sample by using the BERT + BilSTM + CRF method in the step one, wherein the scenic spot entity recognition classifier and the Chinese named entity recognition model have the same structure.
2. The scenic spot entity recognition model for improving knowledge migration of claim 1, wherein: the Chinese named entity recognition model in the first step is as follows:
(1-1) inputting the auxiliary field training set into a BERT model, wherein the auxiliary field training set is a text set which is collected from the daily newspaper of people and is marked with names of people, place names and organization names, and the BERT model outputs text word vectors;
(1-2) inputting the text word vector in the step (1-1) into BiILSTM, and extracting context information;
(1-3) processing the output sequence of the BilSTM by using a CRF layer to obtain predicted score values of different types of entity labels;
the model adopts an optimization method maximum likelihood estimation loss function, and the label identifies and labels a BIO labeling mode for the named entity.
3. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein in step two,
the target field sample set is travel notes on a manually crawled travel website;
the auxiliary field sample set is further subjected to word segmentation by a jieba word segmentation method to obtain an auxiliary field word segmentation text, and the target field sample set is subjected to word segmentation by a jieba word segmentation method to obtain a target field word segmentation text;
further comprising loading stop words and a user-defined dictionary, wherein the user-defined dictionary is composed of words including words that are not intended to be separated by the segmenter.
4. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein the specific steps of calculating the importance of the auxiliary domain words in the third step are as follows:
(3-1) calculating a word frequency in the auxiliary domain sample sentence, wherein an appearance frequency KF of an ith word in the sentence ji,jIs obtained by the following formula:
Figure FDA0002340584580000021
in the formula, ni,jIs shown asThe number of times i words appear in sentence j;
(3-2) calculating a sample inverse sentence frequency ISF for the auxiliary domain sample, wherein the inverse sentence frequency ISF for word iiIs obtained by the following formula;
Figure FDA0002340584580000022
wherein: | S | represents the total number of sentences in the auxiliary domain sample set, | j: ti∈SjI denotes ti∈SjNumber of occurrences, tiRepresenting words in sentences, SjRepresents the jth sentence;
(3-3) calculating the importance degree of the word in the sentence in the assistant domain sample, wherein the importance degree I (I, j) of the word I in the sentence j is calculated by the following formula:
I(i,j)=KFi,j*ISFi
5. the model of scenic spot entity recognition for improving knowledge migration as claimed in claim 4, wherein the calculation method of the target domain keyword in the third step is the same as the calculation method of the auxiliary domain keyword, and the only difference is that the related data are all samples in the target domain training set.
6. The scenic spot entity recognition model for improving knowledge migration of claim 1, wherein the fourth specific step is:
(4-1) obtaining a word vector L of each auxiliary field keyword by using the auxiliary field word vectorization model obtained by the training in the step twoword
(4-2) obtaining a word vector M of each target field keyword by using the target field word vectorization model obtained by training in the step twoword
(4-3) calculating auxiliary domain keywords according to cosine similarity
Figure FDA0002340584580000031
And target domain keywords
Figure FDA0002340584580000032
The specific calculation formula of the similarity is as follows:
Figure FDA0002340584580000033
Lword={l1,l2,…,lndenotes a word vector of auxiliary domain keywords, Mword={m1,m2,…,mnThe word vector representing the target domain keyword.
7. The model for identifying scenic spot entities with improved knowledge migration as claimed in claim 1, wherein the value range of the keyword level similarity threshold in step four is (0.4, 0.6).
8. The model for identifying scenic spot entities with improved knowledge migration as claimed in claim 1, wherein the similarity between the auxiliary domain sentence and the target domain sentence in step 5 is calculated by:
(5-1) obtaining each sentence x in the auxiliary field sample by utilizing the auxiliary field word vectorization model obtained by the training in the step twosSentence vector Lsen={l1,l2,…,ln};
(5-2) obtaining each sentence x in the target field sample by using the target field word vectorization model obtained by the training in the step twotSentence vector Msen={m1,m2,…,mn};
(5-3) to LsenAnd MsenAnd calculating sentence level similarity according to the cosine similarity in the following way:
Figure FDA0002340584580000034
and 5, the value range of the sentence level similarity threshold is (0.4, 0.6).
9. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein the method for computing the auxiliary domain sample extensibility capability SEA in step 6 is as follows:
Figure FDA0002340584580000035
wherein α is a weight coefficient, and the value range is (0, 0.5);
the value range of the expandable capacity threshold is (0.4, 0.6).
10. The scenic spot entity recognition model for improving knowledge migration as claimed in claim 1, wherein the sample expansion condition in step 7 is:
(7-1) expanding the samples with the keyword similarity higher than the threshold value into a target field sample set according to the keyword similarity threshold value;
(7-2) expanding samples, of which sentence similarity is higher than a threshold value, into target domain samples according to a sentence-level similarity threshold value;
and (7-3) expanding the samples with the expandability capacity higher than the threshold value into the target domain samples according to the sample expandability capability similarity threshold value.
CN201911374613.0A 2019-12-27 2019-12-27 Entity identification method for improving knowledge migration Active CN111144119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911374613.0A CN111144119B (en) 2019-12-27 2019-12-27 Entity identification method for improving knowledge migration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911374613.0A CN111144119B (en) 2019-12-27 2019-12-27 Entity identification method for improving knowledge migration

Publications (2)

Publication Number Publication Date
CN111144119A true CN111144119A (en) 2020-05-12
CN111144119B CN111144119B (en) 2024-03-29

Family

ID=70520780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911374613.0A Active CN111144119B (en) 2019-12-27 2019-12-27 Entity identification method for improving knowledge migration

Country Status (1)

Country Link
CN (1) CN111144119B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666414A (en) * 2020-06-12 2020-09-15 上海观安信息技术股份有限公司 Method for detecting cloud service by sensitive data and cloud service platform
CN111695346A (en) * 2020-06-16 2020-09-22 广州商品清算中心股份有限公司 Method for improving public opinion entity recognition rate in financial risk prevention and control field
CN113191148A (en) * 2021-04-30 2021-07-30 西安理工大学 Rail transit entity identification method based on semi-supervised learning and clustering
CN114610852A (en) * 2022-05-10 2022-06-10 天津大学 Course learning-based fine-grained Chinese syntax analysis method and device
WO2022227164A1 (en) * 2021-04-29 2022-11-03 平安科技(深圳)有限公司 Artificial intelligence-based data processing method and apparatus, device, and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286629A1 (en) * 2014-04-08 2015-10-08 Microsoft Corporation Named entity recognition
CN108763201A (en) * 2018-05-17 2018-11-06 南京大学 A kind of open field Chinese text name entity recognition method based on semi-supervised learning
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286629A1 (en) * 2014-04-08 2015-10-08 Microsoft Corporation Named entity recognition
CN108763201A (en) * 2018-05-17 2018-11-06 南京大学 A kind of open field Chinese text name entity recognition method based on semi-supervised learning
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
武惠;吕立;于碧辉;: "基于迁移学习和BiLSTM-CRF的中文命名实体识别" *
王红斌;沈强;线岩团;: "融合迁移学习的中文命名实体识别" *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666414A (en) * 2020-06-12 2020-09-15 上海观安信息技术股份有限公司 Method for detecting cloud service by sensitive data and cloud service platform
CN111666414B (en) * 2020-06-12 2023-10-17 上海观安信息技术股份有限公司 Method for detecting cloud service by sensitive data and cloud service platform
CN111695346A (en) * 2020-06-16 2020-09-22 广州商品清算中心股份有限公司 Method for improving public opinion entity recognition rate in financial risk prevention and control field
CN111695346B (en) * 2020-06-16 2024-05-07 广州商品清算中心股份有限公司 Method for improving public opinion entity recognition rate in financial risk prevention and control field
WO2022227164A1 (en) * 2021-04-29 2022-11-03 平安科技(深圳)有限公司 Artificial intelligence-based data processing method and apparatus, device, and medium
CN113191148A (en) * 2021-04-30 2021-07-30 西安理工大学 Rail transit entity identification method based on semi-supervised learning and clustering
CN113191148B (en) * 2021-04-30 2024-05-28 西安理工大学 Rail transit entity identification method based on semi-supervised learning and clustering
CN114610852A (en) * 2022-05-10 2022-06-10 天津大学 Course learning-based fine-grained Chinese syntax analysis method and device

Also Published As

Publication number Publication date
CN111144119B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN111144119B (en) Entity identification method for improving knowledge migration
CN109284400B (en) Named entity identification method based on Lattice LSTM and language model
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN110633409A (en) Rule and deep learning fused automobile news event extraction method
CN104794169B (en) A kind of subject terminology extraction method and system based on sequence labelling model
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN110688836A (en) Automatic domain dictionary construction method based on supervised learning
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN112101028A (en) Multi-feature bidirectional gating field expert entity extraction method and system
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN113515632B (en) Text classification method based on graph path knowledge extraction
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN110287298A (en) A kind of automatic question answering answer selection method based on question sentence theme
CN113869053A (en) Method and system for recognizing named entities oriented to judicial texts
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112883722A (en) Distributed text summarization method based on cloud data center
CN111581364A (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN114707516A (en) Long text semantic similarity calculation method based on contrast learning
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN112966518B (en) High-quality answer identification method for large-scale online learning platform
CN109325243A (en) Mongolian word cutting method and its word cutting system of the character level based on series model
CN117131932A (en) Semi-automatic construction method and system for domain knowledge graph ontology based on topic model
CN115600602B (en) Method, system and terminal device for extracting key elements of long text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant