CN112925886A - Few-sample entity identification method based on field adaptation - Google Patents

Few-sample entity identification method based on field adaptation Download PDF

Info

Publication number
CN112925886A
CN112925886A CN202110266908.7A CN202110266908A CN112925886A CN 112925886 A CN112925886 A CN 112925886A CN 202110266908 A CN202110266908 A CN 202110266908A CN 112925886 A CN112925886 A CN 112925886A
Authority
CN
China
Prior art keywords
fulcrum
feature
sentence
fulcrum feature
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110266908.7A
Other languages
Chinese (zh)
Other versions
CN112925886B (en
Inventor
韩瑞峰
杨红飞
金霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huoshi Creation Technology Co ltd
Original Assignee
Hangzhou Firestone Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Firestone Technology Co ltd filed Critical Hangzhou Firestone Technology Co ltd
Priority to CN202110266908.7A priority Critical patent/CN112925886B/en
Publication of CN112925886A publication Critical patent/CN112925886A/en
Application granted granted Critical
Publication of CN112925886B publication Critical patent/CN112925886B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a few-sample entity identification method based on field adaptation. And then establishing a training data set, training a fulcrum feature classifier, and forming a coefficient matrix by all fulcrum features to represent the correlation between the non-fulcrum feature words and the fulcrum feature words. Training an entity recognition model in a source field, finally, performing entity recognition in a target field, predicting in the target field by using the trained model, and replacing each non-fulcrum feature word in the target field by a non-fulcrum feature word in the source field which is most relevant and satisfies the condition that the relevance is greater than a certain threshold value. The invention utilizes the common fulcrum feature words among the fields to obtain the corresponding relation among different non-fulcrum features among the fields, thereby achieving the purpose of mapping the feature words among the fields.

Description

Few-sample entity identification method based on field adaptation
Technical Field
The invention relates to the field of entity identification, in particular to a few-sample entity identification method based on field adaptation.
Background
In an application scene of text information extraction, scenes are various and detailed, labeling samples are lacked, the acquisition cost of the labeling samples is high, the current situation is faced in industrial application, the current technology has no mature scheme aiming at the scenes of a small number of labeling samples, and the current situation is a popular research direction for transferring the knowledge learned by the model to the scenes of a small number of samples by skillfully utilizing the existing labeling resources.
In the existing text information extraction method, a large number of labeled samples are required for a model training method, although some depth models have the tendency of higher and higher accuracy and smaller amount of required labeled samples, a certain amount of labeled samples are still required to train to obtain available models, and before obtaining the samples, the work cannot be carried out, so that the development cost is transferred to the labeling of the samples, and the overall development efficiency is still low.
The method is used for entity extraction, and can obtain an extraction model with higher accuracy rate without marking samples in the target field by using a large amount of marking resources in the similar field.
Disclosure of Invention
The invention aims to provide a few-sample entity recognition method based on field adaptation, which uses the same characteristics among corpora in different fields as fulcrum characteristics to establish mapping of characteristics among the fields, so that a model trained in a source field with a large number of labels can show good accuracy in a target field without labels, and can obtain higher accuracy of entity recognition in the target field without labels in migration learning among similar fields by applying the model to an entity recognition task.
The purpose of the invention is realized by the following technical scheme: a few-sample entity identification method based on field adaptation comprises the following steps:
(1) selecting fulcrum characteristics
Counting the frequency of n-gram appearing in the linguistic data of the source field and the target field, selecting phrases appearing in the two fields at the same time and having the frequency exceeding a threshold value as fulcrum feature words, instantiating templates, namely the fulcrum feature word w on the right side, the fulcrum feature word w on the left side and the fulcrum feature word w in the middle, of the templates through the fulcrum feature words, representing fulcrum features of one dimension to form a fulcrum feature set, wherein the w represents the fulcrum feature words in the text sentences.
(2) Establishing a training data set, training a pivot feature classifier
Taking the fulcrum feature set in the step (1) as a training data set, and training a group of fulcrum feature classifiers for predicting whether fulcrum feature words exist in the sample sentences, wherein the specific steps are as follows: training a logistic regression model for each dimension fulcrum feature i as a fulcrum feature classifier, judging whether the fulcrum feature exists in words in a sample sentence, and obtaining a coefficient w of the logistic regression model after trainingi,wiIs a column vector, the dimension of the column vector is the dimension number of the non-fulcrum features, and a matrix w is formed for all the fulcrum features1,w2,…,wM]。wMRepresenting the fulcrum feature of the M-th dimension, M being 3N, wiValue w of the j-th dimensionijRepresenting the correlation between the jth dimension non-fulcrum feature word and the ith dimension fulcrum feature word, wijPositive values indicate a positive correlation.
Sample input of the pivot feature classifier: and calculating n-grams of all sentences in the two fields to form an n-gram list, wherein the characteristic of each sentence is a binary vector of whether the words in the n-gram list exist in the fulcrum characteristic words and non-fulcrum characteristic words in the sentences, and then removing the corresponding dimensionality of all the fulcrum characteristic words in the sentences to obtain the characteristic vectors of the sentences.
Sample output of the pivot feature classifier: the output dimension is 1, because the n-gram feature of the sentence is calculated, if the left or right of a certain non-fulcrum feature word in the sentence is the fulcrum feature word, the sentence necessarily at least comprises the 2-gram of the fulcrum feature word, the label of the sentence sample is 1 at the moment, otherwise, the label is 0, and meanwhile, the w of the fulcrum feature word isiThe corresponding term in the coefficient being a positive value, i.e. a sentenceThe non-fulcrum feature words in the son are positively correlated with the fulcrum feature words.
(3) Training entity recognition models in the source domain
And training an entity recognition model on the corpus of the source field, wherein the model is input into a sentence, the model is output into a tag sequence with the same length as the sentence, and the category of each character in the sentence is marked.
(4) Entity identification on target domain
And (4) predicting on the target field by using the model trained in the step (3), and replacing each non-fulcrum feature word in the target field by using a non-fulcrum feature word which is most relevant to the fulcrum feature word and meets the condition that the relevance is greater than a certain threshold value simultaneously, by using the relevance information in the w matrix.
Further, the source domain has tagged corpus and untagged corpus, and the target domain only contains untagged corpus.
Further, n in the n-gram is 2.
Further, in the step (2), wiMeaning of (a) means the covariance between the non-fulcrum feature and the fulcrum feature.
Further, in the step (2), for the Chinese corpus, the two domain corpora are participated, all possible participated results are taken to form lattice, then n-grams are built for all lattice words, and the process of the step (1) is used for building samples, wherein each sentence contains the n-grams corresponding to all participated results in the sentence.
Further, in step (3), lattice participles are used as input for the Chinese corpus.
Further, in the step (4), for the chinese corpus, the non-pivot word in the most relevant source field corresponding to each lattice participle is replaced.
Further, in step (4), if the lengths of the prediction result replacement times do not match, the resulting offset is handled.
The invention has the beneficial effects that: the corresponding relation between different non-fulcrum characteristics among the fields is obtained by utilizing the common fulcrum characteristic words among the fields, so that the purpose of mapping the characteristic words among the fields is achieved, and the model trained on the source field with a large number of labels can be directly used on the target field without labels.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
As shown in fig. 1, the invention provides a few-sample entity identification method based on domain adaptation. The invention assumes that the similar fields are relatively sufficient with tagged corpora and non-tagged corpora, the source field is provided with tagged corpora and non-tagged corpora, and the target field only contains non-tagged corpora. The specific process is as follows:
(1) selecting fulcrum characteristics
Counting the frequency of n-gram (n is generally 2) appearing in the linguistic data of the source field and the target field, selecting phrases which appear in the two fields at the same time and exceed the frequency threshold (such as 100) as fulcrum feature words, selecting the threshold according to the required identification accuracy, instantiating templates of 'fulcrum feature word w on the right side', 'fulcrum feature word w on the left side' and 'fulcrum feature word w in the middle' through the fulcrum feature words, wherein each instance represents fulcrum features of one dimension to form a fulcrum feature set, and w represents the fulcrum feature words in text sentences.
(2) Establishing a training data set, training a pivot feature classifier
Taking the fulcrum feature set in the step (1) as a training data set, training a fulcrum feature classifier for predicting whether a fulcrum feature word exists in a group of sentences, and if the sentence contains the fulcrum feature word w on the right side of the word, specifically: training a logistic regression model for each dimension fulcrum feature i as a fulcrum feature classifier, judging whether the fulcrum feature exists in words in a sample sentence, and obtaining a coefficient w of the logistic regression model after trainingi,wiIs a column vector, the dimension of the column vector is the dimension number of the non-fulcrum features, and a matrix w is formed for all the fulcrum features1,w2,…,wM]。wMRepresenting fulcrum feature of dimension M=3N,wiMeaning of (a) means the covariance between the non-fulcrum feature and the fulcrum feature. w is aiValue w of the j-th dimensionijRepresenting the correlation between the jth dimension non-fulcrum feature word and the ith dimension fulcrum feature word, wijPositive values indicate a positive correlation. Two non-fulcrum features are likely to correspond if there is a positive correlation between the non-fulcrum features of two different domains and multiple identical fulcrum features.
Sample input of the pivot feature classifier: and calculating n-grams of all sentences in the two fields to form an n-gram list, wherein the characteristic of each sentence is a binary vector of whether the words in the n-gram list exist in the fulcrum characteristic words and non-fulcrum characteristic words in the sentences, and then removing the corresponding dimensionality of all the fulcrum characteristic words in the sentences to obtain the characteristic vectors of the sentences.
Sample output of the pivot feature classifier: the output dimension is 1, because the n-gram feature of the sentence is calculated, if the left side or the right side of a certain non-fulcrum feature word in the sentence is the fulcrum feature word, the sentence necessarily at least comprises the 2-gram of the fulcrum feature word, the label of the sentence sample is 1 at the moment, otherwise, the label is 0, the relevance of the two words is reflected through the fact that the 2-gram appears in large quantity, and meanwhile, the w of the fulcrum feature word is wiThe corresponding item in the coefficient is a positive value, namely the non-fulcrum feature word in the sentence is positively correlated with the fulcrum feature word.
And (3) dividing words of the two fields of linguistic data aiming at the Chinese linguistic data, taking all possible word division results to form lattice, then establishing n-grams for all lattice words, and establishing a sample by using the process of the step (1), wherein each sentence contains the n-grams corresponding to all the word division results in the sentence.
(3) Training entity recognition models in the source domain
Training an entity recognition model such as lstm on a source domain corpus, wherein the input of the model is a sentence, the output of the model is a label sequence with the same length as the sentence, and the category of each character in the sentence is marked. For Chinese corpus, lattice participles are used as input.
(4) Entity identification on target domain
And (4) predicting on the target field by using the model trained in the step (3), searching a corresponding non-fulcrum feature word _ src in the source field for each non-fulcrum feature word _ dst in the target field by using correlation information in the w matrix, specifically, screening a fulcrum feature word set word _ pivot which is most correlated with the word _ dst and simultaneously satisfies that the correlation is greater than a certain threshold value, such as 0.7, finding a word with the highest sum of word correlation scores in the word _ pivots from the source field, and replacing the word _ dst by the non-fulcrum feature word _ src in the source field if the average of the correlation scores is greater than the threshold value, such as 0.7, or not replacing the word _ dst. For example, if the source domain is a diabetic patient medical record and the target domain is a cancer patient medical record, then the non-anchor characteristic word "lung cancer" in the target domain may be replaced with "diabetes" in the source domain. And replacing the non-pivot words in the most relevant source field corresponding to each lattice participle aiming at the Chinese corpus. If the time lengths of the replacement of the prediction results are inconsistent, the caused offset is processed.
The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims (8)

1. A few-sample entity identification method based on field adaptation is characterized by comprising the following steps:
(1) selecting fulcrum characteristics
Counting the frequency of n-gram appearing in the linguistic data of the source field and the target field, selecting phrases appearing in the two fields at the same time and having the frequency exceeding a threshold value as fulcrum feature words, instantiating templates, namely the fulcrum feature word w on the right side, the fulcrum feature word w on the left side and the fulcrum feature word w in the middle, of the templates through the fulcrum feature words, representing fulcrum features of one dimension to form a fulcrum feature set, wherein the w represents the fulcrum feature words in the text sentences.
(2) Establishing a training data set, training a pivot feature classifier
Training a group of predictions by taking the pivot point feature set in the step (1) as a training data setThe fulcrum feature classifier for judging whether fulcrum feature words exist in the sample sentences specifically comprises the following steps: training a logistic regression model for each dimension fulcrum feature i as a fulcrum feature classifier, judging whether the fulcrum feature exists in words in a sample sentence, and obtaining a coefficient w of the logistic regression model after trainingi,wiIs a column vector, the dimension of the column vector is the dimension number of the non-fulcrum features, and a matrix w is formed for all the fulcrum features1,w2,…,wM]。wMRepresenting the fulcrum feature of the M-th dimension, M being 3N, wiValue w of the j-th dimensionijRepresenting the correlation between the jth dimension non-fulcrum feature word and the ith dimension fulcrum feature word, wijPositive values indicate a positive correlation.
Sample input of the pivot feature classifier: and calculating n-grams of all sentences in the two fields to form an n-gram list, wherein the characteristic of each sentence is a binary vector of whether the words in the n-gram list exist in the fulcrum characteristic words and non-fulcrum characteristic words in the sentences, and then removing the corresponding dimensionality of all the fulcrum characteristic words in the sentences to obtain the characteristic vectors of the sentences.
Sample output of the pivot feature classifier: the output dimension is 1, because the n-gram feature of the sentence is calculated, if the left or right of a certain non-fulcrum feature word in the sentence is the fulcrum feature word, the sentence necessarily at least comprises the 2-gram of the fulcrum feature word, the label of the sentence sample is 1 at the moment, otherwise, the label is 0, and meanwhile, the w of the fulcrum feature word isiThe corresponding item in the coefficient is a positive value, namely the non-fulcrum feature word in the sentence is positively correlated with the fulcrum feature word.
(3) Training entity recognition models in the source domain
And training an entity recognition model on the corpus of the source field, wherein the model is input into a sentence, the model is output into a tag sequence with the same length as the sentence, and the category of each character in the sentence is marked.
(4) Entity identification on target domain
And (4) predicting on the target field by using the model trained in the step (3), and replacing each non-fulcrum feature word in the target field by using a non-fulcrum feature word which is most relevant to the fulcrum feature word and meets the condition that the relevance is greater than a certain threshold value simultaneously, by using the relevance information in the w matrix.
2. The method of claim 1, wherein the source domain has tagged corpus and untagged corpus, and the target domain only contains untagged corpus.
3. The field-adaptive small-sample entity recognition method according to claim 1, wherein n in the n-gram is 2.
4. The domain-adaptive small-sample entity recognition method as claimed in claim 1, wherein in step (2), wiMeaning of (a) means the covariance between the non-fulcrum feature and the fulcrum feature.
5. The method for few-sample entity recognition based on domain adaptation as claimed in claim 1, wherein in step (2), for the chinese corpus, the words are segmented for two domain corpora, all possible segmentation results are taken to form lattice, then n-gram is built for all lattice words, and then the sample is built by the process of step (1), when each sentence contains n-gram corresponding to all segmentation results in the sentence.
6. The domain-adaptive small-sample entity recognition method as claimed in claim 5, wherein in step (3), lattice participles are used as input for Chinese corpus.
7. The domain-adaptive small-sample entity recognition method of claim 5, wherein in the step (4), the non-pivot word in the most relevant source domain corresponding to each lattice participle is replaced with respect to the Chinese corpus.
8. The domain-adaptive small-sample entity recognition method according to claim 1, wherein in the step (4), if the time lengths of the replacement of the prediction results are inconsistent, the resulting offset is processed.
CN202110266908.7A 2021-03-11 2021-03-11 Few-sample entity identification method based on field adaptation Active CN112925886B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110266908.7A CN112925886B (en) 2021-03-11 2021-03-11 Few-sample entity identification method based on field adaptation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110266908.7A CN112925886B (en) 2021-03-11 2021-03-11 Few-sample entity identification method based on field adaptation

Publications (2)

Publication Number Publication Date
CN112925886A true CN112925886A (en) 2021-06-08
CN112925886B CN112925886B (en) 2022-01-04

Family

ID=76172758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110266908.7A Active CN112925886B (en) 2021-03-11 2021-03-11 Few-sample entity identification method based on field adaptation

Country Status (1)

Country Link
CN (1) CN112925886B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080310718A1 (en) * 2007-06-18 2008-12-18 International Business Machines Corporation Information Extraction in a Natural Language Understanding System
CN106096004A (en) * 2016-06-23 2016-11-09 北京工业大学 A kind of method setting up extensive cross-domain texts emotional orientation analysis framework
CN110489753A (en) * 2019-08-15 2019-11-22 昆明理工大学 Improve the corresponding cross-cutting sensibility classification method of study of neuromechanism of feature selecting
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080310718A1 (en) * 2007-06-18 2008-12-18 International Business Machines Corporation Information Extraction in a Natural Language Understanding System
CN106096004A (en) * 2016-06-23 2016-11-09 北京工业大学 A kind of method setting up extensive cross-domain texts emotional orientation analysis framework
CN110489753A (en) * 2019-08-15 2019-11-22 昆明理工大学 Improve the corresponding cross-cutting sensibility classification method of study of neuromechanism of feature selecting
CN110765775A (en) * 2019-11-01 2020-02-07 北京邮电大学 Self-adaptive method for named entity recognition field fusing semantics and label differences

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张雪松,庄严,闫飞,王伟: "基于迁移学习的类别级物体识别与检测研究与进展", 《自动化学报》 *

Also Published As

Publication number Publication date
CN112925886B (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN109902145B (en) Attention mechanism-based entity relationship joint extraction method and system
CN108897989B (en) Biological event extraction method based on candidate event element attention mechanism
CN109800437B (en) Named entity recognition method based on feature fusion
CN108733778B (en) Industry type identification method and device of object
CN112231447B (en) Method and system for extracting Chinese document events
CN111966917A (en) Event detection and summarization method based on pre-training language model
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN114416942A (en) Automatic question-answering method based on deep learning
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN112699685B (en) Named entity recognition method based on label-guided word fusion
CN110046356A (en) Label is embedded in the application study in the classification of microblogging text mood multi-tag
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN114781375A (en) Military equipment relation extraction method based on BERT and attention mechanism
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN110569355A (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN112925886B (en) Few-sample entity identification method based on field adaptation
CN113705222B (en) Training method and device for slot identification model and slot filling method and device
CN115186670A (en) Method and system for identifying domain named entities based on active learning
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114661900A (en) Text annotation recommendation method, device, equipment and storage medium
CN114510943A (en) Incremental named entity identification method based on pseudo sample playback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 310051 7th floor, building B, 482 Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Huoshi Creation Technology Co.,Ltd.

Address before: 310051 7th floor, building B, 482 Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee before: HANGZHOU FIRESTONE TECHNOLOGY Co.,Ltd.

CP01 Change in the name or title of a patent holder