WO2023178802A1 - 命名实体识别方法、装置、设备和计算机可读存储介质 - Google Patents

命名实体识别方法、装置、设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2023178802A1
WO2023178802A1 PCT/CN2022/090756 CN2022090756W WO2023178802A1 WO 2023178802 A1 WO2023178802 A1 WO 2023178802A1 CN 2022090756 W CN2022090756 W CN 2022090756W WO 2023178802 A1 WO2023178802 A1 WO 2023178802A1
Authority
WO
WIPO (PCT)
Prior art keywords
named entity
entity recognition
sentence
bottleneck
feature
Prior art date
Application number
PCT/CN2022/090756
Other languages
English (en)
French (fr)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023178802A1 publication Critical patent/WO2023178802A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a named entity recognition method, device, equipment and computer-readable storage medium.
  • NER Named Entities Recognition
  • CRF Conditional Random Field
  • embodiments of the present application provide a named entity recognition method, including:
  • a classification function is used to classify and identify multiple information bottleneck features, and a named entity category corresponding to the first statement is determined.
  • embodiments of the present application also provide a named entity recognition device, including:
  • the first acquisition module is used to acquire the pre-trained named entity recognition model
  • the second acquisition module is used to acquire the first statement to be recognized, and input the first statement into the named entity recognition model, so that the named entity recognition model performs named entity recognition processing;
  • named entity recognition models include:
  • a word segmentation module used to perform word segmentation processing on the first sentence to obtain a second sentence including multiple split words
  • a feature extraction module used to extract features from multiple split words to obtain multiple word embedding feature vectors
  • a cross-domain processing module configured to perform cross-domain information processing on the second sentence according to multiple word embedding feature vectors to obtain multiple cross-domain information features
  • An information bottleneck module is used to process multiple cross-domain information features to obtain multiple information bottleneck features
  • a classification module is configured to use a classification function to classify and identify multiple information bottleneck features, and determine a named entity category corresponding to the first statement.
  • embodiments of the present application also provide a computer device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor executes the computer program.
  • the program implements a named entity recognition method, wherein the named entity recognition method includes: obtaining a pre-trained named entity recognition model, wherein the named entity recognition model includes an information bottleneck layer; obtaining the first statement to be recognized , input the first sentence to the named entity recognition model, so that the named entity recognition model performs the following named entity recognition processing: perform word segmentation processing on the first sentence to obtain a first sentence including multiple split words.
  • Two sentences performing feature extraction on a plurality of the split words to obtain a plurality of word embedding feature vectors; performing cross-domain information processing on the second sentence according to a plurality of the word embedding feature vectors to obtain a plurality of cross-domain information features ; Process a plurality of cross-domain information features through the information bottleneck layer to obtain a plurality of information bottleneck features; use a classification function to classify and identify a plurality of the information bottleneck features, and determine the information corresponding to the first statement. Named entity class.
  • embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions, the computer-executable instructions being used to execute a named entity recognition method, wherein the named entity recognition method
  • the method includes: obtaining a pre-trained named entity recognition model, wherein the named entity recognition model includes an information bottleneck layer; obtaining a first sentence to be recognized, and inputting the first sentence into the named entity recognition model so that The named entity recognition model performs the following named entity recognition processing: perform word segmentation processing on the first statement to obtain a second statement including multiple split words; perform feature extraction on multiple split words to obtain multiple words Embedding feature vectors; performing cross-domain information processing on the second sentence according to multiple word embedding feature vectors to obtain multiple cross-domain information features; processing multiple cross-domain information features through the information bottleneck layer , obtain multiple information bottleneck features; use a classification function to classify and identify multiple information bottleneck features, and determine the named entity category corresponding to the first statement.
  • the named entity recognition method, device, equipment and computer-readable storage medium proposed by the embodiments of this application obtain a pre-trained named entity recognition model and input the acquired first sentence to be recognized into the named entity recognition model to execute the named entity Recognition processing, the named entity recognition model performs word segmentation processing based on the first sentence, and the obtained second sentence includes multiple split words.
  • Multiple word embedding feature vectors are obtained by feature extraction of multiple split words, which can effectively reflect Semantic information facilitates the accurate identification of unregistered words.
  • cross-domain information on the second sentence multiple cross-domain information features are obtained, which can provide the information of split words to the named entity recognition model, which is beneficial to improving the named entity recognition model.
  • multiple cross-domain information features are processed through the information bottleneck layer to obtain multiple information bottleneck features.
  • a classification function is used to classify and identify multiple information bottleneck features to determine the corresponding named entity category.
  • Figure 1 is a flow chart of a named entity identification method provided by an embodiment of the present application.
  • Figure 2 is a flow chart of the named entity recognition processing process provided by an embodiment of the present application.
  • Figure 3 is a schematic structural diagram of the information bottleneck layer provided by an embodiment of the present application.
  • Figure 4 is a flow chart of a named entity recognition method provided by another embodiment of the present application.
  • Figure 5 is a flow chart of a named entity recognition method provided by another embodiment of the present application.
  • Figure 6 is a flow chart of a named entity recognition method provided by another embodiment of the present application.
  • Figure 7 is a flow chart of a named entity recognition method provided by another embodiment of the present application.
  • Figure 8 is a flow chart of a named entity recognition method provided by another embodiment of the present application.
  • Figure 9 is a flow chart of a named entity recognition method provided by another embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a named entity recognition device provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • CRF sequence models to identify named entities in text. This method can learn manually labeled data, but the recognition effect on unlabeled data or unregistered words is not good. With the progress and development of society, , more and more unregistered words are generated on the Internet, and the accuracy of named entity recognition for texts containing unregistered words is not high.
  • the first embodiment of the present application provides a named entity recognition method.
  • the named entity recognition method includes but is not limited to step S110 and step S120:
  • Step S110 Obtain a pre-trained named entity recognition model, where the named entity recognition model includes an information bottleneck layer;
  • the named entity recognition model has been pre-trained. By obtaining the named entity recognition model, the named entity recognition of the text to be recognized can be performed.
  • the named entity recognition model includes an information bottleneck layer, whose main purpose is to reduce parameters. quantity, thereby reducing the amount of calculation, and after dimensionality reduction, data training and feature extraction can be performed more effectively and intuitively.
  • Step S120 Obtain the first sentence to be recognized, and input the first sentence into the named entity recognition model, so that the named entity recognition model performs named entity recognition processing.
  • the first sentence can be obtained from the Internet, and mainly refers to the data to be identified that needs to identify the type of named entities.
  • named entities mainly include names of persons, place names, organization names, proper nouns and other entities identified by names. , may also include entities such as numbers, dates, currencies, addresses, etc.
  • the first statement to be identified may include the name of the organization (ORG) to be identified, and the first statement is "Apple is a company", where Apple is the named entity .
  • the named entity recognition process includes but is not limited to steps S131 to step S135:
  • Step S131 Perform word segmentation processing on the first sentence to obtain a second sentence including multiple split words
  • the word segmentation tool uses jieba, and other word segmentation tools can also be used, such as the Stanford word segmenter.
  • the first sentence is segmented, by identifying the corresponding text sequence in the first sentence and segmenting the first sentence according to the text sequence, multiple split words are obtained, and the second sentence is composed of multiple split words.
  • the first sentence is "Apple is a company”
  • the obtained word segmentation result is [Apple, is, company], in which "Apple”, "is”, and "company” are the split words.
  • word segmentation processing also includes removing some high-frequency words and low-frequency words, as well as removing some meaningless symbols, etc.
  • Step S132 Perform feature extraction on multiple split words to obtain multiple word embedding feature vectors
  • each split word has a corresponding word embedding feature vector.
  • the word embedding feature vector can reflect the grammatical and semantic information of the split word, making it easy to effectively identify unregistered words.
  • the named entity recognition model also includes a language model.
  • the second sentence after word segmentation processing obtains the word embedding feature vector through the Bidirectional Encoder Representations from Transformers (BERT) model based on the transformer.
  • the BERT model is a A deep bidirectional, unsupervised language representation model with a bidirectional Transformer encoder. Through the processing of the bidirectional Transformer encoder, the relationship between split words can be fully considered, making named entity recognition more accurate.
  • word embedding feature vectors can also be obtained through other language models, such as the Global Vectors for Word Representation (GloVe) model based on global information.
  • GloVe Global Vectors for Word Representation
  • Step S133 Perform cross-domain information processing on the second sentence according to multiple word embedding feature vectors to obtain multiple cross-domain information features
  • cross-domain information processing is performed on the second sentence based on multiple word embedding feature vectors to obtain multiple cross-domain information features.
  • Cross-domain information The feature can provide the number information and related information of the second sentence composed of multiple split words to the named entity recognition model, thereby improving the recognition efficiency of the named entity recognition model.
  • Step S134 Process multiple cross-domain information features through the information bottleneck layer to obtain multiple information bottleneck features
  • the information bottleneck layer can retain the necessary information in cross-domain information features. By inputting multiple cross-domain information features to the information bottleneck layer for processing, the corresponding information bottleneck features can be obtained and feature extraction can be performed more effectively. By utilizing the information bottleneck features , can better identify unregistered words in named entities.
  • the information bottleneck layer is composed of a multilayer perceptron (MLP).
  • MLP is composed of two linear (Linear) layers and a ReLu activation function.
  • the linear layer, Relu The activation function and the linear layer are connected in sequence.
  • the information bottleneck layer can retain the necessary information of the input data. After the dimension is increased, the information can be enriched. In addition, with the ReLU activation function, after the dimension is reduced, all the necessary information can be kept without being lost. It facilitates subsequent data training and feature extraction.
  • Step S135 Use a classification function to classify and identify multiple information bottleneck features, and determine the named entity category corresponding to the first statement.
  • the named entity category refers to the category to which the named entity belongs.
  • the named entity category corresponding to the first statement can be determined, thereby facilitating the category labeling of the corresponding named entity.
  • the named entity category corresponding to Apple is the organization name (ORG). It should be noted that by classifying and identifying multiple information bottleneck features, multiple named entity categories or only one named entity category may be output.
  • classification loss is calculated as follows:
  • z i is the i-th information bottleneck feature
  • y i is the i-th named entity category
  • Y is the named entity category set
  • score(z i , y i ) is the score value of the i-th named entity category
  • y i can be learned through the named entity recognition model, which reflects the predicted value.
  • Loss is the loss value, which is used to reflect the loss value between the real value and the predicted value.
  • the acquired first sentence to be recognized is input into the named entity recognition model to perform named entity recognition processing, and the named entity recognition model performs the process based on the first sentence.
  • the second sentence obtained includes multiple split words.
  • multiple word embedding feature vectors are obtained, which can effectively reflect semantic information and facilitate accurate identification of unregistered words.
  • the second sentence performs cross-domain information processing to obtain multiple cross-domain information features, which can provide the split word information to the named entity recognition model, which is conducive to improving the recognition efficiency of the named entity recognition model.
  • multiple cross-domain information features can be obtained.
  • Domain information features are processed to obtain multiple information bottleneck features. Finally, a classification function is used to classify and identify multiple information bottleneck features, and the corresponding named entity category is determined.
  • a classification function is used to classify and identify multiple information bottleneck features, and the corresponding named entity category is determined.
  • step S133 cross-domain information processing is performed on the second sentence according to multiple word embedding feature vectors to obtain multiple cross-domain information features, including but not limited to steps S210 to Step S230:
  • Step S210 Determine multiple boundary vectors based on multiple word embedding feature vectors, where the boundary vectors include starting word embedding features and end word embedding features;
  • Step S220 Determine the corresponding length vector according to each boundary vector
  • Step S230 Obtain multiple cross-domain information features based on multiple boundary vectors and multiple length vectors.
  • the cross-domain information features include two parts: The first part is the boundary vector The second part is the length vector
  • each split word has a corresponding word embedding feature vector.
  • Boundary vector It consists of the starting point word embedding feature h bi and the corresponding end word embedding feature h ei , that is It should be noted that the starting point word embedding feature represents the feature vector of the starting word of the boundary vector, and the end word embedding feature represents the feature vector of the end word of the boundary vector.
  • each boundary vector has a corresponding length vector.
  • the length vector is used to reflect the distance between the starting word and the ending word; through the boundary vector and length vector Ability to form cross-domain information features
  • the second sentence has multiple word embedding feature vectors. By combining the multiple word embedding feature vectors, multiple boundary vectors can be obtained, and multiple length vectors can be determined, so that the second sentence can have multiple spans. Domain information characteristics.
  • the cross-domain information features of the second sentence are obtained based on the word embedding feature vector, and the split words are represented as vectors in the neural network.
  • the word embedding feature vector and cross-domain information features are introduced into the named entity recognition model, making the named entity recognition model competent. More complex situations, such as processing texts with specialized vocabulary and the interrelationships between specialized vocabulary, will help improve the accuracy of the final named entity recognition.
  • multiple boundary vectors are determined based on multiple word embedding feature vectors in step S210, including but not limited to step S310 and step S320:
  • Step S310 Determine multiple starting word embedding features and multiple end word embedding features based on multiple word embedding feature vectors
  • Step S320 Splice each starting point word embedding feature and the corresponding end point word embedding feature to obtain multiple boundary vectors.
  • the starting point word embedding feature and the end word embedding feature are determined from multiple word embedding feature vectors.
  • the starting point word embedding feature represents the feature vector of the starting word of the boundary vector
  • the end word embedding feature represents the end word of the boundary vector.
  • the boundary vector is formed by splicing the starting word embedding feature h bi and the end word embedding feature h ei . By splicing the start word embedding features and the end word embedding features, it means cross-fusion of features, so that the boundary vector has feature fusion characteristics, which can effectively improve the recognition accuracy of the named entity recognition model.
  • the second sentence after word segmentation is [Apple, is a company]
  • the multiple boundary vectors obtained are (1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)
  • the numbers represent the position of the split word in the second sentence
  • (1, 1) represents the split of the word "apple”
  • the word embedding feature vectors (double) are spliced together.
  • (1,3) represents the splicing of the word embedding feature vectors of the two words "Apple” and "Company", where "Apple” is the starting word and "Company” is the end point. word.
  • the corresponding length vector is determined according to each boundary vector in step S220, including but not limited to step S410 and step S420:
  • Step S410 Determine the corresponding cross-domain length according to each boundary vector
  • Step S420 Obtain the corresponding length vector according to each cross-domain length and the preset dimension, where the current dimension of the length vector corresponds to the cross-domain length.
  • the length vector corresponding to each boundary vector can be determined.
  • the length vector is determined by the cross-domain length of the word.
  • the usual operation is to obtain the cross-domain length corresponding to the boundary vector.
  • the cross-domain length and the preset dimension get the corresponding length vector, and the current dimension of the length vector corresponds to the cross-domain length.
  • the cross-domain length determined based on the boundary vector (1, 1) is 0, the dimension of the length vector is a hyperparameter, and the preset dimension is 10.
  • the current dimension of the length vector can be determined based on the cross-domain length, and the current dimension is The value of is set to 1, and the values of other dimensions are all 0, then the obtained length vector is [1, 0, 0, 0, 0, 0, 0, 0, 0].
  • the field length is 2, and the corresponding length vector is [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].
  • the named entity recognition model is obtained according to the following training steps:
  • Step S510 Obtain a pre-annotated training data set, where each training data in the training data set is annotated sentences carrying named entities and annotated categories;
  • Step S520 Obtain replacement category sentences for each labeled sentence, where the replacement category sentences include sentences of the same category and sentences of different categories;
  • Step S530 Calculate the first loss value based on the labeled sentences, sentences of the same category and sentences of different categories;
  • Step S540 Train the initial model based on the first loss value to obtain the trained named entity recognition model.
  • the model is trained by using pre-labeled training data sets.
  • Each piece of training data in the training data set refers to annotated sentences that manually label the named entities and their categories in the sentences. For example, for "Apple is a company” , label the apple as ORG (organization name), and the obtained label sentence carries the named entity and label category.
  • the parameters of the model are conducive to improving the recognition effect of the model, thereby improving the accuracy of named entity recognition.
  • annotation sentence when obtaining training data, first obtain the original data and the annotation categories corresponding to the named entities in the original data, write the annotation categories into the original data, and obtain annotated sentences carrying the named entities and annotation categories.
  • annotation category is the name of an organization
  • the annotation sentence can be in the form: [ORG]+original data.
  • step S530 the first loss value is calculated based on the labeled sentences, sentences of the same category and sentences of different categories, including but not limited to step S610 and step S620:
  • Step S610 Calculate the corresponding first bottleneck feature, second bottleneck feature and third bottleneck feature based on the labeled sentences, sentences of the same category and sentences of different categories;
  • Step S620 Calculate the first loss value based on the first bottleneck feature, the second bottleneck feature, and the third bottleneck feature.
  • the tagged sentence is "Apple is a company", the same category sentence is "Google is a company”, and the different category sentence is "Zhang San is a company".
  • the third is the information bottleneck characteristics corresponding to "Company”, that is, the first bottleneck characteristics, the second bottleneck characteristics and the third bottleneck characteristics are respectively obtained.
  • the first loss value is calculated based on the first bottleneck characteristics, the second bottleneck characteristics and the third bottleneck characteristics. Use the first loss value as the target to train the model, and obtain the trained named entity recognition model.
  • the first bottleneck feature, the second bottleneck feature and the third bottleneck feature can be obtained at the information bottleneck layer of the named entity recognition model.
  • step S620 the first loss value is calculated based on the first bottleneck feature, the second bottleneck feature, and the third bottleneck feature, including but not limited to steps S710 to step S730:
  • Step S710 Calculate the second loss value based on the first bottleneck feature
  • Step S720 Calculate the third loss value based on the first bottleneck feature, the second bottleneck feature, and the third bottleneck feature;
  • Step S730 Calculate the first loss value based on the second loss value and the third loss value.
  • the first bottleneck feature corresponds to the labeled sentence
  • the labeled sentence is used to train the named entity recognition model.
  • the second loss value is initially calculated based on the first bottleneck feature.
  • the second bottleneck feature and the third The three bottleneck features are calculated to obtain the third loss value, and the second loss value is corrected through the third loss value to obtain the first loss value.
  • the named entity recognition model is trained so that the named entity recognition model can learn Get the ability to extract categories of named entities.
  • the second loss value is obtained according to the following formula:
  • L base is the second loss value
  • z i is the information bottleneck feature of type i
  • y i is the named entity category of type i
  • Y is the set of named entity categories
  • score (z i , y i ) is the naming of type i The score value of the entity category.
  • the third loss value is obtained according to the following formula:
  • L gi is the third loss value
  • z 1 is the first bottleneck feature
  • z 2 is the second bottleneck feature
  • z 3 is the third bottleneck feature
  • the gw function is the similarity calculation of cosin
  • Ep is the expectation calculation.
  • the first loss value is obtained according to the following formula:
  • L Lbase + ⁇ * Lgi ;
  • L is the first loss value
  • L base is the second loss value
  • L gi is the third loss value
  • is a hyperparameter used to adjust the weight influence of L gi .
  • the annotated sentence is first obtained.
  • the annotated sentence is "Apple is a company”
  • the replacement category sentences of "Apple is a company” are obtained, which are "Google is a company” and "Zhang Zhang.””Three is a company”
  • “Zhang San is a company” into the named entity recognition model at the same time, and obtain the corresponding first bottleneck feature z 1 and second bottleneck feature at the information bottleneck layer z 2 , the third bottleneck feature z 3 , the second loss value L base is calculated based on the first bottleneck feature z 1 , and the second loss value L base is calculated based on the first bottleneck feature z 1 , the second bottleneck feature z 2 and the third bottleneck feature z 3
  • the third loss value L gi and the third loss value L gi can enable the named entity recognition model to learn the similarity between the same named entity category and different named entity
  • the second loss value L base and the third loss value L gi After calculating the second loss value L base and the third loss value L gi , first adjust the weight influence of L gi according to ⁇ . In this embodiment, ⁇ is set to 0.3, and then the adjusted L gi is added to L base . Based on The summation result of the second loss value L base and the adjusted third loss value L gi obtains the first loss value L. With the goal of minimizing the first loss value L, the parameters of the named entity recognition model are continuously updated.
  • the embodiments of this application can obtain and process relevant data based on artificial intelligence technology.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the named entity recognition method in the embodiment of the present application can be applied to natural language processing application fields such as information retrieval, question and answer systems, machine translation, and sentiment analysis.
  • a second embodiment of the present application provides a named entity recognition device 1000.
  • Figure 10 is a schematic structural diagram of the named entity recognition device 1000 provided by an embodiment of the present application.
  • the named entity recognition device 1000 in the embodiment of the present application includes but is not limited to a first acquisition module 1010 and a second acquisition module 1020.
  • the first acquisition module 1010 is used to acquire a pre-trained named entity recognition model 1030;
  • the second acquisition module Module 1020 is used to obtain the first sentence to be recognized, and input the first sentence to the named entity recognition model 1030, so that the named entity recognition model 1030 performs named entity recognition processing;
  • the named entity recognition model 1030 includes: word segmentation module 1031, Feature extraction module 1032, cross-domain processing module 1033, information bottleneck module 1034, classification module 1035.
  • the word segmentation module 1031 is used to perform word segmentation processing on the first sentence to obtain a second sentence including multiple split words; the feature extraction module 1032 is used to perform feature extraction on the multiple split words to obtain multiple word embedding feature vectors; cross-domain The processing module 1033 is used to process cross-domain information on the second sentence according to multiple word embedding feature vectors to obtain multiple cross-domain information features; the information bottleneck module 1034 is used to process multiple cross-domain information features to obtain multiple cross-domain information features. information bottleneck features; the classification module 1035 is configured to use a classification function to classify and identify multiple information bottleneck features, and determine the named entity category corresponding to the first statement.
  • the named entity recognition device by acquiring a pre-trained named entity recognition model, the acquired first sentence to be recognized is input into the named entity recognition model to perform named entity recognition processing, and the named entity recognition model is based on the first
  • the sentence is segmented, and the second sentence obtained includes multiple split words.
  • multiple word embedding feature vectors are obtained, which can effectively reflect semantic information and facilitate accurate identification of unlogged words.
  • cross-domain information processing By performing cross-domain information processing on the second sentence, multiple cross-domain information features are obtained, which can provide the split word information to the named entity recognition model, which is beneficial to improving the recognition efficiency of the named entity recognition model.
  • multiple cross-domain information features can be obtained.
  • cross-domain information processing is performed on the second sentence based on multiple word embedding feature vectors to obtain multiple cross-domain information features, which specifically include:
  • boundary vectors based on multiple word embedding feature vectors, where the boundary vectors include starting word embedding features and end word embedding features;
  • multiple boundary vectors are determined based on multiple word embedding feature vectors, specifically including:
  • Each starting point word embedding feature and the corresponding end word embedding feature are spliced to obtain multiple boundary vectors.
  • multiple boundary vectors are determined based on multiple word embedding feature vectors, specifically including:
  • Each starting point word embedding feature and the corresponding end word embedding feature are spliced to obtain multiple boundary vectors.
  • each boundary vector which specifically includes:
  • the corresponding length vector is obtained according to each cross-domain length and the preset dimension, where the current dimension of the length vector corresponds to the cross-domain length.
  • the named entity recognition model is obtained according to the following training steps:
  • each training data in the training data set is annotated sentences carrying named entities and annotated categories;
  • replacement category sentences for each labeled sentence, where replacement category sentences include sentences of the same category and sentences of different categories;
  • the first loss value is calculated based on the labeled sentences, sentences of the same category and sentences of different categories;
  • the initial model is trained according to the first loss value to obtain a trained named entity recognition model.
  • the first loss value is calculated based on the labeled sentences, sentences of the same category and sentences of different categories, specifically including:
  • the first loss value is calculated based on the first bottleneck feature, the second bottleneck feature and the third bottleneck feature.
  • the first loss value is calculated based on the first bottleneck feature, the second bottleneck feature and the third bottleneck feature, which specifically includes:
  • the second loss value is calculated based on the first bottleneck characteristics
  • the third loss value is calculated based on the first bottleneck feature, the second bottleneck feature and the third bottleneck feature;
  • the first loss value is calculated based on the second loss value and the third loss value.
  • the third embodiment of the present application also provides a computer device 1100.
  • the computer device 1100 includes: a memory 1110, a processor 1120, and a program stored on the memory 1110 and capable of running on the processor 1120. Computer program.
  • the processor 1120 and the memory 1110 may be connected through a bus or other means.
  • the memory 1110 can be used to store non-transitory software programs and non-transitory computer executable programs.
  • memory 1110 may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device.
  • memory 1110 optionally includes memory located remotely relative to processor 1120, and these remote memories may be connected to the buzzer component via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • the computer device 1100 shown in FIG. 11 does not limit the embodiments of the present application, and may include more or less components than shown, or combine certain components, or use different Component placement.
  • the non-transitory software programs and instructions required to implement the named entity recognition method of the above embodiment are stored in the memory 1110.
  • the named entity recognition method of the above embodiment is executed.
  • the above-described diagram is executed. 1. The method steps in Figure 2 and Figures 4 to 9.
  • the acquired first sentence to be recognized is input into the named entity recognition model to perform named entity recognition processing, and the named entity recognition model performs the process based on the first sentence.
  • the second sentence obtained includes multiple split words.
  • multiple word embedding feature vectors are obtained, which can effectively reflect semantic information and facilitate accurate identification of unregistered words.
  • the second sentence performs cross-domain information processing to obtain multiple cross-domain information features, which can provide the split word information to the named entity recognition model, which is conducive to improving the recognition efficiency of the named entity recognition model.
  • multiple cross-domain information features can be obtained.
  • Domain information features are processed to obtain multiple information bottleneck features. Finally, a classification function is used to classify and identify multiple information bottleneck features, and the corresponding named entity category is determined.
  • a classification function is used to classify and identify multiple information bottleneck features, and the corresponding named entity category is determined.
  • the fourth aspect embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores computer-readable storage media.
  • Execution instructions, computer-executable instructions are used to execute the above named entity recognition method. For example, being executed by a processor of the above-mentioned named entity recognition device can cause the above-mentioned processor to execute the named entity recognition method in the above-mentioned embodiment, for example, execute the methods in FIG. 1, FIG. 2 and FIG. 4 to FIG. 9 described above. step.
  • the acquired first sentence to be recognized is input into the named entity recognition model to perform named entity recognition processing, and the named entity recognition model is based on the first One sentence is segmented, and the second sentence obtained includes multiple split words.
  • Multiple word embedding feature vectors are obtained through feature extraction of multiple split words, which can effectively reflect semantic information and facilitate accurate identification of unlogged words.
  • cross-domain information processing on the second sentence, multiple cross-domain information features are obtained, which can provide the split word information to the named entity recognition model, which is conducive to improving the recognition efficiency of the named entity recognition model.
  • Multiple cross-domain information features are processed to obtain multiple information bottleneck features.
  • a classification function is used to classify and identify multiple information bottleneck features, and the corresponding named entity category is determined.
  • feature extraction is performed more effectively. It can better identify unregistered words in named entities, which is helpful to improve the accuracy of named entity recognition.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer.
  • communication media typically includes computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及人工智能技术领域,提供一种命名实体识别方法、装置、设备和计算机可读存储介质,命名实体识别方法包括获取预先训练好的命名实体识别模型,获取待识别的第一语句并输入至命名实体识别模型,以使命名实体识别模型执行以下命名实体识别处理:对第一语句进行分词处理,得到包括多个拆分词的第二语句;对多个拆分词进行特征提取得到多个词嵌入特征向量;根据多个词嵌入特征向量对第二语句进行处理得到多个跨域信息特征;通过信息瓶颈层对多个跨域信息特征进行处理得到多个信息瓶颈特征;采用分类函数对多个信息瓶颈特征进行分类识别,确定对应的命名实体类别,能够对命名实体中的未登录词进行更好的识别,提高命名实体识别的准确度。

Description

命名实体识别方法、装置、设备和计算机可读存储介质
本申请要求于2022年3月22日提交中国专利局、申请号为2022102825874,发明名称为“命名实体识别方法、装置、设备和计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种命名实体识别方法、装置、设备和计算机可读存储介质。
背景技术
随着人工智能的不断发展,基于深度学习的自然语言处理相关技术取得了很大的进步,命名实体识别(Named Entities Recognition,NER)是自然语言处理的一个基础任务,其目的是识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,命名实体识别在信息检索、问答系统、机器翻译等应用领域起到重要的作用。相关技术一般使用条件随机场(Conditional Random Field,CRF)序列模型进行文本中的命名实体识别,发明人意识到这种方法能够学习到人工标注的数据,但对未标注的数据或者未登录词的识别效果不好,影响命名实体识别的准确度。
技术问题
以下是发明人意识到的现有技术的技术问题:使用CRF序列模型进行文本中的命名实体识别能够学习到人工标注的数据,但对未标注的数据或者未登录词的识别效果不好,影响命名实体识别的准确度。
技术解决方案
第一方面,本申请实施例提供了一种命名实体识别方法,包括:
获取预先训练好的命名实体识别模型,其中,所述命名实体识别模型包括信息瓶颈层;
获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行以下命名实体识别处理:
对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;
对多个所述拆分词进行特征提取得到多个词嵌入特征向量;
根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征;
通过所述信息瓶颈层对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;
采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
第二方面,本申请实施例还提供了一种命名实体识别装置,包括:
第一获取模块,用于获取预先训练好的命名实体识别模型;
第二获取模块,用于获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行命名实体识别处理;
其中,命名实体识别模型包括:
分词模块,用于对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;
特征提取模块,用于对多个所述拆分词进行特征提取得到多个词嵌入特征向量;
跨域处理模块,用于根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理, 得到多个跨域信息特征;
信息瓶颈模块,用于对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;
分类模块,用于采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
第三方面,本申请实施例还提供了一种计算机设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种命名实体识别方法,其中,所述命名实体识别方法包括:获取预先训练好的命名实体识别模型,其中,所述命名实体识别模型包括信息瓶颈层;获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行以下命名实体识别处理:对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;对多个所述拆分词进行特征提取得到多个词嵌入特征向量;根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征;通过所述信息瓶颈层对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
第四方面,本申请实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行一种命名实体识别方法,其中,所述命名实体识别方法包括:获取预先训练好的命名实体识别模型,其中,所述命名实体识别模型包括信息瓶颈层;获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行以下命名实体识别处理:对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;对多个所述拆分词进行特征提取得到多个词嵌入特征向量;根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征;通过所述信息瓶颈层对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
有益效果
本申请实施例提出的命名实体识别方法、装置、设备和计算机可读存储介质,通过获取预先训练好的命名实体识别模型,将获取的待识别的第一语句输入至命名实体识别模型执行命名实体识别处理,命名实体识别模型基于第一语句进行分词处理,得到的第二语句包括有多个拆分词,通过对多个拆分词进行特征提取得到多个词嵌入特征向量,能够有效地反映语义信息,便于准确识别未登陆词,通过对第二语句进行跨域信息处理,得到多个跨域信息特征,能够将拆分词的信息提供给命名实体识别模型,有利于提高命名实体识别模型的识别效率,通过信息瓶颈层对多个跨域信息特征进行处理,得到多个信息瓶颈特征,最后采用分类函数对多个信息瓶颈特征进行分类识别,确定对应的命名实体类别,通过利用信息瓶颈特征,更加有效地进行特征提取,能够对命名实体中的未登录词进行更好的识别,有利于提高命名实体识别的准确度。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的命名实体识别方法的流程图;
图2是本申请一个实施例提供的命名实体识别处理过程的流程图;
图3是本申请一个实施例提供的信息瓶颈层的结构示意图;
图4是本申请另一个实施例提供的命名实体识别方法的流程图;
图5是本申请另一个实施例提供的命名实体识别方法的流程图;
图6是本申请另一个实施例提供的命名实体识别方法的流程图;
图7是本申请另一个实施例提供的命名实体识别方法的流程图;
图8是本申请另一个实施例提供的命名实体识别方法的流程图;
图9是本申请另一个实施例提供的命名实体识别方法的流程图;
图10是本申请一个实施例提供的命名实体识别装置的结构示意图;
图11是本申请一个实施例提供的计算机设备的结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。本文中使用的术语“和/或”仅仅是一种描述关联对象的相同的字段,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
相关技术一般使用CRF序列模型进行文本中的命名实体识别,这种方法能够学习到人工标注的数据,但对未标注的数据或者未登录词的识别效果不好,而随着社会的进步和发展,互联网上生成了越来越多的未登录词,对含有未登录词的文本进行命名实体识别的准确度不高。
下面结合附图,对本申请实施例作进一步阐述。
如图1所示,本申请第一方面实施例提供一种命名实体识别方法,命名实体识别方法包括但不限于有步骤S110和步骤S120:
步骤S110:获取预先训练好的命名实体识别模型,其中,命名实体识别模型包括信息瓶颈层;
需要说明的是,命名实体识别模型已预先训练好,通过获取命名实体识别模型,能够对待识别文本进行命名实体的识别,另外,命名实体识别模型中包括有信息瓶颈层,主要目的是减少参数的数量,从而减少计算量,且在降维之后可以更加有效、直观地进行数据的训练和特征提取。
步骤S120:获取待识别的第一语句,将第一语句输入至命名实体识别模型,以使命名实体识别模型执行命名实体识别处理。
需要说明的是,第一语句可以从互联网中获取,主要是指需要识别命名实体类型的待识别数据,其中,命名实体主要包括人名、地名、机构名、专有名词等以名称为标识的实体,还可以包括数字、日期、货币、地址等实体,例如,待识别的第一语句可以包括待识别的组织机构名称(ORG),第一语句为“苹果是公司”,其中,苹果为命名实体。通过将待识别的第一语句输入至命名实体识别模型中,通过命名实体识别模型能够有效地进行特征提取,有利于提高命名实体识别的准确度。
如图2所示,命名实体识别处理过程包括但不限于有步骤S131至步骤S135:
步骤S131:对第一语句进行分词处理,得到包括多个拆分词的第二语句;
通过对第一语句进行分词处理,得到分词后的第二语句,便于更好地进行命名实体识别, 分词工具采用j ieba,也可以使用其它分词工具,例如stanford分词器。
当对第一语句进行分词处理时,通过识别第一语句中对应的文本顺序,并根据文本顺序对第一语句进行分词,得到多个拆分词,通过多个拆分词构成第二语句,例如,当第一语句为“苹果是公司”,得到的分词结果是[苹果,是,公司],其中,“苹果”、“是”、“公司”是拆分词。另外,分词处理中还包括去除一些高频词汇和低频词汇,以及去掉一些无意义的符号等。
步骤S132:对多个拆分词进行特征提取得到多个词嵌入特征向量;
通过将第一语句进行分词处理,得到多个拆分词,通过命名实体识别模型对多个拆分词进行特征提取可以获得第二语句的多个词嵌入特征向量,可以理解的是,每个拆分词均有对应的词嵌入特征向量,词嵌入特征向量能够反映拆分词的语法及语义信息,便于有效地识别未登陆词。
具体地,命名实体识别模型中还包括语言模型,分词处理后的第二语句通过基于变换器的双向编码器表示技术(Bidirectional Encoder Representations from Transformers,BERT)模型获得词嵌入特征向量,BERT模型是一种深度双向的、无监督的语言表示模型,具有双向Transformer编码器,通过双向Transformer编码器的处理,能充分考虑拆分词之间的关系,使得命名实体识别更加准确。需要说明的是,词嵌入特征向量还可以通过其它语言模型获得,例如基于全局信息的词向量(Global Vectors for Word Representation,GloVe)模型。
步骤S133:根据多个词嵌入特征向量对第二语句进行跨域信息处理,得到多个跨域信息特征;
需要说明的是,由于待识别的文本很多时候是由多个拆分词组成的,根据多个词嵌入特征向量对第二语句进行跨域信息处理,得到多个跨域信息特征,跨域信息特征能够将多个拆分词组成的第二语句的个数信息及关联信息提供给命名实体识别模型,从而提高命名实体识别模型的识别效率。
步骤S134:通过信息瓶颈层对多个跨域信息特征进行处理,得到多个信息瓶颈特征;
信息瓶颈层能够保留跨域信息特征中的必要信息,通过将多个跨域信息特征输入至信息瓶颈层进行处理,能够得到对应的信息瓶颈特征,更加有效地进行特征提取,通过利用信息瓶颈特征,能够对命名实体中的未登录词进行更好的识别。
如图3所示,需要说明的是,信息瓶颈层是由一个多层感知机(Multilayer Perceptron,MLP)组成,MLP是由两个线性(Linear)层和一个ReLu激活函数组成,线性层、Relu激活函数和线性层依次连接,信息瓶颈层能够保留输入数据的必要信息,升维后,能够使得信息更加丰富,另外加上ReLU激活函数,在降维之后,可以保持所有的必要信息不丢失,便于后续的数据训练和特征提取。
步骤S135:采用分类函数对多个信息瓶颈特征进行分类识别,确定与第一语句对应的命名实体类别。
其中,命名实体类别是指命名实体所归属的类别,通过采用分类函数对多个信息瓶颈特征进行分类识别,能够确定与第一语句对应的命名实体类别,从而便于将对应的命名实体进行类别标注,例如,对于“苹果是公司”,苹果所对应的命名实体类别是组织机构名称(ORG)。需要说明的是,通过对多个信息瓶颈特征进行分类识别,可能输出多个命名实体类别或者仅输出一个命名实体类别。
需要说明的是,分类函数采用的是softmax函数。分类loss计算如下:
Figure PCTCN2022090756-appb-000001
score(z i,y i)=exp(z i,y i);
其中,z i是第i类信息瓶颈特征,y i是第i类命名实体类别;Y是命名实体类别集合;score(z i,y i)是第i类命名实体类别的得分值;y i可以通过命名实体识别模型学习,反映的是预 测值,Loss是损失值,用于反映真实值与预测值之间的损失值。
根据本申请实施例的技术方案,通过获取预先训练好的命名实体识别模型,将获取的待识别的第一语句输入至命名实体识别模型执行命名实体识别处理,命名实体识别模型基于第一语句进行分词处理,得到的第二语句包括有多个拆分词,通过对多个拆分词进行特征提取得到多个词嵌入特征向量,能够有效地反映语义信息,便于准确识别未登陆词,通过对第二语句进行跨域信息处理,得到多个跨域信息特征,能够将拆分词的信息提供给命名实体识别模型,有利于提高命名实体识别模型的识别效率,通过信息瓶颈层对多个跨域信息特征进行处理,得到多个信息瓶颈特征,最后采用分类函数对多个信息瓶颈特征进行分类识别,确定对应的命名实体类别,通过利用信息瓶颈特征,更加有效地进行特征提取,能够对命名实体中的未登录词进行更好的识别,有利于提高命名实体识别的准确度。
如图4所示,在上述的命名实体识别方法中,步骤S133中根据多个词嵌入特征向量对第二语句进行跨域信息处理,得到多个跨域信息特征,包括但不限于步骤S210至步骤S230:
步骤S210:根据多个词嵌入特征向量确定多个边界向量,其中,边界向量包括起点词嵌入特征和终点词嵌入特征;
步骤S220:根据每个边界向量确定对应的长度向量;
步骤S230:根据多个边界向量和多个长度向量得到多个跨域信息特征。
通过对第二语句进行跨域信息处理,跨域信息特征包括两个部分:
Figure PCTCN2022090756-appb-000002
第一部分是边界向量
Figure PCTCN2022090756-appb-000003
第二部分是长度向量
Figure PCTCN2022090756-appb-000004
对于边界向量,由于第二语句中包括多个拆分词,每个拆分词均有对应的词嵌入特征向量,通过对多个拆分词进行特征提取能够得到多个词嵌入特征向量,边界向量
Figure PCTCN2022090756-appb-000005
由起点词嵌入特征h bi和对应的终点词嵌入特征h ei组成,即
Figure PCTCN2022090756-appb-000006
需要说明的是,起点词嵌入特征表示边界向量的起点词的特征向量,终点词嵌入特征表示边界向量的终点词的特征向量。对于长度向量,每个边界向量均有对应的长度向量,长度向量用于反映起点词与终点词的距离;通过边界向量
Figure PCTCN2022090756-appb-000007
和长度向量
Figure PCTCN2022090756-appb-000008
能够组成跨域信息特征
Figure PCTCN2022090756-appb-000009
可以理解的是,第二语句具有多个词嵌入特征向量,通过多个词嵌入特征向量相互组合能够得到多个边界向量,同时能够确定多个长度向量,从而对于第二语句可以有多个跨域信息特征。基于词嵌入特征向量得到第二语句的跨域信息特征,令拆分词在神经网络中以向量表示,将词嵌入特征向量和跨域信息特征引入命名实体识别模型,使命名实体识别模型能够胜任更复杂的情境,例如处理具有专业词汇以及专业词汇之间的相互关系的文本,有利于提高最终的命名实体识别的准确度。
如图5所示,在上述的命名实体识别方法中,步骤S210中根据多个词嵌入特征向量确定多个边界向量,包括但不限于步骤S310和步骤S320:
步骤S310:根据多个词嵌入特征向量确定多个起点词嵌入特征和多个终点词嵌入特征;
步骤S320:将每个起点词嵌入特征和对应的终点词嵌入特征进行拼接处理,得到多个边界向量。
可以理解的是,起点词嵌入特征和终点词嵌入特征是从多个词嵌入特征向量中确定,起点词嵌入特征表示边界向量的起点词的特征向量,终点词嵌入特征表示边界向量的终点词的特征向量。通过将起点词嵌入特征h bi和终点词嵌入特征h ei进行拼接处理后组成边界向量
Figure PCTCN2022090756-appb-000010
通过将起点词嵌入特征和终点词嵌入特征进行拼接处理,即表示进行特征交叉融合,使得边界向量具有特征融合特性,从而能够有效提升命名实体识别模型的识别准确度。
具体地,以第一语句是“苹果是公司”为例,分词后的第二语句是[苹果,是,公司],得到的多个边界向量是(1,1),(1,2),(1,3),(2,2),(2,3),(3,3),数字表示拆分词在第二语句中的位置,(1,1)代表将“苹果”这个词的词嵌入特征向量(双份)拼接起来,(1,3)代表将“苹果”和“公司”两个词的词嵌入特征向量拼接起来,其中,“苹果”是起点词,“公司”是终点词。
如图6所示,在上述的命名实体识别方法中,步骤S220中根据每个边界向量确定对应的长度向量,包括但不限于步骤S410和步骤S420:
步骤S410:根据每个边界向量确定对应的跨域长度;
步骤S420:根据每个跨域长度和预设维度得到对应的长度向量,其中,长度向量的当前维度与跨域长度对应。
通过获得多个边界向量,可以确定每个边界向量对应的长度向量,长度向量由词与词的跨域长度来决定,当获取边界向量后,通常操作是得到边界向量对应的跨域长度,根据跨域长度和预设维度得到对应的长度向量,长度向量的当前维度与跨域长度对应。例如,根据边界向量(1,1)确定的跨域长度是0,长度向量的维度是一个超参数,设置的预设维度为10,根据跨域长度可以确定长度向量的当前维度,将当前维度的值设为1,其它维度的值都为0,则得到的长度向量为[1,0,0,0,0,0,0,0,0,0],若(1,3)的跨域长度为2,那对应的长度向量是[0,0,1,0,0,0,0,0,0,0]。通过将两个拆分词转换成固定长度的向量表示,便于进行数据处理,基于边界向量和长度向量组成的跨域信息特征,能够有效地反映组成待识别文本的拆分词之间的关联,从而大大提高命名实体识别的准确度。
如图7所示,在上述的命名实体识别方法中,命名实体识别模型根据以下训练步骤得到:
步骤S510:获取预标注的训练数据集,其中,训练数据集中各训练数据为携带命名实体及标注类别的标注句子;
步骤S520:获取每个标注句子的替换类别句子,其中,替换类别句子包括相同类别句子和不同类别句子;
步骤S530:根据标注句子、相同类别句子和不同类别句子,计算得到第一损失值;
步骤S540:根据第一损失值训练初始模型,得到训练好的命名实体识别模型。
通过使用预先标注好的训练数据集进行模型训练,训练数据集中的各条训练数据是指以人工的方法把句子中的命名实体以及其类别进行标注的标注句子,例如,对于“苹果是公司”,将苹果标注为ORG(组织机构名称),得到的标注句子携带有命名实体及标注类别。为每一条训练数据用其它具有相同类别实体以及不同类别实体的句子替换,即获取每个标注句子的替换类别句子,例如,“谷歌是公司”(相同类别句子),“张三是公司”(不同类别句子),根据标注句子、相同类别句子、不同类别句子计算得到第一损失值,由于第一损失值综合了相同类别实体数据和不同类别实体数据,根据第一损失值不断调整命名实体识别模型的参数,有利于提高模型的识别效果,进而提高命名实体识别的准确性。
需要说明的是,在获取训练数据时,首先获取原始数据以及原始数据中命名实体对应的标注类别,将标注类别写入原始数据中,得到携带有命名实体及标注类别的标注句子,例如,当标注类别为组织机构名称时,标注句子的形式可以为:[ORG]+原始数据。
可以理解的是,基于训练好的命名实体识别模型,通过将待识别的第一语句输入至命名实体识别模型,能够高效地实现对命名实体类别的识别。
如图8所示,在上述的命名实体识别方法中,步骤S530中根据标注句子、相同类别句子和不同类别句子,计算得到第一损失值,包括但不限于步骤S610和步骤S620:
步骤S610:根据标注句子、相同类别句子和不同类别句子,计算得到对应的第一瓶颈特征、第二瓶颈特征和第三瓶颈特征;
步骤S620:根据第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第一损失值。
具体地,标注句子是“苹果是公司”,相同类别句子是“谷歌是公司”,不同类别句子是“张三是公司”,通过计算得到“苹果是公司”,“谷歌是公司”和“张三是公司”对应的信息瓶颈特征,即分别得到第一瓶颈特征、第二瓶颈特征和第三瓶颈特征,基于第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第一损失值,以第一损失值为目标训练模型,得到训练好的命名实体识别模型。通过利用信息瓶颈特征,能够有效地保留输入数据的必要信息,能够对命名实体中的未登录词进行更好的识别。
需要说明的是,在命名实体识别模型的信息瓶颈层能够获得第一瓶颈特征、第二瓶颈特 征和第三瓶颈特征。
如图9所示,在上述的命名实体识别方法中,步骤S620中根据第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第一损失值,包括但不限于步骤S710至步骤S730:
步骤S710:根据第一瓶颈特征计算得到第二损失值;
步骤S720:根据第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第三损失值;
步骤S730:根据第二损失值和第三损失值计算得到第一损失值。
需要说明的是,第一瓶颈特征对应标注句子,标注句子用于训练命名实体识别模型,基于第一瓶颈特征初步计算出第二损失值,另外,根据第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第三损失值,通过第三损失值修正第二损失值,得到第一损失值,以最小化第一损失值为目标,训练命名实体识别模型,使得命名实体识别模型能够学习得到提取命名实体的类别的能力。
基于上述的命名实体识别方法中,第二损失值根据以下公式得到:
Figure PCTCN2022090756-appb-000011
score(z i,y i)=exp(z i,y i);
其中,L base是第二损失值,z i是第i类信息瓶颈特征,y i是第i类命名实体类别;Y是命名实体类别集合;score(z i,y i)是第i类命名实体类别的得分值。
第三损失值根据以下公式得到:
Figure PCTCN2022090756-appb-000012
其中,L gi是第三损失值,z 1是第一瓶颈特征,z 2是第二瓶颈特征,z 3是第三瓶颈特征,gw函数是cosin的相似度计算;Ep是期望计算。
第一损失值根据以下公式得到:
L=L base+γ*L gi
其中,L是第一损失值,L base是第二损失值,L gi是第三损失值,γ是超参数,用来调整L gi的权重影响。
需要说明的是,在训练模型的过程中,首先获取标注句子,例如,标注句子是“苹果是公司”,进一步获取“苹果是公司”的替换类别句子,分别为“谷歌是公司”、“张三是公司”,将“苹果是公司”、“谷歌是公司”、“张三是公司”同时输入至命名实体识别模型,在信息瓶颈层得到对应的第一瓶颈特征z 1、第二瓶颈特征z 2、第三瓶颈特征z 3,基于第一瓶颈特征z 1计算得到第二损失值L base,基于第一瓶颈特征z 1、第二瓶颈特征z 2和第三瓶颈特征z 3计算得到第三损失值L gi,第三损失值L gi可以使命名实体识别模型学习到相同命名实体类别和不同命名实体类别之间的相似度。计算得到第二损失值L base和第三损失值L gi后,首先根据γ调整L gi的权重影响,本实施例将γ设置为0.3,再将调整后的L gi与L base相加,基于第二损失值L base和调整后的第三损失值L gi的求和结果,得到第一损失值L,以最小化第一损失值L为目标,不断更新命名实体识别模型的参数。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。本申请实施例的命名实体识别方法可以应用于信息检索、问答系统、机器翻译、情感分析等自然语言处理应用领域中。
基于上述命名实体识别方法,下面分别提出本申请的命名实体识别装置、计算机设备和 计算机可读存储介质的各个实施例。
如图10所示,本申请第二方面实施例提供一种命名实体识别装置1000,图10是本申请一个实施例提供的命名实体识别装置1000的结构示意图。本申请实施例的命名实体识别装置1000包括但不限于第一获取模块1010和第二获取模块1020,具体地,第一获取模块1010用于获取预先训练好的命名实体识别模型1030;第二获取模块1020用于获取待识别的第一语句,将第一语句输入至命名实体识别模型1030,以使命名实体识别模型1030执行命名实体识别处理;其中,命名实体识别模型1030包括:分词模块1031、特征提取模块1032、跨域处理模块1033、信息瓶颈模块1034、分类模块1035。分词模块1031用于对第一语句进行分词处理,得到包括多个拆分词的第二语句;特征提取模块1032用于对多个拆分词进行特征提取得到多个词嵌入特征向量;跨域处理模块1033用于根据多个词嵌入特征向量对第二语句进行跨域信息处理,得到多个跨域信息特征;信息瓶颈模块1034用于对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;分类模块1035用于采用分类函数对多个信息瓶颈特征进行分类识别,确定与第一语句对应的命名实体类别。
根据本申请实施例的命名实体识别装置,通过获取预先训练好的命名实体识别模型,将获取的待识别的第一语句输入至命名实体识别模型执行命名实体识别处理,命名实体识别模型基于第一语句进行分词处理,得到的第二语句包括有多个拆分词,通过对多个拆分词进行特征提取得到多个词嵌入特征向量,能够有效地反映语义信息,便于准确识别未登陆词,通过对第二语句进行跨域信息处理,得到多个跨域信息特征,能够将拆分词的信息提供给命名实体识别模型,有利于提高命名实体识别模型的识别效率,通过信息瓶颈模块对多个跨域信息特征进行处理,得到多个信息瓶颈特征,最后采用分类函数对多个信息瓶颈特征进行分类识别,确定对应的命名实体类别,通过利用信息瓶颈特征,更加有效地进行特征提取,能够对命名实体中的未登录词进行更好的识别,有利于提高命名实体识别的准确度。
在上述的命名实体识别装置中,根据多个词嵌入特征向量对第二语句进行跨域信息处理,得到多个跨域信息特征,具体包括:
根据多个词嵌入特征向量确定多个边界向量,其中,边界向量包括起点词嵌入特征和终点词嵌入特征;
根据每个边界向量确定对应的长度向量;
根据多个边界向量和多个长度向量得到多个跨域信息特征。
在上述的命名实体识别装置中,根据多个词嵌入特征向量确定多个边界向量,具体包括:
根据多个词嵌入特征向量确定多个起点词嵌入特征和多个终点词嵌入特征;
将每个起点词嵌入特征和对应的终点词嵌入特征进行拼接处理,得到多个边界向量。
在上述的命名实体识别装置中,根据多个词嵌入特征向量确定多个边界向量,具体包括:
根据多个词嵌入特征向量确定多个起点词嵌入特征和多个终点词嵌入特征;
将每个起点词嵌入特征和对应的终点词嵌入特征进行拼接处理,得到多个边界向量。
在上述的命名实体识别装置中,根据每个边界向量确定对应的长度向量,具体包括:
根据每个边界向量确定对应的跨域长度;
根据每个跨域长度和预设维度得到对应的长度向量,其中,长度向量的当前维度与跨域长度对应。
在上述的命名实体识别装置中,命名实体识别模型根据以下训练步骤得到:
获取预标注的训练数据集,其中,训练数据集中各训练数据为携带命名实体及标注类别的标注句子;
获取每个标注句子的替换类别句子,其中,替换类别句子包括相同类别句子和不同类别句子;
根据标注句子、相同类别句子和不同类别句子,计算得到第一损失值;
根据第一损失值训练初始模型,得到训练好的命名实体识别模型。
在上述的命名实体识别装置中,根据标注句子、相同类别句子和不同类别句子,计算得 到第一损失值,具体包括:
根据标注句子、相同类别句子和不同类别句子,计算得到对应的第一瓶颈特征、第二瓶颈特征和第三瓶颈特征;
根据第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第一损失值。
在上述的命名实体识别装置中,根据第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第一损失值,具体包括:
根据第一瓶颈特征计算得到第二损失值;
根据第一瓶颈特征、第二瓶颈特征和第三瓶颈特征计算得到第三损失值;
根据第二损失值和第三损失值计算得到第一损失值。
需要说明的是,本申请实施例的命名实体识别装置的具体实施方式及对应的技术效果,可对应参照上述命名实体识别方法的具体实施方式及对应的技术效果。
如图11所示,本申请的第三方面实施例还提供了一种计算机设备1100,该计算机设备1100包括:存储器1110、处理器1120及存储在存储器1110上并可在处理器1120上运行的计算机程序。
处理器1120和存储器1110可以通过总线或者其他方式连接。存储器1110作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器1110可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器1110可选包括相对于处理器1120远程设置的存储器,这些远程存储器可以通过网络连接至该发号器组件。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。本领域技术人员可以理解的是,图11中示出的计算机设备1100并不构成对本申请实施例的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。实现上述实施例的命名实体识别方法所需的非暂态软件程序以及指令存储在存储器1110中,当被处理器1120执行时,执行上述实施例的命名实体识别方法,例如,执行以上描述的图1、图2以及图4至图9中的方法步骤。
根据本申请实施例的计算机设备,通过获取预先训练好的命名实体识别模型,将获取的待识别的第一语句输入至命名实体识别模型执行命名实体识别处理,命名实体识别模型基于第一语句进行分词处理,得到的第二语句包括有多个拆分词,通过对多个拆分词进行特征提取得到多个词嵌入特征向量,能够有效地反映语义信息,便于准确识别未登陆词,通过对第二语句进行跨域信息处理,得到多个跨域信息特征,能够将拆分词的信息提供给命名实体识别模型,有利于提高命名实体识别模型的识别效率,通过信息瓶颈层对多个跨域信息特征进行处理,得到多个信息瓶颈特征,最后采用分类函数对多个信息瓶颈特征进行分类识别,确定对应的命名实体类别,通过利用信息瓶颈特征,更加有效地进行特征提取,能够对命名实体中的未登录词进行更好的识别,有利于提高命名实体识别的准确度。
另外,本申请的第四方面实施例还提供了一种计算机可读存储介质,该计算机可读存储介质可以是非易失性,也可以是易失性,该计算机可读存储介质存储有计算机可执行指令,计算机可执行指令用于执行上述的命名实体识别方法。例如,被上述命名实体识别装置的一个处理器执行,可使得上述处理器执行上述实施例中的命名实体识别方法,例如,执行以上描述的图1、图2以及图4至图9中的方法步骤。
根据本申请实施例的计算机可读存储介质,通过获取预先训练好的命名实体识别模型,将获取的待识别的第一语句输入至命名实体识别模型执行命名实体识别处理,命名实体识别模型基于第一语句进行分词处理,得到的第二语句包括有多个拆分词,通过对多个拆分词进行特征提取得到多个词嵌入特征向量,能够有效地反映语义信息,便于准确识别未登陆词,通过对第二语句进行跨域信息处理,得到多个跨域信息特征,能够将拆分词的信息提供给命名实体识别模型,有利于提高命名实体识别模型的识别效率,通过信息瓶颈层对多个跨域信息特征进行处理,得到多个信息瓶颈特征,最后采用分类函数对多个信息瓶颈特征进行分类 识别,确定对应的命名实体类别,通过利用信息瓶颈特征,更加有效地进行特征提取,能够对命名实体中的未登录词进行更好的识别,有利于提高命名实体识别的准确度。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包括计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上是对本申请的较佳实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。

Claims (20)

  1. 一种命名实体识别方法,其中,包括:
    获取预先训练好的命名实体识别模型,其中,所述命名实体识别模型包括信息瓶颈层;
    获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行以下命名实体识别处理:
    对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;
    对多个所述拆分词进行特征提取得到多个词嵌入特征向量;
    根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征;
    通过所述信息瓶颈层对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;
    采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
  2. 根据权利要求1所述的命名实体识别方法,其中,所述根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征,包括:
    根据多个所述词嵌入特征向量确定多个边界向量,其中,所述边界向量包括起点词嵌入特征和终点词嵌入特征;
    根据每个所述边界向量确定对应的长度向量;
    根据多个所述边界向量和多个所述长度向量得到多个跨域信息特征。
  3. 根据权利要求2所述的命名实体识别方法,其中,所述根据多个所述词嵌入特征向量确定多个边界向量,包括:
    根据多个词嵌入特征向量确定多个起点词嵌入特征和多个终点词嵌入特征;
    将每个所述起点词嵌入特征和对应的终点词嵌入特征进行拼接处理,得到多个边界向量。
  4. 根据权利要求2所述的命名实体识别方法,其中,所述根据每个所述边界向量确定对应的长度向量,包括:
    根据每个所述边界向量确定对应的跨域长度;
    根据每个所述跨域长度和预设维度得到对应的长度向量,其中,所述长度向量的当前维度与所述跨域长度对应。
  5. 根据权利要求1所述的命名实体识别方法,其中,所述命名实体识别模型根据以下训练步骤得到:
    获取预标注的训练数据集,其中,所述训练数据集中各训练数据为携带命名实体及标注类别的标注句子;
    获取每个所述标注句子的替换类别句子,其中,所述替换类别句子包括相同类别句子和不同类别句子;
    根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到第一损失值;
    根据所述第一损失值训练初始模型,得到训练好的命名实体识别模型。
  6. 根据权利要求5所述的命名实体识别方法,其中,所述根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到第一损失值,包括:
    根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到对应的第一瓶颈特征、第二瓶颈特征和第三瓶颈特征;
    根据所述第一瓶颈特征、所述第二瓶颈特征和所述第三瓶颈特征计算得到第一损失值。
  7. 根据权利要求6所述的命名实体识别方法,其中,所述根据所述第一瓶颈特征、所述第二瓶颈特征和所述第三瓶颈特征计算得到第一损失值,包括:
    根据所述第一瓶颈特征计算得到第二损失值;
    根据所述第一瓶颈特征、所述第二瓶颈特征和所述第三瓶颈特征计算得到第三损失值;
    根据所述第二损失值和所述第三损失值计算得到第一损失值。
  8. 一种命名实体识别装置,其中,包括:
    第一获取模块,用于获取预先训练好的命名实体识别模型;
    第二获取模块,用于获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行命名实体识别处理;
    其中,所述命名实体识别模型包括:
    分词模块,用于对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;
    特征提取模块,用于对多个所述拆分词进行特征提取得到多个词嵌入特征向量;
    跨域处理模块,用于根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征;
    信息瓶颈模块,用于对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;
    分类模块,用于采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
  9. 一种计算机设备,其中,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种命名实体识别方法,其中,所述命名实体识别方法包括:
    获取预先训练好的命名实体识别模型,其中,所述命名实体识别模型包括信息瓶颈层;
    获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行以下命名实体识别处理:
    对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;
    对多个所述拆分词进行特征提取得到多个词嵌入特征向量;
    根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征;
    通过所述信息瓶颈层对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;
    采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
  10. 根据权利要求9所述的计算机设备,其中,所述根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征,包括:
    根据多个所述词嵌入特征向量确定多个边界向量,其中,所述边界向量包括起点词嵌入特征和终点词嵌入特征;
    根据每个所述边界向量确定对应的长度向量;
    根据多个所述边界向量和多个所述长度向量得到多个跨域信息特征。
  11. 根据权利要求10所述的计算机设备,其中,所述根据多个所述词嵌入特征向量确定多个边界向量,包括:
    根据多个词嵌入特征向量确定多个起点词嵌入特征和多个终点词嵌入特征;
    将每个所述起点词嵌入特征和对应的终点词嵌入特征进行拼接处理,得到多个边界向量。
  12. 根据权利要求10所述的计算机设备,其中,所述根据每个所述边界向量确定对应的长度向量,包括:
    根据每个所述边界向量确定对应的跨域长度;
    根据每个所述跨域长度和预设维度得到对应的长度向量,其中,所述长度向量的当前维度与所述跨域长度对应。
  13. 根据权利要求9所述的计算机设备,其中,所述命名实体识别模型根据以下训练步骤得到:
    获取预标注的训练数据集,其中,所述训练数据集中各训练数据为携带命名实体及标注类别的标注句子;
    获取每个所述标注句子的替换类别句子,其中,所述替换类别句子包括相同类别句子和不同类别句子;
    根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到第一损失值;
    根据所述第一损失值训练初始模型,得到训练好的命名实体识别模型。
  14. 根据权利要求13所述的计算机设备,其中,所述根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到第一损失值,包括:
    根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到对应的第一瓶颈特征、第二瓶颈特征和第三瓶颈特征;
    根据所述第一瓶颈特征、所述第二瓶颈特征和所述第三瓶颈特征计算得到第一损失值。
  15. 一种计算机可读存储介质,其中,存储有计算机可执行指令,所述计算机可执行指令用于执行一种命名实体识别方法,其中,所述命名实体识别方法包括:
    获取预先训练好的命名实体识别模型,其中,所述命名实体识别模型包括信息瓶颈层;
    获取待识别的第一语句,将所述第一语句输入至所述命名实体识别模型,以使所述命名实体识别模型执行以下命名实体识别处理:
    对所述第一语句进行分词处理,得到包括多个拆分词的第二语句;
    对多个所述拆分词进行特征提取得到多个词嵌入特征向量;
    根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征;
    通过所述信息瓶颈层对多个所述跨域信息特征进行处理,得到多个信息瓶颈特征;
    采用分类函数对多个所述信息瓶颈特征进行分类识别,确定与所述第一语句对应的命名实体类别。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述根据多个所述词嵌入特征向量对所述第二语句进行跨域信息处理,得到多个跨域信息特征,包括:
    根据多个所述词嵌入特征向量确定多个边界向量,其中,所述边界向量包括起点词嵌入特征和终点词嵌入特征;
    根据每个所述边界向量确定对应的长度向量;
    根据多个所述边界向量和多个所述长度向量得到多个跨域信息特征。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据多个所述词嵌入特征向量确定多个边界向量,包括:
    根据多个词嵌入特征向量确定多个起点词嵌入特征和多个终点词嵌入特征;
    将每个所述起点词嵌入特征和对应的终点词嵌入特征进行拼接处理,得到多个边界向量。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述根据每个所述边界向量确定对应的长度向量,包括:
    根据每个所述边界向量确定对应的跨域长度;
    根据每个所述跨域长度和预设维度得到对应的长度向量,其中,所述长度向量的当前维度与所述跨域长度对应。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述命名实体识别模型根据以下训练步骤得到:
    获取预标注的训练数据集,其中,所述训练数据集中各训练数据为携带命名实体及标注类别的标注句子;
    获取每个所述标注句子的替换类别句子,其中,所述替换类别句子包括相同类别句子和不同类别句子;
    根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到第一损失值;
    根据所述第一损失值训练初始模型,得到训练好的命名实体识别模型。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到第一损失值,包括:
    根据所述标注句子、所述相同类别句子和所述不同类别句子,计算得到对应的第一瓶颈特征、第二瓶颈特征和第三瓶颈特征;
    根据所述第一瓶颈特征、所述第二瓶颈特征和所述第三瓶颈特征计算得到第一损失值。
PCT/CN2022/090756 2022-03-22 2022-04-29 命名实体识别方法、装置、设备和计算机可读存储介质 WO2023178802A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210282587.4A CN114722822B (zh) 2022-03-22 2022-03-22 命名实体识别方法、装置、设备和计算机可读存储介质
CN202210282587.4 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023178802A1 true WO2023178802A1 (zh) 2023-09-28

Family

ID=82240155

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090756 WO2023178802A1 (zh) 2022-03-22 2022-04-29 命名实体识别方法、装置、设备和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114722822B (zh)
WO (1) WO2023178802A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114004A (zh) * 2023-10-25 2023-11-24 江西师范大学 一种基于门控纠偏的少样本两阶段命名实体识别方法
CN117807999A (zh) * 2024-02-29 2024-04-02 武汉科技大学 基于对抗学习的域自适应命名实体识别方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287479A (zh) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 命名实体识别方法、电子装置及存储介质
CN112541355A (zh) * 2020-12-11 2021-03-23 华南理工大学 一种实体边界类别解耦的少样本命名实体识别方法与系统
CN113536791A (zh) * 2020-04-20 2021-10-22 阿里巴巴集团控股有限公司 命名实体识别方法和装置
US20210349975A1 (en) * 2020-04-30 2021-11-11 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for improved cybersecurity named-entity-recognition considering semantic similarity
CN113688631A (zh) * 2021-07-05 2021-11-23 广州大学 一种嵌套命名实体识别方法、系统、计算机和存储介质
CN113807094A (zh) * 2020-06-11 2021-12-17 株式会社理光 实体识别方法、装置及计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2665239C2 (ru) * 2014-01-15 2018-08-28 Общество с ограниченной ответственностью "Аби Продакшн" Автоматическое извлечение именованных сущностей из текста
CN112347785A (zh) * 2020-11-18 2021-02-09 湖南国发控股有限公司 一种基于多任务学习的嵌套实体识别系统
CN113158671B (zh) * 2021-03-25 2023-08-11 胡明昊 一种结合命名实体识别的开放域信息抽取方法
CN113434683B (zh) * 2021-06-30 2023-08-29 平安科技(深圳)有限公司 文本分类方法、装置、介质及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287479A (zh) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 命名实体识别方法、电子装置及存储介质
CN113536791A (zh) * 2020-04-20 2021-10-22 阿里巴巴集团控股有限公司 命名实体识别方法和装置
US20210349975A1 (en) * 2020-04-30 2021-11-11 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for improved cybersecurity named-entity-recognition considering semantic similarity
CN113807094A (zh) * 2020-06-11 2021-12-17 株式会社理光 实体识别方法、装置及计算机可读存储介质
CN112541355A (zh) * 2020-12-11 2021-03-23 华南理工大学 一种实体边界类别解耦的少样本命名实体识别方法与系统
CN113688631A (zh) * 2021-07-05 2021-11-23 广州大学 一种嵌套命名实体识别方法、系统、计算机和存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114004A (zh) * 2023-10-25 2023-11-24 江西师范大学 一种基于门控纠偏的少样本两阶段命名实体识别方法
CN117114004B (zh) * 2023-10-25 2024-01-16 江西师范大学 一种基于门控纠偏的少样本两阶段命名实体识别方法
CN117807999A (zh) * 2024-02-29 2024-04-02 武汉科技大学 基于对抗学习的域自适应命名实体识别方法
CN117807999B (zh) * 2024-02-29 2024-05-10 武汉科技大学 基于对抗学习的域自适应命名实体识别方法

Also Published As

Publication number Publication date
CN114722822B (zh) 2024-01-19
CN114722822A (zh) 2022-07-08

Similar Documents

Publication Publication Date Title
CN108536870B (zh) 一种融合情感特征和语义特征的文本情感分类方法
CN110427463B (zh) 搜索语句响应方法、装置及服务器和存储介质
WO2018218706A1 (zh) 一种基于神经网络的新闻事件抽取的方法及系统
CN111738003B (zh) 命名实体识别模型训练方法、命名实体识别方法和介质
WO2018028077A1 (zh) 一种基于深度学习的中文语义分析的方法及装置
CN104050160B (zh) 一种机器与人工翻译相融合的口语翻译方法和装置
CN109753660B (zh) 一种基于lstm的中标网页命名实体抽取方法
WO2023178802A1 (zh) 命名实体识别方法、装置、设备和计算机可读存储介质
WO2021212801A1 (zh) 面向电商产品的评价对象识别方法、装置及存储介质
CN110197279B (zh) 变换模型训练方法、装置、设备和存储介质
CN113591483A (zh) 一种基于序列标注的文档级事件论元抽取方法
CN113673254B (zh) 基于相似度保持的知识蒸馏的立场检测方法
WO2023159758A1 (zh) 数据增强方法和装置、电子设备、存储介质
CN112069312B (zh) 一种基于实体识别的文本分类方法及电子装置
WO2023137911A1 (zh) 基于小样本语料的意图分类方法、装置及计算机设备
CN113255320A (zh) 基于句法树和图注意力机制的实体关系抽取方法及装置
CN111222330B (zh) 一种中文事件的检测方法和系统
CN108509521A (zh) 一种自动生成文本索引的图像检索方法
CN115563327A (zh) 基于Transformer网络选择性蒸馏的零样本跨模态检索方法
CN114417851A (zh) 一种基于关键词加权信息的情感分析方法
CN110297986A (zh) 一种微博热点话题的情感倾向分析方法
CN115757792A (zh) 一种基于深度学习的微博文本情感分类方法
CN114048314A (zh) 一种自然语言隐写分析方法
CN117370736A (zh) 一种细粒度情感识别方法、电子设备及存储介质
CN112347247A (zh) 基于LDA和Bert的特定类别文本标题二分类方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22932859

Country of ref document: EP

Kind code of ref document: A1