CN113807099A - Entity information identification method, entity information identification device, electronic equipment and storage medium - Google Patents

Entity information identification method, entity information identification device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113807099A
CN113807099A CN202111111471.6A CN202111111471A CN113807099A CN 113807099 A CN113807099 A CN 113807099A CN 202111111471 A CN202111111471 A CN 202111111471A CN 113807099 A CN113807099 A CN 113807099A
Authority
CN
China
Prior art keywords
candidate information
word segment
information
target
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111111471.6A
Other languages
Chinese (zh)
Other versions
CN113807099B (en
Inventor
张惠蒙
黄昉
史亚冰
蒋烨
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111111471.6A priority Critical patent/CN113807099B/en
Publication of CN113807099A publication Critical patent/CN113807099A/en
Application granted granted Critical
Publication of CN113807099B publication Critical patent/CN113807099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides an entity information identification method, an entity information identification device, electronic equipment and a storage medium, and relates to the technical field of data processing, in particular to the technical field of artificial intelligence such as natural language processing and deep learning. The specific implementation scheme is as follows: carrying out named entity recognition on a text to be recognized to obtain at least one candidate message; extracting the characteristics of each candidate information to obtain at least one characteristic information; performing deep semantic recognition on each feature information to obtain at least one semantic recognition result; and determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result.

Description

Entity information identification method, entity information identification device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to the field of artificial intelligence technologies such as NLP (natural language processing), deep learning, and in particular, to an entity information identification method, apparatus, electronic device, and storage medium.
Background
Named Entity Recognition (NER) is one of basic and important tasks in natural language processing, and has a very wide application range. The NER system can extract entity information from unstructured input text and can identify more categories of entity information according to business needs. In each industrial application scene, the identification of entity information is the basis of tasks such as knowledge graph construction, text understanding, conversation intention understanding and the like.
Disclosure of Invention
The disclosure provides an entity information identification method, an entity information identification device, an electronic device and a storage medium.
According to an aspect of the present disclosure, there is provided an entity information identification method, including: carrying out named entity recognition on a text to be recognized to obtain at least one candidate message; extracting the characteristics of each candidate information to obtain at least one characteristic information; deep semantic recognition is carried out on each feature information to obtain at least one semantic recognition result; and determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result.
According to another aspect of the present disclosure, there is provided an entity information identifying apparatus including: the entity identification module is used for carrying out named entity identification on the text to be identified to obtain at least one candidate message; the characteristic extraction module is used for extracting the characteristic of each candidate information to obtain at least one piece of characteristic information; the semantic recognition module is used for carrying out deep semantic recognition on each feature information to obtain at least one semantic recognition result; and the determining module is used for determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the entity information identification method as described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the entity information identifying method as described above.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the entity information identification method as described above.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 schematically illustrates an exemplary system architecture to which the entity information identification method and apparatus may be applied, according to an embodiment of the present disclosure;
fig. 2 schematically shows a flow chart of an entity information identification method according to an embodiment of the present disclosure;
FIG. 3 schematically shows a schematic diagram of an entity candidate extraction model according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of a solid discriminative model according to an embodiment of the present disclosure;
FIG. 5 schematically shows a representation of text feature information in a text to be recognized according to an embodiment of the present disclosure;
fig. 6 schematically shows a flowchart of an entity information identification method according to another embodiment of the present disclosure;
fig. 7 schematically shows a block diagram of an entity information identification apparatus according to an embodiment of the present disclosure; and
FIG. 8 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.
Schemes for implementing named entity recognition include a dictionary matching-based scheme, a rule matching-based scheme, a BERT (Bidirectional Encoder representation from transforms) model-based scheme) + CRF (Conditional Random Fields) model-based scheme, and a BERT + MRC (Machine Reading understanding) model-based scheme, and the like.
According to the dictionary matching scheme, some entity information can be predefined in a dictionary, then forward maximum matching and reverse maximum matching are carried out on the notebook according to the entity information to obtain a candidate information set of entity information identification, and then a result of the entity information identification is screened and output based on word frequency.
Based on the scheme of rule matching, a rule template manually constructed by linguistic experts is mostly adopted, the template can select characteristics such as statistical information, punctuation marks, key words, indicator words, direction words, position words, central words and the like, and a candidate information set for entity information recognition is obtained by taking mode and character string matching as a main means so as to output a result of entity information recognition.
The scheme based on the BERT + CRF model is a scheme realized based on a deep learning model. BERT can learn a good feature representation for a single word by running a self-supervision learning method on the basis of massive corpora. After extracting the semantic representation of each word, the word sequence can be labeled by the CRF, so as to extract the entity in the sentence. Compared with a scheme based on rules and dictionary matching, the scheme based on the BERT + CRF model has better expression capability and generalization capability and higher universality.
The scheme based on the BERT + MRC model is realized based on the deep learning model, and the BERT + MRC model is slightly superior to the BERT + CRF model. The strategy based on the BERT + MRC model is to convert the tasks of sequence labeling into the prediction of the starting position and the ending position of the entity information. After the semantics of the input text are extracted and understood by using ERNIE (Enhanced reconstruction through Knowledge pre-training model for chinese), sentence-level coding is obtained, i.e., each word vector is represented by one 768-dimensional vector. And reducing the dimension of the word vector to the dimension of the number of the sequence labels through a full connection layer, and predicting the initial position and the end position of the entity information. And finally, screening the candidate information set identified by the entity information in a threshold value plus rule mode to obtain the result of the entity information identification.
The inventor finds that the scheme mainly uses ERNIE pre-trained word-based vector coding information in the process of model coding, and ignores the important role of vocabulary information on entity information recognition. In the decoding process, the positions of the initial node and the end node of each type of entity information are firstly generated, and in the screening process of the initial node and the end node pairs, the candidate set is screened by adopting a mode of adding a rule and a multiple threshold value and a rule. Such a screening strategy is too harsh and requires certain modifications to the rules when oriented to different tasks. For example, for different domains, the average lengths of entity information are greatly different, and the nesting situations of entity information in the same text are different, so that the same rule cannot be applied to entity information identification in all domains.
Fig. 1 schematically shows an exemplary system architecture to which the entity information identification method and apparatus may be applied according to an embodiment of the present disclosure.
It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the entity information identification method and apparatus may be applied may include a terminal device, but the terminal device may implement the entity information identification method and apparatus provided in the embodiments of the present disclosure without interacting with a server.
As shown in fig. 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be noted that the entity information identification method provided by the embodiment of the present disclosure may be generally executed by the terminal device 101, 102, or 103. Accordingly, the entity information identification apparatus provided by the embodiment of the present disclosure may also be disposed in the terminal device 101, 102, or 103.
Alternatively, the entity information identification method provided by the embodiment of the present disclosure may also be generally performed by the server 105. Accordingly, the entity information identification apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The entity information identification method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the entity information identification apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
For example, when the entity information in the text to be recognized needs to be recognized, the terminal devices 101, 102, and 103 may perform entity information recognition on the text to be recognized to obtain at least one candidate information. Then, feature extraction is carried out on each candidate information to obtain at least one feature information. And then, performing deep semantic recognition on each feature information to obtain at least one semantic recognition result, and determining entity information corresponding to the text to be recognized from at least one candidate information according to the at least one semantic recognition result. Or by a server or server cluster capable of communicating with the terminal devices 101, 102, 103 and/or the server 105, and to determine entity information corresponding to the text to be recognized.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 schematically shows a flowchart of an entity information identification method according to an embodiment of the present disclosure.
As shown in fig. 2, the method includes operations S210 to S240.
In operation S210, named entity recognition is performed on the text to be recognized to obtain at least one candidate information.
In operation S220, feature extraction is performed on each candidate information to obtain at least one feature information.
In operation S230, deep semantic recognition is performed on each feature information to obtain at least one semantic recognition result.
In operation S240, entity information corresponding to the text to be recognized is determined from the at least one candidate information according to the at least one semantic recognition result.
According to the embodiment of the disclosure, the text to be recognized may include various texts containing or not containing entity information, the entity information may refer to information having a specific meaning or strong reference in the text, and may include at least one of name information, place name information, organization name information, proper noun information, product name information, model information, date and time information, price information, and the like.
According to the embodiment of the disclosure, the process of carrying out named entity recognition on the text to be recognized to obtain at least one candidate message can be realized by various entity candidate extraction models. The method can be used for extracting the characteristics of each candidate information by various entity discrimination models to obtain at least one characteristic information, performing deep semantic recognition on each characteristic information to obtain at least one semantic recognition result, and determining entity information corresponding to a text to be recognized from at least one candidate information according to the at least one semantic recognition result. Both the entity candidate extraction model and the entity discrimination model may be generated based on ERNIE. ERNIE is a pre-trained semantic representation model based on knowledge enhancement. The model learns the semantic representation of the complete concept by encoding semantic units such as words, entities and the like. By using the ERNIE model pre-trained on large-scale corpora as the coding layer, semantic representation of sentences can be better extracted for downstream tasks.
According to an embodiment of the present disclosure, the entity candidate extraction model may include at least one of an ERNIE + MRC-based model, a FLAT (word information) + ERNIE + MRC-based model, and the like, and may also include a BERT + MRC-based model, and the like. The entity discrimination model may include an ERNIE + softmax based model, or the like.
Fig. 3 schematically shows a schematic diagram of an entity candidate extraction model according to an embodiment of the present disclosure.
As shown in fig. 3, the entity candidate extraction model includes an input layer 310, an encoding layer 320, and a decoding layer 330. The input layer 310 is configured to receive text feature information of a text to be recognized input into the entity candidate extraction model, where the text feature information includes at least one of word fragment information and other feature information that constitute the text to be recognized. The encoding layer 320 is configured to extract semantic features of each word segment in the text to be recognized according to the text feature information, and the decoding layer 330 is facilitated to calculate a probability that each word segment belongs to each type of label by mapping each word segment into vector information. The various types of labels may include, for example, at least one of a start point and an end point of a person name, a start point and an end point of a place name, a start point and an end point of an organization name, and so on. In case it is determined that a word fragment belongs to a class label according to the probability of the word fragment, the corresponding class label may be labeled. Candidate information identified from the text to be identified can be determined according to the marking result of the decoding layer.
According to an embodiment of the present disclosure, the category label to which the word segment belongs may be labeled with "1". For example, as can be determined from the marking result in fig. 3, the category label corresponding to word segment 1 may be the starting point of the place name, the category label corresponding to word segment 2 may be the ending point of the place name, the category label corresponding to word segment 3 may be the ending point of the place name, the category label corresponding to word segment 5 may be the starting point of the person name, the category label corresponding to word segment 7 may be the ending point of the person name, and so on. Three candidate information may be determined from word segment 1 and word segment 2, word segment 1 through word segment 3, and word segment 5 through word segment 7.
FIG. 4 schematically shows a schematic diagram of a solid discriminant model according to an embodiment of the present disclosure.
As shown in FIG. 4, the entity discrimination model includes an input layer 410, an encoding layer 420, and a Sofmax layer 430. The input layer 410 may receive candidate information output by the entity candidate extraction model or input into the entity discrimination model in the form of word segments, [ CLS ] and [ SEP ] may be used to determine a text to be recognized, and NER _ O _ S, NER _ O _ E may be used to mark the candidate information in the text to be recognized. After the candidate information is input into the entity discrimination model, the encoding layer 420 may first perform feature extraction on the candidate information to obtain feature information. And then extracting deep semantics through an ERNIE pre-training language model to obtain a semantic recognition result. Then, a binary classification model may be provided by Softmax layer 430 having a classifier function, and the semantic recognition result is judged to determine whether the input candidate information belongs to the entity information. Thereby determining entity information of the text to be recognized.
According to the embodiment of the disclosure, the overall semantics of the text to be recognized, such as "NJSCJDQ" and "NJSCJDQ", may be information characterizing a certain class of scenic spots in a certain area. "NJ" and "NJs" may be location name information, "CJ" may be attraction information, and "DQ" and "Q" may be proper noun information. In addition, "SC" may be job title information, and "JDQ" may be person name information.
For example, inputting "NJs jddq" into the entity candidate extraction model, the information "NJ", "NJs", "SC", "CJ", "JD", "JDQ", "DQ", "Q" waiting for selection can be obtained. Inputting the information to be selected of "NJ", "NJS", "SC", "CJ", "JD", "JDQ", "DQ" and "Q" into the entity judgment model, and performing feature extraction and deep semantic recognition to obtain the judgment results that "NJ", "NJS", "SC", "CJ", "JDQ", "DQ" and "Q" belong to entity information, and that "JD" does not belong to entity information. Thus, the entity information in the text to be recognized "njsjdq" may include "NJ", "NJs", "SC", "CJ", "JDQ", "DQ", "Q".
For example, the "urban suburban park" is input into the entity candidate extraction model, and the "city", "suburban", and "park" candidate information can be obtained. The information of the waiting selection of the city, the suburb and the park is input into the entity judgment model for feature extraction and deep semantic recognition, for example, the judgment result that the city, the suburb and the park belong to the entity information and the judgment result that the suburb does not belong to the entity information can be obtained. So that the entity information in the text "urban suburb park" to be recognized may include "city", "country", "park".
According to the embodiment of the disclosure, the result obtained preliminarily after the named entity recognition is carried out on the text to be recognized is used as the candidate information, methods such as feature extraction and semantic recognition are added to determine the entity information from the candidate information, and the entity information is used as the entity information obtained by the recognition of the text to be recognized, so that the accuracy of the entity information recognition result can be effectively improved. Particularly, compared with an entity information determination mode for screening candidate information according to a predetermined rule, the method can effectively reduce the occurrence of the situation that entity information which does not belong to an entity is screened as entity information of the text to be recognized.
The method illustrated in fig. 2 is further described below with reference to specific embodiments.
According to the embodiment of the disclosure, conducting named entity recognition on a text to be recognized to obtain at least one candidate message comprises: and generating a head position identifier and a tail position identifier for each character segment and each word segment in the text to be recognized. And carrying out named entity recognition on the text to be recognized according to the head position identification, the tail position identification and the text to be recognized to obtain at least one candidate message.
According to an embodiment of the present disclosure, the word segment may represent information of each word in the text to be recognized, and the word segment may represent information of each word (which may include a plurality of words) in the text to be recognized. The head position identifier and the tail position identifier generated for a fragment may both be equal to the position of the fragment in the text corresponding to the fragment. The head position identification generated for a word segment may be equal to the position of the first word segment in the text corresponding to the word segment, and the tail position identification generated for a word segment may be equal to the position of the last word segment in the text corresponding to the word segment. And determining each word segment and each word segment in the text to be recognized according to the head position identification and the tail position identification, thereby determining the text characteristic information of the text to be recognized. The text information in the text to be recognized may constitute which word segments, which may be determined by means of word segmentation, or may be determined according to a preset word list, where the word list may include all preset possible words.
Fig. 5 schematically shows a representation of text characteristic information in a text to be recognized according to an embodiment of the present disclosure.
As shown in fig. 5, the text to be recognized is, for example, "njsjdq", and the word segments determined for the text in the word list include, for example, "NJ", "SC", "CJ", "JDQ", and "DQ". The text feature information determined from the text may include word segment information such as "N", "J", "S", "C", "J", "D", and "Q", and word segment information such as "NJ", "SC", "CJ", "JDQ", and "DQ". According to the 1 st bit of the position of the 'N' in the 'NJSCJDQ', the head position identifier and the tail position identifier of the 'N' are both 1. According to the position of the 'J' in the 'JDQ' in the 'NJSCDQ' as the 5 th bit and the position of the 'Q' in the 'NJSCDQ' in the 'JDQ' as the 7 th bit, the head position identifier of the 'JDQ' is 5 and the tail position identifier of the 'JDQ' is 7. The head position identification and the tail position identification of other word fragment information and word fragment information can be determined according to the same mode.
According to an embodiment of the present disclosure, the text to be recognized is, for example, "urban suburban park", and the word segments determined for the text in the vocabulary include, for example, "city", "suburban", "park". The text feature information determined from the text may include word segment information such as "city", "suburb", "wild", "public", and "garden", and word segment information such as "city", "suburb", and "park". According to the 1 st position of the city in the suburb park, the head position identification and the tail position identification of the city can be determined to be 1. According to the condition that the position of suburb in suburb is 3 rd position and the position of field in suburb is 4 th position, the head position identification of suburb is 3 and the tail position identification of suburb is 4. The head position identification and the tail position identification of other word fragment information and word fragment information can be determined according to the same mode.
By the embodiment of the disclosure, the head position identifier and the tail position identifier are generated aiming at the word segment and the word segment in the text to be recognized, the word vector information and the word vector information in the text to be recognized can be fused in the encoding stage of named entity recognition, the boundary of the entity information is strengthened by richer text characteristic information, and particularly, the recognition accuracy of the entity information with longer segments can be higher.
According to the embodiment of the disclosure, conducting named entity recognition on a text to be recognized to obtain at least one candidate message comprises: a first category probability that each word segment in the text to be recognized serves as a starting word segment of a predefined category and a second category probability that each word segment in the text to be recognized serves as a terminating word segment of the predefined category are determined. And determining the word segment corresponding to the first class probability as the starting word segment under the condition that the numerical value of the first class probability is greater than or equal to a first preset threshold value. And determining the word segment corresponding to the second class probability as the termination word segment when the value of the second class probability is greater than or equal to a second preset threshold value. Candidate information is determined based on the start word segment and the end word segment.
According to an embodiment of the present disclosure, the predefined categories may include at least one of a person name category, a place name category, an organization category, a proper name category, a product name category, a model category, a date category, a time category, a price category, and the like.
According to an embodiment of the present disclosure, the first preset threshold and the second preset threshold may be set to different values corresponding to different predefined categories. For example, in the case where the predefined category is a name category, the first preset threshold may be set to 0.9, and the second preset threshold may be set to 0.5. In case the predefined category is a place name, the first preset threshold may be set to 0.8, the second preset threshold may be set to 0.7, etc. The first and second preset thresholds may be used to determine the category to which the word-segment belongs, in particular in the case where one word-segment may correspond to a plurality of predefined categories, the category probability calculated for one word-segment may comprise a plurality of different first or second category probabilities, in which case the category of the word-segment may be determined according to the first and second preset thresholds.
According to an embodiment of the present disclosure, for example, the text to be recognized is "NJSCJDQ", in the named entity recognition of "NJSCJDQ" from the corresponding word and word fragments, the first class probability of identifying "N" as the beginning word segment of the name class is, for example, 0.3, less than the first preset threshold value of 0.9 when the predefined class is the name class, the first class probability of identifying "N" as the beginning word segment of the place name class is, for example, 0.8, equal to the first preset threshold value of 0.8 when the predefined class is the place name class, the second class probability of identifying "J" as the ending word segment of the place name class is, for example, 0.8, greater than the second preset threshold value of 0.7 when the predefined class is the place name class, it may be determined that "N" in "NJSCJDQ" may be a place name class start word segment, "J" may be a place name class end word segment, and the candidate information identified for "NJSCJDQ" may include "NJ".
According to the embodiment of the present disclosure, for example, the text to be recognized is "urban suburban park", in the process of conducting named entity recognition of "urban suburban park" according to the corresponding word segments and word segments, a first class probability of, for example, 0.1 for recognizing "suburb" as the start word segment of the name class, which is smaller than a first preset threshold value of 0.9 for recognizing "suburb" as the start word segment of the name class, a first class probability of, for example, 0.8 for recognizing "suburb" as the start word segment of the name class, which is equal to a first preset threshold value of 0.8 for recognizing the predefined class as the name class, a second class probability of, for example, 0.8 for recognizing "wild" as the end word segment of the name class, which is larger than a second preset threshold value of 0.7 for recognizing the predefined class as the name class, it may be determined that "suburb" in "urban suburb park" may be a start word segment of the name of the place, "wild" may be a stop word segment of the name of the place, and the candidate information identified for "urban suburb park" may include "suburb.
By introducing the preset threshold value and comparing the preset threshold value with the class probability that the word segment belongs to various predefined classes, the recognition effect when the word segment is recognized as the initial word segment or the terminal word segment can be enhanced, and the probability that the candidate information can be used as entity information is improved.
According to the embodiment of the disclosure, after the named entity recognition is performed on the text to be recognized to obtain at least one candidate information, the at least one candidate information can be selected according to a preset rule to obtain the target candidate information. And then, performing feature extraction and semantic recognition on the target candidate information, and determining entity information corresponding to the text to be recognized according to the target candidate information.
Fig. 6 schematically shows a flowchart of an entity information identification method according to another embodiment of the present disclosure.
As shown in fig. 6, the method includes operations S610 to S650.
In operation S610, named entity recognition is performed on the text to be recognized to obtain at least one candidate information.
In operation S620, at least one candidate information is selected according to a preset rule, so as to obtain at least one target candidate information.
In operation S630, feature extraction is performed on each target candidate information to obtain at least one target feature information.
In operation S640, deep semantic recognition is performed on each target feature information to obtain at least one target semantic recognition result.
In operation S650, entity information corresponding to the text to be recognized is determined from the at least one target candidate information according to the at least one target semantic recognition result.
According to the embodiment of the disclosure, the preset rules may include various custom rules for preliminarily selecting the candidate information to obtain the target candidate information. For example, the preset rule may include filtering out candidate information that does not match the semantics expressed by the semantic information according to the semantic information of the text to be recognized. The preset rule may further include filtering out candidate information existing in the list of the preset blacklist according to the preset blacklist, and the like.
Through the embodiment of the disclosure, the candidate information is secondarily selected according to the named entity identification and the preset rule, and the entity information is determined, so that the accuracy of the entity information identified by the text to be identified can be effectively improved.
According to the embodiment of the disclosure, selecting at least one candidate information according to a preset rule to obtain target candidate information includes: for each candidate information, firstly, carrying out normalization processing on the first target number of the word segments in the candidate information to obtain a normalized value, and determining the first class probability that the initial word segment of the candidate information is used as the initial word segment of the predefined class and the second class probability that the terminal word segment of the candidate information is used as the terminal word segment of the predefined class. Then, a target sum of the first class probability, the second class probability and the normalized value is calculated, and in the case that the target sum is greater than or equal to a third preset threshold, the candidate information is determined as target candidate information.
According to an embodiment of the present disclosure, the first target number may be used to characterize the number of word segments included in the candidate information. The normalization process may include at least one of normalization process according to a preset program, normalization process on the first target number of the word segment of each candidate information according to the first target number of the word segment of all candidate information, and the like, and the preset program may include parameters related to semantic features, category features, and the like of the candidate information. The third preset threshold is used, for example, to select reliable target candidate information from the candidate information. For example, the target and the candidate information smaller than the third preset threshold may be filtered, and the target candidate information may be determined according to the remaining candidate information.
According to the embodiment of the disclosure, the text to be recognized is, for example, "NJSCJDQ", the candidate information includes, for example, "CJ", "JD", and the third preset threshold is, for example, 2.0. The first target number of word fragments of "CJ" and "JD" are normalized by the default program, for example, the normalized value of "CJ" is 0.8, and the normalized value of "JD" is 0.3. In addition, for example, it may be determined that the first class probability of "C" of "CJ" as the place class start word segment is 0.8, and the second class probability of "J" of "CJ" as the place class end word segment is 0.8. For example, it may be determined that the first class probability of "J" of "JD" as the beginning word segment of the name class is 0.5, and the second class probability of "D" of "JD" as the ending word segment of the name class is 0.3. The target sum calculated for "CJ" may be obtained as 2.4, the target sum calculated for "JD" as 1.1, and in combination with a third preset threshold of 2.0, "JD" may be filtered out of the candidate information, and a target candidate information may be determined according to "CJ".
According to an embodiment of the present disclosure, the text to be recognized is, for example, "urban suburban park", the candidate information includes, for example, "suburban", and the third preset threshold is, for example, 2.1. The first target number of the respective character segments of suburb and suburb is normalized by a preset program, for example, the normalized value of suburb is 0.4, and the normalized value of suburb is 0.9. Further, for example, it may be determined that the first category probability of "city" of "suburb" as the location class start word segment is 0.5, and the second category probability of "suburb" as the location class end word segment is 0.4. For example, it may be determined that the first category probability of "suburb" as the start word segment of the name class is 0.7, and the second category probability of "wild" as the end word segment of the name class is 0.8. The target sum calculated for "suburb" can be obtained to be 1.3, the target sum calculated for "suburb" is 2.4, and in combination with the third preset threshold value of 2.1, "suburb" can be filtered out from the candidate information, and one target candidate information is determined according to "suburb".
By the embodiment of the disclosure, a realization method of the preset rule is provided by combining the first target number information of the word segments, the category probabilities of the initial word segment and the termination word segment which are respectively used as the initial word segment and the termination word segment of the predefined category, wherein the first target number information, the initial word segment and the termination word segment are included in the candidate information, and the candidate information can be selected to obtain more reliable target candidate information.
According to the embodiment of the disclosure, selecting at least one candidate information according to a preset rule to obtain target candidate information includes: and determining a second target number of the word segments included in the candidate information for each candidate information, and determining the candidate information as the target candidate information if the second target number is less than or equal to a fourth preset threshold value.
According to the embodiment of the present disclosure, in the case that the candidate information is the same candidate information, the second target number is equal to the first target number. The fourth preset threshold may be used to limit the length of the entity, for example, candidate information including too many word fragments may be filtered out. The value of the fourth preset threshold may be determined according to the number of word segments included in each word segment in the text to be recognized, which is the same as the semantic information of the text to be recognized, or may be determined by self-definition.
For example, according to semantic information of "njsjddq", a value of a fourth preset threshold may be preset to be 3, and candidate information identified for the "njsjddq" further includes, for example, "NJSC" and "SCJDQ", where word segments included in the two are 4 and 5, respectively, so that the "NJSC" and the "SCJDQ" may be filtered from the candidate information, and target candidate information may be determined according to the remaining candidate information.
According to the embodiment of the present disclosure, according to the semantic information of "urban suburb park", a value of the fourth preset threshold may be set to be 4 in advance, and the candidate information identified for "urban suburb park" further includes, for example, "suburb park" and "suburb park", where the word segments included in the two are 4 and 5, respectively, then "suburb park" may be filtered out from the candidate information, and the target candidate information may be determined according to the remaining candidate information.
Through the embodiment of the disclosure, another implementation method of the preset rule is provided by combining the second target number of the word segments included in the candidate information, and the candidate information can be selected to obtain the target candidate information meeting the requirement of the entity length.
According to the embodiment of the disclosure, selecting at least one candidate information according to a preset rule to obtain target candidate information includes: and under the condition that candidate information comprising the predefined word segment information exists in at least one candidate information, filtering the candidate information comprising the predefined word segment information to obtain new candidate information. And determining target candidate information according to the new candidate information.
According to the embodiment of the disclosure, a word segment word list may be preset for screening the content of the candidate information, and the list may include predefined word segment information that cannot appear in the entity, such as stop words like "of", "at", and symbols like "," - "," "and" etc. In the case that the candidate information includes predefined word segment information in the filtered word segment vocabulary, the corresponding candidate information may be filtered, and the remaining candidate information may be used as new candidate information to determine target candidate information.
For example, the candidate information identified for "NJS, CJDQ" may include "S, C", and since the candidate information includes "this fragment information in the fragment vocabulary", the candidate information may be filtered out.
According to the embodiment of the disclosure, the candidate information identified by the "city, suburb park" may include "city, suburb", and since the candidate information includes "this fragment information in the fragment vocabulary", the candidate information may be filtered out.
Through the embodiment of the disclosure, a further implementation method of the preset rule is provided by combining the predefined field information, and the candidate information can be selected to obtain the target candidate information meeting the requirement of the entity content.
According to the embodiment of the disclosure, selecting at least one candidate information according to a preset rule to obtain target candidate information includes: and calculating the target sum of normalized values corresponding to the first category probability of the initial word segment of the candidate information as the initial word segment of the predefined category, the second category probability of the terminal word segment of the candidate information as the terminal word segment of the predefined category and the number of the word segments in the candidate information. And sorting the candidate information according to the target sum to obtain a sorting result. And determining a predetermined number of candidate information as target candidate information according to the sorting result.
According to the embodiment of the present disclosure, the predetermined number may be used to determine the maximum number of entity information recognized for one text to be recognized, may also be used to determine the number of entity information of the same category recognized for one text to be recognized, and the like.
For example, the predetermined number is used to determine that the entity information to be recognized for the text to be recognized of "NJSCJDQ" is 0 person name class entity information, 3 place name class entity information, or the like. The candidate information includes, for example, a place name class "NJ", a place name class "NJs", a person name class "SC", a place name class "CJ", a person name class "JDQ", a place name class "DQ", a place name class "Q", and the like, and the target sum calculated for these candidate information is, for example, 2.3, 2.5, 1.0, 2.4, 0.7, 2.0, 1.7, and the like in this order. The candidate information may be ranked as "NJS", "CJ", "NJ", "DQ", "Q", "SC", "JDQ", and the target candidate information may be determined to include "NJS", "CJ", "NJ", and "NJ" according to a predetermined number.
According to the embodiment of the present disclosure, the predetermined number may be, for example, 2 place name entity information, or the like, corresponding to the text to be recognized of "urban suburban park". The candidate information includes, for example, a place name class "city", a place name class "countryside", a place name class "park", and the like, and the target sum calculated for these candidate information is, for example, 2.0, 2.5, 2.3, and the like in this order. The ranking result of ranking the candidate information may be "countryside", "park", or "city", and the target candidate information may be determined to include "countryside" or "park" according to a predetermined number.
Through the embodiment of the disclosure, in combination with the predetermined number, another implementation method of the preset rule is provided, and candidate information can be selected to obtain target candidate information with the predetermined number, so that the data volume is reduced, and the subsequent calculation is simplified.
According to the embodiment of the disclosure, the entity information identification method may be implemented by using a paddlee 2.0-based deep learning model, which may include an entity candidate extraction model based on FLAT + ERNIE + MRC, the preset rule, and an entity discrimination model based on ERNIE + Softmax, and the two models may be jointly trained.
Through the embodiment of the disclosure, the original mode of adding multiple thresholds and rules is changed into the mode of adding the thresholds and the models in the entity candidate screening stage, so that the method has a better identification effect when facing entity information identification tasks of different fields and different characteristics, and the accuracy and the recall rate of the selection of the candidate information can be improved.
Fig. 7 schematically shows a block diagram of an entity information identification apparatus according to an embodiment of the present disclosure.
As shown in fig. 7, the entity information identifying apparatus 700 includes an entity identifying module 710, a feature extracting module 720, a semantic identifying module 730, and a determining module 740.
And the entity identification module 710 is configured to perform named entity identification on the text to be identified to obtain at least one candidate message.
And the feature extraction module 720 is configured to perform feature extraction on each candidate information to obtain at least one piece of feature information.
The semantic recognition module 730 is configured to perform deep semantic recognition on each feature information to obtain at least one semantic recognition result.
The determining module 740 is configured to determine entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result.
According to an embodiment of the present disclosure, an entity identification module includes a generation unit and an entity identification unit.
And the generating unit is used for generating a head position identifier and a tail position identifier for each character segment and each word segment in the text to be recognized.
And the entity recognition unit is used for carrying out named entity recognition on the text to be recognized according to the head position identification, the tail position identification and the text to be recognized to obtain at least one candidate message.
According to an embodiment of the present disclosure, an entity identification module includes a first determination unit, a second determination unit, a third determination unit, and a fourth determination unit.
The first determining unit is used for determining a first category probability of each word segment in the text to be recognized as a starting word segment of a predefined category and a second category probability of each word segment in the text to be recognized as a terminating word segment of the predefined category.
And the second determining unit is used for determining the word segment corresponding to the first class probability as the initial word segment under the condition that the numerical value of the first class probability is greater than or equal to a first preset threshold value.
And a third determining unit, configured to determine, as the termination word segment, the word segment corresponding to the second class probability if the value of the second class probability is greater than or equal to a second preset threshold.
A fourth determining unit for determining the candidate information according to the start word segment and the end word segment.
According to an embodiment of the present disclosure, the entity information identifying apparatus further includes a selection module.
And the selection module is used for selecting at least one candidate message according to a preset rule to obtain a target candidate message.
The feature extraction module is further used for extracting features of the target candidate information.
According to an embodiment of the present disclosure, the selection module includes a fifth determination unit.
A fifth determining unit configured to, for each candidate information: and carrying out normalization processing on the first target number of the character segments in the candidate information to obtain a normalized value. A first class probability is determined for a start word segment of the candidate information as a start word segment of the predefined class and a second class probability is determined for a stop word segment of the candidate information as a stop word segment of the predefined class. And calculating the target sum of the first class probability, the second class probability and the normalized value. And determining the candidate information as target candidate information under the condition that the target sum is greater than or equal to a third preset threshold value.
According to an embodiment of the present disclosure, the selection module includes a sixth determination unit.
A sixth determining unit configured to, for each candidate information: a second target number of word fragments that the candidate information includes is determined. And determining the candidate information as the target candidate information if the second target number is less than or equal to a fourth preset threshold.
According to an embodiment of the present disclosure, the selection module includes a filtering unit and a seventh determination unit.
And the filtering unit is used for filtering the candidate information comprising the predefined word fragment information to obtain new candidate information under the condition that the candidate information comprising the predefined word fragment information exists in at least one candidate information.
A seventh determining unit configured to determine target candidate information based on the new candidate information.
According to an embodiment of the present disclosure, the selection module includes a calculation unit, a sorting unit, and an eighth determination unit.
And the calculating unit is used for calculating the first category probability of the initial word segment of the candidate information as the initial word segment of the predefined category, the second category probability of the final word segment of the candidate information as the final word segment of the predefined category and the target sum of the normalized values corresponding to the number of the word segments in the candidate information.
And the sorting unit is used for sorting the candidate information according to the target to obtain a sorting result.
An eighth determining unit configured to determine a predetermined number of candidate information as the target candidate information according to the sorting result.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the entity information identification method as described above.
According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the entity information identification method as described above.
According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the entity information identification method as described above.
FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present disclosure described and/or claimed here.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the entity information identification method. For example, in some embodiments, the entity information identification method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM802 and/or communications unit 809. When loaded into RAM803 and executed by computing unit 801, a computer program may perform one or more of the steps of the entity information identification method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the entity information identification method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (19)

1. An entity information identification method includes:
carrying out named entity recognition on a text to be recognized to obtain at least one candidate message;
extracting the characteristics of each candidate information to obtain at least one characteristic information;
deep semantic recognition is carried out on each feature information to obtain at least one semantic recognition result; and
and according to the at least one semantic recognition result, determining entity information corresponding to the text to be recognized from the at least one candidate information.
2. The method of claim 1, wherein the performing named entity recognition on the text to be recognized to obtain at least one candidate message comprises:
generating a head position identifier and a tail position identifier for each word segment and each word segment in the text to be recognized; and
and carrying out named entity recognition on the text to be recognized according to the head position identification, the tail position identification and the text to be recognized to obtain the at least one candidate message.
3. The method according to claim 1 or 2, wherein the conducting named entity recognition on the text to be recognized to obtain at least one candidate message comprises:
determining a first category probability of each word segment in the text to be recognized as a starting word segment of a predefined category and a second category probability of each word segment in the text to be recognized as a terminating word segment of the predefined category;
determining a word segment corresponding to the first class probability as a starting word segment under the condition that the numerical value of the first class probability is greater than or equal to a first preset threshold value;
determining a word segment corresponding to the second class probability as a termination word segment when the numerical value of the second class probability is greater than or equal to a second preset threshold; and
and determining the candidate information according to the starting word segment and the ending word segment.
4. The method of any of claims 1 to 3, further comprising:
selecting the at least one candidate information according to a preset rule to obtain target candidate information;
the extracting the features of each candidate information comprises:
and extracting the characteristics of the target candidate information.
5. The method of claim 4, wherein the selecting the at least one candidate information according to a preset rule to obtain target candidate information comprises:
for each of the candidate information:
carrying out normalization processing on the first target number of the character segments in the candidate information to obtain a normalization value;
determining a first class probability that a starting word segment of the candidate information is used as a starting word segment of a predefined class and a second class probability that a terminating word segment of the candidate information is used as a terminating word segment of the predefined class;
calculating a target sum of the first class probability, the second class probability, and the normalized value; and
and determining the candidate information as the target candidate information under the condition that the target sum is greater than or equal to a third preset threshold value.
6. The method of claim 4, wherein the selecting the at least one candidate information according to a preset rule to obtain target candidate information comprises:
for each of the candidate information:
determining a second target number of word segments included in the candidate information; and
determining the candidate information as the target candidate information if the second target number is less than or equal to a fourth preset threshold.
7. The method of claim 4, wherein the selecting the at least one candidate information according to a preset rule to obtain target candidate information comprises:
under the condition that candidate information comprising predefined word segment information exists in the at least one candidate information, filtering the candidate information comprising the predefined word segment information to obtain new candidate information; and
and determining the target candidate information according to the new candidate information.
8. The method of claim 4, wherein the selecting the at least one candidate information according to a preset rule to obtain target candidate information comprises:
calculating the first class probability of the initial word segment of the candidate information as the initial word segment of the predefined class, the second class probability of the terminal word segment of the candidate information as the terminal word segment of the predefined class and the target sum of the normalized values corresponding to the number of the word segments in the candidate information;
sorting the candidate information according to the target to obtain a sorting result; and
and determining a preset number of candidate information as the target candidate information according to the sorting result.
9. An entity information identification apparatus comprising:
the entity identification module is used for carrying out named entity identification on the text to be identified to obtain at least one candidate message;
the characteristic extraction module is used for extracting the characteristic of each candidate information to obtain at least one piece of characteristic information;
the semantic recognition module is used for carrying out deep semantic recognition on each feature information to obtain at least one semantic recognition result; and
and the determining module is used for determining entity information corresponding to the text to be recognized from the at least one candidate information according to the at least one semantic recognition result.
10. The apparatus of claim 9, wherein the entity identification module comprises:
the generating unit is used for generating a head position identifier and a tail position identifier for each character segment and each word segment in the text to be recognized; and
and the entity identification unit is used for carrying out named entity identification on the text to be identified according to the head position identification, the tail position identification and the text to be identified to obtain the at least one candidate message.
11. The apparatus of claim 9 or 10, wherein the entity identification module comprises:
a first determining unit, configured to determine a first category probability that each word segment in the text to be recognized serves as a starting word segment of a predefined category, and a second category probability that each word segment serves as a terminating word segment of the predefined category;
a second determining unit, configured to determine, as a starting word segment, a word segment corresponding to the first class probability if a numerical value of the first class probability is greater than or equal to a first preset threshold;
a third determining unit, configured to determine, as a termination word segment, a word segment corresponding to the second class probability if a numerical value of the second class probability is greater than or equal to a second preset threshold; and
a fourth determining unit, configured to determine the candidate information according to the start word segment and the end word segment.
12. The apparatus of any of claims 9 to 11, further comprising:
the selection module is used for selecting the at least one candidate information according to a preset rule to obtain target candidate information;
the feature extraction module is used for extracting features of the target candidate information.
13. The apparatus of claim 12, wherein the selection module comprises:
a fifth determining unit configured to, for each of the candidate information:
carrying out normalization processing on the first target number of the character segments in the candidate information to obtain a normalization value;
determining a first class probability that a starting word segment of the candidate information is used as a starting word segment of a predefined class and a second class probability that a terminating word segment of the candidate information is used as a terminating word segment of the predefined class;
calculating a target sum of the first class probability, the second class probability, and the normalized value; and
and determining the candidate information as the target candidate information under the condition that the target sum is greater than or equal to a third preset threshold value.
14. The apparatus of claim 12, wherein the selection module comprises:
a sixth determining unit configured to, for each of the candidate information:
determining a second target number of word segments included in the candidate information; and
determining the candidate information as the target candidate information if the second target number is less than or equal to a fourth preset threshold.
15. The apparatus of claim 12, wherein the selection module comprises:
the filtering unit is used for filtering the candidate information comprising the predefined word segment information to obtain new candidate information under the condition that the candidate information comprising the predefined word segment information exists in the at least one candidate information; and
a seventh determining unit, configured to determine the target candidate information according to the new candidate information.
16. The apparatus of claim 12, wherein the selection module comprises:
a calculating unit, configured to calculate a first category probability that a starting word segment of the candidate information is a starting word segment of a predefined category, a second category probability that a terminating word segment of the candidate information is a terminating word segment of the predefined category, and a target sum of normalized values corresponding to the number of word segments in the candidate information;
the sorting unit is used for sorting the candidate information according to the target to obtain a sorting result; and
an eighth determining unit configured to determine a predetermined number of candidate information as the target candidate information according to the sorting result.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.
CN202111111471.6A 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium Active CN113807099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111111471.6A CN113807099B (en) 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111111471.6A CN113807099B (en) 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113807099A true CN113807099A (en) 2021-12-17
CN113807099B CN113807099B (en) 2024-02-13

Family

ID=78896263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111111471.6A Active CN113807099B (en) 2021-09-22 2021-09-22 Entity information identification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113807099B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080052262A1 (en) * 2006-08-22 2008-02-28 Serhiy Kosinov Method for personalized named entity recognition
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
US10629186B1 (en) * 2013-03-11 2020-04-21 Amazon Technologies, Inc. Domain and intent name feature identification and processing
CN112215008A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Entity recognition method and device based on semantic understanding, computer equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080052262A1 (en) * 2006-08-22 2008-02-28 Serhiy Kosinov Method for personalized named entity recognition
US10629186B1 (en) * 2013-03-11 2020-04-21 Amazon Technologies, Inc. Domain and intent name feature identification and processing
CN109165384A (en) * 2018-08-23 2019-01-08 成都四方伟业软件股份有限公司 A kind of name entity recognition method and device
CN110287479A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Name entity recognition method, electronic device and storage medium
CN112215008A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Entity recognition method and device based on semantic understanding, computer equipment and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JING LI 等: ""A Survey on Deep Learning for Named Entity Recognition"", 《ARXIV》 *
吴小雪;张庆辉;: "预训练语言模型在中文电子病历命名实体识别上的应用", 电子质量, no. 09 *
谢腾;杨俊安;刘辉;: "基于BERT-BiLSTM-CRF模型的中文实体识别", 计算机系统应用, no. 07 *

Also Published As

Publication number Publication date
CN113807099B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN113553412B (en) Question-answering processing method, question-answering processing device, electronic equipment and storage medium
CN113836925B (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN111753086A (en) Junk mail identification method and device
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN113053367A (en) Speech recognition method, model training method and device for speech recognition
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN112559885A (en) Method and device for determining training model of map interest point and electronic equipment
CN115688920A (en) Knowledge extraction method, model training method, device, equipment and medium
CN111079410A (en) Text recognition method and device, electronic equipment and storage medium
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN116028618A (en) Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN114443842A (en) Strategic emerging industry classification method and device, storage medium and electronic equipment
CN115658903B (en) Text classification method, model training method, related device and electronic equipment
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN114880520B (en) Video title generation method, device, electronic equipment and medium
CN115035890B (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN116662484A (en) Text regularization method, device, equipment and storage medium
CN113553410B (en) Long document processing method, processing device, electronic equipment and storage medium
CN115952258A (en) Generation method of government affair label library, and label determination method and device of government affair text
CN113807099B (en) Entity information identification method, device, electronic equipment and storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN114491030A (en) Skill label extraction and candidate phrase classification model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant