CN111178076A - Named entity identification and linking method, device, equipment and readable storage medium - Google Patents

Named entity identification and linking method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN111178076A
CN111178076A CN201911318901.4A CN201911318901A CN111178076A CN 111178076 A CN111178076 A CN 111178076A CN 201911318901 A CN201911318901 A CN 201911318901A CN 111178076 A CN111178076 A CN 111178076A
Authority
CN
China
Prior art keywords
words
entity
word
probability
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911318901.4A
Other languages
Chinese (zh)
Other versions
CN111178076B (en
Inventor
雷士驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Oppo Communication Technology Co ltd
Original Assignee
Chengdu Oppo Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Oppo Communication Technology Co ltd filed Critical Chengdu Oppo Communication Technology Co ltd
Priority to CN201911318901.4A priority Critical patent/CN111178076B/en
Publication of CN111178076A publication Critical patent/CN111178076A/en
Application granted granted Critical
Publication of CN111178076B publication Critical patent/CN111178076B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a named entity identification and linking method, a named entity identification and linking device, named entity identification and linking equipment and a readable storage medium. The method comprises the following steps: acquiring a text to be identified containing a proprietary name; dividing words of the text to be recognized based on a pre-constructed dictionary, and splitting target entity words corresponding to the proper nouns; determining the type of the target entity word based on the labeling information of the target entity word; the dictionary comprises entity words and background words, the probability of the entity words is nonlinearly related to the length of the entity words, the probability of the background words is nonlinearly related to the length of the background words, and both the probability of the entity words and the probability of the background words are used for segmenting the text to be recognized. The method can improve the accuracy of named entity recognition.

Description

Named entity identification and linking method, device, equipment and readable storage medium
Technical Field
The present disclosure relates to the field of computer application technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for identifying and linking named entities.
Background
With the development of semantic recognition technology, the application of named entity recognition and linking technology is widely applied. For example, a voice assistant in a smart device (a smart phone, a tablet computer, etc.) acquires a voice of a user, converts the voice into a text through voice recognition, and then needs to recognize a named entity therein, so as to link the recognized named entity to a corresponding knowledge base. If the user inputs ' playing Qilixiang ' through voice, the intelligent device uses music skills to link the song ' Qilixiang ' to playing software for playing the song after recognizing the song '.
However, current named entity and linking techniques are still more difficult to disambiguate.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The invention aims to provide a named entity identification and linking method, a named entity identification and linking device, named entity identification and linking equipment and a readable storage medium, which can improve the accuracy of named entity identification.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to one aspect of the present disclosure, there is provided a named entity identifying and linking method, including: acquiring a text to be identified containing a proprietary name; dividing words of the text to be recognized based on a pre-constructed dictionary, and splitting target entity words corresponding to the proper nouns; determining the type of the target entity word based on the labeling information of the target entity word; the dictionary comprises entity words and background words, the probability of the entity words is nonlinearly related to the length of the entity words, the probability of the background words is nonlinearly related to the length of the background words, and both the probability of the entity words and the probability of the background words are used for segmenting the text to be recognized.
According to an embodiment of the present disclosure, segmenting words of the text to be recognized based on a pre-constructed dictionary, and splitting out target entity words corresponding to the proper nouns, includes: generating a directed acyclic graph of the text to be recognized based on the prefix tree constructed by the dictionary; searching a maximum probability path based on the probability of each entity word and the probability of each background word in the dictionary; determining the word segmentation result of the maximum probability path as the word segmentation result of the text to be recognized; and determining an entity word in the word segmentation result as the target entity word corresponding to the proper name.
According to an embodiment of the present disclosure, the probability of the entity word is a square of the length of the entity word multiplied by a preset threshold, and the probability of the background word is a square of the length of the background word.
According to an embodiment of the present disclosure, the background word is composed of a sentence fragment in the high frequency query sentence and/or a high frequency query word.
According to an embodiment of the present disclosure, the sentence fragments are determined from the high frequency query sentences based on an N-Gram model.
According to an embodiment of the present disclosure, the dictionary further includes: and the substrings of the entity words and the background words are used for constructing a prefix tree of the dictionary.
According to an embodiment of the present disclosure, the method further comprises: and linking the target entity word to a named entity of a preset knowledge base according to the type of the target entity word.
According to another aspect of the present disclosure, there is provided a named entity identifying and linking apparatus, including: the text acquisition module is used for acquiring a text to be identified containing a proprietary name; the text word segmentation module is used for segmenting the text to be recognized based on a pre-constructed dictionary and splitting a target entity word corresponding to the proper noun; the type determining module is used for determining the type of the target entity word based on the labeling information of the target entity word; the dictionary comprises entity words and background words, the probability of the entity words is nonlinearly related to the length of the entity words, the probability of the background words is nonlinearly related to the length of the background words, and both the probability of the entity words and the probability of the background words are used for segmenting the text to be recognized.
According to still another aspect of the present disclosure, there is provided an electronic device including: a memory, a processor and executable instructions stored in the memory and executable in the processor, the processor implementing any of the methods described above when executing the executable instructions.
According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement any of the methods described above.
According to the named entity recognition and linking method provided by the embodiment of the disclosure, when the text to be recognized is segmented, the probability values nonlinearly related to the lengths of the target words are respectively set for the entity words and the background words in the segmenter dictionary, so that the accuracy of segmenting the entity words can be improved, and ambiguity caused by segmentation is eliminated. In addition, because the type of the entity word can be directly determined according to the label information of the identified entity word, the synchronous execution of the entity identification and the entity link can be realized. The problem of parsing errors caused by ambiguity between an identification process and a linking process when entity identification and entity linking are respectively executed in the related technology is solved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 is a flow chart illustrating a method for named entity identification and linking in an embodiment of the present disclosure.
FIG. 2 is a flow diagram illustrating another method for named entity identification and linking in an embodiment of the present disclosure.
Fig. 3 is a diagram of a prefix tree shown according to an example.
FIG. 4 is a block diagram illustrating a named entity identifying and linking apparatus in an embodiment of the present disclosure.
Fig. 5 exemplarily illustrates a block diagram of an electronic device in an embodiment of the present disclosure.
Fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium in an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
First, an entity recognition method based on a word segmentation device in the related art is introduced. Segmenting a query sentence (such as voice input) input by a user based on a preset word segmentation device; and determining the entity words based on the word segmentation result, and simultaneously completing the labeling of the types of the entity words.
However, the segmentation result may have difficulty in extracting the expected entity word because the dictionary itself contains ambiguity.
Query statement Word segmentation result Expected result
Listen to the sea crying sound Listen to the | sound of sea/(music) | crying | Listen to the sound of | sea cry/(music)
I want to hear the heart fine I want to listen to heart/(music) | fine I want to listen to | heart clear/(music)
As shown in the above table, when the user inputs that the content to be recognized is "listen to a sea crying sound", the segmenter gives a result of "listen to a sea/(music) | crying | sound", and determines that "listen to a sea" is an entity word whose category is music, and the user actually intends to "listen to a sea crying sound/(music)".
Similarly, when the user inputs that the content to be recognized is "i want to hear a heart fine", the result given by the word segmenter is "i want to hear | heart/(music) | fine", and the user actually intends to "i want to hear | heart fine/(music)".
Therefore, in the entity recognition scheme of the related art, there is a semantic understanding ambiguity problem caused by the word segmentation mode.
From the above analysis, in the entity recognition process, the purpose of word segmentation by using the word segmentation device is to determine the target entity word, and how the word segmentation device segments the background word and the unknown word is not concerned. The entity words are typically nouns corresponding to proper names, such as names of people, organizations, places, songs, books, and so on. The background word is, for example, "i want to listen" in "i want to listen to heart fine" described above.
According to the named entity recognition and linking method provided by the embodiment of the disclosure, when the text to be recognized is segmented, the probability values nonlinearly related to the lengths of the target words are respectively set for the entity words and the background words in the segmenter dictionary, so that the accuracy of segmenting the entity words can be improved, and ambiguity caused by segmentation is eliminated.
To facilitate understanding, several terms referred to in the present disclosure are explained below.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, named entity recognition and linking, and knowledge mapping.
Named Entity Recognition (NER), also called "proper name Recognition", refers to Recognition of entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. Named entity recognition is usually the first step of knowledge mining and information extraction, and is widely applied in the field of natural language processing.
Named Entity Linking (NEL), referred to as "Entity Linking," refers to the process of Linking an identified Named Entity to an unambiguous Entity in a knowledge base. The technology can improve the information filtering capability of practical applications such as an online recommendation system, an internet search engine and the like.
The scheme provided by the embodiment of the disclosure relates to an entity identification and entity linking technology, and is specifically explained by the following embodiment.
First, the steps of the named entity identifying and linking method provided by the exemplary embodiment of the present disclosure will be described in more detail with reference to the accompanying drawings and examples.
FIG. 1 is a flow chart illustrating a method for named entity identification and linking in an embodiment of the present disclosure. The method provided by the embodiment of the disclosure can be executed by any electronic equipment with computing processing capacity.
As shown in FIG. 1, a named entity identification and linking method 10 includes:
in step S102, a text to be recognized including a proper name is acquired.
The text to be recognized can be input by a user through software such as a voice assistant, and converted into the text to be recognized through a voice recognition technology.
Alternatively, the text to be recognized may be input by the user through a user interface provided by the client software.
The text to be recognized contains a proper name, for example, the user inputs "i want to hear the heart fine", where "heart fine" is the proper name of a song.
In step S104, based on the pre-constructed dictionary, the text to be recognized is segmented, and the target entity words corresponding to the proper nouns are split.
Wherein, the pre-constructed dictionary comprises entity words and background words. The entity words are nouns corresponding to proper names, such as names of people, organizations, places, songs, books, and so on. The background words are other auxiliary words, such as the above-mentioned "I want to hear" in "I want to hear in a clear heart".
In some embodiments, the background words may be, for example, composed of sentence fragments and/or high frequency query words in the high frequency query sentence. For example, high frequency query statements and/or high frequency query terms of online users may be gathered. After the high frequency query statements of the user are collected, for example, the statement fragments can be split out from the high frequency query statements by using an N-Gram model, and the statement fragments can be combined into effective statements. For example, taking the collected query sentence "do you leave today", and taking N ═ 2 as an example, the sentence fragments split are "you present", "today", "daily rest", "leave", and "do".
The entity words may be composed of, for example, an entity dictionary collected of actual intents and skills (e.g., playing skills, reading skills).
As described above, in the entity recognition process, the purpose of word segmentation using the word segmenter is to determine the target entity word therein, and how the word segmenter segments the background word and the unknown word is not concerned. Therefore, when segmenting words, the probability of the entity words is expected to be larger than that of the background words, and the segmentation of the entity words is guaranteed to be prior to that of the background words when segmenting words.
Also, for complex entities, it is desirable that longest matches can be made. For example, "birthday" and "happy" are both real words, and "happy birthday" is also a real word and is a composite real word consisting of "birthday" and "happy". In recognizing the entity word, it is desirable to be able to segment "birthday happy" instead of separately segmenting "birthday" and "happy", that is, the probability of "birthday happy" is expected to be greater than the sum of the probabilities of "birthday" and "happy".
In addition, when a single word can be combined with other background words to form a new background word, it is desirable to identify the combined background word. For example, "listen" is a single word and "I want" is a background word. In recognizing the physical words, it is desirable to cut out "i want to listen" rather than to cut out "i want" and "listen" respectively.
After the inventor researches a large number of word segmentation scenes such as the above, the inventor finds that when the probability of each entity word is set to be non-linearly related to the length thereof and the probability of each background word is also set to be non-linearly related to the length thereof, the precision of the segmented entity words is greatly improved after the words are segmented based on each entity word and each background word.
In some embodiments, for example, the probability of each entity word may be set as the square of the length of the entity word, and the probability of each background word may also be set as the square of the length of the background word. For example, in order to solve the problem of segmentation of the compound entity word, the length of "birthday" is 2, and the probability P (birthday) is 224; "Happy" has a length of 2 and a probability P (Happy) of 224; the length of "happy birthday" is 4, and its probability P (happy birthday) is 4216. Thus, P (Happy birthday)>When words are cut based on the probability of each entity word, P (birthday) + P (happy) is preferentially cut out as "happy birthday". As another example, in slicingWhen "I want to listen", P (I want to listen)>P (i want) + P (listen), so we will decide to "i want to listen" in preference.
Furthermore, the probability of each entity word may be set as the square of the length of the entity word multiplied by a preset threshold, and different entity words may have different thresholds. For example, "play" is a background word and "play" is a real word, to promote the segmentation priority for "play", the probability of "play" may be multiplied by a threshold greater than 1, resulting in P (play) > P (play).
In step S106, the type of the target entity word is determined based on the labeling information of the target entity word.
The entity words in the dictionary also have labeling information for labeling the types corresponding to the entity words. An entity word may correspond to one or more (e.g., two, three, etc.) types.
Based on the labeled information of the segmented target entity words, the type of the target entity words can be determined.
Further, in some embodiments, the named entity identification and linking method 10 may further include: in step S108, the target entity word is linked to the named entity in the preset knowledge base according to the type of the target entity word. And embedding the identified named entities into the existing knowledge base.
According to the named entity recognition and linking method provided by the embodiment of the disclosure, when the text to be recognized is segmented, the probability values nonlinearly related to the lengths of the target words are respectively set for the entity words and the background words in the segmenter dictionary, so that the accuracy of segmenting the entity words can be improved, and ambiguity caused by segmentation is eliminated. In addition, because the type of the entity word can be directly determined according to the label information of the identified entity word, the synchronous execution of the entity identification and the entity link can be realized. The problem of parsing errors caused by ambiguity between an identification process and a linking process when entity identification and entity linking are respectively executed in the related technology is solved. For example, an error in "playing | like summer and like fall" is resolved into "playing | like summer and like fall", which leads to a link error.
FIG. 2 is a flow diagram illustrating another method for named entity identification and linking in an embodiment of the present disclosure. Unlike the named entity recognition and linking method 10 shown in fig. 1, the named entity recognition and linking method shown in fig. 2 further provides an exemplary embodiment of how to split the text to be recognized into target entity words corresponding to proper nouns based on a pre-constructed dictionary, that is, provides a specific implementation manner of step S104.
Referring to fig. 2, step S104 includes:
in step S1042, a directed acyclic graph of the text to be recognized is generated based on the dictionary-built prefix tree (Trie tree).
The prefix tree is mainly used for integrating character strings into a tree shape. Fig. 3 is a diagram of a prefix tree shown according to an example. The prefix tree shown in fig. 3 is composed of five chinese words of "qinghua university", "qinghua", "refreshing", "china", and "hua". Each box in the tree represents a node, where "Root" represents the Root node and does not represent any characters. "1" to "5" respectively denote leaf nodes. Each node except the root node contains only one character. From the root node to the leaf node, the characters passing through the path are connected to form a word. While the numbers in the leaf nodes represent the links where the word lies in the dictionary tree (how many links there are for how many words in the dictionary), the links with common prefixes are called strings.
In some embodiments, the dictionary may further include: and the substrings of the entity words and the background words are used for constructing a prefix tree of the dictionary.
The prefix tree constructed based on the dictionary can realize efficient word graph scanning, and further generate a Directed Acyclic Graph (DAG) of the text to be recognized.
Taking the above-mentioned "i want to hear a heart clear" as an example, it includes 5 characters, which are numbered 0-4 respectively, and then DAG ═ {0: [0,1,2],1: [1],2: [2,3],3: [3,4],4: [4] }. That is, with "me" as a starting point, we can divide into "me", "me want to listen"; taking the 'important' as a starting point, the method can be divided into 'important'; taking "listen" as a starting point, the method can be divided into "listen" and "listen heart"; the heart is taken as a starting point and can be divided into the heart and the Xinqing; taking "fine" as a starting point, the method can be divided into "fine".
Similarly, the DAG for "listening to sea crying" is {0: [0,1],1: [1,2,3,4,5],2: [2],3: [3],4: [4,5],5: [5] }.
In step S1044, a maximum probability path is searched for based on the probability of each entity word and the probability of each background word in the dictionary.
Taking the maximum probability path calculation method as an example, the probability of each segmentation path of "i want to hear heart clear" is shown in the following table:
Figure BDA0002326608760000091
the searched segmentation path with the maximum probability is {4 (1,4),3 (12,4),2 (13,3),1 (14,1),0 (39,2) }, wherein: "front" indicates each character label, the first digit in parentheses indicates the segmentation path probability, and the second digit indicates the character label at which the word segmentation ends.
Similarly, the maximum probability segmentation path of "listening to the sound of sea crying" can be found as {5 (1,5) }, 4 (4,5) }, 3 (5,3) }, 2 (6,2),1 (75,5) }, 0 (76, 0).
In step S1046, the word segmentation result of the maximum probability path is determined as the word segmentation result of the text to be recognized.
According to the maximum probability path, the word segmentation result can be determined to be 'I want to hear the heart in a fine mode'.
Similarly, the word segmentation result of "listening to the sound of sea cry" is determined as "listening to the sound of | sea cry".
In step S1048, the entity word in the word segmentation result is determined as the target entity word corresponding to the proper name.
And determining that the entity word in the ' I want to listen to the ' Xinqing ' is the ' Xinqing ', and determining that the ' Xinqing ' is the target entity word corresponding to the special name in the text to be recognized.
And the physical word in the ' listening ' to the sound of the sea cry ' is ' the sound of the sea cry ', and the ' sound of the sea cry ' is determined to be the target physical word corresponding to the proprietary name in the text to be recognized.
Based on the scheme, the accuracy of entity word recognition can be improved, and the consumed time can be reduced to a microsecond level.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
FIG. 4 is a block diagram illustrating a named entity identifying and linking apparatus in an embodiment of the present disclosure. Referring to fig. 4, the named entity recognition and linking means 20 comprises: a text acquisition module 202, a text segmentation module 204, and a type determination module 206.
The text acquiring module 202 is configured to acquire a text to be recognized that includes a proprietary name.
The text word segmentation module 204 is configured to segment a text word to be recognized based on a pre-constructed dictionary, and split a target entity word corresponding to a proper noun.
The type determining module 206 is configured to determine the type of the target entity word based on the labeling information of the target entity word;
the dictionary comprises entity words and background words, the probability of the entity words is nonlinearly related to the length of the entity words, the probability of the background words is nonlinearly related to the length of the background words, and both the probability of the entity words and the probability of the background words are used for segmenting the text to be recognized.
In some embodiments, text segmentation module 204 includes: the device comprises a directed acyclic graph generating unit, a path searching unit, a word segmentation result determining unit and a target entity word determining unit. The directed acyclic graph generating unit is used for generating a directed acyclic graph of the text to be recognized based on the prefix tree constructed by the dictionary. The path searching unit is used for searching a maximum probability path based on the probability of each entity word and the probability of each background word in the dictionary. The word segmentation result determining unit is used for determining the word segmentation result of the maximum probability path as the word segmentation result of the text to be recognized. The target entity word determining unit is used for determining the entity words in the word segmentation result as target entity words corresponding to the proper names.
In some embodiments, the probability of the entity word is the square of the length of the entity word multiplied by a preset threshold, and the probability of the background word is the square of the length of the background word.
In some embodiments, the background words are composed of sentence fragments and/or high frequency query words in the high frequency query sentence.
In some embodiments, the statement fragments are determined from the high frequency query statement based on an N-Gram model.
In some embodiments, the dictionary further comprises: and substrings of the entity words and the background words are used for constructing a prefix tree of the dictionary.
In some embodiments, the named entity identifying and linking means 20 further comprises: and the entity linking module 208 is configured to link the target entity word to a named entity in the preset knowledge base according to the type of the target entity word.
When the named entity recognition and linking device provided by the embodiment of the disclosure is used for segmenting words of a text to be recognized, the accuracy of segmenting the entity words can be improved and ambiguity caused by word segmentation can be eliminated by respectively setting probability values nonlinearly related to the lengths of the target words for the entity words and the background words in the word segmenter dictionary. In addition, because the type of the entity word can be directly determined according to the label information of the identified entity word, the synchronous execution of the entity identification and the entity link can be realized. The problem of parsing errors caused by ambiguity between an identification process and a linking process when entity identification and entity linking are respectively executed in the related technology is solved.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 800 according to this embodiment of the disclosure is described below with reference to fig. 5. The electronic device 800 shown in fig. 5 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 that couples the various system components including the memory unit 820 and the processing unit 810.
Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may execute step S102 shown in fig. 1, and obtain a text to be recognized containing a proprietary name; step S104, based on a pre-constructed dictionary, performing word segmentation on the text to be recognized, and splitting a target entity word corresponding to a proper noun; and step S106, determining the type of the target entity word based on the labeling information of the target entity word.
The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.
The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 6, a program product 900 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A named entity identification and linking method, comprising:
acquiring a text to be identified containing a proprietary name;
dividing words of the text to be recognized based on a pre-constructed dictionary, and splitting target entity words corresponding to the proper nouns; and
determining the type of the target entity word based on the labeling information of the target entity word;
the dictionary comprises entity words and background words, the probability of the entity words is nonlinearly related to the length of the entity words, the probability of the background words is nonlinearly related to the length of the background words, and both the probability of the entity words and the probability of the background words are used for segmenting the text to be recognized.
2. The method of claim 1, wherein segmenting the text to be recognized into words based on a pre-constructed dictionary, and segmenting target entity words corresponding to the proper nouns, comprises:
generating a directed acyclic graph of the text to be recognized based on the prefix tree constructed by the dictionary;
searching a maximum probability path based on the probability of each entity word and the probability of each background word in the dictionary;
determining the word segmentation result of the maximum probability path as the word segmentation result of the text to be recognized; and
determining an entity word in the word segmentation result as the target entity word corresponding to the proper name.
3. The method according to claim 1 or 2, wherein the probability of the entity word is a square of a length of the entity word multiplied by a preset threshold, and the probability of the background word is a square of a length of the background word.
4. The method according to claim 1 or 2, wherein the background words are composed of sentence fragments and/or high frequency query words in a high frequency query sentence.
5. The method of claim 4, wherein the statement fragment is determined from the high frequency query statement based on an N-Gram model.
6. The method of claim 2, wherein the dictionary further comprises: and the substrings of the entity words and the background words are used for constructing a prefix tree of the dictionary.
7. The method of claim 1 or 2, further comprising:
and linking the target entity word to a named entity of a preset knowledge base according to the type of the target entity word.
8. A named entity recognition and linking apparatus, comprising:
the text acquisition module is used for acquiring a text to be identified containing a proprietary name;
the text word segmentation module is used for segmenting the text to be recognized based on a pre-constructed dictionary and splitting a target entity word corresponding to the proper noun; and
the type determining module is used for determining the type of the target entity word based on the labeling information of the target entity word;
the dictionary comprises entity words and background words, the probability of the entity words is nonlinearly related to the length of the entity words, the probability of the background words is nonlinearly related to the length of the background words, and both the probability of the entity words and the probability of the background words are used for segmenting the text to be recognized.
9. An electronic device, comprising: memory, processor and executable instructions stored in the memory and executable in the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the executable instructions.
10. A computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, implement the method of any one of claims 1-7.
CN201911318901.4A 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium Active CN111178076B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911318901.4A CN111178076B (en) 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911318901.4A CN111178076B (en) 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111178076A true CN111178076A (en) 2020-05-19
CN111178076B CN111178076B (en) 2023-08-08

Family

ID=70657555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911318901.4A Active CN111178076B (en) 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111178076B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597324A (en) * 2020-05-20 2020-08-28 北京搜狗科技发展有限公司 Text query method and device
CN111881669A (en) * 2020-06-24 2020-11-03 百度在线网络技术(北京)有限公司 Synonymy text acquisition method and device, electronic equipment and storage medium
CN111950288A (en) * 2020-08-25 2020-11-17 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent equipment
CN113220835A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Text information processing method and device, electronic equipment and storage medium
CN113282689A (en) * 2021-07-22 2021-08-20 药渡经纬信息科技(北京)有限公司 Retrieval method and device based on domain knowledge graph and search engine

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
US20150058005A1 (en) * 2013-08-20 2015-02-26 Cisco Technology, Inc. Automatic Collection of Speaker Name Pronunciations
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN106547733A (en) * 2016-10-19 2017-03-29 中国国防科技信息中心 A kind of name entity recognition method towards particular text
CN107577667A (en) * 2017-09-14 2018-01-12 北京奇艺世纪科技有限公司 A kind of entity word treating method and apparatus
US20190007711A1 (en) * 2017-07-02 2019-01-03 Comigo Ltd. Named Entity Disambiguation for providing TV content enrichment
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN110532390A (en) * 2019-08-26 2019-12-03 南京邮电大学 A kind of news keyword extracting method based on NER and Complex Networks Feature
CN110555206A (en) * 2018-06-01 2019-12-10 中兴通讯股份有限公司 named entity identification method, device, equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150058005A1 (en) * 2013-08-20 2015-02-26 Cisco Technology, Inc. Automatic Collection of Speaker Name Pronunciations
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN106547733A (en) * 2016-10-19 2017-03-29 中国国防科技信息中心 A kind of name entity recognition method towards particular text
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
US20190007711A1 (en) * 2017-07-02 2019-01-03 Comigo Ltd. Named Entity Disambiguation for providing TV content enrichment
CN107577667A (en) * 2017-09-14 2018-01-12 北京奇艺世纪科技有限公司 A kind of entity word treating method and apparatus
CN110555206A (en) * 2018-06-01 2019-12-10 中兴通讯股份有限公司 named entity identification method, device, equipment and storage medium
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN110532390A (en) * 2019-08-26 2019-12-03 南京邮电大学 A kind of news keyword extracting method based on NER and Complex Networks Feature

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FRANCK DERNONCOURT, JI YOUNG LEE, PETER SZOLOVITS: ""De-identification of Patient Notes with Recurrent Neural Networks"", 《NEURON》 *
ZHIHENG HUANG WEI XU KAI YU: ""Bidirectional LSTM-CRF Models for Sequence Tagging"", 《COMPUTER SCIENCE》 *
史宇: ""基于深度学习的中文分词方法研究"", 《南京邮电大学硕士论文》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597324A (en) * 2020-05-20 2020-08-28 北京搜狗科技发展有限公司 Text query method and device
CN111597324B (en) * 2020-05-20 2023-10-03 北京搜狗科技发展有限公司 Text query method and device
CN111881669A (en) * 2020-06-24 2020-11-03 百度在线网络技术(北京)有限公司 Synonymy text acquisition method and device, electronic equipment and storage medium
CN111950288A (en) * 2020-08-25 2020-11-17 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent equipment
CN111950288B (en) * 2020-08-25 2024-02-23 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent device
CN113220835A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Text information processing method and device, electronic equipment and storage medium
CN113220835B (en) * 2021-05-08 2023-09-29 北京百度网讯科技有限公司 Text information processing method, device, electronic equipment and storage medium
CN113282689A (en) * 2021-07-22 2021-08-20 药渡经纬信息科技(北京)有限公司 Retrieval method and device based on domain knowledge graph and search engine

Also Published As

Publication number Publication date
CN111178076B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN111178076B (en) Named entity recognition and linking method, device, equipment and readable storage medium
KR101130444B1 (en) System for identifying paraphrases using machine translation techniques
US9448995B2 (en) Method and device for performing natural language searches
US11017301B2 (en) Obtaining and using a distributed representation of concepts as vectors
WO2015135455A1 (en) Natural language question answering method and apparatus
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
EP3405912A1 (en) Analyzing textual data
CN108538286A (en) A kind of method and computer of speech recognition
CN105045852A (en) Full-text search engine system for teaching resources
JP2016522524A (en) Method and apparatus for detecting synonymous expressions and searching related contents
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
Saloot et al. An architecture for Malay Tweet normalization
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN114556328A (en) Data processing method and device, electronic equipment and storage medium
US10606903B2 (en) Multi-dimensional query based extraction of polarity-aware content
US10740570B2 (en) Contextual analogy representation
CN113220835A (en) Text information processing method and device, electronic equipment and storage medium
JP2015088064A (en) Text summarization device, text summarization method, and program
CN111611793B (en) Data processing method, device, equipment and storage medium
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN114430832A (en) Data processing method and device, electronic equipment and storage medium
CN114595696A (en) Entity disambiguation method, entity disambiguation apparatus, storage medium, and electronic device
CN112541062B (en) Parallel corpus alignment method and device, storage medium and electronic equipment
JP7122773B2 (en) DICTIONARY CONSTRUCTION DEVICE, DICTIONARY PRODUCTION METHOD, AND PROGRAM
JP2022055334A (en) Text processing method, apparatus, device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant