CN111178076B - Named entity recognition and linking method, device, equipment and readable storage medium - Google Patents

Named entity recognition and linking method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN111178076B
CN111178076B CN201911318901.4A CN201911318901A CN111178076B CN 111178076 B CN111178076 B CN 111178076B CN 201911318901 A CN201911318901 A CN 201911318901A CN 111178076 B CN111178076 B CN 111178076B
Authority
CN
China
Prior art keywords
entity
words
word
probability
background
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911318901.4A
Other languages
Chinese (zh)
Other versions
CN111178076A (en
Inventor
雷士驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Oppo Communication Technology Co ltd
Original Assignee
Chengdu Oppo Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Oppo Communication Technology Co ltd filed Critical Chengdu Oppo Communication Technology Co ltd
Priority to CN201911318901.4A priority Critical patent/CN111178076B/en
Publication of CN111178076A publication Critical patent/CN111178076A/en
Application granted granted Critical
Publication of CN111178076B publication Critical patent/CN111178076B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present disclosure provides a named entity recognition and linking method, apparatus, device, and readable storage medium. The method comprises the following steps: acquiring a text to be identified containing a proper name; splitting the text to be recognized into words based on a pre-constructed dictionary, and splitting out target entity words corresponding to the proper nouns; determining the type of the target entity word based on the labeling information of the target entity word; the dictionary comprises entity words and background words, wherein the probability of the entity words is in nonlinear correlation with the length of the entity words, the probability of the background words is in nonlinear correlation with the length of the background words, and the probability of the entity words and the probability of the background words are used for word segmentation of the text to be recognized. The method can improve the accuracy of named entity identification.

Description

Named entity recognition and linking method, device, equipment and readable storage medium
Technical Field
The present disclosure relates to the field of computer application technologies, and in particular, to a named entity identification and linking method, apparatus, device, and readable storage medium.
Background
With the development of semantic recognition technology, the application of named entity recognition and linking technology is widely used. For example, a voice assistant in an intelligent device (such as a smart phone, a tablet personal computer, etc.) acquires voice of a user, and after voice recognition, the voice is converted into characters, a named entity in the voice needs to be recognized, and then the recognized named entity is linked into a corresponding knowledge base. If the user inputs 'play Qilixiang', the intelligent device uses the music skills to link the intelligent device to play the song after identifying the song 'Qilixiang'.
However, current naming entity and linking techniques are still relatively difficult to disambiguate.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The disclosure aims to provide a named entity identification and linking method, device, equipment and readable storage medium, which can improve the accuracy of named entity identification.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to one aspect of the present disclosure, there is provided a named entity recognition and linking method, including: acquiring a text to be identified containing a proper name; splitting the text to be recognized into words based on a pre-constructed dictionary, and splitting out target entity words corresponding to the proper nouns; determining the type of the target entity word based on the labeling information of the target entity word; the dictionary comprises entity words and background words, wherein the probability of the entity words is in nonlinear correlation with the length of the entity words, the probability of the background words is in nonlinear correlation with the length of the background words, and the probability of the entity words and the probability of the background words are used for word segmentation of the text to be recognized.
According to an embodiment of the present disclosure, based on a pre-constructed dictionary, segmenting the text to be recognized into words, and splitting out target entity words corresponding to the proper nouns, including: generating a directed acyclic graph of the text to be identified based on the prefix tree constructed by the dictionary; searching a maximum probability path based on the probability of each entity word and the probability of each background word in the dictionary; determining the word segmentation result of the maximum probability path as the word segmentation result of the text to be identified; and determining entity words in the word segmentation result as the target entity words corresponding to the proper names.
According to an embodiment of the disclosure, the probability of the entity word is the square of the length of the entity word multiplied by a preset threshold, and the probability of the background word is the square of the length of the background word.
According to an embodiment of the present disclosure, the background word is composed of a sentence fragment in the high frequency query sentence and/or the high frequency query word.
According to an embodiment of the disclosure, the statement fragments are determined from the high frequency query statement based on an N-Gram model.
According to an embodiment of the disclosure, the dictionary further comprises: and the substring of the entity word and the background word is used for constructing a prefix tree of the dictionary.
According to an embodiment of the present disclosure, the method further comprises: and according to the type of the target entity word, linking the target entity word into a named entity of a preset knowledge base.
According to another aspect of the present disclosure, there is provided a named entity recognition and linking apparatus, including: the text acquisition module is used for acquiring a text to be identified containing a proper name; the text word segmentation module is used for segmenting the text to be recognized based on a pre-constructed dictionary and splitting out target entity words corresponding to the proper nouns; the type determining module is used for determining the type of the target entity word based on the labeling information of the target entity word; the dictionary comprises entity words and background words, wherein the probability of the entity words is in nonlinear correlation with the length of the entity words, the probability of the background words is in nonlinear correlation with the length of the background words, and the probability of the entity words and the probability of the background words are used for word segmentation of the text to be recognized.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: the system comprises a memory, a processor and executable instructions stored in the memory and executable in the processor, wherein the processor implements any one of the methods when executing the executable instructions.
According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement a method as any one of the above.
According to the named entity recognition and linking method provided by the embodiment of the disclosure, when the text to be recognized is segmented, the accuracy of the segmentation of the entity words can be improved and ambiguity caused by the segmentation can be eliminated by respectively setting the probability values which are nonlinear related to the lengths of the target words for the entity words and the background words in the word segmentation device dictionary. In addition, the method can directly determine the type of the entity word according to the marking information of the identified entity word, so that synchronous execution of entity identification and entity link can be realized. The problem of resolving errors caused by ambiguity between an identification process and a linking process when entity identification and entity linking are respectively executed in the related art is avoided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 illustrates a flow chart of a named entity recognition and linking method in an embodiment of the present disclosure.
FIG. 2 illustrates a flow chart of another named entity recognition and linking method in an embodiment of the present disclosure.
Fig. 3 is a schematic diagram of a prefix tree shown according to an example.
FIG. 4 illustrates a block diagram of a named entity recognition and linking device in an embodiment of the present disclosure.
Fig. 5 schematically illustrates a block diagram of an electronic device in an embodiment of the disclosure.
Fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium in an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
First, a method for identifying entities based on a word segmentation device in the related art is introduced. Word segmentation is carried out on a query sentence (such as voice input) input by a user based on a preset word segmentation device; based on the word segmentation result, determining the entity word therein, and simultaneously completing the labeling of the entity word type.
However, the word segmentation result may make it difficult to extract the intended physical word due to ambiguity contained in the dictionary itself.
Query statement Word segmentation result Expected results
Listening to sea crying sounds Listening to the sound of sea/(music) |crying|sound Listening to the sound of |sea cry/(music)
I want to hear and feel fine I want to listen to heart/(music) sunny I want to listen to the heart of the person/(music)
As shown in the above table, when the user inputs a sound whose content to be recognized is "listen to sea cry", the word segmentation unit gives a result of "listen to sea/(music) |cry|sound", and determines that "listen to sea" is an entity word whose category is music, and the user's actual intention is "listen to sound of |sea cry/(music)".
Similarly, when the user inputs that the content to be recognized is "i want to listen to heart sunny", the word segmentation unit gives a result of "i want to listen to heart/(music) |sunny", and the user's actual intention is "i want to listen to heart sunny/(music)".
Therefore, in the entity recognition scheme of the related art, there is a problem of semantic understanding ambiguity caused by the word segmentation mode.
According to the analysis, in the entity recognition process, the word segmentation device is used for word segmentation to determine target entity words, and the word segmentation device is not concerned about how to segment background words and unregistered words. The entity words are typically nouns corresponding to proper names, such as person names, organization names, place names, song names, book names, and the like. The background word is "i want to listen to" in "i want to listen to heart sunny" described above, for example.
According to the named entity recognition and linking method provided by the embodiment of the disclosure, when the text to be recognized is segmented, the accuracy of the segmentation of the entity words can be improved and ambiguity caused by the segmentation can be eliminated by respectively setting the probability values which are nonlinear related to the lengths of the target words for the entity words and the background words in the word segmentation device dictionary.
For ease of understanding, several terms referred to in this disclosure are first explained below.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, named entity recognition and linking, knowledge-graph, and the like.
Named entity recognition (Named Entity Recognition, NER for short), abbreviated as "entity recognition," also known as "private name recognition," refers to recognizing entities in text with specific meaning, mainly including person names, place names, organization names, proper nouns, and the like. Named entity recognition is usually the first step of knowledge mining and information extraction, and is widely applied to the field of natural language processing.
Named entity linking (Named Entity Linking, NEL), abbreviated as "entity linking," refers to the process of linking an identified named entity to a disambiguated entity in a knowledge base. The technology can improve the information filtering capability of actual applications such as an online recommendation system, an Internet search engine and the like.
The scheme provided by the embodiment of the disclosure relates to entity identification and entity linking technology, and is specifically described by the following embodiment.
First, each step of the named entity recognition and linking method provided in the exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings and examples.
FIG. 1 illustrates a flow chart of a named entity recognition and linking method in an embodiment of the present disclosure. The method provided by the embodiments of the present disclosure may be performed by any electronic device having computing processing capabilities.
As shown in fig. 1, the named entity recognition and linking method 10 includes:
in step S102, a text to be recognized including a proper name is acquired.
The text to be recognized may be entered by the user, for example, by means of a software such as a voice assistant, and converted into text to be recognized by means of a voice recognition technique.
Alternatively, the text to be recognized may be entered by a user through a user interface provided by the client software.
The text to be identified contains a proper name, for example, the user inputs "i want to hear all", and the center of "all" is the proper name of a song.
In step S104, based on the dictionary constructed in advance, the text to be recognized is segmented, and the target entity word corresponding to the proper noun is split.
Wherein the pre-constructed dictionary contains physical words and background words. Entity words are, for example, nouns corresponding to proper names, such as person names, organization names, place names, song names, book names, and the like. The background word is other auxiliary words, such as "i want to listen to" in "i want to listen to heart sunny" above.
In some embodiments, the background words may be, for example, composed of sentence fragments in the high frequency query sentence and/or the high frequency query word. For example, high frequency query sentences and/or high frequency query terms of an online user may be gathered. After collecting the high frequency query sentence of the user, for example, the sentence fragments can be split from the high frequency query sentence by using an N-Gram model, and the sentence fragments can form a valid sentence. For example, taking the collected query sentence "do you vacate today", taking n=2 as an example, the divided sentence fragments are "you jin", "today", "day-rest", "vacate", "fake", "do-you-go" respectively.
The entity words may be constituted, for example, by entity dictionaries collected of actual intentions and skills (e.g., playing skills, reading skills).
As described above, in the entity recognition process, the purpose of word segmentation using the word segmentation device is to determine the target entity word therein, and the word segmentation device is not concerned about how to segment the background word and the unregistered word. Therefore, when the word is segmented, the probability of the expected entity word is larger than that of the background word, and the segmentation of the entity word is guaranteed to be superior to that of the background word when the word is segmented.
Also, for composite entities, it is desirable that the longest match can be made. For example, "birthday" and "happy" are both entity words, while "birthday happy" is also an entity word, and is a compound entity word consisting of "birthday" and "happy". In recognizing the entity word, it is desirable to be able to cut "birthday happiness" instead of cutting "birthday" and "happiness" respectively, that is, the probability of desiring "birthday happiness" is greater than the sum of the probabilities of both "birthday" and "happiness".
In addition, when a single word may form a new background word with other background words, it is desirable that the combined background word be recognized. For example, "listen" is a single word and "i want" is a background word. In recognizing the entity word, it is desirable to separate out "i want to hear" rather than separate out "i want to" and "listen" respectively.
The inventors found out after studying a large number of word segmentation scenes such as those described above that when the probability of each entity word is set to be non-linearly related to its length and the probability of each background word is set to be non-linearly related to its length, the accuracy of the segmented entity word is greatly improved after word segmentation based on each entity word and each background word.
In some embodiments, for example, the probability of each entity word may be set to be the square of the length of the entity word, and the probability of each background word may also be set to be the square of the length of the background word. For example, in solving the above-described segmentation problem of compound entity words, the length of "birthday" is 2, and the probability P (birthday) =2 2 =4; "happy" has a length of 2, and its probability P (happy) =2 2 =4; and "happy birthday" has a length of 4, and its probability P (happy birthday) =4 2 =16. Thus, P (Happy birthday)>P (birthday) +p (happiness), when words are cut based on the probability of each entity word, the "birthday happiness" is preferentially cut. For another example, when splitting "I want to hear", P (I want to hear)>P (i want) +p (listen), so "i want to listen" will be preferentially split out.
Furthermore, the probability of each entity word may be set to be the square of the length of the entity word multiplied by a preset threshold, and different entity words may have different thresholds. For example, "play" is a background word and "play" is an entity word, and in order to raise the segmentation priority for "play", the probability of "play" may be multiplied by a threshold value greater than 1, so that P (play) > P (play).
In step S106, the type of the target entity word is determined based on the labeling information of the target entity word.
The entity words in the dictionary also have labeling information for labeling the types corresponding to the entity words. An entity word may correspond to one or more (e.g., two, three, etc.) types.
Based on the label information of the cut target entity words, the type of the target entity words can be determined.
Further, in some embodiments, the named entity recognition and linking method 10 may further include: in step S108, the target entity word is linked to the named entity of the preset knowledge base according to the type of the target entity word. Thereby realizing embedding the identified named entity into the existing knowledge base.
According to the named entity recognition and linking method provided by the embodiment of the disclosure, when the text to be recognized is segmented, the accuracy of the segmentation of the entity words can be improved and ambiguity caused by the segmentation can be eliminated by respectively setting the probability values which are nonlinear related to the lengths of the target words for the entity words and the background words in the word segmentation device dictionary. In addition, the method can directly determine the type of the entity word according to the marking information of the identified entity word, so that synchronous execution of entity identification and entity link can be realized. The problem of resolving errors caused by ambiguity between an identification process and a linking process when entity identification and entity linking are respectively executed in the related art is avoided. For example, "play |one like summer and one like autumn" is erroneously interpreted as "play one|one like summer and one like autumn", resulting in a link error.
FIG. 2 illustrates a flow chart of another named entity recognition and linking method in an embodiment of the present disclosure. Unlike the named entity recognition and linking method 10 shown in fig. 1, the named entity recognition and linking method shown in fig. 2 further provides an exemplary embodiment of how to split the target entity word corresponding to the proper noun into separate words based on the pre-constructed dictionary for text segmentation to be recognized, that is, provides a specific implementation of step S104.
Referring to fig. 2, step S104 includes:
in step S1042, a directed acyclic graph of text to be recognized is generated based on the dictionary-constructed prefix tree (Trie tree).
The purpose of the prefix tree is to integrate character strings into a tree shape. Fig. 3 is a schematic diagram of a prefix tree shown according to an example. The prefix tree as shown in fig. 3 is composed of five chinese words of "university of bloom", "refreshing", "chinese", and "hualian". Each square in this tree represents a node, where "Root" represents the Root node and does not represent any characters. "1" to "5" respectively denote leaf nodes. Each node except the root node contains only one character. From the root node to the leaf nodes, the characters passing through the path are connected to form a word. While the numbers in leaf nodes represent the links in the dictionary tree where the word is located (how many words there are in the dictionary), links with common prefixes are called strings.
In some embodiments, the dictionary may further comprise: the substring of the entity word and the background word is used for constructing a prefix tree of the dictionary.
Based on the prefix tree constructed by the dictionary, efficient word graph scanning can be realized, and further, a Directed Acyclic Graph (DAG) of the text to be recognized is generated.
Taking the above-mentioned "I want to listen to heart all" as an example, it contains 5 characters, and their numbers are respectively 0-4, then its DAG= {0: [0,1,2],1: [1],2: [2,3],3, 4: [4] }. That is, starting from "me", it can be split into "me", "me about"; starting from "want" it can be split into "want"; starting from "listen", it can be divided into "listen" and "listen heart"; starting from the "heart" it can be split into "heart", "heart fine"; starting from "sunny" it can be cut into "sunny".
Similarly, DAG= {0: [0,1],1: [1,2,3,4,5],2: [2],3: [3],4: [4,5],5: [5] }.
In step S1044, a maximum probability path is found based on the probability of each entity word and the probability of each background word in the dictionary.
Taking the maximum probability path calculation method as an example, the probability of each segmentation path of "i want to hear all" is as follows:
the found segmentation path with the highest probability is {4 (1, 4), 3 (12, 4), 2 (13, 3), 1 (14, 1), 0 (39,2) }, wherein ": "before" indicates each character label, the first position in the brackets indicates the segmentation path probability, and the second position indicates the character label at the end of segmentation.
Similarly, the maximum probability segmentation path for the sound of "hearing sea cry" can be found to be {5 (1, 5), 4 (4, 5), 3 (5, 3), 2 (6, 2), 1 (75, 5), 0 (76,0) }.
In step S1046, it is determined that the word segmentation result of the maximum probability path is the word segmentation result of the text to be recognized.
From the maximum probability path, it can be determined that the word segmentation result is "I want to hear |heart fine".
Similarly, the word segmentation result of "listen to the sound of sea cry" is determined as "listen to the sound of |sea cry".
In step S1048, the entity word in the word segmentation result is determined as the target entity word corresponding to the proprietary name.
The entity word in the I heart sunny is "heart sunny", and the heart sunny is determined to be the target entity word corresponding to the proprietary name in the text to be identified.
The entity word in the sound of ' hearing|crying is ' crying sound ', and the sound of ' crying ' is determined to be a target entity word corresponding to a proprietary name in a text to be identified.
Based on the scheme, the accuracy of entity word recognition can be improved, and the time consumption can be reduced to microsecond level.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
FIG. 4 illustrates a block diagram of a named entity recognition and linking device in an embodiment of the present disclosure. Referring to fig. 4, the named entity recognition and linking means 20 includes: a text acquisition module 202, a text word segmentation module 204, and a type determination module 206.
The text obtaining module 202 is configured to obtain a text to be identified including a proper name.
The text word segmentation module 204 is configured to split a target entity word corresponding to a proper noun from a text word to be recognized based on a pre-constructed dictionary.
The type determining module 206 is configured to determine a type of the target entity word based on the labeling information of the target entity word;
the dictionary comprises entity words and background words, the probability of the entity words is in nonlinear correlation with the length of the entity words, the probability of the background words is in nonlinear correlation with the length of the background words, and the probability of the entity words and the probability of the background words are used for word segmentation of texts to be recognized.
In some embodiments, text segmentation module 204 includes: the system comprises a directed acyclic graph generating unit, a path searching unit, a word segmentation result determining unit and a target entity word determining unit. The directed acyclic graph generating unit is used for generating a directed acyclic graph of the text to be identified based on the prefix tree constructed by the dictionary. The path searching unit is used for searching the maximum probability path based on the probability of each entity word and the probability of each background word in the dictionary. The word segmentation result determining unit is used for determining that the word segmentation result of the maximum probability path is the word segmentation result of the text to be recognized. The target entity word determining unit is used for determining entity words in the word segmentation result as target entity words corresponding to the proprietary names.
In some embodiments, the probability of an entity word is the square of the length of the entity word multiplied by a preset threshold, and the probability of a background word is the square of the length of the background word.
In some embodiments, the background words are composed of sentence fragments in the high frequency query sentence and/or the high frequency query words.
In some embodiments, the statement fragments are determined from the high frequency query statement based on an N-Gram model.
In some embodiments, the dictionary further comprises: the substring of the entity word and the background word is used for constructing a prefix tree of the dictionary.
In some embodiments, named entity recognition and linking device 20 further comprises: and the entity linking module 208 is configured to link the target entity word to a named entity in a preset knowledge base according to the type of the target entity word.
According to the named entity recognition and linking device provided by the embodiment of the disclosure, when the text to be recognized is segmented, the accuracy of the segmentation of the entity words can be improved and ambiguity caused by the segmentation can be eliminated by respectively setting the probability values which are nonlinear related to the lengths of the target words for the entity words and the background words in the word segmentation device dictionary. In addition, the method can directly determine the type of the entity word according to the marking information of the identified entity word, so that synchronous execution of entity identification and entity link can be realized. The problem of resolving errors caused by ambiguity between an identification process and a linking process when entity identification and entity linking are respectively executed in the related art is avoided.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 800 according to such an embodiment of the present disclosure is described below with reference to fig. 5. The electronic device 800 shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.
As shown in fig. 5, the electronic device 800 is embodied in the form of a general purpose computing device. Components of electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 connecting the various system components, including the memory unit 820 and the processing unit 810.
Wherein the storage unit stores program code that is executable by the processing unit 810 such that the processing unit 810 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the present specification. For example, the processing unit 810 may perform step S102 shown in fig. 1, to obtain a text to be recognized containing a proper name; step S104, based on a pre-constructed dictionary, word segmentation is carried out on the text to be recognized, and target entity words corresponding to proper nouns are split; step S106, determining the type of the target entity word based on the labeling information of the target entity word.
The storage unit 820 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 8201 and/or cache memory 8202, and may further include Read Only Memory (ROM) 8203.
Storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 830 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 800 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 860. As shown, network adapter 860 communicates with other modules of electronic device 800 over bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.
Referring to fig. 6, a program product 900 for implementing the above-described method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A named entity recognition and linking method, comprising:
acquiring a text to be identified containing a proper name;
splitting the text word to be recognized into target entity words corresponding to the proper names based on a pre-constructed dictionary; and
determining the type of the target entity word based on the labeling information of the target entity word;
the dictionary comprises entity words and background words, the probability of the entity words is in nonlinear correlation with the length of the entity words, the probability of the background words is in nonlinear correlation with the length of the background words, the probability of the entity words and the probability of the background words are used for word segmentation of the text to be recognized, and the probability of the entity words is larger than that of the background words.
2. The method of claim 1, wherein splitting the target entity word corresponding to the proper noun out of the text word to be recognized based on a pre-constructed dictionary comprises:
generating a directed acyclic graph of the text to be identified based on the prefix tree constructed by the dictionary;
searching a maximum probability path based on the probability of each entity word and the probability of each background word in the dictionary;
determining the word segmentation result of the maximum probability path as the word segmentation result of the text to be identified; and
and determining entity words in the word segmentation result as the target entity words corresponding to the proper names.
3. The method according to claim 1 or 2, wherein the probability of the entity word is the square of the length of the entity word multiplied by a preset threshold, and the probability of the background word is the square of the length of the background word.
4. The method according to claim 1 or 2, characterized in that the background word consists of sentence fragments in high frequency query sentences and/or high frequency query words.
5. The method of claim 4, wherein the statement fragments are determined from the high frequency query statement based on an N-Gram model.
6. The method of claim 2, wherein the dictionary further comprises: and the substring of the entity word and the background word is used for constructing a prefix tree of the dictionary.
7. The method according to claim 1 or 2, further comprising:
and according to the type of the target entity word, linking the target entity word into a named entity of a preset knowledge base.
8. A named entity recognition and linking device, comprising:
the text acquisition module is used for acquiring a text to be identified containing a proper name;
the text word segmentation module is used for segmenting the text to be recognized based on a pre-constructed dictionary, and splitting out target entity words corresponding to the proper names; and
the type determining module is used for determining the type of the target entity word based on the labeling information of the target entity word;
the dictionary comprises entity words and background words, the probability of the entity words is in nonlinear correlation with the length of the entity words, the probability of the background words is in nonlinear correlation with the length of the background words, the probability of the entity words and the probability of the background words are used for word segmentation of the text to be recognized, and the probability of the entity words is larger than that of the background words.
9. An electronic device, comprising: memory, a processor and executable instructions stored in the memory and executable in the processor, wherein the processor implements the method of any of claims 1-7 when executing the executable instructions.
10. A computer readable storage medium having stored thereon computer executable instructions which when executed by a processor implement the method of any of claims 1-7.
CN201911318901.4A 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium Active CN111178076B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911318901.4A CN111178076B (en) 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911318901.4A CN111178076B (en) 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111178076A CN111178076A (en) 2020-05-19
CN111178076B true CN111178076B (en) 2023-08-08

Family

ID=70657555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911318901.4A Active CN111178076B (en) 2019-12-19 2019-12-19 Named entity recognition and linking method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111178076B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597324B (en) * 2020-05-20 2023-10-03 北京搜狗科技发展有限公司 Text query method and device
CN111881669B (en) * 2020-06-24 2023-06-09 百度在线网络技术(北京)有限公司 Synonymous text acquisition method and device, electronic equipment and storage medium
CN111950288B (en) * 2020-08-25 2024-02-23 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent device
CN113220835B (en) * 2021-05-08 2023-09-29 北京百度网讯科技有限公司 Text information processing method, device, electronic equipment and storage medium
CN113282689B (en) * 2021-07-22 2023-02-03 药渡经纬信息科技(北京)有限公司 Retrieval method and device based on domain knowledge graph

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN106547733A (en) * 2016-10-19 2017-03-29 中国国防科技信息中心 A kind of name entity recognition method towards particular text
CN107577667A (en) * 2017-09-14 2018-01-12 北京奇艺世纪科技有限公司 A kind of entity word treating method and apparatus
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN110532390A (en) * 2019-08-26 2019-12-03 南京邮电大学 A kind of news keyword extracting method based on NER and Complex Networks Feature
CN110555206A (en) * 2018-06-01 2019-12-10 中兴通讯股份有限公司 named entity identification method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9240181B2 (en) * 2013-08-20 2016-01-19 Cisco Technology, Inc. Automatic collection of speaker name pronunciations
US10652592B2 (en) * 2017-07-02 2020-05-12 Comigo Ltd. Named entity disambiguation for providing TV content enrichment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN106547733A (en) * 2016-10-19 2017-03-29 中国国防科技信息中心 A kind of name entity recognition method towards particular text
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN107577667A (en) * 2017-09-14 2018-01-12 北京奇艺世纪科技有限公司 A kind of entity word treating method and apparatus
CN110555206A (en) * 2018-06-01 2019-12-10 中兴通讯股份有限公司 named entity identification method, device, equipment and storage medium
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN110532390A (en) * 2019-08-26 2019-12-03 南京邮电大学 A kind of news keyword extracting method based on NER and Complex Networks Feature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"De-identification of Patient Notes with Recurrent Neural Networks";Franck Dernoncourt, Ji Young Lee, Peter Szolovits;《Neuron》;全文 *

Also Published As

Publication number Publication date
CN111178076A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN111178076B (en) Named entity recognition and linking method, device, equipment and readable storage medium
US10176804B2 (en) Analyzing textual data
KR101130444B1 (en) System for identifying paraphrases using machine translation techniques
Liu et al. Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision
US11017301B2 (en) Obtaining and using a distributed representation of concepts as vectors
US10423649B2 (en) Natural question generation from query data using natural language processing system
CN106570180B (en) Voice search method and device based on artificial intelligence
CN110659366A (en) Semantic analysis method and device, electronic equipment and storage medium
CN1742273A (en) Multimodal speech-to-speech language translation and display
CN101334774A (en) Character input method and input method system
KR102041621B1 (en) System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
CN114556328A (en) Data processing method and device, electronic equipment and storage medium
CN107656921B (en) Short text dependency analysis method based on deep learning
US10565982B2 (en) Training data optimization in a service computing system for voice enablement of applications
US20190243895A1 (en) Contextual Analogy Representation
TWI752406B (en) Speech recognition method, speech recognition device, electronic equipment, computer-readable storage medium and computer program product
WO2023045186A1 (en) Intention recognition method and apparatus, and electronic device and storage medium
US10133736B2 (en) Contextual analogy resolution
CN111611793B (en) Data processing method, device, equipment and storage medium
KR100372850B1 (en) Apparatus for interpreting and method thereof
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN112541062B (en) Parallel corpus alignment method and device, storage medium and electronic equipment
CN114595696A (en) Entity disambiguation method, entity disambiguation apparatus, storage medium, and electronic device
CN110276001B (en) Checking page identification method and device, computing equipment and medium
CN108831473B (en) Audio processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant