CN112989235B - Knowledge base-based inner link construction method, device, equipment and storage medium - Google Patents

Knowledge base-based inner link construction method, device, equipment and storage medium Download PDF

Info

Publication number
CN112989235B
CN112989235B CN202110258493.9A CN202110258493A CN112989235B CN 112989235 B CN112989235 B CN 112989235B CN 202110258493 A CN202110258493 A CN 202110258493A CN 112989235 B CN112989235 B CN 112989235B
Authority
CN
China
Prior art keywords
entity
words
word
chain
inner chain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110258493.9A
Other languages
Chinese (zh)
Other versions
CN112989235A (en
Inventor
熊壮
雷谦
姚后清
张翔翔
施鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110258493.9A priority Critical patent/CN112989235B/en
Publication of CN112989235A publication Critical patent/CN112989235A/en
Application granted granted Critical
Publication of CN112989235B publication Critical patent/CN112989235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/382Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using citations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The disclosure discloses a knowledge base-based method, a knowledge base-based device, a knowledge base-based equipment and a knowledge base-based storage medium, and relates to the technical field of computers, in particular to the technical field of natural language processing. The specific implementation scheme is as follows: extracting entity inner chain words in a target text; selecting candidate knowledge items of the chain words in the entity from knowledge items of a knowledge base; and selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to at least one of classification characteristics, historical inlined characteristics and contextual characteristics of the entity inlined word and the candidate knowledge items. The embodiment of the application can improve the efficiency and accuracy of the inner link construction in the knowledge base.

Description

Knowledge base-based inner link construction method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for constructing an inner link based on a knowledge base.
Background
Today, in the explosion of internet knowledge, people acquire knowledge information more and more conveniently, and internet searching can greatly meet active information demands of users. However, as internet products are mature gradually, how to mine information demands of users, so as to provide knowledge values more conveniently, is a problem that the products need to be considered.
In the field of text knowledge, an inner link refers to adding a link pointing to another web page to an entity word in a web page text, so that in the process of reading the web page text, the entity word can be clicked to directly jump to the other web page pointed to by the entity word.
How to build the inner chain for the knowledge product is important.
Disclosure of Invention
The present disclosure provides a method, apparatus, device and storage medium for knowledge base based inlining.
According to an aspect of the present disclosure, there is provided a knowledge base-based inlining method, including:
extracting entity inner chain words in a target text;
selecting candidate knowledge items of the chain words in the entity from knowledge items of a knowledge base;
and selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to at least one of classification characteristics, historical inlined characteristics and contextual characteristics of the entity inlined word and the candidate knowledge items.
According to another aspect of the present disclosure, there is provided a knowledge base-based in-link construction apparatus, including:
the inner chain word extraction module is used for extracting entity inner chain words in the target text;
a candidate item selection module, configured to select a candidate knowledge item of the intra-entity chain word from knowledge items of a knowledge base;
and the target item selection module is used for selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to the entity inlined word and at least one of the classification characteristics, the history inlined characteristics and the context characteristics of the candidate knowledge items.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the knowledge base based in-link construction method provided by any embodiment of the present application.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the knowledge base based method of building a inlink provided by any embodiment of the present application.
According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a knowledge base based inlining method provided by any embodiment of the present application.
According to the technology of the application, the efficiency and accuracy of the inner link construction in the knowledge base can be improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of a knowledge base based inlining method in accordance with an embodiment of the present disclosure;
FIG. 2a is a schematic diagram of another knowledge base based inlining method in accordance with an embodiment of the present disclosure;
FIG. 2b is a schematic diagram of a method for screening candidate knowledge items, provided in accordance with an embodiment of the application;
FIG. 3a is a schematic diagram of yet another knowledge base based method of inlining in accordance with an embodiment of the present disclosure;
FIG. 3b is a schematic diagram illustrating the extraction of a chain word in an entity according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a knowledge base based in-link construction apparatus, in accordance with an embodiment of the disclosure;
FIG. 5 is a block diagram of an electronic device for implementing a knowledge base based in-link construction method in accordance with an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a knowledge base-based inlining method according to an embodiment of the present application, where the embodiment of the present application may be applicable to a case of inlining entity words in a target text, i.e., a case of pointing entity words in the target text to knowledge items. The method may be performed by a knowledge base based in-link construction device, which may be implemented in hardware and/or software, and may be configured in an electronic device. Referring to fig. 1, the method specifically includes the following steps:
s110, extracting the entity inner chain words in the target text.
S120, selecting candidate knowledge items of the chain words in the entity from the knowledge items of the knowledge base.
S130, selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to at least one of classification characteristics, history inlined characteristics and context characteristics of the entity inlined word and the candidate knowledge items.
In the embodiment of the application, the target text refers to a web page text to which an inner link knowledge item is added in a knowledge base, namely, a link relation between an entity word and other knowledge items in the web page text to be constructed, so that a user can jump to other knowledge items only by clicking the entity word in the process of browsing the target text. The knowledge base may refer to a knowledge sharing platform, and the knowledge items may be terms.
Wherein, the entity inner chain word can be the entity word of the inner chain to be constructed. The in-entity chain word extraction rule may be preset, for example, the in-entity chain word may be a proper noun that needs to be interpreted to the user, and may not be a non-proper noun, or a word with less information and no parsing requirement. At least one entity in-link word is extracted from the target text based on the in-link word extraction rules. Knowledge items in the knowledge base that include entity-in-link words may be considered candidate knowledge items for entity-in-link words.
In the embodiment of the application, a set of classification strategies is formulated, and the entity inner chain words and knowledge items are integrated under the same classification system, so that the number of candidate items of the entity inner chain words is reduced, and the inner chain construction efficiency and accuracy are improved. Specifically, a unified classification strategy is adopted to respectively determine classification characteristics of the chain words in the entity and the candidate knowledge items, and the candidate knowledge items of the chain words in the entity are screened based on the classification characteristics.
Wherein the historical in-link characteristics may be determined by counting the frequency with which in-entity-link words in the historical text of the knowledge base have been linked to candidate knowledge items. The history text refers to the knowledge item web page text of the built-in link. Taking an example that a certain entity in-link word links a first candidate knowledge item in a part of history text, links a second candidate knowledge item in a part of history text and links a third candidate knowledge item in a part of history text, that is, taking an example that the entity in-link word has a first candidate knowledge item, a second candidate knowledge item and a third candidate knowledge item, the frequency that the entity in-link word in the knowledge base has been linked to the first candidate knowledge item, the second candidate knowledge item and the third candidate knowledge item can be counted respectively as the history in-link characteristic of the entity in-link word.
In the embodiment of the application, the context feature may refer to a relationship between a context word of an entity inner link word in a target text and a context word of the entity inner link word in a candidate knowledge item.
Specifically, candidate knowledge items of the entity inner chain words are screened according to at least one of classification characteristics, historical inner chain characteristics and contextual characteristics of the entity inner chain words and the candidate knowledge items to obtain target knowledge items of the entity inner chain words, and the target knowledge items are used for linking the entity inner chain words in the target text to the target knowledge items. By automatically screening the candidate knowledge items of the entity inner chain words according to at least one of the classification characteristics, the historical inner chain characteristics and the contextual characteristics, compared with manually screening the candidate knowledge items, the inner chain construction efficiency can be improved, and the accuracy rate of inner chain construction can be improved.
According to the technical scheme, the entity inner link words in the target text are extracted, the candidate knowledge items of the entity inner link words are selected from the knowledge items of the knowledge base, and the candidate knowledge items of the entity inner link words are automatically screened according to at least one of classification characteristics, historical inner link characteristics and contextual characteristics of the entity inner link words and the candidate knowledge items, so that inner link construction efficiency and inner link construction accuracy can be improved.
Fig. 2a is a schematic flow chart of another knowledge base-based method for constructing an in-link according to an embodiment of the present application. This embodiment is an alternative to the embodiments described above. Referring to fig. 2a, the knowledge base-based inner chain construction method provided in the present embodiment includes:
s210, extracting the entity inner chain words in the target text.
S220, selecting candidate knowledge items of the chain words in the entity from the knowledge items of the knowledge base.
And S230, if the classification characteristic of any candidate knowledge item is different from the classification characteristic of the chain word in the entity, filtering the candidate knowledge item to obtain the rest candidate knowledge item.
S240, selecting a target knowledge item to be inlined for the entity inlined word from the rest candidate knowledge items according to the entity inlined word and the historical inlined characteristics and/or the contextual characteristics of the candidate knowledge items.
Filtering candidate knowledge items with classification characteristics different from the entity inner chain words, and reserving candidate knowledge items with classification characteristics identical to the entity inner chain words, namely adopting the knowledge items with classification characteristics identical to the entity inner chain words to explain the entity inner chain words, so that the inner chain accuracy is further improved; and the selection efficiency of candidate target knowledge items can be improved.
In the embodiment of the application, whether the number of the remaining candidate knowledge items is unique can be determined, and under the condition of uniqueness, the entity inner chain word meaning can be determined; in the case of non-uniqueness, the sense of the entity's in-chain word may be determined.
In an alternative embodiment, selecting a target knowledge item to be inlined for the entity inlier word from the remaining candidate knowledge items according to the entity inlier word and the historical inlier features of the candidate knowledge items, comprising: under the condition that the candidate knowledge items of the entity in-link words are unique, counting the historical in-link frequency characteristics of the entity in-link words between the historical text and the candidate knowledge items; and under the condition that the historical intra-link frequency characteristic accords with an intra-link chain piece, taking the rest candidate knowledge items as target knowledge items of the intra-link of the entity intra-link word.
The inner chain condition can be preset and set, and can be adjusted according to service requirements, for example, the inner chain member conforming to the inner chain member can be determined under the condition that the historical inner chain frequency characteristic is larger than the inner chain frequency characteristic threshold; otherwise, determining that the internal link condition is not met; or determining to conform to the inner chain piece if the historical inner chain frequency characteristic duty cycle is greater than the inner chain frequency characteristic duty cycle threshold. The historical intra-link frequency characteristic refers to the number of the historical intra-link frequencies of the intra-links established by the intra-entity link words and the candidate knowledge items, and the historical intra-link frequency ratio refers to the ratio of the number of the intra-entity link words established by the intra-entity link words and the candidate knowledge items to the total number of the intra-entity link words. By taking the unique candidate knowledge item as the target knowledge item in the case that the historical intra-chain frequency characteristic accords with the intra-chain piece. It should be noted that, in the case that the historical intra-link frequency characteristic does not meet the intra-link condition, the characteristics of the historical intra-link frequency characteristic, the context characteristic, the knowledge item content and the like may be combined to determine whether the candidate knowledge item is high-confidence, and in the case of high-confidence, the candidate knowledge item is taken as the target knowledge item. The advantage of this arrangement is that the in-link accuracy can be further improved by taking the high confidence candidate knowledge item as the target knowledge item.
In an alternative embodiment, selecting a target knowledge item to be inlined for the entity inlier word from the remaining candidate knowledge items based on the historical inlier features and the contextual features of the entity inlier word and the candidate knowledge item, comprising: under the condition that the entity inner chain word has at least two candidate knowledge items, counting the historical inner chain frequency characteristics of the entity inner chain word between the historical text and the candidate knowledge items; extracting a context keyword of the intra-entity chain word from the target text, and determining the context feature according to the co-occurrence times of the context keyword in the target text and the candidate knowledge item; and selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to the historical inlined frequency characteristic and/or the contextual characteristic.
The context keywords refer to other keywords except for the chain words in the entity in the target text. Under the condition that the intra-entity chain words are ambiguous, the confidence level of the candidate knowledge items can be determined by combining the historical intra-chain frequency characteristics, the contextual characteristics, the knowledge item content and other characteristics, the candidate knowledge items with the confidence level being larger than the threshold value are used as target knowledge items, and the candidate knowledge items with the confidence level being lower than the threshold value are filtered, so that the intra-chain accuracy can be further improved.
In an alternative embodiment, the method further comprises determining context keywords of the intra-entity chain word by: taking the verticals to which the chain words in the entity belong as target verticals, and acquiring at least one key attribute of the target verticals; and taking the words belonging to the key attributes in the target text as context keywords of the intra-entity chain words.
Wherein at least two verticals with ambiguous risk may be preset and at least one key attribute of each verticals is determined for identifying contextual keywords of the entity's in-chain words. For example, a plumb class with ambiguous risk may be a person, a territory, etc.; the key attributes of the character drop can be the job, name, work and the like, and the key attributes of the region can be scenic spots, delicacies, cultures and the like.
Specifically, aiming at an entity inner link word in a target text, determining a sags class to which the entity inner link word belongs as a target sags class, and acquiring a key attribute of the target verticality; extracting words belonging to key attributes from the target text as context keywords of the entity inner chain words; and counting the co-occurrence times of the context keywords in the target text and the candidate knowledge items, and determining the confidence level of the candidate knowledge items according to the co-occurrence times. By introducing the context keywords of the entity inner chain words, the confidence of the candidate knowledge item is determined by combining the context characteristics of the context keywords in the target text and the candidate knowledge item, and the accuracy of the target knowledge item can be further improved.
Fig. 2b is a schematic diagram of a method for screening candidate knowledge items according to an embodiment of the application. Referring to fig. 2b, the classification feature, the context feature and the history inner-link feature of the entity inner-link word are respectively determined, and the classification feature, the context feature and the history inner-link feature of the candidate knowledge item are respectively determined; screening the candidate knowledge items according to the classification characteristics of the candidate knowledge items and the classification characteristics of the chain words in the entity based on the classification verification strategy to obtain residual candidate knowledge items; determining whether the real internal chain word is univocal according to the number of the rest candidate knowledge items; under the univocal condition, determining whether the candidate knowledge items meet the high-confidence requirement based on the univocal mounting strategy; outputting a target knowledge item under the condition of meeting the high confidence requirement; in the case of ambiguity or in the case where the univocal mounting policy determines that the candidate knowledge item does not meet the high confidence requirement, it may be determined whether the candidate knowledge item meets the high confidence requirement based on the univocal mounting policy and the target knowledge item may be output according to the result of the univocal mounting policy. And, the in-entity chain word in the target text can be linked to the target knowledge item, namely, the in-entity chain word in the target text is mounted to the target knowledge item. The single mounting strategy can utilize the historical inner chain characteristics to judge whether the candidate knowledge items are high in confidence; the multi-meaning mounting vehicle can judge whether the candidate knowledge item is high in confidence or not by combining the characteristics of the context, the characteristics of the historical inner chain, the content of the knowledge item and the like.
According to the technical scheme of the embodiment of the application, according to the influence of the classification of the entity inner chain words on the inner chain strategy, the inner chain construction with high accuracy is realized by utilizing the statistical priori information of the inner chain in the knowledge base and considering the context characteristics in a drooping manner.
Fig. 3a is a schematic flow chart of yet another knowledge base based method for constructing an in-link according to an embodiment of the application. This embodiment is an alternative to the embodiments described above. Referring to fig. 3a, the knowledge base-based inner chain construction method provided in the present embodiment includes:
s310, based on a natural language processing technology, performing word segmentation, part-of-speech tagging and entity recognition on the target text to obtain basic words in the target text, and part-of-speech and entity characteristic information of the basic words.
S320, combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain the potential inland words.
S330, screening the potential inner chain words to obtain the entity inner chain words.
S340, selecting candidate knowledge items of the chain words in the entity from the knowledge items of the knowledge base.
S350, selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to at least one of classification characteristics, history inlined characteristics and context characteristics of the entity inlined word and the candidate knowledge items.
In the extraction stage of the potential inlined word, the natural language processing technology (Natural Language Processing, NLP) can be utilized to segment the target text, label the part of speech and identify the entity (namely identify proper nouns), so as to obtain the basic word in the target text, and the part of speech and the entity characteristic information of the basic word. Wherein, in the case that the basic word is not an entity word, the entity characteristic information of the basic word may be null; in the case that the basic word is an entity word, the entity characteristic information of the basic word may be an entity classification of the basic word, for example, a music character, an influence character, a work, or the like.
In the embodiment of the application, the basic words are combined according to the part of speech and the entity characteristic information of the basic words to obtain the submerged inner chain words, namely, the submerged inner chain words are obtained by performing word pasting processing on the basic words. Different from directly utilizing word segmentation results as a submerged inner chain word set, the word segmentation errors and the problem of small word segmentation granularity caused by a word segmentation technology can be effectively relieved by performing word pasting processing on basic words, namely, sequentially combining the basic words as minimum word units into shorter words. In the word-sticking process, nouns and nouns can be stuck together, and proper nouns can be stuck together. Taking a certain, limited or company three basic words as an example, the following potential inland words can be obtained: some finite, some finite company. Before the base words are subjected to word sticking processing, the base words with empty entity characteristic information can be removed, so that the accuracy of the submerged inner chain words is further improved.
In an optional embodiment, the screening the potential inlier word to obtain the entity inlier word includes at least one of the following:
screening the potential inner chain words according to the word frequency of the potential inner chain words;
based on a natural language processing technology, carrying out entity recognition on the potential inner chain words, and screening the potential inner chain words according to an entity recognition result;
and matching the potential inner chain words with a preset entity blacklist, and screening the potential inner chain words according to a matching result.
In the screening stage of the potential inner-link words, high-frequency word filtering can be performed according to word frequencies, for example, a high-frequency word list can be utilized to filter the high-frequency potential inner-link words, wherein the word frequencies can be obtained based on statistics of Internet word frequencies. The meaning of the high frequency words is generally clear, and is not necessarily interpreted by an inner chain, so that the risk of nonsensical entities can be reduced.
In the screening stage, entity detection can be performed, potential inner chain words which are not entities are filtered by using part of speech and entity characteristic information of the potential inner chain words, for example, the potential inner chain words which are not entities are processed based on an NLP technology, and the potential inner chain words which are not entities are identified and filtered. The accuracy of the specific name recognition of the potential inland words based on the NLP technology is higher than that of the target text based on the NLP technology, and the reason is as follows: 1) Recognizing the text, wherein the division granularity of words is relatively smaller; 2) Has a contextual impact.
In the screening stage, the product requirement standard can be combined to perform entity quality detection, for example, an entity blacklist is provided based on the online standard of the service, wherein the entity blacklist is used for storing nonsensical entity words with low-quality proper noun characteristics, and the confidence of the entity in-chain words can be further improved by filtering potential in-chain words matched with the entity blacklist. For example, a lesser amount of information from the secondary professor in the position may be added to the physical blacklist. In addition, potential inliers which are not matched with the existing knowledge items can be eliminated, namely, potential inliers without candidate knowledge items are eliminated.
Fig. 3b is a schematic diagram of extracting a chain word in an entity according to an embodiment of the present application. Referring to fig. 3b, processing the target text based on the NLP technology to obtain part of speech and entity characteristic information of the basic word; performing word sticking treatment on the basic words to obtain potential inner chain words; and obtaining the real in-link words through high-frequency word filtering, non-proper noun filtering based on NLP and low-quality potential in-link word filtering based on low-quality proper noun characteristics.
According to the technical scheme, accurate and meaningful entity in-chain words are extracted from the target text based on the entity extraction technology; and, the target knowledge item to be inlined is selected for the entity inlined word, so that the inlined construction efficiency and accuracy can be improved.
Fig. 4 is a schematic diagram of a knowledge base-based inner chain construction device according to an embodiment of the present application, where the embodiment may be applicable to a case of constructing an inner chain for entity words in a target text, that is, a case of pointing entity words in the target text to knowledge items, where the device is configured in an electronic device, and the knowledge base-based inner chain construction method according to any embodiment of the present application may be implemented. The knowledge base-based in-link construction apparatus 400 specifically includes the following:
the inlined word extracting module 401 is configured to extract an entity inlined word in the target text;
a candidate entry selection module 402, configured to select a candidate knowledge entry of the intra-entity chain word from knowledge entries in a knowledge base;
a target item selection module 403, configured to select a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to at least one of the classification feature, the history inlined feature and the context feature of the entity inlined word and the candidate knowledge item.
In an alternative embodiment, the target entry selection module 403 includes:
the classification screening unit is used for filtering any candidate knowledge item if the classification characteristic of the candidate knowledge item is different from the classification characteristic of the chain word in the entity;
and the target selection unit is used for selecting a target knowledge item to be inlined for the entity inlined word from the rest candidate knowledge items according to the entity inlined word and the historical inlined characteristics and/or the contextual characteristics of the candidate knowledge items.
In an alternative embodiment, the object selection unit is specifically configured to:
under the condition that the candidate knowledge items of the entity in-link words are unique, counting the historical in-link frequency characteristics of the entity in-link words between the historical text and the candidate knowledge items;
and under the condition that the historical intra-link frequency characteristic accords with an intra-link chain piece, taking the rest candidate knowledge items as target knowledge items of the intra-link of the entity intra-link word.
In an alternative embodiment, the object selection unit is specifically configured to:
under the condition that the entity inner chain word has at least two candidate knowledge items, counting the historical inner chain frequency characteristics of the entity inner chain word between the historical text and the candidate knowledge items;
extracting a context keyword of the intra-entity chain word from the target text, and determining the context feature according to the co-occurrence times of the context keyword in the target text and the candidate knowledge item;
and selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to the historical inlined frequency characteristic and/or the contextual characteristic.
In an alternative embodiment, the target entry selection module 403 is further configured to determine the context keyword of the intra-entity chain word by:
taking the verticals to which the chain words in the entity belong as target verticals, and acquiring at least one key attribute of the target verticals;
and taking the words belonging to the key attributes in the target text as context keywords of the intra-entity chain words.
In an alternative embodiment, the in-link word extraction module 401 includes:
the basic word determining unit is used for carrying out word segmentation, part-of-speech tagging and entity recognition on the target text based on a natural language processing technology to obtain basic words in the target text, and part-of-speech and entity characteristic information of the basic words;
the hidden inland word determining unit is used for combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain hidden inland words;
and the potential inner chain word screening unit is used for screening the potential inner chain words to obtain the entity inner chain words.
In an alternative embodiment, the potential endoword screening unit is specifically configured to screen the potential endowords by at least one of:
screening the potential inner chain words according to the word frequency of the potential inner chain words;
based on a natural language processing technology, carrying out entity recognition on the potential inner chain words, and screening the potential inner chain words according to an entity recognition result;
and matching the potential inner chain words with a preset entity blacklist, and screening the potential inner chain words according to a matching result.
According to the technical scheme, accurate and meaningful entity in-chain words are extracted from the target text based on the entity extraction technology; and according to the influence of the classification of the entity inner chain words on the inner chain strategy, the statistical prior information of the inner chain in the knowledge base is utilized, the context characteristics are considered in a sagging manner, and the target knowledge item to be inner chain is selected for the entity inner chain words, so that the inner chain construction efficiency and accuracy can be improved.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 5 illustrates a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the apparatus 500 includes a computing unit 501 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units executing machine learning model algorithms, a digital information processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as a knowledge base based inlining method. For example, in some embodiments, the knowledge base based in-link construction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by the computing unit 501, one or more steps of the knowledge base based inlining method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the knowledge base based in-link construction method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs executing on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (14)

1. The knowledge base-based internal link construction method comprises the following steps:
extracting entity inner chain words in a target text;
selecting candidate knowledge items of the chain words in the entity from knowledge items of a knowledge base;
under the condition that the entity inner chain word has at least two candidate knowledge items, counting the historical inner chain frequency characteristics of the entity inner chain word between the historical text and the candidate knowledge items;
extracting a context keyword of the intra-entity chain word from the target text, and determining a context feature according to the co-occurrence times of the context keyword in the target text and the candidate knowledge item;
and selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to the historical inlined frequency characteristic and the contextual characteristic.
2. The method of claim 1, the method further comprising:
and if the classification characteristic of any candidate knowledge item is different from the classification characteristic of the chain word in the entity, filtering the candidate knowledge item.
3. The method of claim 1, further comprising, after selecting the candidate knowledge item for the intra-entity chain word from knowledge items in a knowledge base:
under the condition that the candidate knowledge items of the entity in-link words are unique, counting the historical in-link frequency characteristics of the entity in-link words between the historical text and the candidate knowledge items;
and under the condition that the historical intra-link frequency characteristic accords with an intra-link chain piece, taking the candidate knowledge item as a target knowledge item of the intra-link of the entity intra-link word.
4. The method of claim 1, further comprising determining contextual keywords of the intra-entity chain word by:
taking the verticals to which the chain words in the entity belong as target verticals, and acquiring at least one key attribute of the target verticals;
and taking the words belonging to the key attributes in the target text as context keywords of the intra-entity chain words.
5. The method of claim 1, wherein the extracting the intra-entity chain words in the target text comprises:
based on a natural language processing technology, performing word segmentation, part-of-speech tagging and entity recognition on the target text to obtain basic words in the target text, and part-of-speech and entity characteristic information of the basic words;
combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain potential inland words;
and screening the potential inner chain words to obtain the entity inner chain words.
6. The method of claim 5, wherein the screening the potential inl-link words comprises at least one of:
screening the potential inner chain words according to the word frequency of the potential inner chain words;
based on a natural language processing technology, carrying out entity recognition on the potential inner chain words, and screening the potential inner chain words according to an entity recognition result;
and matching the potential inner chain words with a preset entity blacklist, and screening the potential inner chain words according to a matching result.
7. Knowledge base-based internal link construction device includes:
the inner chain word extraction module is used for extracting entity inner chain words in the target text;
a candidate item selection module, configured to select a candidate knowledge item of the intra-entity chain word from knowledge items of a knowledge base;
the target item selection module is specifically configured to:
under the condition that the entity inner chain word has at least two candidate knowledge items, counting the historical inner chain frequency characteristics of the entity inner chain word between the historical text and the candidate knowledge items;
extracting a context keyword of the intra-entity chain word from the target text, and determining a context feature according to the co-occurrence times of the context keyword in the target text and the candidate knowledge item;
and selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to the historical inlined frequency characteristic and the contextual characteristic.
8. The apparatus of claim 7, the target item selection module further comprising a classification screening unit;
the classifying and screening unit is used for filtering the candidate knowledge items if the classifying characteristics of any candidate knowledge item are different from the classifying characteristics of the chain words in the entity.
9. The apparatus of claim 7, wherein the target entry selection module is further specifically configured to:
under the condition that the candidate knowledge items of the entity in-link words are unique, counting the historical in-link frequency characteristics of the entity in-link words between the historical text and the candidate knowledge items;
and under the condition that the historical intra-link frequency characteristic accords with an intra-link chain piece, taking the candidate knowledge item as a target knowledge item of the intra-link of the entity intra-link word.
10. The apparatus of claim 7, the target entry selection module further to determine context keywords for the intra-entity chain word by:
taking the verticals to which the chain words in the entity belong as target verticals, and acquiring at least one key attribute of the target verticals;
and taking the words belonging to the key attributes in the target text as context keywords of the intra-entity chain words.
11. The apparatus of claim 7, wherein the in-link word extraction module comprises:
the basic word determining unit is used for carrying out word segmentation, part-of-speech tagging and entity recognition on the target text based on a natural language processing technology to obtain basic words in the target text, and part-of-speech and entity characteristic information of the basic words;
the hidden inland word determining unit is used for combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain hidden inland words;
and the potential inner chain word screening unit is used for screening the potential inner chain words to obtain the entity inner chain words.
12. The apparatus of claim 11, wherein the potential inlier word screening unit is specifically configured to screen the potential inlier word by at least one of:
screening the potential inner chain words according to the word frequency of the potential inner chain words;
based on a natural language processing technology, carrying out entity recognition on the potential inner chain words, and screening the potential inner chain words according to an entity recognition result;
and matching the potential inner chain words with a preset entity blacklist, and screening the potential inner chain words according to a matching result.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
CN202110258493.9A 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium Active CN112989235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110258493.9A CN112989235B (en) 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110258493.9A CN112989235B (en) 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112989235A CN112989235A (en) 2021-06-18
CN112989235B true CN112989235B (en) 2023-08-01

Family

ID=76334694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110258493.9A Active CN112989235B (en) 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112989235B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919347B (en) * 2021-12-14 2022-04-05 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN114647739B (en) * 2022-02-25 2023-02-28 北京百度网讯科技有限公司 Entity chain finger method, device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107092605A (en) * 2016-02-18 2017-08-25 北大方正集团有限公司 A kind of entity link method and device
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
CN108959270A (en) * 2018-08-10 2018-12-07 新华智云科技有限公司 A kind of entity link method based on deep learning
CN109446406A (en) * 2018-09-14 2019-03-08 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
CN110674317A (en) * 2019-09-30 2020-01-10 北京邮电大学 Entity linking method and device based on graph neural network
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111428507A (en) * 2020-06-09 2020-07-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111523326A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111737430A (en) * 2020-06-16 2020-10-02 北京百度网讯科技有限公司 Entity linking method, device, equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107092605A (en) * 2016-02-18 2017-08-25 北大方正集团有限公司 A kind of entity link method and device
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
CN108959270A (en) * 2018-08-10 2018-12-07 新华智云科技有限公司 A kind of entity link method based on deep learning
CN109446406A (en) * 2018-09-14 2019-03-08 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110674317A (en) * 2019-09-30 2020-01-10 北京邮电大学 Entity linking method and device based on graph neural network
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111523326A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111428507A (en) * 2020-06-09 2020-07-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111737430A (en) * 2020-06-16 2020-10-02 北京百度网讯科技有限公司 Entity linking method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
与自然语言查询表述相关的词语分析;熊文新;;图书情报工作(第17期);全文 *

Also Published As

Publication number Publication date
CN112989235A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
TWI729472B (en) Method, device and server for determining feature words
CN112989235B (en) Knowledge base-based inner link construction method, device, equipment and storage medium
CN113360700B (en) Training of image-text retrieval model, image-text retrieval method, device, equipment and medium
CN113660541B (en) Method and device for generating abstract of news video
CN113128209A (en) Method and device for generating word stock
CN113191145B (en) Keyword processing method and device, electronic equipment and medium
CN114428902B (en) Information searching method, device, electronic equipment and storage medium
CN113033194B (en) Training method, device, equipment and storage medium for semantic representation graph model
CN113963197A (en) Image recognition method and device, electronic equipment and readable storage medium
CN116361591A (en) Content auditing method, device, electronic equipment and computer readable storage medium
CN113590774B (en) Event query method, device and storage medium
CN112860626B (en) Document ordering method and device and electronic equipment
CN114417862A (en) Text matching method, and training method and device of text matching model
CN112528644A (en) Entity mounting method, device, equipment and storage medium
CN113377922B (en) Method, device, electronic equipment and medium for matching information
CN113377921B (en) Method, device, electronic equipment and medium for matching information
CN114422584B (en) Method, device and storage medium for pushing resources
CN113268987B (en) Entity name recognition method and device, electronic equipment and storage medium
CN115392389B (en) Cross-modal information matching and processing method and device, electronic equipment and storage medium
CN112527126B (en) Information acquisition method and device and electronic equipment
CN116244413B (en) New intention determining method, apparatus and storage medium
CN115828915B (en) Entity disambiguation method, device, electronic equipment and storage medium
CN113656393B (en) Data processing method, device, electronic equipment and storage medium
CN113971216B (en) Data processing method and device, electronic equipment and memory
CN117609504A (en) Case classification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant