CN112989235A - Knowledge base-based internal link construction method, device, equipment and storage medium - Google Patents

Knowledge base-based internal link construction method, device, equipment and storage medium Download PDF

Info

Publication number
CN112989235A
CN112989235A CN202110258493.9A CN202110258493A CN112989235A CN 112989235 A CN112989235 A CN 112989235A CN 202110258493 A CN202110258493 A CN 202110258493A CN 112989235 A CN112989235 A CN 112989235A
Authority
CN
China
Prior art keywords
entity
inlined
words
word
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110258493.9A
Other languages
Chinese (zh)
Other versions
CN112989235B (en
Inventor
熊壮
雷谦
姚后清
张翔翔
施鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110258493.9A priority Critical patent/CN112989235B/en
Publication of CN112989235A publication Critical patent/CN112989235A/en
Application granted granted Critical
Publication of CN112989235B publication Critical patent/CN112989235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/382Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using citations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The disclosure discloses a knowledge base-based inner chain construction method, a knowledge base-based inner chain construction device, knowledge base-based inner chain construction equipment and a storage medium, and relates to the technical field of computers, in particular to the technical field of natural language processing. The specific implementation scheme is as follows: extracting entity internal links in the target text; selecting candidate knowledge items of the entity internal link words from knowledge items of a knowledge base; and selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification characteristics, historical inlined characteristics and context characteristics of the entity inlined word and the candidate knowledge items. The method and the device can improve the efficiency and accuracy of inner chain construction in the knowledge base.

Description

Knowledge base-based internal link construction method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for constructing an internal link based on a knowledge base.
Background
Today, people acquire knowledge information more and more conveniently when internet knowledge is exploded, and internet search can greatly meet active information requirements of users. However, as internet products are gradually mature, how to mine the information requirements of users and provide knowledge values more conveniently is a problem that products need to be thought.
In the text knowledge field, an internal link refers to adding a link pointing to another webpage to an entity word in a webpage text, so that in the process of reading the webpage text, the user can directly jump to the other webpage pointed by the entity word by clicking the entity word.
How to construct an inner chain for knowledge products is very important.
Disclosure of Invention
The present disclosure provides a method, apparatus, device and storage medium for knowledge-base-based inlining.
According to an aspect of the present disclosure, there is provided a knowledge-base-based inlining method, including:
extracting entity internal links in the target text;
selecting candidate knowledge items of the entity internal link words from knowledge items of a knowledge base;
and selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification characteristics, historical inlined characteristics and context characteristics of the entity inlined word and the candidate knowledge items.
According to another aspect of the present disclosure, there is provided a knowledge-base-based inlink construction apparatus, including:
the internal link word extraction module is used for extracting entity internal link words in the target text;
the candidate item selection module is used for selecting candidate knowledge items of the entity internal link words from the knowledge items of the knowledge base;
and the target item selection module is used for selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification characteristics, the history inlined characteristics and the context characteristics of the entity inlined word and the candidate knowledge items.
According to still another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of knowledge base based inlink construction as provided in any of the embodiments of the present application.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the method for knowledge-base based inlining provided by any of the embodiments of the present application.
According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the repository-based inlining construction method provided by any of the embodiments of the present application.
According to the technology of the application, the efficiency and the accuracy of constructing the inner chain in the knowledge base can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of a knowledge-base based inlining method according to an embodiment of the present disclosure;
FIG. 2a is a schematic diagram of another knowledge-base based inlining method according to an embodiment of the present disclosure;
FIG. 2b is a schematic diagram of a method for screening candidate knowledge items according to an embodiment of the present application;
FIG. 3a is a schematic diagram of yet another knowledge-base based inlining method according to an embodiment of the present disclosure;
FIG. 3b is a schematic diagram illustrating extraction of a chaining word in an entity according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a knowledge-base-based inlink construction apparatus according to an embodiment of the present disclosure;
FIG. 5 is a block diagram of an electronic device for implementing the knowledge-base-based inlining method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a knowledge-base-based inlining construction method according to an embodiment of the present application, which is applicable to a case of constructing an inlining for an entity word in a target text, that is, a case of pointing the entity word in the target text to a knowledge item. The method can be executed by an internal chain building device based on a knowledge base, the device can be realized by adopting a hardware and/or software mode, and the device can be configured in an electronic device. Referring to fig. 1, the method specifically includes the following steps:
and S110, extracting entity internal chain words in the target text.
And S120, selecting candidate knowledge items of the entity internal chain words from the knowledge items of the knowledge base.
S130, selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification characteristics, the history inlined characteristics and the context characteristics of the entity inlined word and the candidate knowledge items.
In the embodiment of the application, the target text refers to a webpage text to which an inner-link knowledge item is to be added in a knowledge base, that is, a link relation between an entity word in the webpage text of the knowledge item and other knowledge items is to be constructed, so that a user can jump to other knowledge items only by clicking the entity word in the process of browsing the target text. The knowledge base can refer to a knowledge sharing platform, and the knowledge items can be entries.
Wherein, the entity inlined words can be entity words of the inlined to be constructed. The entity internal link word extraction rule can be preset, for example, the entity internal link word can be a proper noun which needs to be explained to a user, but not a non-proper noun, or a word with less information and without analysis requirement. And extracting at least one entity internal chain word from the target text based on the internal chain word extraction rule. Knowledge items in the knowledge base that include the entity inlined words can be used as candidate knowledge items of the entity inlined words.
In the embodiment of the application, a set of classification strategies is formulated, and the entity internal chain words and the knowledge items are integrated into the same classification system, so that the number of candidate items of the entity internal chain words is reduced, and the internal chain construction efficiency and the internal chain construction accuracy are increased. Specifically, a unified classification strategy is adopted to respectively determine classification characteristics of the entity internal chain words and the candidate knowledge items, and the candidate knowledge items of the entity internal chain words are screened based on the classification characteristics.
The historical inlink features can be determined by counting the frequency of the entity inlink words linked to the candidate knowledge items in the historical texts of the knowledge base. The historical text refers to the knowledge item web page text of the built-in chain. Taking an example that a certain entity inlined word links a first candidate knowledge item in a part of historical texts, links a second candidate knowledge item in a part of historical texts, and links a third candidate knowledge item in a part of historical texts, that is, taking an example that the certain entity inlined word has the first candidate knowledge item, the second candidate knowledge item, and the third candidate knowledge item, the frequency that the entity inlined word has been linked to the first candidate knowledge item, the second candidate knowledge item, and the third candidate knowledge item in the knowledge base can be respectively counted as the historical inlined features of the entity inlined word.
In the embodiment of the present application, the context feature may refer to a relationship between a context word of the intra-entity link word in the target text and a context word of the intra-entity link word in the candidate knowledge item.
Specifically, the candidate knowledge items of the entity internal link words are screened according to at least one of the classification features, the history internal link features and the context features of the entity internal link words and the candidate knowledge items to obtain target knowledge items of the entity internal link words, and the target knowledge items are used for linking the entity internal link words in the target text to the target knowledge items. Through at least one item in classification characteristic, historical inner chain characteristic and the context characteristic, the candidate knowledge items of the entity inner chain words are automatically screened, compared with the manual screening of the candidate knowledge items, the inner chain construction efficiency can be improved, and the inner chain construction accuracy can also be improved.
According to the technical scheme of the embodiment of the application, the candidate knowledge items of the entity internal chain words are selected from the knowledge items of the knowledge base by extracting the entity internal chain words in the target text, and the candidate knowledge items of the entity internal chain words are automatically screened according to at least one of the classification characteristics, the historical internal chain characteristics and the context characteristics of the entity internal chain words and the candidate knowledge items, so that the internal chain construction efficiency can be improved, and the internal chain construction accuracy can be improved.
Fig. 2a is a schematic flow chart of another knowledge-base-based inlining construction method provided in an embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 2a, the method for constructing an inner chain based on a knowledge base provided by this embodiment includes:
and S210, extracting entity internal chain words in the target text.
S220, selecting candidate knowledge items of the entity internal chain words from the knowledge items of the knowledge base.
And S230, if the classification characteristic of any candidate knowledge item is different from the classification characteristic of the entity internal chain word, filtering the candidate knowledge item to obtain the remaining candidate knowledge items.
S240, selecting a target knowledge item to be inlined from the remaining candidate knowledge items for the entity inlined word according to the entity inlined word and the historical inlined characteristics and/or the context characteristics of the candidate knowledge items.
The candidate knowledge items with the classification characteristics different from those of the entity internal chain words are filtered, and the candidate knowledge items with the classification characteristics same as those of the entity internal chain words are reserved, namely the entity internal chain words are explained by the knowledge items with the classification characteristics same as those of the entity internal chain words, so that the internal chain accuracy is further improved; moreover, the selection efficiency of the candidate target knowledge items can be improved.
In the embodiment of the application, whether the number of the remaining candidate knowledge items is unique can be determined, and in the case of the unique, the meaning of the intra-entity chaining word can be determined; in the non-exclusive case, the intra-entity catenaries ambiguity can be determined.
In an alternative embodiment, selecting a target knowledge item to be inlined for the entity inlined word from the remaining candidate knowledge items according to the historical inlined features of the entity inlined word and the candidate knowledge items comprises: under the condition that the candidate knowledge item of the entity in-link word is unique, counting the historical in-link frequency characteristics of the entity in-link word between the historical text and the candidate knowledge item; and under the condition that the historical inlink frequency characteristics accord with inlink conditions, taking the remaining candidate knowledge items as target knowledge items to be inlined of the entity inlink words.
The internal link condition can be preset and set, and can also be adjusted according to business requirements, for example, the internal link condition can be determined to be met under the condition that the historical internal link frequency characteristic is greater than the internal link frequency characteristic threshold; otherwise, determining that the internal chain condition is not met; or determining that the internal link condition is met under the condition that the historical internal link frequency feature proportion is larger than the internal link frequency feature proportion threshold. The historical internal link frequency characteristic refers to the historical internal link frequency quantity of an internal link established by the entity internal link words and the candidate knowledge items, and the historical internal link frequency ratio refers to the ratio of the quantity of the internal links established by the entity internal link words and the candidate knowledge items to the total quantity of the entity internal link words. And taking the only candidate knowledge item as a target knowledge item under the condition that the historical inlining frequency characteristic meets an inlining condition. It should be noted that, in the case that the history internal link frequency feature does not meet the internal link condition, it may be determined whether the candidate knowledge item has high confidence by combining features such as the history internal link frequency feature, the context feature, and the knowledge item content, and in the case of high confidence, the candidate knowledge item is taken as the target knowledge item. The advantage of this is that the inlining accuracy can be further improved by using highly-trusted candidate knowledge items as target knowledge items.
In an alternative embodiment, selecting a target knowledge item to be inlined for the entity inlined word from the remaining candidate knowledge items according to the historical inlined features and the context features of the entity inlined word and the candidate knowledge items comprises: under the condition that the entity internal link words have at least two candidate knowledge entries, counting the historical internal link frequency characteristics of the entity internal link words between the historical texts and the candidate knowledge entries; extracting context keywords of the entity internal chain words from the target text, and determining the context characteristics according to the co-occurrence times of the context keywords in the target text and the candidate knowledge items; and selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to the historical inlined frequency characteristic and/or the context characteristic.
The context keywords refer to other keywords in the target text except the entity inlined words. Under the condition that the entity internal link words are ambiguous, the confidence degrees of the candidate knowledge items can be determined by combining the characteristics of the historical internal link frequency, the context characteristics, the knowledge item contents and the like, the candidate knowledge items with the confidence degrees larger than the threshold value are used as the target knowledge items, and the candidate knowledge items with the confidence degrees lower than the threshold value are filtered, so that the internal link accuracy can be further improved.
In an alternative embodiment, the method further comprises determining the context keyword of the intra-entity chain word by: taking the vertical class to which the entity internal link word belongs as a target vertical class, and acquiring at least one key attribute of the target vertical class; and taking the words belonging to the key attributes in the target text as context keywords of the entity internal chain words.
At least two vertical classes with ambiguous risks can be preset, and at least one key attribute of each vertical class is determined and used for identifying context keywords of the entity internal chain words. For example, the verticals where there is an ambiguous risk may be people, territories, etc.; the key attributes of the character verticals can be the positions, names, works and the like, and the key attributes of the regions can be scenic spots, gourmets, cultures and the like.
Specifically, aiming at an entity internal chain word in a target text, determining a vertical class to which the entity internal chain word belongs as a target vertical class, and acquiring a key attribute of the target vertical class; extracting words belonging to key attributes from the target text as context keywords of the entity internal chain words; and counting the co-occurrence times of the context keywords in the target text and the candidate knowledge items, and determining the confidence degrees of the candidate knowledge items according to the co-occurrence times. By introducing the context keywords of the entity internal chain words, the confidence degrees of the candidate knowledge items are determined by combining the context characteristics of the context keywords in the target text and the candidate knowledge items, and the accuracy of the target knowledge items can be further improved.
Fig. 2b is a schematic diagram of a screening method of candidate knowledge items according to an embodiment of the present application. Referring to fig. 2b, the classification feature, the context feature and the historical inlink feature of the entity inlink word are respectively determined, and the classification feature, the context feature and the historical inlink feature of the candidate knowledge item are respectively determined; based on a classification checking strategy, screening the candidate knowledge items according to the classification characteristics of the candidate knowledge items and the classification characteristics of the entity internal chain words to obtain the remaining candidate knowledge items; determining whether the entity internal link words are univocal or not according to the number of the remaining candidate knowledge items; in the case of the univocal, determining whether the candidate knowledge item meets the high confidence requirement based on the univocal mount strategy; under the condition of meeting the high confidence requirement, outputting a target knowledge item; in the case of ambiguity or in the case that the univocal mount policy determines that the candidate knowledge entry does not satisfy the high confidence requirement, it may be determined whether the candidate knowledge entry satisfies the high confidence requirement based on the ambiguous mount policy, and the target knowledge entry may be output according to the result of the ambiguous mount policy. And the entity in-link words in the target text can be linked to the target knowledge items, namely the entity in-link words in the target text are mounted to the target knowledge items. The univocal mounting strategy can judge whether the candidate knowledge items are high in confidence or not by utilizing the historical internal link characteristics; the polysemous mounted vehicle can judge whether the candidate knowledge items are high in confidence or not by combining the characteristics of the context, the historical inner-link characteristics, the knowledge item content and the like.
According to the technical scheme of the embodiment of the application, according to the influence of the classification of the entity internal link words on the internal link strategy, the internal link construction with high accuracy is realized by utilizing the statistical prior information of the internal links in the knowledge base and considering the context characteristics according to the vertical classes.
Fig. 3a is a schematic flowchart of another knowledge-base-based inlining construction method provided in an embodiment of the present application. The present embodiment is an alternative proposed on the basis of the above-described embodiments. Referring to fig. 3a, the method for constructing an inner chain based on a knowledge base provided by this embodiment includes:
s310, based on a natural language processing technology, performing word segmentation, part-of-speech tagging and entity recognition on the target text to obtain basic words in the target text and part-of-speech and entity feature information of the basic words.
And S320, combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain potential internal link words.
S330, screening the potential internal link words to obtain the entity internal link words.
S340, selecting candidate knowledge items of the entity internal chain words from the knowledge items of the knowledge base.
S350, selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification characteristics, the history inlined characteristics and the context characteristics of the entity inlined word and the candidate knowledge items.
In the extraction stage of the potential inlined words, Natural Language Processing (NLP) can be used for performing word segmentation, part of speech tagging and entity recognition (namely proper noun recognition) on the target text, so as to obtain basic words in the target text, and part of speech and entity feature information of the basic words. Wherein, under the condition that the basic word is not the entity word, the entity characteristic information of the basic word can be null; in the case that the basic word is an entity word, the entity feature information of the basic word may be an entity classification of the basic word, such as a music character, an influence character or a work.
In the embodiment of the application, the basic words are combined according to the part of speech and the entity characteristic information of the basic words to obtain the potential internal link words, namely, the potential internal link words are obtained by performing word pasting processing on the basic words. Different from the method that the word segmentation result is directly used as a potential internal link word set, the method performs word sticking processing on the basic words, namely the basic words are used as minimum word units to be sequentially combined into longer phrases, so that the problems of word segmentation errors and small word segmentation granularity caused by the word segmentation technology can be effectively solved. In the process of word-pasting processing, nouns and nouns can be pasted together, and proper nouns can be pasted together. Taking the three basic words of a certain, limited and company as an example, the following potential inlined words can be obtained: a certain limited, a certain limited company. Before the basic words are subjected to word sticking processing, the basic words with empty entity characteristic information can be removed, so that the accuracy of potential internal links is further improved.
In an optional embodiment, the screening the potential inlined words to obtain the entity inlined words includes at least one of the following:
screening the potential internal link words according to the word frequency of the potential internal link words;
based on a natural language processing technology, carrying out entity recognition on the potential internal link words, and screening the potential internal link words according to an entity recognition result;
and matching the potential internal links with a preset entity blacklist, and screening the potential internal links according to a matching result.
In the screening stage of the potential internal link words, high-frequency word filtering may be performed according to the word frequency, for example, a high-frequency word list may be used to filter the high-frequency potential internal link words, where the word frequency may be obtained based on statistics of the internet word frequency. The meaning of the high-frequency words is generally clear and does not need to be explained by an inner chain, so that the risk of meaningless entities can be reduced.
In the screening stage, entity detection may also be performed, and the part-of-speech and entity feature information of the potential inlined words are used to filter out the potential inlined words that are not entities, for example, based on NLP technology, the potential inlined words are processed, the potential inlined words that do not belong to the entities are identified, and filtering is performed. The accuracy of the proper name recognition of the potential internal link words based on the NLP technology is higher than that of the recognition of the target text based on the NLP technology, and the reasons are as follows: 1) the text is identified, and the division granularity of the words is relatively smaller; 2) there is a contextual impact.
In the screening stage, entity quality detection can be performed in combination with a product requirement standard, for example, an entity blacklist is provided based on an online standard of a service, wherein the entity blacklist is used for storing entity words with low-quality proper noun features and meaningless, and the confidence of the entity internal link words can be further improved by filtering out potential internal link words matched with the entity blacklist. For example, the amount of information that is secondarily taught in the job is small and may be added to the entity blacklist. In addition, potential in-link words which do not match with the existing knowledge items can be eliminated, namely potential in-link words without candidate knowledge items are eliminated.
Fig. 3b is a schematic diagram of extracting a linkbyte in an entity according to an embodiment of the present application. Referring to fig. 3b, the target text is processed based on the NLP technique to obtain the part of speech and the entity feature information of the basic word; carrying out word sticking processing on the basic words to obtain potential internal link words; and obtaining entity internal links through high-frequency word filtering, NLP-based non-proper noun filtering and low-quality potential internal link word filtering based on low-quality proper noun characteristics.
According to the technical scheme of the embodiment of the application, accurate and meaningful entity internal links are extracted from the target text by an entity extraction technology; and moreover, the target knowledge item to be linked in is selected for the entity linked word, so that the efficiency and the accuracy of constructing the linked in can be improved.
Fig. 4 is a schematic diagram of an apparatus for constructing an inner chain based on a knowledge base according to an embodiment of the present application, which is applicable to a case of constructing an inner chain for an entity word in a target text, that is, a case of pointing the entity word in the target text to a knowledge item. The knowledge-base-based inner chain constructing device 400 specifically includes the following steps:
the internal link word extraction module 401 is configured to extract an entity internal link word in the target text;
a candidate item selection module 402, configured to select candidate knowledge items of the entity inlined words from knowledge items in a knowledge base;
a target item selecting module 403, configured to select a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification features, the history inlined features, and the context features of the entity inlined word and the candidate knowledge items.
In an alternative embodiment, the target item selection module 403 includes:
the classification screening unit is used for filtering any candidate knowledge item if the classification characteristic of the candidate knowledge item is different from the classification characteristic of the entity internal chain word;
and the target selection unit is used for selecting a target knowledge item to be inlined from the remaining candidate knowledge items for the entity inlined word according to the history inlined characters and/or the context characters of the entity inlined word and the candidate knowledge items.
In an optional implementation manner, the target selecting unit is specifically configured to:
under the condition that the candidate knowledge item of the entity in-link word is unique, counting the historical in-link frequency characteristics of the entity in-link word between the historical text and the candidate knowledge item;
and under the condition that the historical inlink frequency characteristics accord with inlink conditions, taking the remaining candidate knowledge items as target knowledge items to be inlined of the entity inlink words.
In an optional implementation manner, the target selecting unit is specifically configured to:
under the condition that the entity internal link words have at least two candidate knowledge entries, counting the historical internal link frequency characteristics of the entity internal link words between the historical texts and the candidate knowledge entries;
extracting context keywords of the entity internal chain words from the target text, and determining the context characteristics according to the co-occurrence times of the context keywords in the target text and the candidate knowledge items;
and selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to the historical inlined frequency characteristic and/or the context characteristic.
In an alternative embodiment, the target item selection module 403 is further configured to determine the context keyword of the intra-entity chaining word by:
taking the vertical class to which the entity internal link word belongs as a target vertical class, and acquiring at least one key attribute of the target vertical class;
and taking the words belonging to the key attributes in the target text as context keywords of the entity internal chain words.
In an alternative embodiment, the inlined word extraction module 401 includes:
the basic word determining unit is used for performing word segmentation, part of speech tagging and entity identification on the target text based on a natural language processing technology to obtain basic words in the target text and part of speech and entity characteristic information of the basic words;
the potential internal link word determining unit is used for combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain potential internal link words;
and the potential internal link word screening unit is used for screening the potential internal link words to obtain the entity internal link words.
In an optional embodiment, the potential inlined word screening unit is specifically configured to screen the potential inlined words by at least one of:
screening the potential internal link words according to the word frequency of the potential internal link words;
based on a natural language processing technology, carrying out entity recognition on the potential internal link words, and screening the potential internal link words according to an entity recognition result;
and matching the potential internal links with a preset entity blacklist, and screening the potential internal links according to a matching result.
According to the technical scheme of the embodiment, accurate and meaningful entity internal links are extracted from the target text by an entity extraction technology; and according to the influence of the classification of the entity internal chain words on the internal chain strategy, the target knowledge items to be internal-linked are selected for the entity internal chain words by utilizing the statistical prior information of the internal chain in the knowledge base and considering the context characteristics according to the vertical classification, so that the internal chain construction efficiency and the internal chain construction accuracy can be improved.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units that perform machine learning model algorithms, a digital information processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as a knowledge-base-based inlink construction method. For example, in some embodiments, the knowledge-base-based in-chain construction method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When loaded into RAM 503 and executed by computing unit 501, may perform one or more of the steps of the knowledge-base-based inlining method described above. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the knowledge-base-based inlink construction method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs executing on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. The method for constructing the internal chain based on the knowledge base comprises the following steps:
extracting entity internal links in the target text;
selecting candidate knowledge items of the entity internal link words from knowledge items of a knowledge base;
and selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification characteristics, historical inlined characteristics and context characteristics of the entity inlined word and the candidate knowledge items.
2. The method of claim 1, wherein the selecting a target knowledge item to be inlined for the entity inlined word from the candidate knowledge items according to at least one of classification features, historical inlined features and context features of the entity inlined word and the candidate knowledge items comprises:
if the classification characteristic of any candidate knowledge item is different from the classification characteristic of the entity internal chain word, filtering the candidate knowledge item;
and selecting a target knowledge item to be inlined from the remaining candidate knowledge items for the entity inlined word according to the entity inlined word and the historical inlined characteristics and/or the context characteristics of the candidate knowledge items.
3. The method of claim 2, wherein selecting a target knowledge item to be inlined for the entity inlined word from the remaining candidate knowledge items according to the historical inlined features of the entity inlined word and the candidate knowledge items comprises:
under the condition that the candidate knowledge item of the entity in-link word is unique, counting the historical in-link frequency characteristics of the entity in-link word between the historical text and the candidate knowledge item;
and under the condition that the historical inlink frequency characteristics accord with inlink conditions, taking the remaining candidate knowledge items as target knowledge items to be inlined of the entity inlink words.
4. The method of claim 2, wherein selecting a target knowledge item to be inlined for the entity inlined word from the remaining candidate knowledge items according to the historical inlined features and the context features of the entity inlined word and the candidate knowledge items comprises:
under the condition that the entity internal link words have at least two candidate knowledge entries, counting the historical internal link frequency characteristics of the entity internal link words between the historical texts and the candidate knowledge entries;
extracting context keywords of the entity internal chain words from the target text, and determining the context characteristics according to the co-occurrence times of the context keywords in the target text and the candidate knowledge items;
and selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to the historical inlined frequency characteristic and/or the context characteristic.
5. The method of claim 4, further comprising determining a context keyword for the intra-entity linkage by:
taking the vertical class to which the entity internal link word belongs as a target vertical class, and acquiring at least one key attribute of the target vertical class;
and taking the words belonging to the key attributes in the target text as context keywords of the entity internal chain words.
6. The method of claim 1, wherein the extracting entity inlined words in the target text comprises:
based on a natural language processing technology, performing word segmentation, part-of-speech tagging and entity identification on the target text to obtain basic words in the target text and part-of-speech and entity characteristic information of the basic words;
combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain potential internal link words;
and screening the potential internal link words to obtain the entity internal link words.
7. The method of claim 6, wherein the screening the potential inlined words comprises at least one of:
screening the potential internal link words according to the word frequency of the potential internal link words;
based on a natural language processing technology, carrying out entity recognition on the potential internal link words, and screening the potential internal link words according to an entity recognition result;
and matching the potential internal links with a preset entity blacklist, and screening the potential internal links according to a matching result.
8. The inner chain building device based on the knowledge base comprises:
the internal link word extraction module is used for extracting entity internal link words in the target text;
the candidate item selection module is used for selecting candidate knowledge items of the entity internal link words from the knowledge items of the knowledge base;
and the target item selection module is used for selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to at least one of the classification characteristics, the history inlined characteristics and the context characteristics of the entity inlined word and the candidate knowledge items.
9. The apparatus of claim 8, wherein the target entry selection module comprises:
the classification screening unit is used for filtering any candidate knowledge item if the classification characteristic of the candidate knowledge item is different from the classification characteristic of the entity internal chain word;
and the target selection unit is used for selecting a target knowledge item to be inlined from the remaining candidate knowledge items for the entity inlined word according to the history inlined characters and/or the context characters of the entity inlined word and the candidate knowledge items.
10. The apparatus according to claim 9, wherein the target selection unit is specifically configured to:
under the condition that the candidate knowledge item of the entity in-link word is unique, counting the historical in-link frequency characteristics of the entity in-link word between the historical text and the candidate knowledge item;
and under the condition that the historical inlink frequency characteristics accord with inlink conditions, taking the remaining candidate knowledge items as target knowledge items to be inlined of the entity inlink words.
11. The apparatus according to claim 9, wherein the target selection unit is specifically configured to:
under the condition that the entity internal link words have at least two candidate knowledge entries, counting the historical internal link frequency characteristics of the entity internal link words between the historical texts and the candidate knowledge entries;
extracting context keywords of the entity internal chain words from the target text, and determining the context characteristics according to the co-occurrence times of the context keywords in the target text and the candidate knowledge items;
and selecting a target knowledge item to be inlined from the candidate knowledge items for the entity inlined word according to the historical inlined frequency characteristic and/or the context characteristic.
12. The apparatus of claim 14, the target item selection module further configured to determine a context keyword for the intra-entity linkage by:
taking the vertical class to which the entity internal link word belongs as a target vertical class, and acquiring at least one key attribute of the target vertical class;
and taking the words belonging to the key attributes in the target text as context keywords of the entity internal chain words.
13. The apparatus of claim 8, wherein the inlined word extraction module comprises:
the basic word determining unit is used for performing word segmentation, part of speech tagging and entity identification on the target text based on a natural language processing technology to obtain basic words in the target text and part of speech and entity characteristic information of the basic words;
the potential internal link word determining unit is used for combining the basic words according to the part of speech and the entity characteristic information of the basic words to obtain potential internal link words;
and the potential internal link word screening unit is used for screening the potential internal link words to obtain the entity internal link words.
14. The apparatus according to claim 13, wherein the potential inlined word screening unit is specifically configured to screen the potential inlined words by at least one of:
screening the potential internal link words according to the word frequency of the potential internal link words;
based on a natural language processing technology, carrying out entity recognition on the potential internal link words, and screening the potential internal link words according to an entity recognition result;
and matching the potential internal links with a preset entity blacklist, and screening the potential internal links according to a matching result.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.
CN202110258493.9A 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium Active CN112989235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110258493.9A CN112989235B (en) 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110258493.9A CN112989235B (en) 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112989235A true CN112989235A (en) 2021-06-18
CN112989235B CN112989235B (en) 2023-08-01

Family

ID=76334694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110258493.9A Active CN112989235B (en) 2021-03-09 2021-03-09 Knowledge base-based inner link construction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112989235B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919347A (en) * 2021-12-14 2022-01-11 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN114647739A (en) * 2022-02-25 2022-06-21 北京百度网讯科技有限公司 Entity chain finger method, device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107092605A (en) * 2016-02-18 2017-08-25 北大方正集团有限公司 A kind of entity link method and device
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
CN108959270A (en) * 2018-08-10 2018-12-07 新华智云科技有限公司 A kind of entity link method based on deep learning
CN109446406A (en) * 2018-09-14 2019-03-08 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
CN110674317A (en) * 2019-09-30 2020-01-10 北京邮电大学 Entity linking method and device based on graph neural network
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111428507A (en) * 2020-06-09 2020-07-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111523326A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111737430A (en) * 2020-06-16 2020-10-02 北京百度网讯科技有限公司 Entity linking method, device, equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107092605A (en) * 2016-02-18 2017-08-25 北大方正集团有限公司 A kind of entity link method and device
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
CN108959270A (en) * 2018-08-10 2018-12-07 新华智云科技有限公司 A kind of entity link method based on deep learning
CN109446406A (en) * 2018-09-14 2019-03-08 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN110674317A (en) * 2019-09-30 2020-01-10 北京邮电大学 Entity linking method and device based on graph neural network
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111523326A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111428507A (en) * 2020-06-09 2020-07-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111737430A (en) * 2020-06-16 2020-10-02 北京百度网讯科技有限公司 Entity linking method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊文新;: "与自然语言查询表述相关的词语分析", 图书情报工作, no. 17 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919347A (en) * 2021-12-14 2022-01-11 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN113919347B (en) * 2021-12-14 2022-04-05 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN114647739A (en) * 2022-02-25 2022-06-21 北京百度网讯科技有限公司 Entity chain finger method, device, electronic equipment and storage medium
CN114647739B (en) * 2022-02-25 2023-02-28 北京百度网讯科技有限公司 Entity chain finger method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112989235B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
CN110020422B (en) Feature word determining method and device and server
CN113660541B (en) Method and device for generating abstract of news video
CN112989235B (en) Knowledge base-based inner link construction method, device, equipment and storage medium
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN112560461A (en) News clue generation method and device, electronic equipment and storage medium
CN112380847A (en) Interest point processing method and device, electronic equipment and storage medium
CN113033194B (en) Training method, device, equipment and storage medium for semantic representation graph model
CN112699237B (en) Label determination method, device and storage medium
CN113191145A (en) Keyword processing method and device, electronic equipment and medium
CN112784050A (en) Method, device, equipment and medium for generating theme classification data set
CN112948573A (en) Text label extraction method, device, equipment and computer storage medium
CN114491232B (en) Information query method and device, electronic equipment and storage medium
CN113239273B (en) Method, apparatus, device and storage medium for generating text
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN113590774A (en) Event query method, device and storage medium
CN112926297A (en) Method, apparatus, device and storage medium for processing information
CN115248890A (en) User interest portrait generation method and device, electronic equipment and storage medium
CN113971216B (en) Data processing method and device, electronic equipment and memory
CN112528644B (en) Entity mounting method, device, equipment and storage medium
CN114492409B (en) Method and device for evaluating file content, electronic equipment and program product
CN113377922B (en) Method, device, electronic equipment and medium for matching information
CN113377921B (en) Method, device, electronic equipment and medium for matching information
CN114036263A (en) Website identification method and device and electronic equipment
CN112528644A (en) Entity mounting method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant