CN110781310A - Target concept graph construction method and device, computer equipment and storage medium - Google Patents

Target concept graph construction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110781310A
CN110781310A CN201910848493.7A CN201910848493A CN110781310A CN 110781310 A CN110781310 A CN 110781310A CN 201910848493 A CN201910848493 A CN 201910848493A CN 110781310 A CN110781310 A CN 110781310A
Authority
CN
China
Prior art keywords
entry
initial
entry information
map
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910848493.7A
Other languages
Chinese (zh)
Inventor
朱昱锦
徐国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
Original Assignee
OneConnect Smart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Smart Technology Co Ltd filed Critical OneConnect Smart Technology Co Ltd
Priority to CN201910848493.7A priority Critical patent/CN110781310A/en
Publication of CN110781310A publication Critical patent/CN110781310A/en
Priority to PCT/CN2020/106256 priority patent/WO2021047327A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of artificial intelligence, is applied to the financial industry, and particularly relates to a method and a device for constructing a target concept map, computer equipment and a storage medium. The method in one embodiment comprises: reading a term map, and acquiring initial terms in the term map and corresponding initial term information; crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information of the same initial entry to obtain entry information; extracting a candidate word of the initial entry information, and taking the candidate word as a pending word when the candidate word does not belong to an entry in the entry map; when the undetermined word is crawled, supplementing the undetermined word and corresponding entry information to an entry map, and updating the entry map; and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map.

Description

Target concept graph construction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for constructing a target concept graph, a computer device, and a storage medium.
Background
The concept map takes abstract concepts, knowledge and terms as entities to form a main map hierarchical structure, not only can reflect the relation between the knowledge and the knowledge, but also can expand the map into an encyclopedia map by connecting various practical examples with the corresponding concepts. The concept graph is used as the bottom-layer capability, and provides support for top-layer retrieval, question answering, information extraction and other capabilities.
The traditional concept graph is mainly constructed manually, and the concept graph usually needs to be subjected to large-scale information collection in a knowledge source in the construction process and then is subjected to screening and combination, so that the graph construction efficiency is low.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for constructing a target concept graph, which can improve efficiency.
A method of constructing a target concept graph, the method comprising:
reading a vocabulary entry map, and acquiring initial vocabulary entries in the vocabulary entry map and initial vocabulary entry information corresponding to the initial vocabulary entries;
crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information;
extracting a candidate word in the initial entry information, and taking the candidate word as a pending word when the candidate word does not belong to an entry in the entry map;
when the undetermined words are crawled, supplementing the undetermined words and entry information corresponding to the undetermined words to the entry map, and updating the entry map;
and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map.
In one embodiment, the fusing the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information includes:
obtaining similarity evaluation values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on a preset similarity evaluation index;
when the similarity score value is larger than a preset score value, acquiring the respective character string lengths of the initial entry information and the supplementary entry information corresponding to the same initial entry;
when the character string length of the initial entry information is smaller than a preset length threshold value and the character string length of the supplementary entry information is larger than the preset length threshold value, taking the supplementary entry information as entry information;
when the length of the character string of the initial entry information is greater than a preset length threshold value and the length of the character string of the supplementary entry information is less than the preset length threshold value, taking the initial entry information as entry information; the preset length threshold is between the length of the character string of the initial entry information and the length of the character string of the supplementary entry information.
In an embodiment, after obtaining the similarity score values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on the preset similarity evaluation index, the method further includes:
and when the similarity score value is smaller than or equal to a preset value, combining the initial entry information corresponding to the same initial entry with the supplementary entry information to obtain entry information.
In one embodiment, the extracting candidate words from the initial entry information includes:
performing word segmentation processing on first preset field information in the initial entry information to obtain entity words in the initial entry information;
and combining the entity words with words in second preset field information of the initial entry information to obtain candidate words corresponding to the initial entry information.
In one embodiment, the method further comprises:
and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is greater than or equal to the preset threshold, taking the updated entry map as the entry map, and returning to the step of reading the entry map.
In one embodiment, the obtaining of the initial entry in the entry map and the initial entry information corresponding to the initial entry includes:
crawling entries on the classification index page based on a preset seed knowledge source, wherein the crawled entries serve as initial entries;
and extracting preset field information of the initial entry from the classification index page based on a preset field to obtain initial entry information corresponding to the initial entry.
An apparatus for constructing a target concept graph, the apparatus comprising:
the information acquisition module is used for reading a vocabulary entry map and acquiring initial vocabulary entries in the vocabulary entry map and initial vocabulary entry information corresponding to the initial vocabulary entries;
the information processing module is used for crawling the supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information;
the undetermined word acquisition module is used for extracting a candidate word in the initial entry information, and when the candidate word does not belong to an entry in the entry map, the candidate word is used as the undetermined word;
the map updating module is used for supplementing the undetermined word and entry information corresponding to the undetermined word to the entry map and updating the entry map when the undetermined word is crawled;
and the map determining module is used for taking the updated vocabulary entry map as a target concept map when the ratio of the number of the undetermined words to the number of the vocabulary entries in the updated vocabulary entry map is less than a preset threshold value.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
reading a vocabulary entry map, and acquiring initial vocabulary entries in the vocabulary entry map and initial vocabulary entry information corresponding to the initial vocabulary entries;
crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information;
extracting a candidate word in the initial entry information, and taking the candidate word as a pending word when the candidate word does not belong to an entry in the entry map;
when the undetermined words are crawled, supplementing the undetermined words and entry information corresponding to the undetermined words to the entry map, and updating the entry map;
and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
reading a vocabulary entry map, and acquiring initial vocabulary entries in the vocabulary entry map and initial vocabulary entry information corresponding to the initial vocabulary entries;
crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information;
extracting a candidate word in the initial entry information, and taking the candidate word as a pending word when the candidate word does not belong to an entry in the entry map;
when the undetermined words are crawled, supplementing the undetermined words and entry information corresponding to the undetermined words to the entry map, and updating the entry map;
and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map.
The target concept graph construction method, the target concept graph construction device, the computer equipment and the storage medium perform fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information, and therefore the entry graph comprising the initial entry and the entry information corresponding to the initial entry can be obtained; extracting a candidate word in the initial entry information, taking the candidate word as an undetermined word when the candidate word does not belong to an entry in the entry map, and supplementing the undetermined word and the entry information corresponding to the undetermined word to the entry map when the undetermined word is crawled, so as to update the entry map and expand the entry map; when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, the entry map is large enough, and unknown new entries needing to be crawled are difficult to find, namely the entry map is saturated, the entry map is constructed, and the updated entry map is used as a target concept map; therefore, the encyclopedic concept graph can be automatically and quickly built, and the graph building efficiency can be greatly improved.
Drawings
FIG. 1 is a diagram of an application environment of a method for constructing a concept graph of an object according to an embodiment;
FIG. 2 is a schematic flow chart diagram illustrating a method for constructing a concept graph of an object according to an embodiment;
FIG. 3 is a flowchart illustrating a candidate word obtaining step according to an embodiment;
FIG. 4 is a schematic flow chart showing a method of constructing a target concept graph according to another embodiment;
FIG. 5 is a block diagram showing the construction of a target concept graph according to an embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The method for constructing the target concept graph can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 reads the entry map from the terminal 102, obtains initial entries in the entry map and initial entry information corresponding to the initial entries, crawls supplementary entry information of the initial entries, and performs fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entries to obtain entry information; extracting a candidate word in the initial entry information, and taking the candidate word as a pending word when the candidate word does not belong to an entry in the entry map; when the undetermined word is crawled, supplementing the undetermined word and entry information corresponding to the undetermined word to an entry map so as to update the entry map; and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for constructing a target concept graph is provided, which is described by taking the method as an example for being applied to the server in fig. 1, and includes the following steps:
step 202, reading the entry map, and obtaining initial entries in the entry map and initial entry information corresponding to the initial entries.
The entry can be a word or a word, or can be composed of the word or the phrase, and the entry is fixed but not a sentence. The term information refers to content having a fixed format, such as a plurality of field information, corresponding to a term. Specifically, obtaining an initial entry in the entry map and initial entry information corresponding to the initial entry includes: crawling entries on the classification index page based on a preset seed knowledge source, wherein the crawled entries serve as initial entries; and extracting preset field information of the entries from the classified index page based on the preset fields to obtain initial entry information corresponding to the initial entries.
For example, 1-3 websites with complete concept hierarchy are selected as a preset seed knowledge source, and the websites have comprehensive classification index pages. Specifically, the MBA Chinesota encyclopedia can be selected as a seed knowledge source, all entries can be recursively crawled through the entry classification index page of the MBA Chinesota encyclopedia, and the crawled entries have a top-level relationship and a bottom-level relationship due to the index relationship.
The entries of the seed knowledge source are crawled based on a framework built by Scapy, the entry information may include a plurality of preset field information, and specific fields may include id (used for representing the code number of the current entry), name (used for representing the concept name of the current entry), description (used for representing the definition or description of the current entry on the current entry page, generally in the first section of the entry text), link (used for representing the link of the current entry page), parent (used for representing the parent of the current entry), related word (used for representing the word containing the internal link of the current entry page), tag (used for representing the label at the bottom of the current entry page, such as indicating the field to which the concept belongs), info box (used for representing the information column, including the most basic attribute of the entry), category (used for representing the content of the entry in small sections), and the like. The initial terms and the initial term information corresponding to the initial terms can be stored in a MongoDB database, which is a database based on distributed file storage.
And 204, crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information.
The supplementary entry information of the initial entry can be specifically crawled based on a preset encyclopedia site, wherein the preset encyclopedia site can be encyclopedia, interactive encyclopedia, dog encyclopedia, wikipedia and other encyclopedia sites, and the supplementary entry information refers to the entry information of the initial entry crawled from the preset encyclopedia site. Performing fusion processing based on similarity on initial entry information and supplementary entry information corresponding to the same initial entry to obtain entry information, including: obtaining similarity evaluation values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on a preset similarity evaluation index; when the similarity score value is larger than a preset score value, acquiring the respective character string lengths of initial entry information and supplementary entry information corresponding to the same initial entry; when the length of the character string of the initial entry information is smaller than a preset length threshold value and the length of the character string of the supplementary entry information is larger than the preset length threshold value, taking the supplementary entry information as the entry information; when the length of the character string of the initial entry information is greater than a preset length threshold value and the length of the character string of the supplementary entry information is smaller than the preset length threshold value, taking the initial entry information as the entry information; the preset length threshold is between the length of the character string of the initial entry information and the length of the character string of the supplementary entry information. After obtaining the similarity score values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on the preset similarity evaluation index, the method further comprises the following steps: and when the similarity score value is smaller than or equal to the preset score value, combining the initial entry information corresponding to the same initial entry with the supplementary entry information to obtain entry information. When the similarity score value is larger than the preset score value, the initial entry information and the supplementary entry information of the entry are judged to be similar, longer information can be reserved, and shorter information can be removed. The similarity evaluation index may specifically be the same word/word in the initial entry information and the supplementary entry information, and the similarity score value of the initial entry information and the corresponding supplementary entry information at this time is the number of times the same word/word appears. For example, when the number of times of appearance of the same word/word in the initial entry information and the supplementary entry information is greater than a preset number of times, it is determined that the initial entry information and the supplementary entry information of the entry are similar to each other, the respective character string lengths of the initial entry information and the corresponding supplementary entry information are obtained, information with a longer character string length is retained, information with a shorter character string length is removed, and entry information is obtained, which is information with a longer character string length. When the occurrence frequency of the same character/word in the initial entry information and the supplementary entry information is less than or equal to a preset frequency, judging that the initial entry information of the entry is not similar to the supplementary entry information, combining the initial entry information and the corresponding supplementary entry information to obtain entry information, wherein the initial entry information comprises { A1, A2}, the corresponding supplementary entry information { B1, B2, B3}, and the entry information obtained after combination is { A1, A2, B1, B2, B3 }.
And step 206, extracting candidate words in the initial entry information, and taking the candidate words as undetermined words when the candidate words do not belong to entries in the entry map.
And 208, when the undetermined word is crawled, supplementing the undetermined word and entry information corresponding to the undetermined word to an entry map so as to update the entry map.
And storing the initial entries into a dictionary to generate a concept dictionary. A dictionary is a collection of elements, each element having a field called a key, the keys of different elements being different. The format of the entry stored in the concept dictionary is specifically as follows: the name of the entry is used as a key, and the Boolean value True/False is used as a value. And storing the candidate words into a concept dictionary, setting a Boolean value corresponding to the undetermined words to be crawled based on a preset encyclopedic site as True, and indicating that the candidate words are crawled. When a certain undetermined word is not crawled from a preset encyclopedia site, the undetermined word is indicated to be not a concept word, the undetermined word is stored in a concept dictionary, and a corresponding Boolean value is set to be False. When a certain undetermined word is crawled, vocabulary entry information corresponding to the undetermined word is crawled from a preset encyclopedic site, the corresponding relation between the undetermined word and the vocabulary entry information is established, and the vocabulary entry map is updated.
And step 210, when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map.
The entry number in the updated entry map can be obtained by obtaining the entry number marked as TRUE by the boolean value in the concept dictionary, and the entry number in the entry map is used for identifying the scale of the map. And when the ratio of the number of the newly added strange entries to the size of the entry graph is lower than a preset threshold value, the entry graph is large enough, and an unknown new entry concept needing to be crawled is difficult to find. For example, when the number of the undetermined words is zero, no new words exist, the vocabulary entry map is saturated, and the construction of the target concept map is finished.
The target concept map construction method comprises the steps of performing fusion processing based on similarity on initial entry information and supplementary entry information corresponding to the same initial entry to obtain entry information, and thus obtaining an entry map comprising the initial entry and the entry information corresponding to the initial entry; extracting a candidate word in the initial entry information, taking the candidate word as an undetermined word when the candidate word does not belong to an entry in the entry map, and supplementing the undetermined word and the entry information corresponding to the undetermined word to the entry map when the undetermined word is crawled, so as to update the entry map and expand the entry map; when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, the entry map is large enough, and unknown new entries needing to be crawled are difficult to find, namely the entry map is saturated, the entry map is constructed, and the updated entry map is used as a target concept map; therefore, the encyclopedic concept graph can be automatically and quickly built, and the graph building efficiency can be greatly improved.
In one embodiment, as shown in fig. 3, extracting candidate words from the initial entry information includes: step 302, performing word segmentation processing on first preset field information in the initial entry information to obtain entity words in the initial entry information; and 304, merging the entity words and the words in the second preset field information of the initial entry information to obtain candidate words corresponding to the initial entry information. The first preset field information refers to field information used for defining or describing the entry in the initial entry information, and the second preset field information refers to corresponding field information of related words, parent words, label words and the like of the entry in the initial entry information. For example, the description field information is used to indicate the definition or description of the entry, generally in the first section of the entry text; the related word field information is used for representing words containing internal links in the vocabulary entry page, and the tag field information is used for representing tag words of the vocabulary entry page, such as words corresponding to the field to which the concept belongs; the parent field information is used to indicate the words corresponding to the parent class of the lemma. And performing word segmentation processing on the description field information corresponding to each initial entry information, extracting the entity words after the word segmentation processing, and combining the entity words extracted from the description field information with words contained in related word, tag and parent field information to obtain a candidate word set.
In an embodiment, as shown in fig. 4, the method for constructing an encyclopedic concept graph further includes a step 412, when the ratio of the number of pending words to the number of entries in the updated entry graph is greater than or equal to a preset threshold, taking the updated entry graph as an entry graph, and returning to the step 402 of reading the entry graph. Acquiring initial entries in an entry map and initial entry information corresponding to the initial entries; crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information; extracting a candidate word in the initial entry information, and taking the candidate word as a pending word when the candidate word does not belong to an entry in the entry map; when the undetermined words are crawled, supplementing the undetermined words and entry information corresponding to the undetermined words to an entry map, and updating the entry map; when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map; and when the ratio of the number of the words to be determined to the number of the entries in the updated entry map is greater than or equal to a preset threshold value, taking the updated entry map as the entry map, and returning to the step of reading the entry map.
In one embodiment, a method for constructing a target concept graph comprises the following steps: performing a first round of crawling on a classification index page based on a preset seed knowledge source, crawling all entries of the classification index page, and taking the crawled entries as initial entries; and extracting preset field information of each initial entry to obtain initial entry information corresponding to the initial entry, so as to obtain an initial entry map, wherein the initial entry map is used for representing the relationship between each initial entry and the corresponding initial entry information. And performing a second round of crawling based on the preset encyclopedia site, crawling supplementary entry information of each initial entry in the initial entry map, and performing fusion processing based on the similarity on the initial entry information of each initial entry and the supplementary entry information of the entry to obtain entry information after the fusion processing. And extracting candidate words in the initial entry information, searching each candidate word in entries in the entry map respectively, and taking the candidate words as undetermined words when the candidate words do not belong to the entries in the entry map. And performing third-round crawling based on the preset encyclopedia site, sequentially crawling each undetermined word, and supplementing the undetermined word to the vocabulary entry map when a certain undetermined word is crawled, so as to update the vocabulary entry map. And counting the number of the undetermined words, stopping the step when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is less than a preset threshold value, and taking the updated entry map as a target concept map. And when the ratio of the number of the words to be determined to the number of the entries in the updated entry map is greater than or equal to a preset threshold value, taking the updated entry map as the entry map, returning to the step of reading the entry map, and performing iterative crawling. According to the method for constructing the target concept map, only few levels of knowledge sources and a plurality of mainstream knowledge sources are needed, a set of complete encyclopedia concept map can be quickly and automatically constructed, and the efficiency can be greatly improved. The method can update the map from zero every week or every half month, and realize the increment of the entries in the map. The method can also capture the association between concepts (namely entries) and concepts, and can contain all concepts of a vertical field and the extended cross-field concepts thereof, so that the atlas is more robust.
It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided an apparatus for constructing a target concept graph, including: an information acquisition module 502, an information processing module 504, a pending word acquisition module 506, a map update module 508, and a map determination module 510. The information acquisition module is used for reading the vocabulary entry map and acquiring initial vocabulary entries in the vocabulary entry map and initial vocabulary entry information corresponding to the initial vocabulary entries; the information processing module is used for crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information; the undetermined word acquisition module is used for extracting a candidate word in the initial entry information, and when the candidate word does not belong to an entry in the entry map, the candidate word is used as the undetermined word; the map updating module is used for supplementing the undetermined words and the entry information corresponding to the undetermined words to the entry map and updating the entry map when the undetermined words are crawled; and the map determining module is used for taking the updated entry map as the target concept map when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is less than a preset threshold value.
In one embodiment, the information processing module includes: the similarity obtaining unit is used for obtaining similarity grading values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on a preset similarity evaluation index; the entry processing unit is used for acquiring the respective character string lengths of the initial entry information and the supplementary entry information corresponding to the same initial entry when the similarity score value is larger than a preset score; the first comparison unit is used for taking the supplementary entry information as the entry information when the character string length of the initial entry information is smaller than a preset length threshold and the character string length of the supplementary entry information is larger than the preset length threshold; the second comparison unit is used for taking the initial entry information as the entry information when the character string length of the initial entry information is greater than a preset length threshold and the character string length of the supplementary entry information is less than the preset length threshold; the preset length threshold is between the length of the character string of the initial entry information and the length of the character string of the supplementary entry information.
In an embodiment, the similarity obtaining unit further includes an entry merging unit configured to merge initial entry information corresponding to the same initial entry with the supplementary entry information when the similarity score is smaller than or equal to a preset score, so as to obtain entry information.
In one embodiment, the undetermined word acquisition module comprises an entity word extraction unit, which is used for performing word segmentation processing on first preset field information in the initial entry information to acquire entity words in the initial entry information; and the candidate word acquisition unit is used for merging the entity word and the word in the second preset field information of the initial entry information to obtain a candidate word corresponding to the initial entry information.
In an embodiment, the encyclopedic concept graph constructing device further includes an iteration module, configured to, when a ratio of the number of the pending words to the number of the entries in the updated entry graph is greater than or equal to a preset threshold, use the updated entry graph as the entry graph, and return to the read entry graph.
In one embodiment, the information acquisition module comprises an initial entry acquisition unit, configured to perform entry crawling on a classification index page based on a preset seed knowledge source, and take a crawled entry as an initial entry; and the initial entry information acquisition unit is used for extracting the preset field information of the initial entry from the classification index page based on the preset field to obtain the initial entry information corresponding to the initial entry.
For specific limitations of the target concept graph constructing apparatus, reference may be made to the above limitations of the target concept graph constructing method, and details are not repeated here. The respective modules in the construction apparatus of the above-described target concept graph may be entirely or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as a vocabulary entry map, candidate words, undetermined words, a target concept map and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of constructing a target concept graph.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer apparatus comprising a memory storing a computer program and a processor implementing the steps of the method of constructing a target concept graph in any of the embodiments when the processor executes the computer program.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the steps of the method of constructing a target concept graph in any of the embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of constructing a target concept graph, the method comprising:
reading a vocabulary entry map, and acquiring initial vocabulary entries in the vocabulary entry map and initial vocabulary entry information corresponding to the initial vocabulary entries;
crawling supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information;
extracting a candidate word in the initial entry information, and taking the candidate word as a pending word when the candidate word does not belong to an entry in the entry map;
when the undetermined words are crawled, supplementing the undetermined words and entry information corresponding to the undetermined words to the entry map, and updating the entry map;
and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is smaller than a preset threshold value, taking the updated entry map as a target concept map.
2. The method according to claim 1, wherein the fusing the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information comprises:
obtaining similarity evaluation values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on a preset similarity evaluation index;
when the similarity score value is larger than a preset score value, acquiring the respective character string lengths of the initial entry information and the supplementary entry information corresponding to the same initial entry;
when the character string length of the initial entry information is smaller than a preset length threshold value and the character string length of the supplementary entry information is larger than the preset length threshold value, taking the supplementary entry information as entry information;
when the character string length of the initial entry information is greater than the preset length threshold and the character string length of the supplementary entry information is less than the preset length threshold, taking the initial entry information as entry information; the preset length threshold is between the length of the character string of the initial entry information and the length of the character string of the supplementary entry information.
3. The method according to claim 2, wherein after obtaining the similarity score values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on the preset similarity evaluation index, the method further comprises:
and when the similarity score value is smaller than or equal to a preset value, combining the initial entry information corresponding to the same initial entry with the supplementary entry information to obtain entry information.
4. The method of claim 1, wherein the extracting candidate words from the initial entry information comprises:
performing word segmentation processing on first preset field information in the initial entry information to obtain entity words in the initial entry information;
and combining the entity words with words in second preset field information of the initial entry information to obtain candidate words corresponding to the initial entry information.
5. The method of claim 1, further comprising:
and when the ratio of the number of the undetermined words to the number of the entries in the updated entry map is greater than or equal to the preset threshold, taking the updated entry map as the entry map, and returning to the step of reading the entry map.
6. The method of claim 1, wherein the obtaining of the initial entry in the entry graph and the initial entry information corresponding to the initial entry comprises:
crawling entries on the classification index page based on a preset seed knowledge source, wherein the crawled entries serve as initial entries;
and extracting preset field information of the initial entry from the classification index page based on a preset field to obtain initial entry information corresponding to the initial entry.
7. An apparatus for constructing a target concept graph, the apparatus comprising:
the information acquisition module is used for reading a vocabulary entry map and acquiring initial vocabulary entries in the vocabulary entry map and initial vocabulary entry information corresponding to the initial vocabulary entries;
the information processing module is used for crawling the supplementary entry information of the initial entry, and performing fusion processing based on similarity on the initial entry information and the supplementary entry information corresponding to the same initial entry to obtain entry information;
the undetermined word acquisition module is used for extracting a candidate word in the initial entry information, and when the candidate word does not belong to an entry in the entry map, the candidate word is used as the undetermined word;
the map updating module is used for supplementing the undetermined word and entry information corresponding to the undetermined word to the entry map and updating the entry map when the undetermined word is crawled;
and the map determining module is used for taking the updated vocabulary entry map as a target concept map when the ratio of the number of the undetermined words to the number of the vocabulary entries in the updated vocabulary entry map is less than a preset threshold value.
8. The apparatus of claim 7, wherein the information processing module comprises:
the similarity obtaining unit is used for obtaining similarity grading values of the initial entry information and the supplementary entry information corresponding to the same initial entry based on a preset similarity evaluation index;
the entry processing unit is used for acquiring the respective character string lengths of the initial entry information and the supplementary entry information corresponding to the same initial entry when the similarity score value is larger than a preset score;
a first comparing unit, configured to use the supplementary entry information as entry information when a string length of the initial entry information is smaller than a preset length threshold and a string length of the supplementary entry information is larger than the preset length threshold;
a second comparing unit, configured to use the initial entry information as entry information when the string length of the initial entry information is greater than the preset length threshold and the string length of the supplemental entry information is less than the preset length threshold; the preset length threshold is between the length of the character string of the initial entry information and the length of the character string of the supplementary entry information.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN201910848493.7A 2019-09-09 2019-09-09 Target concept graph construction method and device, computer equipment and storage medium Pending CN110781310A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910848493.7A CN110781310A (en) 2019-09-09 2019-09-09 Target concept graph construction method and device, computer equipment and storage medium
PCT/CN2020/106256 WO2021047327A1 (en) 2019-09-09 2020-07-31 Method and apparatus for constructing target concept map, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910848493.7A CN110781310A (en) 2019-09-09 2019-09-09 Target concept graph construction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110781310A true CN110781310A (en) 2020-02-11

Family

ID=69384089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910848493.7A Pending CN110781310A (en) 2019-09-09 2019-09-09 Target concept graph construction method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110781310A (en)
WO (1) WO2021047327A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395391A (en) * 2020-11-17 2021-02-23 中国平安人寿保险股份有限公司 Concept graph construction method and device, computer equipment and storage medium
WO2021047327A1 (en) * 2019-09-09 2021-03-18 深圳壹账通智能科技有限公司 Method and apparatus for constructing target concept map, computer device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130283138A1 (en) * 2012-04-24 2013-10-24 Wo Hai Tao Method for creating knowledge map
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN109359178A (en) * 2018-09-14 2019-02-19 华南师范大学 A kind of search method, device, storage medium and equipment
CN109885691A (en) * 2019-01-08 2019-06-14 平安科技(深圳)有限公司 Knowledge mapping complementing method, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781310A (en) * 2019-09-09 2020-02-11 深圳壹账通智能科技有限公司 Target concept graph construction method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130283138A1 (en) * 2012-04-24 2013-10-24 Wo Hai Tao Method for creating knowledge map
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN109359178A (en) * 2018-09-14 2019-02-19 华南师范大学 A kind of search method, device, storage medium and equipment
CN109885691A (en) * 2019-01-08 2019-06-14 平安科技(深圳)有限公司 Knowledge mapping complementing method, device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021047327A1 (en) * 2019-09-09 2021-03-18 深圳壹账通智能科技有限公司 Method and apparatus for constructing target concept map, computer device, and storage medium
CN112395391A (en) * 2020-11-17 2021-02-23 中国平安人寿保险股份有限公司 Concept graph construction method and device, computer equipment and storage medium
CN112395391B (en) * 2020-11-17 2023-11-03 中国平安人寿保险股份有限公司 Concept graph construction method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
WO2021047327A1 (en) 2021-03-18

Similar Documents

Publication Publication Date Title
CN110765265B (en) Information classification extraction method and device, computer equipment and storage medium
CN109858010B (en) Method and device for recognizing new words in field, computer equipment and storage medium
WO2021120627A1 (en) Data search matching method and apparatus, computer device, and storage medium
CN108664595B (en) Domain knowledge base construction method and device, computer equipment and storage medium
CN110458324B (en) Method and device for calculating risk probability and computer equipment
CN112015900B (en) Medical attribute knowledge graph construction method, device, equipment and medium
CN109726664B (en) Intelligent dial recommendation method, system, equipment and storage medium
CN112766319A (en) Dialogue intention recognition model training method and device, computer equipment and medium
CN113779994B (en) Element extraction method, element extraction device, computer equipment and storage medium
CN110750698A (en) Knowledge graph construction method and device, computer equipment and storage medium
CN109146625B (en) Content-based multi-version App update evaluation method and system
CN114547257B (en) Class matching method and device, computer equipment and storage medium
WO2021047327A1 (en) Method and apparatus for constructing target concept map, computer device, and storage medium
CN116484025A (en) Vulnerability knowledge graph construction method, vulnerability knowledge graph evaluation equipment and storage medium
CN110413994B (en) Hot topic generation method and device, computer equipment and storage medium
CN111984659A (en) Data updating method and device, computer equipment and storage medium
CN112115328A (en) Page flow map construction method and device and computer readable storage medium
CN109213775B (en) Search method, search device, computer equipment and storage medium
CN113849644A (en) Text classification model configuration method and device, computer equipment and storage medium
CN111401055B (en) Method and apparatus for extracting context information from financial information
CN112464660A (en) Text classification model construction method and text data processing method
CN114579834B (en) Webpage login entity identification method and device, electronic equipment and storage medium
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
CN114003685B (en) Word segmentation position index construction method and device, and document retrieval method and device
CN115186240A (en) Social network user alignment method, device and medium based on relevance information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200211