WO2020114022A1 - Knowledge base alignment method and apparatus, computer device and storage medium - Google Patents
Knowledge base alignment method and apparatus, computer device and storage medium Download PDFInfo
- Publication number
- WO2020114022A1 WO2020114022A1 PCT/CN2019/103487 CN2019103487W WO2020114022A1 WO 2020114022 A1 WO2020114022 A1 WO 2020114022A1 CN 2019103487 W CN2019103487 W CN 2019103487W WO 2020114022 A1 WO2020114022 A1 WO 2020114022A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- knowledge
- entities
- similarity
- clustering
- entity
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims abstract description 92
- 238000004364 calculation method Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims description 32
- 238000013527 convolutional neural network Methods 0.000 claims description 29
- 238000004422 calculation algorithm Methods 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 230000000717 retained effect Effects 0.000 claims description 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 9
- 230000006854 communication Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 230000007175 bidirectional communication Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- This application relates to the technical field of knowledge base processing, and in particular, to a knowledge base alignment method, device, computer equipment, and storage medium.
- the knowledge base has a positive meaning for the sharing and dissemination of information.
- the information of a single knowledge base is limited, and in some cases it cannot meet the needs of users; and usually the knowledge base is continuously expanded, and the scale of occupied storage resources continues to expand, but the data continuously expanded into the knowledge base may be redundant In addition, this redundancy causes a waste of storage resources.
- it also increases the amount of search calculations and duplicates the search result information, causing inconvenience to users.
- Knowledge Base refers to finding entities belonging to the same thing in reality for each entity from different sources.
- the entity here refers to things that exist objectively and can be distinguished from each other, including concrete people, things, things, abstract concepts, relationships. Therefore, knowledge base alignment, that is, extracting entity information and removing redundancy, is a key issue in building a high-quality knowledge base.
- the common method of knowledge base alignment is to use the attribute information of the entity to determine whether different source entities can be aligned. Since the data of different entities belongs to the type of user-generated content (User Generated Content, UGC), the quality of the data edited by different users is uneven, only through The entity attribute information edited by the user is difficult to accurately determine whether it is the same entity.
- UGC User Generated Content
- Embodiments of the present application provide a knowledge base alignment method, apparatus, computer equipment, and storage medium.
- a knowledge base alignment method includes:
- the knowledge base alignment device includes:
- An acquisition module for acquiring a vector set of knowledge entities, wherein the vector set of knowledge entities is a vectorized representation of the knowledge entities in the knowledge base to be aligned;
- a processing module configured to input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;
- a calculation module used to select any two knowledge entities belonging to the same class according to the clustering result, and calculate the similarity between the two knowledge entities
- the execution module is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.
- a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing any of the above knowledge when executing the computer-readable instructions The steps of the library alignment method.
- a readable storage medium storing computer-readable instructions that when executed by a processor implements the steps of any of the above knowledge base alignment methods.
- FIG. 1 is a schematic flowchart of a knowledge base alignment method according to an embodiment of this application
- FIG. 2 is a schematic diagram of vectorization of knowledge entities based on an IF-IDF algorithm according to an embodiment of the present application
- FIG. 3 is a schematic diagram of a training process of a clustering model based on a convolutional neural network according to an embodiment of the present application
- FIG. 4 is a schematic diagram of a calculation process of similarity of knowledge entities according to an embodiment of the present application.
- FIG. 5 is a schematic diagram of a process of merging knowledge entities according to an embodiment of the present application.
- FIG. 6 is a block diagram of a basic structure of a knowledge base alignment device according to an embodiment of the present application.
- FIG. 7 is a block diagram of the basic structure of a computer device for implementing this application.
- terminal and terminal device used herein include both wireless signal receiver devices, which only have wireless signal receiver devices without transmitting capabilities, and also include hardware for receiving and transmitting hardware.
- Such devices may include: cellular or other communication devices with single-line displays or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Services), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notepads, calendars and/or GPS (Global Positioning System (Global Positioning System) receiver; conventional laptop and/or palmtop computer or other device that has and/or includes a conventional radio frequency receiver and/or palmtop computer or other device.
- PCS Personal Communications Services
- PDA Personal Digital Assistant
- terminal and “terminal equipment” may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or terrestrial), or adapted and/or configured to operate locally, and/or In a distributed form, it operates at any other location on the earth and/or space.
- the "terminal” and “terminal device” used herein may also be a communication terminal, an Internet terminal, a music/video playback terminal, for example, may be a PDA, MID (Mobile Internet Device), and/or have music/video playback Functional mobile phones can also be smart TVs, set-top boxes and other devices.
- the terminal in this embodiment is the above-mentioned terminal.
- FIG. 1 is a schematic diagram of a basic process of a knowledge base alignment method according to this embodiment.
- a knowledge base alignment method includes the following steps:
- the knowledge entities stored in the knowledge base are usually text or pictures. When aligning the knowledge entities, it is usually necessary to calculate the similarity between the knowledge entities.
- the knowledge entities need to be converted into vectors.
- the vectorized representation of text is realized by a vector space model, also known as a bag of words model.
- the simplest mode is one-hot encoding based on words, using each word as the dimension key. Some words correspond to position 1, others are 0, and vector length is the same as dictionary size.
- S102 Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;
- the vector set representing the knowledge entity is input into a preset knowledge entity clustering model.
- the clustering model of knowledge entities adopts a density-based clustering algorithm.
- the density-based clustering algorithm does not need to determine the data of clusters in advance. It can find clusters of any shape, can identify noise points, and compares with outliers. Good robustness can detect outliers.
- DBSCAN is one of the most typical representative algorithms in this type of method. Its core idea is to first find the points with higher density, and then gradually connect the similar high-density points into one piece, and then generate various clusters.
- the specific algorithm implementation for each data point is the center of the circle, draw a circle with eps as the radius (called the neighborhood eps-neigbourhood), and then count how many points are in this circle, and this number is the density value of the point. Then select a density threshold MinPts. For example, the center point of the circle less than MinPts is a low-density point, and the center point of MinPts is greater than or equal to the high-density point (called Core point). If there is a high-density point in the circle of another high-density point, we will connect these two points, so that we can continuously connect many points in series.
- the trained convolutional neural network model is used to implement clustering, and the convolutional neural network model can be used to train the convolutional neural network model to manually learn the characteristics of the training sample, so that the convolutional neural network model can predict the knowledge entity Perform clustering.
- step S102 the knowledge entities in the knowledge base are clustered, and then in the same category, the similarity of any two knowledge entities is calculated to determine whether there are redundant entities, which narrows the scope of the comparison of knowledge entities , Which reduces the amount of calculation and improves the efficiency of determining whether redundant entities exist.
- the similarity of two knowledge entities is obtained by calculating the similarity between vectors representing two knowledge entities.
- the similarity between two vectors may be a cosine similarity.
- Cosine similarity measures the similarity between two vectors by measuring the cosine of the angle between the two vectors.
- the cosine value of an angle of 0 degrees is 1, and the cosine value of any other angle is not greater than 1; and its minimum value is -1.
- the cosine of the angle between the two vectors determines whether the two vectors are pointing in roughly the same direction.
- the cosine similarity value is 1; when the angle between the two vectors is 90°, the cosine similarity value is 0; when the two vectors point in completely opposite directions, the cosine similarity value Is -1.
- This result has nothing to do with the length of the vector, only the direction of the vector.
- Cosine similarity is applicable to any dimension vector space, and is often used in high-dimensional positive space, so it is suitable for comparison of text files.
- x 1i and x 2i are X 1
- X 2 is the value of each dimension after normalization.
- a threshold is set in advance, which is called the first threshold here.
- the first threshold When the similarity of the two knowledge entities is greater than the set first threshold, the content of the two knowledge entities is considered to be partially repeated, and the two knowledge entities are merged into one entity.
- the knowledge entity can be obtained by accessing the server where the knowledge base is located.
- the knowledge entity can belong to the same knowledge repository, or can come from multiple knowledge bases.
- TF-IDF is a statistical method used to evaluate the importance of a word to a document in a document set or a corpus. The importance of a word increases proportionally with the number of times it appears in the document, but at the same time it decreases inversely with the frequency of its appearance in the corpus.
- TF-IDF is actually: TF*IDF, TF (Term Frequency), IDF (Inverse Document Frequency), inverse document frequency.
- TF indicates the frequency of entries in document d.
- the training of the clustering model based on the convolutional neural network includes the following steps:
- the training goal of the convolutional neural network is to identify the category to which the knowledge entity belongs, and the convolutional neural network model implements the function of clustering the knowledge entity by manually labeling the characteristics of the category in the training learning sample.
- the convolutional neural network model is composed of: convolutional layer, pooling layer, fully connected and classification layer.
- the convolutional layer is used to locally sense the knowledge entity vector, and the convolutional layer is usually connected in a cascade manner, and the later the convolutional layer in the cascade can sense the more global information.
- the fully connected layer acts as a "classifier" in the entire convolutional neural network. If the operations of convolutional layer, pooling layer and activation function layer are to map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space .
- the fully connected layer is connected to the output position of the convolutional layer and can perceive the global characteristics of the knowledge entity vector.
- the training samples are input into the convolutional neural network model, and the input clustering reference information of the convolutional neural network model is obtained.
- the softmax cross-entropy loss function used in the embodiment of the present application is specifically as follows:
- the input feature for the i-th sample in the last layer of the network is X i
- its corresponding label is Y i
- h (h1,h2,...,hc)
- C is the number of all last classifications.
- the weight of each node in the convolutional neural network model is adjusted by gradient descent method, which is an optimization algorithm used in machine learning and artificial intelligence to recursively approximate the minimum deviation model.
- Clustering knowledge entities through the trained convolutional neural network model can make the clustering results closer to the user's expectations.
- step S103 further includes the following steps:
- the two knowledge entities are not similar in terms of content, the two knowledge entities correspond to an entity in reality, that is, the two knowledge entities describe the two entities of an entity in reality.
- attribute similarity is introduced here. Get the attributes of the knowledge entity first. Attributes are the data used to describe the knowledge entity. They can also be called tags.
- the editing distance is used to measure the similarity between two knowledge entities.
- Editing distance refers to the minimum number of operands required to convert character string A to character string B using character manipulation.
- Character operations include: deleting a character, modifying a character, and inserting a character.
- the cost of each operation is set to 1, and the attribute similarity can be calculated by the following formula:
- Attribute similarity (1-edit distance)/maximum length of two attribute strings
- Vector similarity that is, the aforementioned cosine similarity or Euclidean distance, which measures the similarity between two knowledge entities.
- S is the similarity between the two knowledge entities
- X is the attribute similarity
- Y is the vector similarity
- a and b are the weight of the attribute similarity and the vector similarity, respectively .
- Step S104 also includes the following steps:
- the second threshold is greater than the aforementioned first threshold, for example, the set second threshold is 0.95, that is, the two knowledge entities are basically the same, then , Deleting any knowledge entity from the knowledge base is an effective way to remove redundancy.
- step S104 further includes the following steps:
- the similarity of the two knowledge entities is greater than the preset first threshold, it is considered that part of the content of the two knowledge entities is duplicated.
- the two knowledge entities can be divided into several sub-entities according to certain rules. , For example, according to the content segment.
- a third threshold When the similarity between two sub-entities is greater than a preset threshold, this is referred to as a third threshold. It is considered that the content of the two sub-entities is basically duplicated, and any one of them is deleted. To avoid deleting too much content, the third threshold is required to be greater than the aforementioned first threshold.
- the retained sub-entities are merged as the alignment result of the two knowledge entities to be aligned before.
- FIG. 6 is a block diagram of the basic structure of the knowledge base alignment device of this embodiment.
- a knowledge base alignment device includes: an acquisition module 210, a processing module 220, a calculation module 230, and an execution module 240.
- the obtaining module 210 is used to obtain a knowledge entity vector set, wherein the knowledge entity vector set is a vectorized representation of the knowledge entity in the knowledge base to be aligned
- the processing module 220 is used to input the knowledge entity vector set Go to the preset knowledge entity clustering model to obtain the clustering result of the knowledge entities in the knowledge base to be aligned
- the calculation module 230 is used to select any two knowledge entities belonging to the same class according to the clustering result, Calculate the similarity between the two knowledge entities
- the execution module 240 is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.
- a clustering result of knowledge entities in the knowledge base to be aligned is obtained, according to the clustering
- any two knowledge entities belonging to the same category are selected, the similarity between the two knowledge entities is calculated, and when the similarity is greater than the set first threshold, the two knowledge entities are merged.
- the comparison of the similarity of two knowledge entities is limited to the same type of entity, which greatly reduces the amount of calculation. Among them, the calculation of similarity combines the attribute similarity of the entity and the vector similarity, making the calculation of similarity more reasonable and more effective. Discover and remove redundant information.
- the knowledge base alignment device further includes: a first acquisition submodule and a first processing submodule.
- the first acquisition submodule is used to acquire knowledge entities in the knowledge base to be aligned; the first processing submodule is used to vectorize the knowledge entities based on the IF-IDF algorithm to obtain the knowledge entity vector set.
- the predetermined knowledge entity clustering model in the knowledge base alignment device uses a DBSCAN density clustering algorithm.
- the knowledge entity clustering model preset in the knowledge base alignment device uses a convolution neural network-based clustering model.
- the calculation module 230 includes: a second acquisition submodule, a first calculation submodule, and a second calculation submodule.
- a second acquisition submodule is used to acquire the attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entities;
- a first calculation submodule is used to calculate the two knowledges The attribute similarity and vector similarity of the entity; the second calculation submodule is used to calculate the weighted sum of the attribute similarity and the vector similarity of the two knowledge entities according to the following formula, to obtain the relationship between the two knowledge entities Similarity, ie:
- S is the similarity between the two knowledge entities
- X is the attribute similarity
- Y is the vector similarity
- a and b are the weight of the attribute similarity and the vector similarity, respectively .
- the execution module 240 includes a first execution sub-module for when the similarity is greater than a set second threshold, wherein the second threshold is greater than the first threshold, from Any one of the two knowledge entities is deleted from the knowledge base to be aligned.
- the execution module 240 includes: a first division submodule, a third calculation submodule, a second execution submodule, a first loop submodule, and a third execution submodule.
- the first division sub-module is used to divide the two knowledge entities into several sub-entities;
- the third calculation sub-module is used to select any two sub-entities of the plurality of sub-entities and calculate the two sub-entities The similarity between the two;
- the second execution sub-module is used to delete any one of the two sub-entities when the similarity between the two sub-entities is greater than a preset third threshold, wherein the The three thresholds are greater than the first threshold;
- the first loop submodule is used to make the third calculation submodule and the second execution submodule run repeatedly until the similarity between any two subentities in the reserved subentities is less than or Equal to a preset third threshold;
- a third execution sub-module configured to merge the reserved sub-entities as an alignment entity of the two
- FIG. 7 is a block diagram of the basic structure of the computer device of this embodiment.
- the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus.
- the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
- the database may store a sequence of control information.
- the processor may implement a A method of knowledge base alignment.
- the processor of the computer device is used to provide calculation and control capabilities, and support the operation of the entire computer device.
- the computer device may store computer readable instructions in the memory. When the computer readable instructions are executed by the processor, the processor may cause the processor to perform a knowledge base alignment method.
- the network interface of the computer device is used to connect and communicate with the terminal.
- FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
- the specific computer equipment may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.
- the processor is used to execute the specific content of the acquisition module 210, the processing module 220, the calculation module 230, and the execution module 240 in FIG. 6.
- the memory stores computer-readable instructions and various types of data required to execute the above modules.
- the network interface is used for data transmission between user terminals or servers.
- the memory in this embodiment stores the computer-readable instructions and data required to execute all submodules in the knowledge base alignment method, and the server can call the computer-readable instructions and data of the server to execute the functions of all submodules.
- the computer device obtains the knowledge entity vector set and inputs the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of the knowledge entities in the knowledge base to be aligned, and according to the clustering result, Select any two knowledge entities that belong to the same class, calculate the similarity between the two knowledge entities, and when the similarity is greater than the set first threshold, merge the two knowledge entities.
- the comparison of the similarity of two knowledge entities is limited to the same type of entity, which greatly reduces the amount of calculation. Among them, the calculation of similarity combines the attribute similarity of the entity and the vector similarity, making the calculation of similarity more reasonable and more effective.
- the present application also provides a storage medium storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the knowledge base alignment described in any of the foregoing embodiments Method steps.
- the computer-readable instructions may include the processes of the foregoing method embodiments.
- the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed in the embodiments of the present application are a knowledge base alignment method and apparatus, a computer device and a storage medium. The method comprises the following steps: obtaining a knowledge entity vector set, wherein the knowledge entity vector set is a vectorized representation of knowledge entities in a knowledge base to be aligned; inputting the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of the knowledge entities in said knowledge base; selecting any two knowledge entities of the same type according to the clustering result, and calculating the similarity between the two knowledge entities; and if the similarity is greater than a set first threshold, combining the two knowledge entities. The comparison for the similarity between two knowledge entities is limited to the entities of the same type, so the calculation amount is greatly reduced; during clustering, the clustering is achieved by means of an artificial intelligence technology, so that the clustering result is more in line with the expectation; the similarity calculation integrates the attribute similarity and vector similarity of the entities, thus the similarity calculation is more reasonable, and redundant information can be more effectively found and removed.
Description
本申请以2018年12月4日提交的申请号为201811474699X,名称为“一种知识库对齐方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application with the application number 201811474699X filed on December 4, 2018, titled "A Knowledge Base Alignment Method, Device, Computer Equipment, and Storage Media", and claims its priority.
本申请涉及知识库处理技术领域,尤其涉及一种知识库对齐方法、装置、计算机设备及存储介质。This application relates to the technical field of knowledge base processing, and in particular, to a knowledge base alignment method, device, computer equipment, and storage medium.
随着互联网的发展,各个领域构建了越来越多的知识库,这些知识库也被广泛的应用于搜索服务、自动问答等互联网应用中。知识库对信息的共享和传播具有积极意义。然而,单个知识库的信息有限,在一些情况下不能满足用户的需求;且通常知识库是持续扩充的,占用的存储资源的规模也持续扩大,但持续扩充到知识库中的数据可能存在冗余,这种冗余造成存储资源的浪费,同时,也使搜索计算量增大,搜索结果信息重复,给用户带来不便。With the development of the Internet, more and more knowledge bases have been constructed in various fields, and these knowledge bases are also widely used in Internet applications such as search services and automatic question answering. The knowledge base has a positive meaning for the sharing and dissemination of information. However, the information of a single knowledge base is limited, and in some cases it cannot meet the needs of users; and usually the knowledge base is continuously expanded, and the scale of occupied storage resources continues to expand, but the data continuously expanded into the knowledge base may be redundant In addition, this redundancy causes a waste of storage resources. At the same time, it also increases the amount of search calculations and duplicates the search result information, causing inconvenience to users.
知识库对齐(Knowledge Base Alignment)指对于不同来源的各个实体,找出属于现实中同一事物的实体。这里实体指客观存在并可相互区别的事物,包括具体的人、事、物、抽象的概念、关系。因此知识库对齐,即抽取实体信息,去除冗余,是构建高质量知识库的关键问题。Knowledge Base (Alignment) refers to finding entities belonging to the same thing in reality for each entity from different sources. The entity here refers to things that exist objectively and can be distinguished from each other, including concrete people, things, things, abstract concepts, relationships. Therefore, knowledge base alignment, that is, extracting entity information and removing redundancy, is a key issue in building a high-quality knowledge base.
知识库对齐常用的方法是利用实体的属性信息判定不同源实体是否可进行对齐,由于不同实体数据属于用户原创内容(User Generated Content,UGC)类型,不同用户编辑的数据质量参差不齐,仅通过用户编辑的实体属性信息难以准确判定是否为同一实体。The common method of knowledge base alignment is to use the attribute information of the entity to determine whether different source entities can be aligned. Since the data of different entities belongs to the type of user-generated content (User Generated Content, UGC), the quality of the data edited by different users is uneven, only through The entity attribute information edited by the user is difficult to accurately determine whether it is the same entity.
【发明内容】[Invention content]
本申请实施例提供了一种知识库对齐方法、装置、计算机设备及存储介质。Embodiments of the present application provide a knowledge base alignment method, apparatus, computer equipment, and storage medium.
一种知识库对齐方法,所述知识库对齐方法包括:A knowledge base alignment method, the knowledge base alignment method includes:
获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;
将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;
根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度;According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;
当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。When the similarity is greater than the set first threshold, the two knowledge entities are merged.
一种知识库对齐装置,所述知识库对齐装置包括:A knowledge base alignment device. The knowledge base alignment device includes:
获取模块,用于获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;An acquisition module, for acquiring a vector set of knowledge entities, wherein the vector set of knowledge entities is a vectorized representation of the knowledge entities in the knowledge base to be aligned;
处理模块,用于将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;A processing module, configured to input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;
计算模块,用于根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度;A calculation module, used to select any two knowledge entities belonging to the same class according to the clustering result, and calculate the similarity between the two knowledge entities;
执行模块,用于当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。The execution module is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述任一种知识库对齐方法的步骤。A computer device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing any of the above knowledge when executing the computer-readable instructions The steps of the library alignment method.
一种存储有计算机可读指令的可读存储介质,所述计算机可读指令被一种处理器执行时实现上述任一种知识库对齐方法的步骤。A readable storage medium storing computer-readable instructions that when executed by a processor implements the steps of any of the above knowledge base alignment methods.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are set forth in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings, and claims.
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, without paying any creative work, other drawings can be obtained based on these drawings.
图1为本申请实施例一种知识库对齐方法基本流程示意图;FIG. 1 is a schematic flowchart of a knowledge base alignment method according to an embodiment of this application;
图2为本申请实施例基于IF-IDF算法对知识实体向量化的示意图;2 is a schematic diagram of vectorization of knowledge entities based on an IF-IDF algorithm according to an embodiment of the present application;
图3为本申请实施例基于卷积神经网络的聚类模型训练流程示意图;3 is a schematic diagram of a training process of a clustering model based on a convolutional neural network according to an embodiment of the present application;
图4为本申请实施例知识实体相似度计算流程示意图;4 is a schematic diagram of a calculation process of similarity of knowledge entities according to an embodiment of the present application;
图5为本申请实施例知识实体合并流程示意图;5 is a schematic diagram of a process of merging knowledge entities according to an embodiment of the present application;
图6为本申请实施例一种知识库对齐装置基本结构框图;6 is a block diagram of a basic structure of a knowledge base alignment device according to an embodiment of the present application;
图7为本申请实施计算机设备基本结构框图。7 is a block diagram of the basic structure of a computer device for implementing this application.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solution of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application.
在本申请的说明书和权利要求书及上述附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如101、102等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。Some processes described in the specification and claims of this application and the above drawings include multiple operations in a specific order, but it should be clearly understood that these operations may not be in the order in which they appear in this document Execution or parallel execution. The sequence numbers of operations such as 101 and 102 are only used to distinguish different operations. The sequence number itself does not represent any execution sequence. In addition, these processes may include more or fewer operations, and these operations may be performed sequentially or in parallel. It should be noted that the descriptions of "first", "second", etc. in this article are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, nor limit "first" and "second". Are different types.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without making creative work fall within the protection scope of the present application.
实施例Examples
本技术领域技术人员可以理解,这里所使用的“终端”、“终端设备”既包括无线信号接收器的设备,其仅具备无发射能力的无线信号接收器的设备,又包括接收和发射硬件的设备,其具有能够在双向通信链路上,执行双向通信的接收和发射硬件的设备。这种设备可以包括:蜂窝或其他通信设备,其具有单线路显示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备;PCS(Personal Communications Service,个人通信系统),其可以组合语音、数据处理、传真和/或数据通信能力;PDA(Personal Digital Assistant,个人数字助理),其可以包括射频接收器、寻呼机、互联网/内联网访问、网络浏览器、记事本、日历和/或GPS(Global Positioning System,全球定位系统)接收器;常规膝上型 和/或掌上型计算机或其他设备,其具有和/或包括射频接收器的常规膝上型和/或掌上型计算机或其他设备。这里所使用的“终端”、“终端设备”可以是便携式、可运输、安装在交通工具(航空、海运和/或陆地)中的,或者适合于和/或配置为在本地运行,和/或以分布形式,运行在地球和/或空间的任何其他位置运行。这里所使用的“终端”、“终端设备”还可以是通信终端、上网终端、音乐/视频播放终端,例如可以是PDA、MID(Mobile Internet Device,移动互联网设备)和/或具有音乐/视频播放功能的移动电话,也可以是智能电视、机顶盒等设备。Those skilled in the art can understand that the "terminal" and "terminal device" used herein include both wireless signal receiver devices, which only have wireless signal receiver devices without transmitting capabilities, and also include hardware for receiving and transmitting hardware. A device having a device capable of performing receiving and transmitting hardware for bidirectional communication on a bidirectional communication link. Such devices may include: cellular or other communication devices with single-line displays or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Services), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notepads, calendars and/or GPS (Global Positioning System (Global Positioning System) receiver; conventional laptop and/or palmtop computer or other device that has and/or includes a conventional radio frequency receiver and/or palmtop computer or other device. As used herein, "terminal" and "terminal equipment" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or terrestrial), or adapted and/or configured to operate locally, and/or In a distributed form, it operates at any other location on the earth and/or space. The "terminal" and "terminal device" used herein may also be a communication terminal, an Internet terminal, a music/video playback terminal, for example, may be a PDA, MID (Mobile Internet Device), and/or have music/video playback Functional mobile phones can also be smart TVs, set-top boxes and other devices.
本实施方式中的终端即为上述的终端。The terminal in this embodiment is the above-mentioned terminal.
具体地,请参阅图1,图1为本实施例一种知识库对齐方法的基本流程示意图。Specifically, please refer to FIG. 1, which is a schematic diagram of a basic process of a knowledge base alignment method according to this embodiment.
如图1所示,一种知识库对齐方法,包括下述步骤:As shown in Figure 1, a knowledge base alignment method includes the following steps:
S101、获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;S101. Acquire a knowledge entity vector set, where the knowledge entity vector set is a vectorized representation of knowledge entities in a knowledge base to be aligned;
保存在知识库中的知识实体通常为文本或图片,在对知识实体进行对齐时,通常需要计算知识实体间的相似度,为了方便计算机处理和理解,需要用将知识实体转化为向量。例如文本的向量化表示通过向量空间模型也称为词袋模型(bag of words)实现,其中最简单的模式是基于词的独热编码(one-hot encoding),用每一个词作为维度key,有单词对应的位置为1,其他为0,向量长度和词典大小相同。The knowledge entities stored in the knowledge base are usually text or pictures. When aligning the knowledge entities, it is usually necessary to calculate the similarity between the knowledge entities. In order to facilitate computer processing and understanding, the knowledge entities need to be converted into vectors. For example, the vectorized representation of text is realized by a vector space model, also known as a bag of words model. The simplest mode is one-hot encoding based on words, using each word as the dimension key. Some words correspond to position 1, others are 0, and vector length is the same as dictionary size.
S102、将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;S102: Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;
将表示知识实体的向量集输入到预先设定的知识实体聚类模型。其中知识实体的聚类模型采用基于密度的聚类算法,基于密度的聚类算法不需要事先确定簇类的数据,可以发现任意形状的簇类,能够识别出噪声点,对离群点有较好的鲁棒性,可以检测离群点。DBSCAN是该类方法中最典型的代表算法之一,其核心思想就是先发现密度较高的点,然后把相近的高密度点逐步都连成一片,进而生成各种簇。具体的算法实现:对每个数据点为圆心,以eps为半径画个圈(称为邻域eps-neigbourhood),然后数有多少个点在这个圈内,这个数就是该点密度值。然后选取一个密度阈值MinPts,如圈内点数小于MinPts的圆心点为低密度的点,而大于或等于MinPts的圆心点高密度的点(称为核心点Core point)。如果有一个高密度的点在另一个高密度的点的圈内,我们就把这两个点 连接起来,这样我们可以把好多点不断地串联出来。之后,如果有低密度的点也在高密度的点的圈内,把它也连到最近的高密度点上,称之为边界点。这样所有能连到一起的点就成一了个簇,而不在任何高密度点的圈内的低密度点就是异常点。The vector set representing the knowledge entity is input into a preset knowledge entity clustering model. Among them, the clustering model of knowledge entities adopts a density-based clustering algorithm. The density-based clustering algorithm does not need to determine the data of clusters in advance. It can find clusters of any shape, can identify noise points, and compares with outliers. Good robustness can detect outliers. DBSCAN is one of the most typical representative algorithms in this type of method. Its core idea is to first find the points with higher density, and then gradually connect the similar high-density points into one piece, and then generate various clusters. The specific algorithm implementation: for each data point is the center of the circle, draw a circle with eps as the radius (called the neighborhood eps-neigbourhood), and then count how many points are in this circle, and this number is the density value of the point. Then select a density threshold MinPts. For example, the center point of the circle less than MinPts is a low-density point, and the center point of MinPts is greater than or equal to the high-density point (called Core point). If there is a high-density point in the circle of another high-density point, we will connect these two points, so that we can continuously connect many points in series. After that, if there is a low-density point in the circle of high-density points, connect it to the nearest high-density point, which is called the boundary point. In this way, all the points that can be connected together become a cluster, and the low-density points that are not in the circle of any high-density points are abnormal points.
在一些实施方式中,采用经过训练的卷积神经网络模型来实现聚类,通过对卷积神经网络进行训练学习人工对训练样本聚类的特征,使卷积神经网络模型可以按照预期对知识实体进行聚类。In some embodiments, the trained convolutional neural network model is used to implement clustering, and the convolutional neural network model can be used to train the convolutional neural network model to manually learn the characteristics of the training sample, so that the convolutional neural network model can predict the knowledge entity Perform clustering.
S103、根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度;S103. According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;
通过步骤S102,对知识库中的知识实体进行了聚类,再在同一类中,通过计算任意两个知识实体的相似度,来判断是否存在冗余的实体,这样缩小了知识实体比较的范围,减小了计算量,提高了判断是否存在冗余实体的效率。Through step S102, the knowledge entities in the knowledge base are clustered, and then in the same category, the similarity of any two knowledge entities is calculated to determine whether there are redundant entities, which narrows the scope of the comparison of knowledge entities , Which reduces the amount of calculation and improves the efficiency of determining whether redundant entities exist.
两个知识实体的相似度通过计算表示两个知识实体的向量之间的相似度来得出。两个向量之间的相似度可以是余弦相似度。余弦相似度通过测量两个向量的夹角的余弦值来度量它们之间的相似性。0度角的余弦值是1,而其他任何角度的余弦值都不大于1;并且其最小值是-1。从而两个向量之间的角度的余弦值确定两个向量是否大致指向相同的方向。两个向量有相同的指向时,余弦相似度的值为1;两个向量夹角为90°时,余弦相似度的值为0;两个向量指向完全相反的方向时,余弦相似度的值为-1。这结果是与向量的长度无关的,仅仅与向量的指向方向相关。余弦相似度对于任何维度的向量空间都适用,且常用于高维正空间,所以适合用于文本文件的比较。The similarity of two knowledge entities is obtained by calculating the similarity between vectors representing two knowledge entities. The similarity between two vectors may be a cosine similarity. Cosine similarity measures the similarity between two vectors by measuring the cosine of the angle between the two vectors. The cosine value of an angle of 0 degrees is 1, and the cosine value of any other angle is not greater than 1; and its minimum value is -1. Thus the cosine of the angle between the two vectors determines whether the two vectors are pointing in roughly the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the angle between the two vectors is 90°, the cosine similarity value is 0; when the two vectors point in completely opposite directions, the cosine similarity value Is -1. This result has nothing to do with the length of the vector, only the direction of the vector. Cosine similarity is applicable to any dimension vector space, and is often used in high-dimensional positive space, so it is suitable for comparison of text files.
也可以通过计算向量之间的欧氏距离来衡量两个向量之间的相似度。为了避免尺度的影响,先对向量进行归一化,再按照以下公式求向量空间中两个点X
1,X
2之间的距离:
You can also measure the similarity between two vectors by calculating the Euclidean distance between the vectors. In order to avoid the influence of scale, first normalize the vector, and then find the distance between two points X 1 and X 2 in the vector space according to the following formula:
其中x
1i,x
2i为X
1,X
2归一化后各维度的值。
Where x 1i and x 2i are X 1 , and X 2 is the value of each dimension after normalization.
S104、当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。S104. When the similarity is greater than the set first threshold, merge the two knowledge entities.
预先设定一个阈值,这里称之为第一阈值,当两个知识实体的相似度大于设定的第一阈值时,认为两个知识实体部分内容重复,将两个知识实体合并为 一个实体。A threshold is set in advance, which is called the first threshold here. When the similarity of the two knowledge entities is greater than the set first threshold, the content of the two knowledge entities is considered to be partially repeated, and the two knowledge entities are merged into one entity.
如图2所示,在S101之前,还包括步骤:As shown in Figure 2, before S101, it also includes steps:
S111、获取待对齐的知识库中的知识实体;S111. Acquire knowledge entities in the knowledge base to be aligned;
通过访问知识库所在服务器获取知识实体,知识实体可以属于同一知识所库,也可以来源于多个知识库。The knowledge entity can be obtained by accessing the server where the knowledge base is located. The knowledge entity can belong to the same knowledge repository, or can come from multiple knowledge bases.
S112、将所述知识实体基于IF-IDF算法向量化,得到所述知识实体向量集。S112. Vectorize the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
将知识实体向量化,除了前述的基于词袋模型向量化外,还可以在基于基于IF-IDF算法对知识实体向量化。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF实际上是:TF*IDF,TF(Term Frequency,词频),IDF(Inverse Document Frequency,逆向文件频率)。TF表示词条在文档d中出现的频率。使用TF-IDF对文本向量化,同样构建一个词典,用每个词的TF-IDF值作为该词的权重。To vectorize knowledge entities, in addition to the aforementioned vectorization based on the bag-of-words model, it is also possible to vectorize knowledge entities based on the IF-IDF algorithm. TF-IDF is a statistical method used to evaluate the importance of a word to a document in a document set or a corpus. The importance of a word increases proportionally with the number of times it appears in the document, but at the same time it decreases inversely with the frequency of its appearance in the corpus. TF-IDF is actually: TF*IDF, TF (Term Frequency), IDF (Inverse Document Frequency), inverse document frequency. TF indicates the frequency of entries in document d. Use TF-IDF to vectorize the text, and also construct a dictionary, using the TF-IDF value of each word as the weight of the word.
如图3所示,所述基于卷积神经网络的聚类模型的训练,包括下述步骤:As shown in FIG. 3, the training of the clustering model based on the convolutional neural network includes the following steps:
S121、获取标记有聚类判断信息的训练样本,所述训练样本的聚类判断信息为样本知识实体的类别;S121. Obtain a training sample marked with cluster judgment information, and the cluster judgment information of the training sample is the category of the sample knowledge entity;
本申请实施例中,卷积神经网络的训练目标是识别知识实体所属的类别,卷积神经网络模型通过训练学习样本中人工标注类别的特征,实现对知识实体聚类的功能。In the embodiment of the present application, the training goal of the convolutional neural network is to identify the category to which the knowledge entity belongs, and the convolutional neural network model implements the function of clustering the knowledge entity by manually labeling the characteristics of the category in the training learning sample.
S122、将所述训练样本输入卷积神经网络模型获取所述训练样本的模型聚类参照信息;S122. Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;
卷积神经网络模型由:卷积层、池化层、全连接和分类层组成。其中,卷积层被用于对知识实体向量局部进行感知,且卷积层通常以级联的方式进行连接,级联中位置越靠后的卷积层能够感知越全局化的信息。The convolutional neural network model is composed of: convolutional layer, pooling layer, fully connected and classification layer. Among them, the convolutional layer is used to locally sense the knowledge entity vector, and the convolutional layer is usually connected in a cascade manner, and the later the convolutional layer in the cascade can sense the more global information.
全连接层在整个卷积神经网络中起到“分类器”的作用。如果说卷积层、池化层和激活函数层等操作是将原始数据映射到隐层特征空间的话,全连接层则起到将学到的“分布式特征表示”映射到样本标记空间的作用。全连接层连接在卷积层输出位置,能够感知知识实体向量的全局化特征。The fully connected layer acts as a "classifier" in the entire convolutional neural network. If the operations of convolutional layer, pooling layer and activation function layer are to map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space . The fully connected layer is connected to the output position of the convolutional layer and can perceive the global characteristics of the knowledge entity vector.
将训练样本输入到卷积神经网络模型中,获取卷积神经网络模型输入聚类参照信息。The training samples are input into the convolutional neural network model, and the input clustering reference information of the convolutional neural network model is obtained.
S123、通过损失函数比对所述训练样本内不同样本的模型聚类参照信息与所述聚类判断信息是否一致;S123. Compare the model clustering reference information of different samples in the training sample with the clustering judgment information by using a loss function;
通过损失函数比对聚类参照信息和样本标注的聚类判断信息是否一致,本申请实施例中使用softmax交叉熵损失函数,具体为:By judging whether the clustering reference information and the sample-labeled clustering information are consistent through the loss function comparison, the softmax cross-entropy loss function used in the embodiment of the present application is specifically as follows:
假设共有N个训练样本,针对网络最后分层第i个样本的输入特征为X
i,其对应的标记为Y
i是最终的分类结果,h=(h1,h2,...,hc)为网络的最终输出,即样本i的预测结果。其中C是最后所有分类的数量。
Suppose there are N training samples, the input feature for the i-th sample in the last layer of the network is X i , and its corresponding label is Y i is the final classification result, h=(h1,h2,...,hc) is The final output of the network is the prediction result of sample i. Where C is the number of all last classifications.
S124、当所述模型聚类参照信息与所述聚类判断信息不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述模型聚类参照信息与所述聚类判断信息一致时结束。S124. When the model clustering reference information is inconsistent with the clustering judgment information, repeatedly and iteratively update the weights in the convolutional neural network model to the model clustering reference information and the clustering judgment End when the information is consistent.
在训练过程中,调整卷积神经网络模型中各节点的权重,使Softmax交叉熵损失函数尽可能收敛,也就是说继续调整权重,得到的损失函数的值不再缩小,反而增大时,认为卷积神经网络训练可以结束。各节点权重的调整采用梯度下降法,梯度下降法是一个最优化算法,用于机器学习和人工智能当中用来递归性地逼近最小偏差模型。During the training process, adjust the weight of each node in the convolutional neural network model to make the Softmax cross-entropy loss function converge as much as possible, that is to continue to adjust the weight, the value of the loss function obtained no longer decreases, but instead increases, think Convolutional neural network training can end. The weight of each node is adjusted by gradient descent method, which is an optimization algorithm used in machine learning and artificial intelligence to recursively approximate the minimum deviation model.
通过训练后的卷积神经网络模型对知识实体进行聚类,可以使聚类结果更接近用户的预期。Clustering knowledge entities through the trained convolutional neural network model can make the clustering results closer to the user's expectations.
如图4所示,步骤S103还包括下述步骤:As shown in FIG. 4, step S103 further includes the following steps:
S131、获取所述两个知识实体的属性,其中,所述知识实体的属性为描述对应知识实体的数据;S131. Acquire attributes of the two knowledge entities, where the attributes of the knowledge entity are data describing the corresponding knowledge entity;
在一些情况下,虽然两个知识实体从内容来看相似度不高,但是两个知识实体都对应现实中的一个实体,也就是说,两个知识实体分别描述了现实中某个实体的两部分信息,为了使用的方便,也有必要将这两部分信息合在一起。所以,这里引入属性相似度。先获取知识实体的属性,属性是用来描述知识实体的数据,也可以称为标签。In some cases, although the two knowledge entities are not similar in terms of content, the two knowledge entities correspond to an entity in reality, that is, the two knowledge entities describe the two entities of an entity in reality. Part of the information, for the convenience of use, it is also necessary to combine the two parts of information. Therefore, attribute similarity is introduced here. Get the attributes of the knowledge entity first. Attributes are the data used to describe the knowledge entity. They can also be called tags.
S132、计算所述两个知识实体的属性相似度和向量相似度;S132: Calculate the attribute similarity and vector similarity of the two knowledge entities;
属性相似度,本申请实施例中采用编辑距离来衡量两个知识实体之间的相似度。编辑距离,是指利用字符操作,把字符串A转换成字符串B所需要的最少操作数。字符操作包括:删除一个字符、修改一个字符、插入一个字符。在 这里设置每次操作的代价为1,属性相似度可以通过以下公式计算:Attribute similarity. In this embodiment of the present application, the editing distance is used to measure the similarity between two knowledge entities. Editing distance refers to the minimum number of operands required to convert character string A to character string B using character manipulation. Character operations include: deleting a character, modifying a character, and inserting a character. Here, the cost of each operation is set to 1, and the attribute similarity can be calculated by the following formula:
属性相似度=(1-编辑距离)/两个属性字符串的最大长度Attribute similarity = (1-edit distance)/maximum length of two attribute strings
向量相似度,即前述的衡量两个知识实体向量相似度的余弦相似度或欧氏距离。Vector similarity, that is, the aforementioned cosine similarity or Euclidean distance, which measures the similarity between two knowledge entities.
S133、按照以下公式计算所述两个知识实体的属性相似度和向量相似度的加权和,得到所述两个知识实体之间的相似度,即:S133. Calculate the weighted sum of the attribute similarity and vector similarity of the two knowledge entities according to the following formula to obtain the similarity between the two knowledge entities, namely:
S=aX+bYS=aX+bY
其中,S为所述两个知识实体之间的相似度,X为所述属性相似度,Y为所述向量相似度,a、b分别为所述属性相似度和所述向量相似度的权重。Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
综合属性相似度和向量相似度,可以在内容相似度不高的情况下,发现描述同一现实实体的两个知识实体,并对描述同一现实实体的知识实体进行合并,方便用户的使用和知识库的维护。Comprehensive attribute similarity and vector similarity can find two knowledge entities describing the same real entity when the content similarity is not high, and merge the knowledge entities describing the same real entity, which is convenient for users to use and knowledge base Maintenance.
步骤S104还包括下述步骤:Step S104 also includes the following steps:
S141、当所述相似度大于设定的第二阈值时,其中,所述第二阈值大于所述第一阈值,从待对齐的知识库中删除所述两个知识实体中的任意一个。S141. When the similarity is greater than the set second threshold, wherein the second threshold is greater than the first threshold, delete any one of the two knowledge entities from the knowledge base to be aligned.
当两个知识实体的相似度非常高,这里我们设定第二阈值,第二阈值大于前述的第一阈值,例如设定的第二阈值为0.95,即认为两个知识实体基本相同,这时,从知识库中删除任意一个知识实体就是有效的去除冗余的方法。When the similarity between two knowledge entities is very high, here we set a second threshold, the second threshold is greater than the aforementioned first threshold, for example, the set second threshold is 0.95, that is, the two knowledge entities are basically the same, then , Deleting any knowledge entity from the knowledge base is an effective way to remove redundancy.
如图5所示,步骤S104还包括下述步骤:As shown in FIG. 5, step S104 further includes the following steps:
S151、将所述两个知识实体分割成若干个子实体;S151. Split the two knowledge entities into several sub-entities;
当两个知识实体的相似度大于预设的第一阈值时,认为两个知识实体部分内容重复,为了将重复的内容剔出,可以先将两个知识实体按照一定的规则分割成若干个子实体,例如按照内容段落分割。When the similarity of the two knowledge entities is greater than the preset first threshold, it is considered that part of the content of the two knowledge entities is duplicated. In order to eliminate the duplicate content, the two knowledge entities can be divided into several sub-entities according to certain rules. , For example, according to the content segment.
S152、选择所述若干个子实体中的任意两个子实体,计算所述两个子实体之间的相似度;S152. Select any two of the several sub-entities and calculate the similarity between the two sub-entities;
选择分割后的任意两个子实体,计算两个子实体间的相似度,即如前面所述,先将子实体向量化,然后计算表示子实体的向量之间的相似度,可以是余弦相似度、也可以是欧氏距离。Select any two sub-entities after segmentation, calculate the similarity between the two sub-entities, that is, as described above, first vectorize the sub-entities, and then calculate the similarity between the vectors representing the sub-entities, which can be cosine similarity, It can also be Euclidean distance.
S153、当所述两个子实体之间的相似度大于预设的第三阈值时,删除所述两个子实体中的任意一个,其中,所述第三阈值大于所述第一阈值;S153. When the similarity between the two sub-entities is greater than a preset third threshold, delete any one of the two sub-entities, where the third threshold is greater than the first threshold;
当两个子实体之间的相似度大于预设的阈值时,这里称为第三阈值,认为 两个子实体内容基本重复,删除其中任意一个。为避免删除过多的内容,第三阈值要求大于前述的第一阈值。When the similarity between two sub-entities is greater than a preset threshold, this is referred to as a third threshold. It is considered that the content of the two sub-entities is basically duplicated, and any one of them is deleted. To avoid deleting too much content, the third threshold is required to be greater than the aforementioned first threshold.
S154、重复步骤S152和步骤S153,直到保留的子实体中任意两个子实体之间的相似度都小于或等于预设的第三阈值;S154. Repeat steps S152 and S153 until the similarity between any two sub-entities in the retained sub-entities is less than or equal to the preset third threshold;
重复进行子实体之间相似度的比较,删除重合度高的子实体,使保留的子实体中任意两个子实体的相似度都小于或等于预设的第三阈值。Repeat the comparison of the similarity between the sub-entities, delete the sub-entities with high coincidence, so that the similarity of any two sub-entities in the retained sub-entities is less than or equal to the preset third threshold.
S155、将所述保留的子实体合并作为所述两个知识实体的对齐实体。S155. Combine the reserved sub-entities as alignment entities of the two knowledge entities.
将保留的子实体合并作为之前待对齐的两个知识实体的对齐结果。The retained sub-entities are merged as the alignment result of the two knowledge entities to be aligned before.
为解决上述技术问题本申请实施例还提供一种知识库对齐装置。具体请参阅图6,图6为本实施例知识库对齐装置的基本结构框图。To solve the above technical problems, the embodiments of the present application also provide a knowledge base alignment device. For details, please refer to FIG. 6, which is a block diagram of the basic structure of the knowledge base alignment device of this embodiment.
如图6所示,一种知识库对齐装置,包括:获取模块210、处理模块220、计算模块230和执行模块240。其中,获取模块210,用于获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;处理模块220,用于将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;计算模块230,用于根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度;执行模块240,用于当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。As shown in FIG. 6, a knowledge base alignment device includes: an acquisition module 210, a processing module 220, a calculation module 230, and an execution module 240. Wherein, the obtaining module 210 is used to obtain a knowledge entity vector set, wherein the knowledge entity vector set is a vectorized representation of the knowledge entity in the knowledge base to be aligned; the processing module 220 is used to input the knowledge entity vector set Go to the preset knowledge entity clustering model to obtain the clustering result of the knowledge entities in the knowledge base to be aligned; the calculation module 230 is used to select any two knowledge entities belonging to the same class according to the clustering result, Calculate the similarity between the two knowledge entities; the execution module 240 is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.
本申请实施例通过获取知识实体向量集,将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果,根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度,当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。两个知识实体相似度的比较限于同一类实体中,大大减少了计算量,其中,相似度的计算综合了实体的属性相似度和向量相似度,使相似度的计算更合理,可以更有效的发现和去除冗余信息。In this embodiment of the present application, by acquiring a knowledge entity vector set, and inputting the knowledge entity vector set into a preset knowledge entity clustering model, a clustering result of knowledge entities in the knowledge base to be aligned is obtained, according to the clustering As a result, any two knowledge entities belonging to the same category are selected, the similarity between the two knowledge entities is calculated, and when the similarity is greater than the set first threshold, the two knowledge entities are merged. The comparison of the similarity of two knowledge entities is limited to the same type of entity, which greatly reduces the amount of calculation. Among them, the calculation of similarity combines the attribute similarity of the entity and the vector similarity, making the calculation of similarity more reasonable and more effective. Discover and remove redundant information.
在一些实施方式中,所述知识库对齐装置还包括:第一获取子模块和第一处理子模块。其中,第一获取子模块,用于获取待对齐的知识库中的知识实体;第一处理子模块,用于将所述知识实体基于IF-IDF算法向量化,得到所述知识实体向量集。In some embodiments, the knowledge base alignment device further includes: a first acquisition submodule and a first processing submodule. The first acquisition submodule is used to acquire knowledge entities in the knowledge base to be aligned; the first processing submodule is used to vectorize the knowledge entities based on the IF-IDF algorithm to obtain the knowledge entity vector set.
在一些实施方式中,所述知识库对齐装置中预先设定的知识实体聚类模型采用DBSCAN密度聚类算法。In some embodiments, the predetermined knowledge entity clustering model in the knowledge base alignment device uses a DBSCAN density clustering algorithm.
在一些实施方式中,所述知识库对齐装置中预先设定的知识实体聚类模型 采用基于卷积神经网络的聚类模型。In some embodiments, the knowledge entity clustering model preset in the knowledge base alignment device uses a convolution neural network-based clustering model.
在一些实施方式中,所述计算模块230包括:第二获取子模块、第一计算子模块和第二计算子模块。其中,第二获取子模块,用于获取所述两个知识实体的属性,其中,所述知识实体的属性为描述对应知识实体的数据;第一计算子模块,用于计算所述两个知识实体的属性相似度和向量相似度;第二计算子模块,用于按照以下公式计算所述两个知识实体的属性相似度和向量相似度的加权和,得到所述两个知识实体之间的相似度,即:In some embodiments, the calculation module 230 includes: a second acquisition submodule, a first calculation submodule, and a second calculation submodule. Wherein, a second acquisition submodule is used to acquire the attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entities; a first calculation submodule is used to calculate the two knowledges The attribute similarity and vector similarity of the entity; the second calculation submodule is used to calculate the weighted sum of the attribute similarity and the vector similarity of the two knowledge entities according to the following formula, to obtain the relationship between the two knowledge entities Similarity, ie:
S=aX+bYS=aX+bY
其中,S为所述两个知识实体之间的相似度,X为所述属性相似度,Y为所述向量相似度,a、b分别为所述属性相似度和所述向量相似度的权重。Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
在一些实施方式中,所述执行模块240包括:第一执行子模块,用于当所述相似度大于设定的第二阈值时,其中,所述第二阈值大于所述第一阈值,从待对齐的知识库中删除所述两个知识实体中的任意一个。In some embodiments, the execution module 240 includes a first execution sub-module for when the similarity is greater than a set second threshold, wherein the second threshold is greater than the first threshold, from Any one of the two knowledge entities is deleted from the knowledge base to be aligned.
在一些实施方式中,所述执行模块240包括:第一分割子模块、第三计算子模块、第二执行子模块、第一循环子模块和第三执行子模块。其中,第一分割子模块,用于将所述两个知识实体分割成若干个子实体;第三计算子模块,用于选择所述若干个子实体中的任意两个子实体,计算所述两个子实体之间的相似度;第二执行子模块,用于当所述两个子实体之间的相似度大于预设的第三阈值时,删除所述两个子实体中的任意一个,其中,所述第三阈值大于所述第一阈值;第一循环子模块,用于使第三计算子模块和第二执行子模块重复运行,直到保留的子实体中任意两个子实体之间的相似度都小于或等于预设的第三阈值;第三执行子模块,用于将所述保留的子实体合并作为所述两个知识实体的对齐实体。In some embodiments, the execution module 240 includes: a first division submodule, a third calculation submodule, a second execution submodule, a first loop submodule, and a third execution submodule. Wherein, the first division sub-module is used to divide the two knowledge entities into several sub-entities; the third calculation sub-module is used to select any two sub-entities of the plurality of sub-entities and calculate the two sub-entities The similarity between the two; the second execution sub-module is used to delete any one of the two sub-entities when the similarity between the two sub-entities is greater than a preset third threshold, wherein the The three thresholds are greater than the first threshold; the first loop submodule is used to make the third calculation submodule and the second execution submodule run repeatedly until the similarity between any two subentities in the reserved subentities is less than or Equal to a preset third threshold; a third execution sub-module, configured to merge the reserved sub-entities as an alignment entity of the two knowledge entities.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图7,图7为本实施例计算机设备基本结构框图。To solve the above technical problems, embodiments of the present application also provide computer equipment. For details, please refer to FIG. 7, which is a block diagram of the basic structure of the computer device of this embodiment.
如图7所示,计算机设备的内部结构示意图。如图7所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种知识库对齐的方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储 有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种知识库对齐的方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。As shown in FIG. 7, a schematic diagram of the internal structure of the computer device. As shown in FIG. 7, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. The non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store a sequence of control information. When the computer-readable instructions are executed by the processor, the processor may implement a A method of knowledge base alignment. The processor of the computer device is used to provide calculation and control capabilities, and support the operation of the entire computer device. The computer device may store computer readable instructions in the memory. When the computer readable instructions are executed by the processor, the processor may cause the processor to perform a knowledge base alignment method. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.
本实施方式中处理器用于执行图6中获取模块210、处理模块220、计算模块230和执行模块240的具体内容,存储器存储有执行上述模块所需的计算机可读指令和各类数据。网络接口用于向用户终端或服务器之间的数据传输。本实施方式中的存储器存储有知识库对齐方法中执行所有子模块所需的计算机可读指令及数据,服务器能够调用服务器的计算机可读指令及数据执行所有子模块的功能。In this embodiment, the processor is used to execute the specific content of the acquisition module 210, the processing module 220, the calculation module 230, and the execution module 240 in FIG. 6. The memory stores computer-readable instructions and various types of data required to execute the above modules. The network interface is used for data transmission between user terminals or servers. The memory in this embodiment stores the computer-readable instructions and data required to execute all submodules in the knowledge base alignment method, and the server can call the computer-readable instructions and data of the server to execute the functions of all submodules.
计算机设备通过获取知识实体向量集,将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果,根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度,当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。两个知识实体相似度的比较限于同一类实体中,大大减少了计算量,其中,相似度的计算综合了实体的属性相似度和向量相似度,使相似度的计算更合理,可以更有效的发现和去除冗余信息The computer device obtains the knowledge entity vector set and inputs the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of the knowledge entities in the knowledge base to be aligned, and according to the clustering result, Select any two knowledge entities that belong to the same class, calculate the similarity between the two knowledge entities, and when the similarity is greater than the set first threshold, merge the two knowledge entities. The comparison of the similarity of two knowledge entities is limited to the same type of entity, which greatly reduces the amount of calculation. Among them, the calculation of similarity combines the attribute similarity of the entity and the vector similarity, making the calculation of similarity more reasonable and more effective. Discover and remove redundant information
本申请还提供一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一实施例所述知识库对齐方法的步骤。The present application also provides a storage medium storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the knowledge base alignment described in any of the foregoing embodiments Method steps.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art may understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through computer-readable instructions, which can be stored in a computer-readable storage medium When executed, the computer-readable instructions may include the processes of the foregoing method embodiments. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而 且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowchart of the drawings are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless there is a clear description in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least a part of the steps in the flow chart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution order is also It is not necessarily performed sequentially, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above is only part of the implementation of this application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and retouches can be made. These improvements and retouches also It should be regarded as the scope of protection of this application.
Claims (20)
- 一种知识库对齐方法,其特征在于,包括下述步骤:A knowledge base alignment method is characterized by the following steps:获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度;According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。When the similarity is greater than the set first threshold, the two knowledge entities are merged.
- 根据权利要求1所述的知识库对齐方法,其特征在于,在所述获取知识实体向量集的步骤之前,还包括下述步骤:The knowledge base alignment method according to claim 1, wherein before the step of acquiring a knowledge entity vector set, the method further comprises the following steps:获取待对齐的知识库中的知识实体;Obtain the knowledge entities in the knowledge base to be aligned;将所述知识实体基于IF-IDF算法向量化,得到所述知识实体向量集。Vectorizing the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
- 根据权利要求1所述的知识库对齐方法,其特征在于,所述预先设定的知识实体聚类模型采用DBSCAN密度聚类算法。The knowledge base alignment method according to claim 1, wherein the predetermined knowledge entity clustering model uses a DBSCAN density clustering algorithm.
- 根据权利要求1所述的知识库对齐方法,其特征在于,所述预先设定的知识实体聚类模型采用基于卷积神经网络的聚类模型,所述基于卷积神经网络的聚类模型的训练包含下述步骤:The knowledge base alignment method according to claim 1, wherein the preset knowledge entity clustering model uses a convolutional neural network-based clustering model, and the convolutional neural network-based clustering model The training consists of the following steps:获取标记有聚类判断信息的训练样本,所述训练样本的聚类判断信息为样本知识实体的类别;Obtaining training samples marked with clustering judgment information, the clustering judgment information of the training samples being the category of sample knowledge entities;将所述训练样本输入卷积神经网络模型获取所述训练样本的模型聚类参照信息;Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;通过损失函数比对所述训练样本内不同样本的模型聚类参照信息与所述聚类判断信息是否一致;Comparing the model clustering reference information of different samples in the training samples with the clustering judgment information through a loss function;当所述模型聚类参照信息与所述聚类判断信息不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述模型聚类参照信息与所述聚类判断信息一致时结束。When the model clustering reference information is inconsistent with the clustering judgment information, the weights in the convolutional neural network model are updated repeatedly and iteratively, until the model clustering reference information is consistent with the clustering judgment information At the end.
- 根据权利要求1所述的知识库对齐方法,其特征在于,在所述根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度的步骤具体包括下述步骤:The knowledge base alignment method according to claim 1, wherein any two knowledge entities belonging to the same class are selected according to the clustering result, and the similarity between the two knowledge entities is calculated The steps include the following steps:获取所述两个知识实体的属性,其中,所述知识实体的属性为描述对应知识实体的数据;Acquiring attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entity;计算所述两个知识实体的属性相似度和向量相似度;Calculating the attribute similarity and vector similarity of the two knowledge entities;按照以下公式计算所述两个知识实体的属性相似度和向量相似度的加权和,得到所述两个知识实体之间的相似度,即:The weighted sum of attribute similarity and vector similarity of the two knowledge entities is calculated according to the following formula to obtain the similarity between the two knowledge entities, namely:S=aX+bYS=aX+bY其中,S为所述两个知识实体之间的相似度,X为所述属性相似度,Y为所述向量相似度,a、b分别为所述属性相似度和所述向量相似度的权重。Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
- 根据权利要求1所述的知识库对齐方法,其特征在于,在所述当所述相似度大于设定的第一阈值时,将所述两个知识实体合并的步骤中,还包括下述步骤:The method for aligning a knowledge base according to claim 1, wherein the step of merging the two knowledge entities when the similarity is greater than a set first threshold further includes the following steps :当所述相似度大于设定的第二阈值时,其中,所述第二阈值大于所述第一阈值,从待对齐的知识库中删除所述两个知识实体中的任意一个。When the similarity is greater than the set second threshold, wherein the second threshold is greater than the first threshold, any one of the two knowledge entities is deleted from the knowledge base to be aligned.
- 根据权利要求1所述的知识库对齐方法,其特征在于,在所述当所述相似度大于设定的第一阈值时,将所述两个知识实体合并的步骤中,还包括下述步骤:The knowledge base alignment method according to claim 1, wherein the step of merging the two knowledge entities when the similarity is greater than a set first threshold further includes the following steps :a.将所述两个知识实体分割成若干个子实体;a. Divide the two knowledge entities into several sub-entities;b.选择所述若干个子实体中的任意两个子实体,计算所述两个子实体之间的相似度;b. select any two of the several sub-entities and calculate the similarity between the two sub-entities;c.当所述两个子实体之间的相似度大于预设的第三阈值时,删除所述两个子实体中的任意一个,其中,所述第三阈值大于所述第一阈值;c. When the similarity between the two sub-entities is greater than a preset third threshold, delete any one of the two sub-entities, wherein the third threshold is greater than the first threshold;d.重复步骤b和步骤c,直到保留的子实体中任意两个子实体之间的相似度都小于或等于预设的第三阈值;d. Repeat steps b and c until the similarity between any two sub-entities in the retained sub-entities is less than or equal to the preset third threshold;e.将所述保留的子实体合并作为所述两个知识实体的对齐实体。e. Merging the reserved sub-entities as an alignment entity of the two knowledge entities.
- 一种知识库对齐装置,其特征在于,包括:A knowledge base alignment device, characterized in that it includes:获取模块,用于获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;An acquisition module, for acquiring a vector set of knowledge entities, wherein the vector set of knowledge entities is a vectorized representation of the knowledge entities in the knowledge base to be aligned;处理模块,用于将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;A processing module, configured to input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;计算模块,用于根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度;A calculation module, used to select any two knowledge entities belonging to the same class according to the clustering result, and calculate the similarity between the two knowledge entities;执行模块,用于当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。The execution module is configured to merge the two knowledge entities when the similarity is greater than the set first threshold.
- 根据权利要求8所述的知识库对齐装置,其特征在于,还包括:The knowledge base alignment device according to claim 8, further comprising:第一获取子模块,用于获取待对齐的知识库中的知识实体;A first acquisition sub-module for acquiring knowledge entities in the knowledge base to be aligned;第一处理子模块,用于将所述知识实体基于IF-IDF算法向量化,得到所述知识实体向量集。The first processing submodule is used to vectorize the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
- 根据权利要求8所述的知识库对齐装置,其特征在于,所述计算模块包括:The knowledge base alignment device according to claim 8, wherein the calculation module comprises:第二获取子模块,用于获取所述两个知识实体的属性,其中,所述知识实体的属性为描述对应知识实体的数据;A second obtaining submodule, configured to obtain attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing corresponding knowledge entities;第一计算子模块,用于计算所述两个知识实体的属性相似度和向量相似度;A first calculation sub-module for calculating the attribute similarity and vector similarity of the two knowledge entities;第二计算子模块,用于按照以下公式计算所述两个知识实体的属性相似度和向量相似度的加权和,得到所述两个知识实体之间的相似度,即:The second calculation submodule is used to calculate the weighted sum of attribute similarity and vector similarity of the two knowledge entities according to the following formula, to obtain the similarity between the two knowledge entities, namely:S=aX+bYS=aX+bY其中,S为所述两个知识实体之间的相似度,X为所述属性相似度,Y为所述向量相似度,a、b分别为所述属性相似度和所述向量相似度的权重。Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
- 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下知识库对齐方法的步骤:A computer device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that the processor is implemented as follows when executing the computer-readable instructions Steps of knowledge base alignment method:获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度;According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。When the similarity is greater than the set first threshold, the two knowledge entities are merged.
- 根据权利要求11所述的计算机设备,其特征在于,在所述获取知识实体向量集的步骤之前,还包括下述步骤:The computer device according to claim 11, further comprising the following steps before the step of acquiring a vector set of knowledge entities:获取待对齐的知识库中的知识实体;Obtain the knowledge entities in the knowledge base to be aligned;将所述知识实体基于IF-IDF算法向量化,得到所述知识实体向量集。Vectorizing the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
- 根据权利要求11所述的计算机设备,其特征在于,所述预先设定的知 识实体聚类模型采用DBSCAN密度聚类算法。The computer equipment according to claim 11, wherein the predetermined knowledge entity clustering model uses a DBSCAN density clustering algorithm.
- 根据权利要求11所述的计算机设备,其特征在于,所述预先设定的知识实体聚类模型采用基于卷积神经网络的聚类模型,所述基于卷积神经网络的聚类模型的训练包含下述步骤:The computer device according to claim 11, wherein the preset knowledge entity clustering model uses a convolutional neural network-based clustering model, and the training of the convolutional neural network-based clustering model includes The following steps:获取标记有聚类判断信息的训练样本,所述训练样本的聚类判断信息为样本知识实体的类别;Obtaining training samples marked with clustering judgment information, the clustering judgment information of the training samples being the category of sample knowledge entities;将所述训练样本输入卷积神经网络模型获取所述训练样本的模型聚类参照信息;Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;通过损失函数比对所述训练样本内不同样本的模型聚类参照信息与所述聚类判断信息是否一致;Comparing the model clustering reference information of different samples in the training samples with the clustering judgment information through a loss function;当所述模型聚类参照信息与所述聚类判断信息不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述模型聚类参照信息与所述聚类判断信息一致时结束。When the model clustering reference information is inconsistent with the clustering judgment information, the weights in the convolutional neural network model are updated repeatedly and iteratively, until the model clustering reference information is consistent with the clustering judgment information At the end.
- 根据权利要求11所述的计算机设备,其特征在于,在所述根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度的步骤具体包括下述步骤:The computer device according to claim 11, wherein in the step of calculating any similarity between the two knowledge entities according to the clustering result, selecting any two knowledge entities belonging to the same class It includes the following steps:获取所述两个知识实体的属性,其中,所述知识实体的属性为描述对应知识实体的数据;Acquiring attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entity;计算所述两个知识实体的属性相似度和向量相似度;Calculating the attribute similarity and vector similarity of the two knowledge entities;按照以下公式计算所述两个知识实体的属性相似度和向量相似度的加权和,得到所述两个知识实体之间的相似度,即:The weighted sum of attribute similarity and vector similarity of the two knowledge entities is calculated according to the following formula to obtain the similarity between the two knowledge entities, namely:S=aX+bYS=aX+bY其中,S为所述两个知识实体之间的相似度,X为所述属性相似度,Y为所述向量相似度,a、b分别为所述属性相似度和所述向量相似度的权重。Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
- 一种存储有计算机可读指令的可读存储介质,其特征在于,所述计算机可读指令被一种处理器执行时,使得所述一种处理执行如下步骤:A readable storage medium storing computer-readable instructions, characterized in that, when the computer-readable instructions are executed by a processor, the processing is performed as follows:获取知识实体向量集,其中,所述知识实体向量集是待对齐的知识库中知识实体的向量化表示;Acquiring a set of knowledge entity vectors, where the set of knowledge entity vectors is a vectorized representation of knowledge entities in the knowledge base to be aligned;将所述知识实体向量集输入到预先设定的知识实体聚类模型,得到所述待对齐知识库中知识实体的聚类结果;Input the knowledge entity vector set into a preset knowledge entity clustering model to obtain a clustering result of knowledge entities in the knowledge base to be aligned;根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知 识实体之间的相似度;According to the clustering result, select any two knowledge entities belonging to the same class, and calculate the similarity between the two knowledge entities;当所述相似度大于设定的第一阈值时,将所述两个知识实体合并。When the similarity is greater than the set first threshold, the two knowledge entities are merged.
- 根据权利要求16所述的可读存储介质,其特征在于,在所述获取知识实体向量集的步骤之前,还包括下述步骤:The readable storage medium according to claim 16, characterized in that, before the step of acquiring a knowledge entity vector set, the method further comprises the following steps:获取待对齐的知识库中的知识实体;Obtain the knowledge entities in the knowledge base to be aligned;将所述知识实体基于IF-IDF算法向量化,得到所述知识实体向量集。Vectorizing the knowledge entity based on the IF-IDF algorithm to obtain the knowledge entity vector set.
- 根据权利要求16所述的可读存储介质,其特征在于,所述预先设定的知识实体聚类模型采用DBSCAN密度聚类算法。The readable storage medium according to claim 16, wherein the predetermined knowledge entity clustering model uses a DBSCAN density clustering algorithm.
- 根据权利要求16所述的可读存储介质,其特征在于,所述预先设定的知识实体聚类模型采用基于卷积神经网络的聚类模型,所述基于卷积神经网络的聚类模型的训练包含下述步骤:The readable storage medium according to claim 16, wherein the predetermined knowledge entity clustering model uses a convolutional neural network-based clustering model, and the convolutional neural network-based clustering model The training consists of the following steps:获取标记有聚类判断信息的训练样本,所述训练样本的聚类判断信息为样本知识实体的类别;Obtaining training samples marked with clustering judgment information, the clustering judgment information of the training samples being the category of sample knowledge entities;将所述训练样本输入卷积神经网络模型获取所述训练样本的模型聚类参照信息;Input the training sample into a convolutional neural network model to obtain model clustering reference information of the training sample;通过损失函数比对所述训练样本内不同样本的模型聚类参照信息与所述聚类判断信息是否一致;Comparing the model clustering reference information of different samples in the training samples with the clustering judgment information through a loss function;当所述模型聚类参照信息与所述聚类判断信息不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述模型聚类参照信息与所述聚类判断信息一致时结束。When the model clustering reference information is inconsistent with the clustering judgment information, the weights in the convolutional neural network model are updated repeatedly and iteratively, until the model clustering reference information is consistent with the clustering judgment information At the end.
- 根据权利要求16所述的可读存储介质,其特征在于,在所述根据所述聚类结果,选择属于同一类的任意两个知识实体,计算所述两个知识实体之间的相似度的步骤具体包括下述步骤:The readable storage medium according to claim 16, characterized in that, according to the clustering result, any two knowledge entities belonging to the same class are selected, and the similarity between the two knowledge entities is calculated The steps include the following steps:获取所述两个知识实体的属性,其中,所述知识实体的属性为描述对应知识实体的数据;Acquiring attributes of the two knowledge entities, wherein the attributes of the knowledge entity are data describing the corresponding knowledge entity;计算所述两个知识实体的属性相似度和向量相似度;Calculating the attribute similarity and vector similarity of the two knowledge entities;按照以下公式计算所述两个知识实体的属性相似度和向量相似度的加权和,得到所述两个知识实体之间的相似度,即:The weighted sum of attribute similarity and vector similarity of the two knowledge entities is calculated according to the following formula to obtain the similarity between the two knowledge entities, namely:S=aX+bYS=aX+bY其中,S为所述两个知识实体之间的相似度,X为所述属性相似度,Y为所述向量相似度,a、b分别为所述属性相似度和所述向量相似度的权重。Where S is the similarity between the two knowledge entities, X is the attribute similarity, Y is the vector similarity, and a and b are the weight of the attribute similarity and the vector similarity, respectively .
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811474699.XA CN109783582B (en) | 2018-12-04 | 2018-12-04 | Knowledge base alignment method, device, computer equipment and storage medium |
CN201811474699.X | 2018-12-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020114022A1 true WO2020114022A1 (en) | 2020-06-11 |
Family
ID=66496644
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/103487 WO2020114022A1 (en) | 2018-12-04 | 2019-08-30 | Knowledge base alignment method and apparatus, computer device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109783582B (en) |
WO (1) | WO2020114022A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112417163A (en) * | 2020-11-13 | 2021-02-26 | 中译语通科技股份有限公司 | Entity clue fragment-based candidate entity alignment method and device |
CN112445876A (en) * | 2020-11-25 | 2021-03-05 | 中国科学院自动化研究所 | Entity alignment method and system fusing structure, attribute and relationship information |
CN112541360A (en) * | 2020-12-07 | 2021-03-23 | 国泰君安证券股份有限公司 | Cross-platform anomaly identification and translation method, device, processor and storage medium for clustering by using hyper-parametric self-adaptive DBSCAN (direct media Access controller area network) |
CN113095948A (en) * | 2021-03-24 | 2021-07-09 | 西安交通大学 | Multi-source heterogeneous network user alignment method based on graph neural network |
CN113361263A (en) * | 2021-06-04 | 2021-09-07 | 中国人民解放军战略支援部队信息工程大学 | Character entity attribute alignment method and system based on attribute value distribution |
CN113886659A (en) * | 2021-10-08 | 2022-01-04 | 科大讯飞股份有限公司 | Data fusion method, related device and readable storage medium |
CN113971953A (en) * | 2021-09-17 | 2022-01-25 | 珠海格力电器股份有限公司 | Voice command word recognition method and device, storage medium and electronic equipment |
CN114329003A (en) * | 2021-12-27 | 2022-04-12 | 北京达佳互联信息技术有限公司 | Media resource data processing method and device, electronic equipment and storage medium |
CN114676267A (en) * | 2022-04-01 | 2022-06-28 | 北京明略软件系统有限公司 | Method and device for entity alignment and electronic equipment |
CN115114443A (en) * | 2022-04-27 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Training method and device of multi-modal coding model, electronic equipment and storage medium |
CN115563350A (en) * | 2022-10-22 | 2023-01-03 | 山东浪潮新基建科技有限公司 | Alignment and completion method and system for multi-source heterogeneous power grid equipment data |
CN117668581A (en) * | 2023-12-13 | 2024-03-08 | 北京知其安科技有限公司 | Entity identification method and device for multi-source data and electronic equipment |
CN118170927A (en) * | 2024-05-10 | 2024-06-11 | 山东圣剑医学研究有限公司 | Scientific research data knowledge graph construction method for AI digital person |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783582B (en) * | 2018-12-04 | 2023-08-15 | 平安科技(深圳)有限公司 | Knowledge base alignment method, device, computer equipment and storage medium |
CN110377906A (en) * | 2019-07-15 | 2019-10-25 | 出门问问信息科技有限公司 | Entity alignment schemes, storage medium and electronic equipment |
CN110427436B (en) * | 2019-07-31 | 2022-03-22 | 北京百度网讯科技有限公司 | Method and device for calculating entity similarity |
CN112579770A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Knowledge graph generation method, device, storage medium and equipment |
CN111026865B (en) * | 2019-10-18 | 2023-07-21 | 平安科技(深圳)有限公司 | Knowledge graph relationship alignment method, device, equipment and storage medium |
CN112699909B (en) * | 2019-10-23 | 2024-03-19 | 中移物联网有限公司 | Information identification method, information identification device, electronic equipment and computer readable storage medium |
CN111159420B (en) * | 2019-12-12 | 2023-04-28 | 西安交通大学 | Entity optimization method based on attribute calculation and knowledge template |
CN111488461A (en) * | 2020-03-24 | 2020-08-04 | 腾讯科技(深圳)有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111563192B (en) * | 2020-04-28 | 2023-05-30 | 腾讯科技(深圳)有限公司 | Entity alignment method, device, electronic equipment and storage medium |
CN112541054B (en) * | 2020-12-15 | 2023-08-29 | 平安科技(深圳)有限公司 | Knowledge base question and answer management method, device, equipment and storage medium |
CN113536796A (en) * | 2021-07-15 | 2021-10-22 | 北京明略昭辉科技有限公司 | Entity alignment auxiliary method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103699663A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院自动化研究所 | Hot event mining method based on large-scale knowledge base |
US9430738B1 (en) * | 2012-02-08 | 2016-08-30 | Mashwork, Inc. | Automated emotional clustering of social media conversations |
CN108154198A (en) * | 2018-01-25 | 2018-06-12 | 北京百度网讯科技有限公司 | Knowledge base entity normalizing method, system, terminal and computer readable storage medium |
CN109739939A (en) * | 2018-12-29 | 2019-05-10 | 颖投信息科技(上海)有限公司 | The data fusion method and device of knowledge mapping |
CN109783582A (en) * | 2018-12-04 | 2019-05-21 | 平安科技(深圳)有限公司 | A kind of knowledge base alignment schemes, device, computer equipment and storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239553A (en) * | 2014-09-24 | 2014-12-24 | 江苏名通信息科技有限公司 | Entity recognition method based on Map-Reduce framework |
CN105279277A (en) * | 2015-11-12 | 2016-01-27 | 百度在线网络技术(北京)有限公司 | Knowledge data processing method and device |
CN108363810B (en) * | 2018-03-09 | 2022-02-15 | 南京工业大学 | Text classification method and device |
CN108804567B (en) * | 2018-05-22 | 2024-07-19 | 平安科技(深圳)有限公司 | Method, device, storage medium and device for improving intelligent customer service response rate |
-
2018
- 2018-12-04 CN CN201811474699.XA patent/CN109783582B/en active Active
-
2019
- 2019-08-30 WO PCT/CN2019/103487 patent/WO2020114022A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9430738B1 (en) * | 2012-02-08 | 2016-08-30 | Mashwork, Inc. | Automated emotional clustering of social media conversations |
CN103699663A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院自动化研究所 | Hot event mining method based on large-scale knowledge base |
CN108154198A (en) * | 2018-01-25 | 2018-06-12 | 北京百度网讯科技有限公司 | Knowledge base entity normalizing method, system, terminal and computer readable storage medium |
CN109783582A (en) * | 2018-12-04 | 2019-05-21 | 平安科技(深圳)有限公司 | A kind of knowledge base alignment schemes, device, computer equipment and storage medium |
CN109739939A (en) * | 2018-12-29 | 2019-05-10 | 颖投信息科技(上海)有限公司 | The data fusion method and device of knowledge mapping |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112417163A (en) * | 2020-11-13 | 2021-02-26 | 中译语通科技股份有限公司 | Entity clue fragment-based candidate entity alignment method and device |
CN112445876A (en) * | 2020-11-25 | 2021-03-05 | 中国科学院自动化研究所 | Entity alignment method and system fusing structure, attribute and relationship information |
CN112445876B (en) * | 2020-11-25 | 2023-12-26 | 中国科学院自动化研究所 | Entity alignment method and system for fusing structure, attribute and relationship information |
CN112541360A (en) * | 2020-12-07 | 2021-03-23 | 国泰君安证券股份有限公司 | Cross-platform anomaly identification and translation method, device, processor and storage medium for clustering by using hyper-parametric self-adaptive DBSCAN (direct media Access controller area network) |
CN113095948B (en) * | 2021-03-24 | 2023-06-06 | 西安交通大学 | Multi-source heterogeneous network user alignment method based on graph neural network |
CN113095948A (en) * | 2021-03-24 | 2021-07-09 | 西安交通大学 | Multi-source heterogeneous network user alignment method based on graph neural network |
CN113361263B (en) * | 2021-06-04 | 2023-10-20 | 中国人民解放军战略支援部队信息工程大学 | Character entity attribute alignment method and system based on attribute value distribution |
CN113361263A (en) * | 2021-06-04 | 2021-09-07 | 中国人民解放军战略支援部队信息工程大学 | Character entity attribute alignment method and system based on attribute value distribution |
CN113971953A (en) * | 2021-09-17 | 2022-01-25 | 珠海格力电器股份有限公司 | Voice command word recognition method and device, storage medium and electronic equipment |
CN113886659A (en) * | 2021-10-08 | 2022-01-04 | 科大讯飞股份有限公司 | Data fusion method, related device and readable storage medium |
CN114329003A (en) * | 2021-12-27 | 2022-04-12 | 北京达佳互联信息技术有限公司 | Media resource data processing method and device, electronic equipment and storage medium |
CN114676267A (en) * | 2022-04-01 | 2022-06-28 | 北京明略软件系统有限公司 | Method and device for entity alignment and electronic equipment |
CN115114443A (en) * | 2022-04-27 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Training method and device of multi-modal coding model, electronic equipment and storage medium |
CN115563350A (en) * | 2022-10-22 | 2023-01-03 | 山东浪潮新基建科技有限公司 | Alignment and completion method and system for multi-source heterogeneous power grid equipment data |
CN117668581A (en) * | 2023-12-13 | 2024-03-08 | 北京知其安科技有限公司 | Entity identification method and device for multi-source data and electronic equipment |
CN118170927A (en) * | 2024-05-10 | 2024-06-11 | 山东圣剑医学研究有限公司 | Scientific research data knowledge graph construction method for AI digital person |
Also Published As
Publication number | Publication date |
---|---|
CN109783582A (en) | 2019-05-21 |
CN109783582B (en) | 2023-08-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020114022A1 (en) | Knowledge base alignment method and apparatus, computer device and storage medium | |
CN112256874B (en) | Model training method, text classification method, device, computer equipment and medium | |
CN111274811B (en) | Address text similarity determining method and address searching method | |
US11386157B2 (en) | Methods and apparatus to facilitate generation of database queries | |
US9754188B2 (en) | Tagging personal photos with deep networks | |
WO2017215370A1 (en) | Method and apparatus for constructing decision model, computer device and storage device | |
CN111190997B (en) | Question-answering system implementation method using neural network and machine learning ordering algorithm | |
US8447120B2 (en) | Incremental feature indexing for scalable location recognition | |
CN111400504B (en) | Method and device for identifying enterprise key people | |
WO2023138188A1 (en) | Feature fusion model training method and apparatus, sample retrieval method and apparatus, and computer device | |
CN111666416B (en) | Method and device for generating semantic matching model | |
CN112231592B (en) | Graph-based network community discovery method, device, equipment and storage medium | |
US20170308620A1 (en) | Making graph pattern queries bounded in big graphs | |
CN113821657A (en) | Artificial intelligence-based image processing model training method and image processing method | |
CN114329029B (en) | Object retrieval method, device, equipment and computer storage medium | |
WO2023020214A1 (en) | Retrieval model training method and apparatus, retrieval method and apparatus, device and medium | |
WO2020147259A1 (en) | User portait method and apparatus, readable storage medium, and terminal device | |
CN116097281A (en) | Theoretical superparameter delivery via infinite width neural networks | |
CN111008213A (en) | Method and apparatus for generating language conversion model | |
CN110390011B (en) | Data classification method and device | |
CN117093604B (en) | Search information generation method, apparatus, electronic device, and computer-readable medium | |
CN117435685A (en) | Document retrieval method, document retrieval device, computer equipment, storage medium and product | |
CN111091198A (en) | Data processing method and device | |
US20230162518A1 (en) | Systems for Generating Indications of Relationships between Electronic Documents | |
CN111914083A (en) | Statement processing method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19892241 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19892241 Country of ref document: EP Kind code of ref document: A1 |