CN109783582A

CN109783582A - A kind of knowledge base alignment schemes, device, computer equipment and storage medium

Info

Publication number: CN109783582A
Application number: CN201811474699.XA
Authority: CN
Inventors: 吴壮伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2019-05-21
Anticipated expiration: 2038-12-04
Also published as: CN109783582B; WO2020114022A1

Abstract

The embodiment of the invention discloses method, apparatus, computer equipment and the storage mediums of a kind of alignment of knowledge base, wherein method includes the following steps: to obtain knowledge entity vector set, wherein, the knowledge entity vector set is that the vectorization of knowledge in knowledge base entity to be aligned indicates；The knowledge entity vector set is input to preset knowledge entity cluster model, obtains the cluster result of the knowledge in knowledge base entity to be aligned；According to the cluster result, selection belongs to of a sort any two knowledge entity, calculates the similarity between described two knowledge entities；When the similarity is greater than the first threshold of setting, described two knowledge entities are merged.The comparison of two knowledge entity similarities is limited in same class entity, greatly reduce calculation amount, when cluster, it is realized by artificial intelligence technology, cluster result is set more to meet expection, the calculating of similarity combines entity attributes similarity and vector similarity, keeps the calculating of similarity more reasonable, more effectively can find and remove redundancy.

Description

A kind of knowledge base alignment schemes, device, computer equipment and storage medium

Technical field

The present invention relates to knowledge base processing technology fields more particularly to a kind of knowledge base alignment schemes, device, computer to set Standby and storage medium.

Background technique

With the development of internet, every field constructs more and more knowledge bases, these knowledge bases are also extensive Applied in the Internet applications such as search service, automatic question answering.Knowledge base is to the shared of information and propagates with positive effect.So And the Limited information of single knowledge base, it cannot meet the needs of users in some cases；And usually knowledge base is to continue expansion , the scale of the storage resource of occupancy also continuous enlargement, but persistently extend to the data in knowledge base there may be redundancies, it is this Redundancy causes the waste of storage resource, meanwhile, also make to search for calculation amount increase, search result information is repeated, brought not to user Just.

Knowledge base alignment (Knowledge Base Alignment) refers to each entity for separate sources, finds out and belongs to The entity of same thing in reality.Here the things that entity refers to objective reality and can be mutually distinguishable, including specific people, thing, object, Abstract concept, relationship.Therefore knowledge base alignment, i.e. extraction entity information, remove redundancy, are the passes for constructing high quality knowledge base Key problem.

Knowledge base is aligned common method is to determine whether entity from different sources can be aligned using entity attributes information, Since different entities data belong to user's original content (User Generated Content, UGC) type, different user editor The quality of data it is irregular, the entity attribute information only edited by user is difficult to accurately determine whether same entity.

Summary of the invention

The present invention provides a kind of knowledge base alignment schemes, device, computer equipment and storage medium.

In order to solve the above technical problems, the present invention proposes a kind of knowledge base alignment schemes, include the following steps:

Obtain knowledge entity vector set, wherein the knowledge entity vector set is knowledge in knowledge base entity to be aligned Vectorization indicate；

The knowledge entity vector set is input to preset knowledge entity cluster model, obtains described to be aligned knowing Know the cluster result of knowledge entity in library；

According to the cluster result, selection belongs to of a sort any two knowledge entity, and it is real to calculate described two knowledge Similarity between body；

When the similarity is greater than the first threshold of setting, described two knowledge entities are merged.

Optionally, further include following step before the acquisition knowledge entity vector set the step of:

Obtain the knowledge entity in knowledge base to be aligned；

The knowledge entity is based on the vectorization of IF-IDF algorithm, obtains the knowledge entity vector set.

Optionally, the preset knowledge entity cluster model uses DBSCAN density clustering algorithm.

Optionally, the preset knowledge entity cluster model uses the Clustering Model based on convolutional neural networks, The training of the Clustering Model based on convolutional neural networks includes following step:

It obtains and is marked with the training sample that cluster judges information, the cluster of the training sample judges information for sample knowledge The classification of entity；

Training sample input convolutional neural networks model is obtained into the Model tying of the training sample referring to information；

Sentenced by the Model tying that loss function compares different samples in the training sample referring to information and the cluster Whether disconnected information is consistent；

When the Model tying judges that information is inconsistent referring to information and the cluster, the update institute of iterative cycles iteration The weight in convolutional neural networks model is stated, until the Model tying is tied when judging that information is consistent with the cluster referring to information Beam.

Optionally, described according to the cluster result, selection belongs to of a sort any two knowledge entity, calculates institute The step of stating the similarity between two knowledge entities specifically include the following steps:

Obtain described two knowledge entity attributes, wherein the knowledge entity attributes are to describe corresponding knowledge entity Data；

Calculate described two knowledge entity attributes similarities and vector similarity；

The weighted sum that described two knowledge entity attributes similarities and vector similarity are calculated according to following formula, obtains Similarity between described two knowledge entities, it may be assumed that

S=aX+bY

Wherein, similarity of the S between described two knowledge entities, X are the attributes similarity, and Y is the vector phase Like degree, a, b are respectively the weight of the attributes similarity and the vector similarity.

Optionally, in the first threshold for being greater than setting when the similarity, described two knowledge entities are merged The step of in, further include following step:

When the similarity is greater than the second threshold of setting, wherein the second threshold is greater than the first threshold, from Any one in described two knowledge entities is deleted in knowledge base to be aligned.

A. by described two knowledge splitting objects at several fructifications；

B. any two fructification in several described fructifications is selected, is calculated similar between described two fructifications Degree；

C. when the similarity between described two fructifications is greater than preset third threshold value, described two fructifications are deleted In any one, wherein the third threshold value be greater than the first threshold；

D. step b and step c is repeated, until the similarity in the fructification of reservation between any two fructification is both less than Or it is equal to preset third threshold value；

E., the fructification of the reservation is incorporated as to the alignment entity of described two knowledge entities.

To solve the above problems, the present invention also provides a kind of knowledge base alignment means, comprising:

Module is obtained, for obtaining knowledge entity vector set, wherein the knowledge entity vector set is knowledge to be aligned The vectorization of knowledge entity indicates in library；

Processing module is obtained for the knowledge entity vector set to be input to preset knowledge entity cluster model To the cluster result of the knowledge in knowledge base entity to be aligned；

Computing module, for according to the cluster result, selection to belong to of a sort any two knowledge entity, calculating institute State the similarity between two knowledge entities；

Execution module merges described two knowledge entities when for being greater than the first threshold of setting when the similarity.

Optionally, the knowledge base alignment means further include:

First acquisition submodule, for obtaining the knowledge entity in knowledge base to be aligned；

It is real to obtain the knowledge for the knowledge entity to be based on the vectorization of IF-IDF algorithm for first processing submodule Body vector set.

Optionally, preset knowledge entity cluster model is poly- using DBSCAN density in the knowledge base alignment means Class algorithm.

Optionally, preset knowledge entity cluster model uses and is based on convolutional Neural in the knowledge base alignment means The Clustering Model of network.

Optionally, the computing module includes:

Second acquisition submodule, for obtaining described two knowledge entity attributes, wherein the knowledge entity attributes For the data for describing corresponding knowledge entity；

First computational submodule, for calculating described two knowledge entity attributes similarities and vector similarity；

Second computational submodule, for calculating described two knowledge entity attributes similarities and vector according to following formula The weighted sum of similarity obtains the similarity between described two knowledge entities, it may be assumed that

S=aX+bY

Optionally, the execution module includes:

First implementation sub-module, when for being greater than the second threshold of setting when the similarity, wherein the second threshold Greater than the first threshold, from any one deleted in knowledge base to be aligned in described two knowledge entities.

Optionally, the execution module includes:

First segmentation submodule, is used for described two knowledge splitting objects into several fructifications；

Third computational submodule calculates described two for selecting any two fructification in several described fructifications Similarity between a fructification；

Second implementation sub-module, for when the similarity between described two fructifications be greater than preset third threshold value when, Delete any one in described two fructifications, wherein the third threshold value is greater than the first threshold；

First circulation submodule, for making third computational submodule and the second implementation sub-module rerun, until retaining Fructification in similarity between any two fructification both less than or be equal to preset third threshold value；

Third implementation sub-module, the alignment for the fructification of the reservation to be incorporated as to described two knowledge entities are real Body.

In order to solve the above technical problems, the embodiment of the present invention also provides a kind of computer equipment, including memory and processing Device is stored with computer-readable instruction in the memory, when the computer-readable instruction is executed by the processor, so that The processor executes the step of knowledge base alignment schemes described above.

In order to solve the above technical problems, the embodiment of the present invention also provides a kind of computer readable storage medium, the calculating Computer-readable instruction is stored on machine readable storage medium storing program for executing, when the computer-readable instruction is executed by processor, so that institute State the step of processor executes knowledge base alignment schemes described above.

The embodiment of the present invention has the beneficial effect that by obtaining knowledge entity vector set, by the knowledge entity vector set It is input to preset knowledge entity cluster model, obtains the cluster result of the knowledge in knowledge base entity to be aligned, root According to the cluster result, selection belongs to of a sort any two knowledge entity, calculates the phase between described two knowledge entities Described two knowledge entities are merged when the similarity is greater than the first threshold of setting like degree.Two knowledge entities are similar The comparison of degree is limited in same class entity, greatly reduces calculation amount, wherein the calculating of similarity combines entity attributes phase Like degree and vector similarity, keeps the calculating of similarity more reasonable, more effectively can find and remove redundancy.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure

Fig. 1 is a kind of knowledge base alignment schemes basic procedure schematic diagram of the embodiment of the present invention；

Fig. 2 is for the embodiment of the present invention based on IF-IDF algorithm to the schematic diagram of knowledge entity vectorization；

Fig. 3 is Clustering Model training flow diagram of the embodiment of the present invention based on convolutional neural networks；

Fig. 4 is knowledge of embodiment of the present invention entity similarity calculation flow diagram；

Fig. 5 is that knowledge of embodiment of the present invention entity merges flow diagram；

Fig. 6 is a kind of knowledge base alignment means basic structure block diagram of the embodiment of the present invention；

Fig. 7 is that the present invention implements computer equipment basic structure block diagram.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.

In some processes of the description in description and claims of this specification and above-mentioned attached drawing, contain according to Multiple operations that particular order occurs, but it should be clearly understood that these operations can not be what appears in this article suitable according to its Sequence is executed or is executed parallel, and serial number of operation such as 101,102 etc. is only used for distinguishing each different operation, serial number It itself does not represent and any executes sequence.In addition, these processes may include more or fewer operations, and these operations can To execute or execute parallel in order.It should be noted that the description such as " first " herein, " second ", is for distinguishing not Same message, equipment, module etc., does not represent sequencing, does not also limit " first " and " second " and be different type.

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall within the protection scope of the present invention.

Embodiment

Those skilled in the art of the present technique are appreciated that " terminal " used herein above, " terminal device " both include wireless communication The equipment of number receiver, only has the equipment of the wireless signal receiver of non-emissive ability, and including receiving and emitting hardware Equipment, have on bidirectional communication link, can execute two-way communication reception and emit hardware equipment.This equipment It may include: honeycomb or other communication equipments, shown with single line display or multi-line display or without multi-line The honeycomb of device or other communication equipments；PCS (Personal Communications Service, PCS Personal Communications System), can With combine voice, data processing, fax and/or communication ability；PDA (Personal Digital Assistant, it is personal Digital assistants), it may include radio frequency receiver, pager, the Internet/intranet access, web browser, notepad, day It goes through and/or GPS (Global Positioning System, global positioning system) receiver；Conventional laptop and/or palm Type computer or other equipment, have and/or the conventional laptop including radio frequency receiver and/or palmtop computer or its His equipment." terminal " used herein above, " terminal device " can be it is portable, can transport, be mounted on the vehicles (aviation, Sea-freight and/or land) in, or be suitable for and/or be configured in local runtime, and/or with distribution form, operate in the earth And/or any other position operation in space." terminal " used herein above, " terminal device " can also be communication terminal, on Network termination, music/video playback terminal, such as can be PDA, MID (Mobile InternetDevice, mobile Internet are set It is standby) and/or mobile phone with music/video playing function, it is also possible to the equipment such as smart television, set-top box.

Terminal in present embodiment is above-mentioned terminal.

Specifically, referring to Fig. 1, Fig. 1 is a kind of basic procedure schematic diagram of knowledge base alignment schemes of the present embodiment.

As shown in Figure 1, a kind of knowledge base alignment schemes, include the following steps:

S101, knowledge entity vector set is obtained, wherein the knowledge entity vector set is knowledge in knowledge base to be aligned The vectorization of entity indicates；

The knowledge entity being stored in knowledge base is usually text or picture, when being aligned to knowledge entity, usually Need the similarity between calculation knowledge entity, in order to facilitate computer disposal and understanding, need with by knowledge entity be converted into Amount.Such as the vectorization of text indicates that bag of words (bag of words) is also referred to as by vector space model to be realized, wherein Simplest mode is word-based one-hot coding (one-hotencoding), uses each word as dimension key, there is word Corresponding position is 1, other are 0, and vector length is identical with dictionary size.

S102, the knowledge entity vector set is input to preset knowledge entity cluster model, obtain it is described to It is aligned the cluster result of knowledge in knowledge base entity；

The vector set for indicating knowledge entity is input to preset knowledge entity cluster model.Wherein knowledge entity Clustering Model uses density-based algorithms, and density-based algorithms do not need the data that cluster class is determined in advance, can To find the cluster class of arbitrary shape, noise spot can recognize that, have preferable robustness to outlier, can detecte outlier. DBSCAN be it is most typical in such method represent one of algorithm, core concept is exactly the first discovery higher point of density, then Similar high density point is gradually all joined together, and then generates various clusters.Specific algorithm is realized: being circle to each data point The heart draws a circle (referred to as neighborhood eps-neigbourhood) by radius of eps, then counts how many point in this circle, this Number is exactly the dot density value.Then a density threshold MinPts is chosen, such as enclosing interior centre point of the points less than MinPts is The point of low-density, and it is greater than or equal to the highdensity point (referred to as core point Corepoint) of centre point of MinPts.If there is one A highdensity point is in the circle of another highdensity point, we just connect the two points, we can be in this way A lot of point constantly series connection come out.Later, if there is the point of low-density is also in the circle of highdensity point, it is also connected to recently High density point on, referred to as boundary point.All in this way points that can be connected to together are just at an a cluster, without in any high density Low-density point in the circle of point is exactly abnormal point.

In some embodiments, cluster is realized using trained convolutional neural networks model, by convolution Neural network is trained study manually to the feature of training sample cluster, makes convolutional neural networks model can be it is anticipated that right Knowledge entity is clustered.

S103, according to the cluster result, selection belongs to of a sort any two knowledge entity, calculates and described two knows Know the similarity between entity；

By step S102, the knowledge entity in knowledge base is clustered, then in same class, it is any by calculating The similarity of two knowledge entities reduces the range that knowledge entity compares in this way, subtracts to determine whether there are the entities of redundancy Small calculation amount, improves the efficiency for judging whether there is redundant entity.

The similarity of two knowledge entities is obtained by calculating the similarity between the vector for indicating two knowledge entities. Similarity between two vectors can be cosine similarity.The cosine value that cosine similarity passes through the angle of two vectors of measurement To measure the similitude between them.0 degree of cosine of an angle value is 1, and the cosine value of other any angles is all not more than 1；And Its minimum value is -1.To which the cosine value of the angle between two vectors determines whether two vectors are pointed generally in identical side To.When two vectors are equally directed to, the value of cosine similarity is 1；When two vector angles are 90 °, cosine similarity Value is 0；When two vectors are directed toward exactly opposite direction, the value of cosine similarity is -1.This result is that with the length of vector without It closes, it is only related to the pointing direction of vector.Cosine similarity is all suitable for the vector space of any dimension, and is usually used in The higher-dimension positive space, so being suitable for the comparison of text file.

The similarity between two vectors can also be measured by calculating the Euclidean distance between vector.In order to avoid ruler The influence of degree, is first normalized vector, seeks two point X in vector space according still further to following formula₁, X₂The distance between:

Wherein x_1i, x_2iFor X₁, X₂The value of each dimension after normalization.

S104, when the similarity be greater than setting first threshold when, by described two knowledge entities merge.

A threshold value is preset, referred to herein as first threshold, when the similarity of two knowledge entities is greater than setting When first threshold, it is believed that two knowledge entity part contents repeat, and two knowledge entities are merged into an entity.

As shown in Fig. 2, being further comprised the steps of: before S101

Knowledge entity in S111, acquisition knowledge base to be aligned；

Knowledge entity is obtained by server where access knowledge base, knowledge entity may belong to same knowledge institute library, Multiple knowledge bases can be derived from.

S112, the knowledge entity is based on the vectorization of IF-IDF algorithm, obtains the knowledge entity vector set.

By knowledge entity vectorization, in addition to it is above-mentioned based on bag of words vectorization other than, can also be based on being based on IF- IDF algorithm is to knowledge entity vectorization.TF-IDF is a kind of statistical method, to assess a words for a file set or one The significance level of a copy of it file in a corpus.The importance of words is directly proportional with the number that it occurs hereof Increase, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.TF-IDF is actually: TF*IDF, TF (Term Frequency, word frequency), IDF (Inverse Document Frequency, reverse document-frequency).TF indicates entry The frequency occurred in document d.Using TF-IDF to text vector, a dictionary is equally constructed, with the TF-IDF of each word It is worth the weight as the word.

As shown in figure 3, the training of the Clustering Model based on convolutional neural networks, includes the following steps:

S121, acquisition are marked with the training sample that cluster judges information, and the cluster of the training sample judges information for sample The classification of this knowledge entity；

In the embodiment of the present invention, the training objective of convolutional neural networks is classification belonging to identification knowledge entity, convolution mind Pass through in training learning sample through network model and manually mark class another characteristic, realizes the function to knowledge entity cluster.

S122, the Model tying reference that training sample input convolutional neural networks model is obtained to the training sample Information；

Convolutional neural networks model is made of: convolutional layer, pond layer, full connection and classification layer.Wherein, convolutional layer is used for Knowledge entity vector is locally perceived, and convolutional layer is usually attached in cascaded fashion, position is more rearward in cascade Convolutional layer can perceive the information being more globalized.

Full articulamentum plays the role of " classifier " in entire convolutional neural networks.If convolutional layer, pond layer and The operations such as activation primitive layer are that full articulamentum is then played " to be divided what is acquired if initial data to be mapped to hidden layer feature space Cloth character representation " is mapped to the effect in sample labeling space.Full articulamentum is connected to convolutional layer output position, can perceive and know Know the globalization feature of entity vector.

Training sample is input in convolutional neural networks model, obtains convolutional neural networks mode input cluster referring to letter Breath.

S123, the Model tying that different samples in the training sample are compared by loss function gather referring to information with described Class judges whether information is consistent；

Cluster is compared by loss function and judges whether information is consistent with the cluster that sample marks referring to information, and the present invention is real It applies and uses softmax cross entropy loss function in example, specifically:

Assuming that sharing N number of training sample, the input feature vector that i-th of sample is finally layered for network is X_i, corresponding Labeled as Y_iIt is final classification results, h=(h1, h2 ..., hc) is the final output of network, the i.e. prediction result of sample i. Wherein C is the quantity of last all classification.

S124, when the Model tying judges that information is inconsistent referring to information and the cluster, iterative cycles iteration The weight in the convolutional neural networks model is updated, until the Model tying judges that information is consistent with the cluster referring to information When terminate.

In the training process, the weight for adjusting each node in convolutional neural networks model makes Softmax intersect entropy loss letter Number is restrained as far as possible, that is to say, that continues to adjust weight, the value of obtained loss function no longer reduces, when increasing instead, it is believed that Convolutional neural networks training can terminate.The adjustment of each node weights uses gradient descent method, and gradient descent method is one optimal Change algorithm, for approaching minimum deflection model in machine learning and artificial intelligence with being used to recursiveness.

Knowledge entity is clustered by the convolutional neural networks model after training, cluster result can be made closer to use The expection at family.

As shown in figure 4, step S103 further includes following step:

S131, described two knowledge entity attributes are obtained, wherein the knowledge entity attributes are to describe corresponding knowledge The data of entity；

In some cases, although two knowledge entity similarities from the point of view of content are not high, two knowledge entities are all An entity in corresponding reality, that is to say, that two knowledge entities respectively describe two parts letter of some entity in reality Breath, for the convenience used, also it is necessary to be combined this two parts information.So introducing attributes similarity here.First obtain Knowledge entity attributes are taken, attribute is the data for describing knowledge entity, is referred to as label.

S132, described two knowledge entity attributes similarities and vector similarity are calculated；

Attributes similarity measures the similarity between two knowledge entities in the embodiment of the present invention using editing distance. Editing distance, refers to using character manipulation, and character string A is converted into minimal action number required for character string B.Character manipulation packet It includes: deleting one character, one character of modification, insertion character.It is 1 that the cost operated every time is arranged herein, attribute phase It can be calculated by the following formula like degree:

The maximum length of the propertystring of attributes similarity=(1- editing distance)/two

Vector similarity, i.e., the cosine similarity or Euclidean distance above-mentioned for measuring two knowledge entity vector similarities.

S133, the weighting that described two knowledge entity attributes similarities and vector similarity are calculated according to following formula With obtain the similarity between described two knowledge entities, it may be assumed that

S=aX+bY

Synthesized attribute similarity and vector similarity can find that description is same in the case where content similarity is not high Two knowledge entities of live entities, and the knowledge entity for describing same live entities is merged, it is convenient for the user to use With the maintenance of knowledge base.

Step S104 further includes following step:

S141, when the similarity be greater than setting second threshold when, wherein the second threshold be greater than first threshold Value, from any one deleted in knowledge base to be aligned in described two knowledge entities.

When the similarity of two knowledge entities is very high, we set second threshold here, and second threshold is greater than above-mentioned First threshold, such as the second threshold that sets think that two knowledge entities are essentially identical, at this moment, from knowledge base as 0.95 Delete the method that any one knowledge entity is exactly effective removal redundancy.

As shown in figure 5, step S104 further includes following step:

S151, by described two knowledge splitting objects at several fructifications；

When the similarity of two knowledge entities is greater than preset first threshold, it is believed that bulk density in two knowledge entity parts It is multiple, in order to pick out duplicate content, two knowledge entities first can be divided into several fructifications according to certain rules, Such as divide according to interior paragraph.

Any two fructification in several fructifications described in S152, selection, calculates between described two fructifications Similarity；

Any two fructification after selection segmentation, calculates the similarity between two fructifications, i.e., as previously described, first will Then fructification vectorization calculates the similarity between the vector for indicating fructification, can be cosine similarity, is also possible to Europe Family name's distance.

S153, when the similarity between described two fructifications be greater than preset third threshold value when, delete described two sons Any one in entity, wherein the third threshold value is greater than the first threshold；

When the similarity between two fructifications is greater than preset threshold value, referred to herein as third threshold value, it is believed that two sons Physical contents repeat substantially, delete wherein any one.To avoid deleting excessive content, third threshold requirement is greater than above-mentioned First threshold.

S154, step S152 and step S153 is repeated, until the phase in the fructification of reservation between any two fructification Like degree both less than or equal to preset third threshold value；

Repeat the comparison of similarity between fructification, deletes the high fructification of registration, make in the fructification retained The similarity of any two fructification is both less than or equal to preset third threshold value.

S155, the alignment entity that the fructification of the reservation is incorporated as to described two knowledge entities.

The alignment result of two knowledge entities to be aligned before the fructification of reservation is incorporated as.

The embodiment of the present invention also provides a kind of knowledge base alignment means to solve above-mentioned technical problem.Referring specifically to Fig. 6, Fig. 6 is the basic structure block diagram of the present embodiment knowledge base alignment means.

As shown in fig. 6, a kind of knowledge base alignment means, comprising: obtain module 210, processing module 220, computing module 230 With execution module 240.Wherein, module 210 is obtained, for obtaining knowledge entity vector set, wherein the knowledge entity vector set It is the vectorization expression of knowledge in knowledge base entity to be aligned；Processing module 220, for the knowledge entity vector set is defeated Enter the cluster result that the knowledge in knowledge base entity to be aligned is obtained to preset knowledge entity cluster model；It calculates Module 230 belongs to of a sort any two knowledge entity for selecting according to the cluster result, and calculating is described two to be known Know the similarity between entity；Execution module 240, when for being greater than the first threshold of setting when the similarity, by described two A knowledge entity merges.

The knowledge entity vector set is input to and presets by obtaining knowledge entity vector set by the embodiment of the present invention Knowledge entity cluster model, obtain the cluster result of the knowledge in knowledge base entity to be aligned, according to the cluster result, Selection belongs to of a sort any two knowledge entity, the similarity between described two knowledge entities is calculated, when described similar When degree is greater than the first threshold of setting, described two knowledge entities are merged.The comparison of two knowledge entity similarities is limited to together In a kind of entity, calculation amount is greatly reduced, wherein it is similar with vector that the calculating of similarity combines entity attributes similarity Degree, keeps the calculating of similarity more reasonable, more effectively can find and remove redundancy.

In some embodiments, the knowledge base alignment means further include: the first acquisition submodule and the first processing Module.Wherein, the first acquisition submodule, for obtaining the knowledge entity in knowledge base to be aligned；First processing submodule, is used In the knowledge entity is based on the vectorization of IF-IDF algorithm, the knowledge entity vector set is obtained.

In some embodiments, preset knowledge entity cluster model uses in the knowledge base alignment means DBSCAN density clustering algorithm.

In some embodiments, preset knowledge entity cluster model uses base in the knowledge base alignment means In the Clustering Model of convolutional neural networks.

In some embodiments, the computing module 230 include: the second acquisition submodule, the first computational submodule and Second computational submodule.Wherein, the second acquisition submodule, for obtaining described two knowledge entity attributes, wherein described to know Knowing entity attributes is the data for describing corresponding knowledge entity；First computational submodule, for calculating described two knowledge entities Attributes similarity and vector similarity；Second computational submodule, for calculating described two knowledge entities according to following formula Attributes similarity and vector similarity weighted sum, obtain the similarity between described two knowledge entities, it may be assumed that

S=aX+bY

In some embodiments, the execution module 240 includes: the first implementation sub-module, for working as the similarity Greater than setting second threshold when, wherein the second threshold be greater than the first threshold, deleted from knowledge base to be aligned Any one in described two knowledge entities.

In some embodiments, the execution module 240 includes: the first segmentation submodule, third computational submodule, Two implementation sub-modules, first circulation submodule and third implementation sub-module.Wherein, the first segmentation submodule, is used for described two A knowledge splitting object is at several fructifications；Third computational submodule, it is any in several described fructifications for selecting Two fructifications, calculate the similarity between described two fructifications；Second implementation sub-module, for working as described two fructifications Between similarity be greater than preset third threshold value when, delete any one in described two fructifications, wherein the third Threshold value is greater than the first threshold；First circulation submodule, for repeating third computational submodule and the second implementation sub-module Operation, until the similarity in the fructification of reservation between any two fructification is both less than or equal to preset third threshold value； Third implementation sub-module, for the fructification of the reservation to be incorporated as to the alignment entity of described two knowledge entities.

In order to solve the above technical problems, the embodiment of the present invention also provides computer equipment.It is this referring specifically to Fig. 7, Fig. 7 Embodiment computer equipment basic structure block diagram.

As shown in fig. 7, the schematic diagram of internal structure of computer equipment.As shown in fig. 7, the computer equipment includes passing through to be Processor, non-volatile memory medium, memory and the network interface of bus of uniting connection.Wherein, the computer equipment is non-easy The property lost storage medium is stored with operating system, database and computer-readable instruction, can be stored with control information sequence in database Column when the computer-readable instruction is executed by processor, may make processor to realize a kind of method that knowledge base is aligned.The calculating The processor of machine equipment supports the operation of entire computer equipment for providing calculating and control ability.The computer equipment It can be stored with computer-readable instruction in memory, when which is executed by processor, processor may make to hold A kind of method of knowledge base alignment of row.The network interface of the computer equipment is used for and terminal connection communication.Those skilled in the art Member is appreciated that structure shown in Fig. 7, only the block diagram of part-structure relevant to application scheme, composition pair The restriction for the computer equipment that application scheme is applied thereon, specific computer equipment may include than as shown in the figure more More or less component perhaps combines certain components or with different component layouts.

Processor is for executing acquisition module 210, processing module 220,230 and of computing module in Fig. 6 in present embodiment The particular content of execution module 240, program code and Various types of data needed for memory is stored with the above-mentioned module of execution.Network connects Mouth to the data between user terminal or server for transmitting.Memory in present embodiment is stored with knowledge base alignment side Program code needed for executing all submodules in method and data, server is capable of the program code of invoking server and data are held The function of all submodules of row.

The knowledge entity vector set is input to preset by computer equipment by obtaining knowledge entity vector set Knowledge entity cluster model obtains the cluster result of the knowledge in knowledge base entity to be aligned, according to the cluster result, choosing It selects and belongs to of a sort any two knowledge entity, the similarity between described two knowledge entities is calculated, when the similarity Greater than setting first threshold when, by described two knowledge entities merge.The comparison of two knowledge entity similarities is limited to same In class entity, calculation amount is greatly reduced, wherein it is similar with vector that the calculating of similarity combines entity attributes similarity Degree, keeps the calculating of similarity more reasonable, more effectively can find and remove redundancy.

The present invention also provides a kind of storage mediums for being stored with computer-readable instruction, and the computer-readable instruction is by one When a or multiple processors execute, so that one or more processors execute knowledge base alignment schemes described in any of the above-described embodiment The step of.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, which can be stored in a computer-readable storage and be situated between In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, storage medium above-mentioned can be The non-volatile memory mediums such as magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random storage note Recall body (Random Access Memory, RAM) etc..

It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.

The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of knowledge base alignment schemes, it is characterised in that, include the following steps:

Obtain knowledge entity vector set, wherein the knowledge entity vector set be knowledge in knowledge base entity to be aligned to Quantization means；

The knowledge entity vector set is input to preset knowledge entity cluster model, obtains the knowledge base to be aligned The cluster result of middle knowledge entity；

According to the cluster result, selection belongs to of a sort any two knowledge entity, calculate described two knowledge entities it Between similarity；

2. knowledge base alignment schemes according to claim 1, which is characterized in that in the acquisition knowledge entity vector set Further include following step before step:

Obtain the knowledge entity in knowledge base to be aligned；

3. knowledge base alignment schemes according to claim 1, which is characterized in that the preset knowledge entity cluster Model uses DBSCAN density clustering algorithm.

4. knowledge base alignment schemes according to claim 1, which is characterized in that the preset knowledge entity cluster Model uses the Clustering Model based on convolutional neural networks, under the training of the Clustering Model based on convolutional neural networks includes State step:

It obtains and is marked with the training sample that cluster judges information, the cluster of the training sample judges information for sample knowledge entity Classification；

Believed by the Model tying that loss function compares different samples in the training sample referring to information and cluster judgement It whether consistent ceases；

When the Model tying judges that information is inconsistent referring to information and the cluster, the update of the iterative cycles iteration volume Weight in product neural network model, until the Model tying terminates when judging that information is consistent with the cluster referring to information.

5. knowledge base alignment schemes according to claim 1, which is characterized in that described according to the cluster result, choosing The step of selecting and belong to of a sort any two knowledge entity, calculating the similarity between described two knowledge entities specifically includes Following step:

Obtain described two knowledge entity attributes, wherein the knowledge entity attributes are the number for describing corresponding knowledge entity According to；

The weighted sum that described two knowledge entity attributes similarities and vector similarity are calculated according to following formula obtains described Similarity between two knowledge entities, it may be assumed that

S=aX+bY

Wherein, similarity of the S between described two knowledge entities, X are the attributes similarity, and Y is the vector similarity, A, b is respectively the weight of the attributes similarity and the vector similarity.

6. knowledge base alignment schemes according to claim 1, which is characterized in that described when the similarity is greater than setting First threshold when, by described two knowledge entities merge the step of in, further include following step:

When the similarity is greater than the second threshold of setting, wherein the second threshold is greater than the first threshold, to right Any one in described two knowledge entities is deleted in neat knowledge base.

7. knowledge base alignment schemes according to claim 1, which is characterized in that described when the similarity is greater than setting First threshold when, by described two knowledge entities merge the step of in, further include following step:

A. by described two knowledge splitting objects at several fructifications；

B. any two fructification in several described fructifications is selected, the similarity between described two fructifications is calculated；

C. it when the similarity between described two fructifications is greater than preset third threshold value, deletes in described two fructifications Any one, wherein the third threshold value is greater than the first threshold；

D. step b and step c is repeated, until the similarity in the fructification of reservation between any two fructification is both less than or is waited In preset third threshold value；

8. a kind of knowledge base alignment means characterized by comprising

Module is obtained, for obtaining knowledge entity vector set, wherein the knowledge entity vector set is in knowledge base to be aligned The vectorization of knowledge entity indicates；

Processing module obtains institute for the knowledge entity vector set to be input to preset knowledge entity cluster model State the cluster result of knowledge in knowledge base entity to be aligned；

Computing module, for according to the cluster result, selection to belong to of a sort any two knowledge entity, calculating described two Similarity between a knowledge entity；

9. a kind of computer equipment, including memory and processor, it is stored with computer-readable instruction in the memory, it is described When computer-readable instruction is executed by the processor, so that the processor executes such as any one of claims 1 to 7 right It is required that the step of knowledge base alignment schemes.

10. a kind of computer readable storage medium, it is stored with computer-readable instruction on the computer readable storage medium, institute It states and realizes the knowledge base pair as described in any one of claims 1 to 7 claim when computer-readable instruction is executed by processor The step of neat method.