CN105608075A

CN105608075A - Related knowledge point acquisition method and system

Info

Publication number: CN105608075A
Application number: CN201410497470.3A
Authority: CN
Inventors: 叶茂; 徐剑波; 汤帜; 杨亮; 卢菁
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Current assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Priority date: 2014-09-26
Filing date: 2014-09-26
Publication date: 2016-05-25

Abstract

The present invention provides a related knowledge point acquisition method. The method comprises: firstly, acquiring domain knowledge points; then carrying out word segmentation on a text in a domain according to the domain knowledge points; obtaining candidate knowledge points after removing common words; obtaining semantic vectors of the candidate knowledge points; and obtaining candidate knowledge points, related to each domain knowledge point, as target knowledge points by calculating similarity between the domain knowledge points and the candidate knowledge points. Thus, a plurality of target knowledge points related to each domain knowledge point can be obtained. When constructing an encyclopedia directory entry, it may be determined, through searching, whether each domain knowledge point has a related knowledge point, and if not, a related knowledge point needs to be added. In this way, checking and construction of encyclopedia entries are completed, so that a manual workload is significantly reduced; time costs and labor costs are reduced; inaccuracy caused by subjectivity and non-uniform standards of manual checking is avoided; and efficiency and accuracy are greatly improved.

Description

A kind of acquisition methods of correlated knowledge point and system

Technical field

The present invention relates to a kind of electric digital data processing field, specifically obtaining of a kind of correlated knowledge pointAccess method and system.

Background technology

Digital publishing resource has become one of major way that information provides. People are a large amount of from paper readingTurn to electronic reading. Digital publishing resource comprise e-book, digital encyclopedia, digital periodical,Digital newspaper etc. The information of digital publishing Resource Supply is conventionally more authoritative and correct than internet. Therefore,How to become particularly important according to feature raising people's study or the reading experience of digital publishing resource.

Encyclopedia (Encyclopedia) is introduced the A to Z of of the mankind or the instrument of a certain class knowledgeBook. Often according to the layout of dictionary form (taking entry as elementary cell), collect each ken noun,Idiom, place name, event, personage, works etc. Encyclopedia can be comprehensive, comprises all necks(for example, " Great Britain's encyclopedia " is exactly that a famous comprehensive encyclopaedia is complete to the related content in territoryBook). Also can be professional encyclopedia, as the encyclopedia in some fields, as historical encyclopaediaPandect, military encyclopedia etc., the encyclopedia in this some fields is called field encyclopedia. HundredSection's pandect is considered to be the mark of a country and an epoch science and culture development.

Field encyclopedia is by the information classification of magnanimity, for user provides resource more targetedly. FieldEncyclopedia is also a kind of important digital publishing resource. The common mode group with entry of field encyclopediaKnit realm information. Field encyclopedia need to comprise the important entry in field. But, construction field hundredSection's pandect needs a large amount of human inputs. Because field entry number is more, therefore seek by artificial modeLook for suitable field entry not only consuming time, and easily omit some very relevant field entries. How trueWhether these entries that phasing closes have all included is a very important job, but implements needsSpend a large amount of manpower and time.

Distributedwordsrepresentations (distributed word represents) is at Rumelhar,D.E.，Hinton，G.E.，Williams，R.J.：LearningRepresenationsbyIn Back-propagatingErrors.Nature323 (6088): pp533-536 (1986), carry firstGo out, word lists is shown as vector row (continuousvectors) by this thought, and in vector space,The distance of similar word is nearer. Feedforwardneuralnetwork (feedforward neural network) usesIn the method for study term vector and language model (see Bengio, Y., Ducharme, R., Vincent,P.：ANeuralProbabilisticLanguageModel.JournalofMachineLearningResearch3, pp1137-1155 (2003)). Recently, Mikolov has proposed use skip-gramOr CBOW model, by a large amount of texts, train simple neutral net obtain at short notice word toAmount. Although in theory had some researchs about semantic vector, still lacked answering of this technologyWith popularization.

Summary of the invention

For this reason, technical problem to be solved by this invention is that in prior art, obtaining relevant entry needs peopleFor the problem of screening, workload is large, objectivity is poor, determine phase thereby propose one according to semantic vectorClose the method for knowledge point.

For solving the problems of the technologies described above, of the present inventionly a kind of acquisition methods of correlated knowledge point is provided and isSystem.

The acquisition methods that the invention provides a kind of correlated knowledge point, comprising:

Obtain domain knowledge point;

According to described domain knowledge point, text is carried out to participle, obtain word segmentation result;

According to word segmentation result and everyday words, determine candidate knowledge point;

Determine the semantic vector of each candidate knowledge point;

For each domain knowledge point, calculate the semantic similarity of this domain knowledge point and candidate knowledge point;

According to the semantic similarity calculating, determine and put relevant object knowledge point to this domain knowledge.

The present invention also provides a kind of system of the correlated knowledge point that obtains knowledge point, comprising:

Extraction unit: obtain domain knowledge point;

Participle unit: according to described domain knowledge point, text is carried out to participle, obtain word segmentation result;

Candidate unit: according to word segmentation result and everyday words, determine candidate knowledge point;

Semantic vector computing unit: the semantic vector of determining each candidate knowledge point;

Similarity calculated: for each domain knowledge point, calculate this domain knowledge point and candidate's knowledgeThe semantic similarity of point;

Correlated knowledge point computing unit: according to the semantic similarity calculating, determine and this domain knowledge pointRelevant object knowledge point.

Technique scheme of the present invention has the following advantages compared to existing technology,

(1) the invention provides a kind of acquisition methods of correlated knowledge point, first obtain domain knowledge point, thenAccording to these domain knowledge points, the text in field is carried out to participle, after removal everyday words, obtain candidate's knowledgePoint, and then the semantic vector of acquisition candidate knowledge point, by calculating domain knowledge point and candidate knowledge pointSimilarity, obtain to each domain knowledge and put relevant candidate knowledge point, as object knowledge point.Like this, just can obtain several object knowledge points that each domain knowledge point is relevant. Building encyclopaedia orderRecord entry time, whether the correlated knowledge point that can search each domain knowledge point exists, if do not deposited, needing increases. Complete in this way inspection and the construction of the encyclopedical entry in field.Greatly reduce artificial workload, saved time cost and cost of labor, and avoided hand inspectionSubjectivity and the inaccuracy brought of standard disunity, greatly improved efficiency and the degree of accuracy.

(2) acquisition methods of correlated knowledge point of the present invention, in correlated knowledge point acquisition process,The method that adopts calculated candidate knowledge point semantic vector, quantizes the semantic information of knowledge point, passes throughDigitized mode embodies its semantic feature, like this, follow-up when knowledge point is analyzed, canConvenient, for the application such as knowledge point search, recommendation, information filtering provide basis.

(3) the present invention also provides a kind of system of obtaining of correlated knowledge point, comprises extraction unit, participleUnit, candidate unit, semantic vector computing unit, similarity calculated and correlated knowledge point calculateUnit, adopt the mode of computing semantic vector, by calculating domain knowledge point and candidate knowledge pointSimilarity, obtains to each domain knowledge and puts relevant candidate knowledge point, knows thereby obtain each fieldKnow several object knowledge points that point is relevant. In the time building the entry of encyclopaedia catalogue, can search each neckWhether the correlated knowledge point of domain knowledge point exists, and if do not existed, needing increases. By this sideFormula completes inspection and the construction of the encyclopedical entry in field, greatly reduces artificial workload.

Brief description of the drawings

For content of the present invention is more likely to be clearly understood, below according to specific embodiment of the inventionAlso by reference to the accompanying drawings, the present invention is further detailed explanation, wherein for example

Fig. 1 is the flow chart of the acquisition methods of correlated knowledge point in embodiment 1;

Fig. 2 is the flow chart of the semantic vector of calculated candidate knowledge point in embodiment 2;

Fig. 3 is the schematic diagram of skip-gram model in embodiment 2;

Fig. 4 is the schematic diagram of CBOW model in embodiment 2;

Fig. 5 is the structured flowchart of the system of obtaining of correlated knowledge point in embodiment 4.

Detailed description of the invention

Embodiment 1:

In the present embodiment, provide a kind of acquisition methods of correlated knowledge point, obtain in field by the methodThe relevant knowledge point of all knowledge points, then according to these correlated knowledge points that obtain, for foundationField encyclopedia in entry carry out leakage detection and fill a vacancy, come further perfectly, there is extraordinary guidanceBe worth. Knowledge point refers to the elementary cell that information is transmitted, and the expression of research knowledge point is learnt improving with associatedNavigation, information recommendation, retrieve, set up dictionary etc. and there is important effect.

The acquisition methods of this correlated knowledge point, as shown in Figure 1, detailed process is as follows for flow chart:

First, obtain domain knowledge point, obtain all knowledge points in this field, as for build encyclopaediaWhen pandect, can obtain all entries in this field of having built well, as domain knowledge point.

Then, according to domain knowledge point, text is carried out to participle, obtain word segmentation result. Text choosing hereinSelect some digital resources in field, in order to make its knowledge point of containing enough extensive, generally understand multiselectSelect the electronic digit resource in some this areas. After the digital resource of selected field, therefrom extract text, soRear participle. In the time of participle, first domain knowledge point is added in participle device, and then carries out with this participle deviceParticiple. By domain knowledge point to be added to effect in participle device be knowledge point in field when the participle asA word is processed, as in " the upper hall of Emperor Qin receives and pays respects to " this sentence, when participle " Emperor Qin "Be a word, " emperor " is also a word, has two kinds of possibilities when participle, in domain knowledge point, depositsAt " Emperor Qin " this word, domain knowledge point is added after participle device, in the time of participle " Emperor Qin "To serve as a word. Like this, by domain knowledge point is added in participle device, can be better to neckText in territory carries out participle, and the word segmentation result in this area obtaining is more accurate.

After participle, obtain a large amount of words, these words had both comprised some knowledge points in this area,Also comprise the word that some are conventional, as you, they, have a meal etc. By the file after participle, asAlternative file.

Then,, according to word segmentation result and everyday words, determine candidate knowledge point. Because everyday words is often to makeWith a series of word, by above-mentioned word segmentation result, this part word is removed, just obtainedWith the word of domain-specific, using these words as candidate knowledge point. Everyday words is herein prior artIn the everyday words that determined. In other embodiment, can also come by the following method to determineEveryday words: select the digital resource of common text, as digital resources such as life newspaper, life magazines, rightIt carries out participle (adopt the vocabulary of stopping using to remove stop words, for example, adopt Harbin Institute of Technology's vocabulary of stopping using), will beThe word occurring in many texts is defined as everyday words. In alternative file, remove after everyday words, just obtainThe word in this area, as candidate knowledge point.

Then, calculate the semantic vector of each candidate knowledge point. The method of computing semantic vector can adoptMethod of the prior art, by the mode of computing semantic vector, quantizes each knowledge point by semantemeMode carry out digitized representations.

Then,, for each domain knowledge point, calculate the semantic phase of this domain knowledge point and candidate knowledge pointLike degree. Because candidate knowledge point obtains by a large amount of digital resource in field, therefore we thinkThe all knowledge points in domain knowledge point in this candidate knowledge point, are contained, in this candidate knowledge pointTo search the semantic vector that obtains every field knowledge point, then calculate the semanteme of itself and each candidate knowledge pointSimilarity.

The computational methods of semantic similarity are herein:

f (X, Y) = \frac{X \cdot Y}{| | X | | | | Y | |} = \frac{Σ_{i = 1}^{m} X_{i} \times Y_{i}}{\sqrt{Σ_{i = 1}^{m} {(X_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{m} {(Y_{i})}^{2}}}

Wherein, X, Y are the vector that need to compare two m row of similarity, and one of them is domain knowledge pointSemantic vector, another is the semantic vector of candidate knowledge point, f (X, Y) is that the semanteme of X, Y is similarDegree.

Like this, just obtain the semantic similarity of all spectra knowledge point and all candidates knowledge point.

Finally, according to the semantic similarity calculating, determine and put relevant object knowledge to this domain knowledgePoint. Can be by the similarity descending of this domain knowledge point and other candidate knowledge points, select sequence to existThe candidate knowledge point of front some is as the correlated knowledge point of this domain knowledge point. As replacingEmbodiment, also can set in advance a similarity threshold, choose the time that similarity is greater than this threshold valueSelect the correlated knowledge point of knowledge point as this domain knowledge point.

The acquisition methods of the correlated knowledge point providing in the present embodiment, first obtains domain knowledge point, then basisThese domain knowledge points carry out participle to the text in field, after removal everyday words, obtain candidate knowledge point,And then the semantic vector of acquisition candidate knowledge point, the phase by calculating domain knowledge point with candidate knowledge pointLike degree, obtain to each domain knowledge and put relevant candidate knowledge point, as object knowledge point. Like this,Just can obtain several object knowledge points that each domain knowledge point is relevant. At the word of building encyclopaedia catalogueWhen bar, whether the correlated knowledge point that can search each domain knowledge point exists, if do not existed,Needing increases. Complete in this way inspection and the construction of the encyclopedical entry in field. Greatly fallLow workload manually, has saved time cost and cost of labor, and has avoided the subjectivity of hand inspectionThe inaccuracy that property and standard disunity bring, has improved efficiency and the degree of accuracy greatly.

Embodiment 2:

In the present embodiment, provide a kind of acquisition methods of correlated knowledge point, the step in its step and embodiment 1Suddenly identical, the tool of the semantic vector of the each candidate of the calculating knowledge point in said process is provided in the present embodimentBody method, detailed process is as follows:

The first step, determines the number of times that each candidate knowledge point occurs in alternative file, has so just obtainedThe text of each candidate knowledge point and occurrence number thereof. Candidate's text is participle from selected digital resourceAfter the text that obtains, candidate knowledge point is that the word that obtains after participle in candidate's text is removed everyday words and obtainedWord, this part is identical with embodiment 1, does not repeat them here.

Second step, the number of times occurring in candidate's text according to each candidate knowledge point and this candidate knowledge point,Calculate the binary tree of cum rights path minimum.

Taking candidate knowledge point as leaf node, be the given weights of each leaf node, these weights are for knowingKnow the occurrence number of point in text, construct a binary tree, the weights of father node are two child nodesWeights sum, cum rights path is defined as the cum rights path sum of all leaf nodes, if cum rightsIt is minimum that path reaches, and claims that such binary tree is optimum binary tree, also referred to as Huffman tree. HereinBuilding method adopt existing mode in prior art to build, obtain cum rights path by existing algorithmThe binary tree of length minimum.

The 3rd step, the position according to each knowledge point in described candidate's text and cum rights path minimumBinary tree, determine the semantic vector of each knowledge point.

First, create skip-gram model, skip-gram model is a kind of nerve net in prior artNetwork model, as shown in Figure 3, for training word vector, cardinal principle is by current word to schematic diagramPredict that thereby the word in its front and back certain limit obtains suitable word vector representation, the training side of useMethod is random gradient descending method, is input as text data, and the result obtaining according to training can be obtained wordLanguage vector.

The embodiment that can replace as other, also can select CBOW model herein, and schematic diagram is as Fig. 4Shown in, it is also a kind of neural network model. CBOW model is pre-by the context at a word placeSurvey this word, as shown in Figure 3, this model is a kind of neural network model in prior art to process, for instructionPractice word vector, thereby cardinal principle is closed by this word of context-prediction at a word placeSuitable word vector representation, the training method of use is random gradient descending method, is input as text data,The result obtaining according to training is for can obtain word vector.

In addition,, in some documents, also provided concrete Jie of skip-gram model or CBOW modelContinue and apply, as follows:

Mikolov，T.，Chen，K.，Corrado，G.，etal.：EfficientEstimationofWordRepresentationsinVectorSpace.InProc.ICLRWorkshop(2013)

Mikolov，T.，Sutskever，I.，Chen，K.，etal.DistributedRepresentationsofWordsandPhrasesandTheirCompositionality.InProc.NIPS(2013)

Building after above-mentioned model, taking alternative file as training sample, described cum rights path minimumBinary tree is output layer, trains; After having trained, according to the binary tree of cum rights path minimumIn knot vector obtain the semantic vector of each candidate knowledge point. Concrete mode is for extracting optimum binary treeLeaf node vector after training corresponding to middle leaf node position, the i.e. knowledge point for this reason of this vectorSemantic vector.

In the present embodiment, obtaining after the semantic vector of candidate knowledge point, calculating domain knowledge point and candidateWhen the similarity of knowledge point, adopt following formula:

The computational methods of described semantic similarity are:

f (X, Y) = \frac{2 Σ_{i = 1}^{m} X_{i} \times Y_{i}}{Σ_{i = 1}^{m} {(X_{i})}^{2} + Σ_{i = 1}^{m} {(Y_{i})}^{2}}

Finally, show that by this semantic similarity each domain knowledge puts relevant candidate knowledge point, doFor object knowledge point. In the encyclopedia of field, search this object knowledge point, complete field encyclopediaThe inspection of entry and structure.

In the present embodiment, in correlated knowledge point acquisition process, the method for calculated candidate knowledge point semantic vector,The semantic information of knowledge point is quantized, embody its semantic feature by digitized mode, like this,Follow-up when knowledge point is analyzed, can be convenient, be knowledge point search, recommendation, information mistakeThe application such as filter provide basis.

Embodiment 3:

Field encyclopedia is a kind of important digital publishing resource. Field encyclopedia is conventionally with entryMode is organized realm information. Field encyclopedia need to comprise the important entry in field. But, buildField encyclopedia needs a large amount of human inputs. A kind of obtaining of correlated knowledge point of obtaining is provided in the present embodimentMethod, domain knowledge is put the entry in field encyclopedia namely. In the present embodiment, utilize field electronicsBook text and newpapers and periodicals text, the semanteme of the candidate entry calculating by skip-gram model toAmount. Calculate semanteme between the field entry building and the candidate entry obtaining by semantic vector similarDegree. Utilize the semantic similarity of entry, find with the encyclopaedical entry in field semanteme relevant and lostThe other field entry leaking, the possibility being missed to reduce some field entry. Concrete steps are as follows.

The first step, obtains domain knowledge point, as taking historical field as example, in historical encyclopedia, obtainsThe entry all establishing, using these field entries as domain knowledge point, then by these field entriesAdd in participle device dictionary. Participle device herein can be selected IK participle device, at other embodimentsIn also can select other participle device as Ansj participle device etc.

Second step, selects field e-book, and field e-book can be selected the e-book of historical domain-specific,As digital resources such as 5,000-year and down, dynasty in successive dynasties histories, select the numeral money in this area as much as possibleSource makes it contain all entries in this area as far as possible. From these field e-book, extract text, soThe above-mentioned participle device that has added field entry of rear use carries out participle to text, obtains the text F after participle.

The 3rd step, selects life newpapers and periodicals text, uses participle device to carry out participle to newpapers and periodicals text, according to dividingWord result is determined everyday words.

The 4th step, has comprised the word in field and some conventional words in word segmentation result in second step,In the 3rd step, obtain conventional word, removed like this commonly using in the text F after participle in second stepWord, remaining word is as field candidate entry.

The 5th step, according to field candidate entry, the number of times that in statistics file F, each candidate entry occurs, shapeBecome statistics file, the entry in statistics file is pressed the occurrence number inverted order of article name and is arranged. Statistics fileForm as follows, wherein o_i，o_j，o_kFor the title of entry, t_i，t_j，t_kFor article name occurs in file FNumber of times.

o_i，t_i

o_j，t_j

…

o_k，t_k

According to this statistics file, taking entry as leaf node, form a Huffman tree. Build and breathe out hereinThe process of Fu Man tree is as follows:

1. generate the set R={r of n binary tree according to given n entry₁，r₂，..，r_n, whereinEvery binary tree r_iIn only have a cum rights w_iRoot node, weight w_iEqual the occurrence number t of entry_i，Left and right subtree is sky.

2. the tree of selecting two root node weights minimums in R as one of left and right subtree structure new twoFork tree, and the weights of putting the root node of new binary tree are the weights sum of root node in its left and right subtree.

3. in R, delete this two trees, and new binary tree is added in R.

4. repeat the 2nd step and the 3rd step, until only contain one tree in R.

This obtaining tree is Huffman tree.

The 6th step, with the training of skip-gram model, obtains each entry pair in Huffman treeThe leaf node vector of answering, thus the semantic vector of each entry obtained.

Skip-gram model is a kind of neural network model in prior art, for training word vector,Cardinal principle is to predict that by current word thereby the word in its front and back certain limit obtains suitable wordLanguage vector representation, the training method of use is random gradient descending method, is input as text data, according toThe result that training obtains can be obtained word vector.

In the present embodiment, first create skip-gram model, skip-gram model as shown in Figure 3,This model comprises input layer input, intermediate layer projection and output layer output. Wherein outputLayer adopts the Huffman tree in the 4th step. The path of each entry w from root node to leaf nodeBe expressed as L (w), n (w, j) represents the j under this path^thNode, ch (n) represents the child of non-leaf node nChild node, s (x) is-symbol function, when x is that true time gets 1, otherwise gets-1. For training set w₁，w₂，…，w_T(w₁，w₂，…，w_TBe exactly the word in training set), skip-gram model will maximize probable valueWherein j ≠ 0, k is with w_tCentered by window size, T is training setIn word number. Conventionally, k value is larger, and the result that training obtains is more accurate. But k value is larger,The training time needing is also longer. P (w|w_I) be defined as

p (w | w_{I}) = Π_{j = 1}^{L (w) - 1} σ (s (n (w, j + 1) = ch (n (w, j))) \cdot {v_{n (w, j)}^{'}}^{T} v_{w_{I}}),

N (w, j) represents the j under this path^thNode, s (x) is-symbol function, wherein σ (x)=1/ (1+exp (x)), v_wIt is the vector table of leaf node wShow v '_nIt is the vector representation of non-leaf node n. When training, word w in training set_iThe probability being dropped isWherein t is the threshold value of specifying, g (w_i) be word w_iThe frequency occurring, is used this generalThe object that rate abandons word is accelerate training speed and improve accuracy.

The 7th step, with the file F after participle as training sample, by random Gradient Descent backpropagationAlgorithm for Training model. After model training completes, obtain each candidate entry o_iSemantic vector v_i。

The 8th step, for each the entry o in the encyclopedia of field_i, calculate this entry and other allThe semantic similarity of candidate entry, the computational methods of similarity are situated between in above-mentioned other embodimentContinue, do not repeat them here, select as required the similarity calculating method in above-described embodiment. RootAccording to semantic similarity descending sort entry, obtain m the entry that similarity is the highest. Check that these entries areNo in the encyclopedia of field, if not in the encyclopedia of field, these entries are recorded inIn file, check for field encyclopedia builder.

Because entry number in the encyclopedia of field is more, therefore find suitable neck by artificial modeTerritory entry is not only consuming time, and easily omits some very relevant field entries. Being correlated with in the present embodimentThe acquisition methods of knowledge point, can check for field encyclopedia entry construction, for finding and fieldEncyclopedia entry is at semantically relevant other field entry, and to reduce, some field entry is missedPossibility.

Embodiment 4:

A kind of system of the correlated knowledge point that obtains knowledge point is provided in the present embodiment, as shown in Figure 5, comprises:

Extraction unit: obtain domain knowledge point;

Wherein, participle unit comprises:

Participle device is set up unit: described domain knowledge point is added in participle device;

Extracting unit: select field digital resource, therefrom extract text;

Alternative file acquiring unit: use described participle device to carry out participle to described text, obtain after participleFile, as alternative file.

Wherein, candidate unit comprises:

Everyday words determining unit: select the digital resource of common text, it is carried out to participle and determine everyday words;

Candidate knowledge point determining unit: the word in alternative file is removed to described everyday words, obtain candidateKnowledge point.

Wherein, semantic vector computing unit comprises:

Statistic unit: determine the number of times that each candidate knowledge point occurs in alternative file;

Optimum binary tree computing unit: according to each candidate knowledge point and this candidate knowledge point at candidate's textThe number of times of middle appearance, the binary tree of calculating cum rights path minimum;

Semantic vector determining unit: position and band according to each candidate knowledge point in described candidate's textWeigh the binary tree of path minimum, determine the semantic vector of each candidate knowledge point.

Above-mentioned semantic vector determining unit, further comprises:

Modeling unit: create skip-gram model;

Training unit: taking described alternative file as training sample, the y-bend of described cum rights path minimumTree, for output layer, is trained;

Computing unit: after having trained, according to the knot vector in the binary tree of cum rights path minimumObtain the semantic vector of each candidate knowledge point.

In the present embodiment, similarity calculated comprises computing formula, as follows:

f (X, Y) = \frac{X \cdot Y}{| | X | | | | Y | |} = \frac{Σ_{i = 1}^{m} X_{i} \times Y_{i}}{\sqrt{Σ_{i = 1}^{m} {(X_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{m} {(Y_{i})}^{2}}}

Wherein, X, Y are the vector that need to compare two m row of similarity, and f (X, Y) is the semanteme of X, YSimilarity.

In the embodiment that can replace at other, described similarity calculated comprises semantic similarityComputing formula is:

f (X, Y) = \frac{2 Σ_{i = 1}^{m} X_{i} \times Y_{i}}{Σ_{i = 1}^{m} {(X_{i})}^{2} + Σ_{i = 1}^{m} {(Y_{i})}^{2}}

Wherein, X, Y are the vector that need to relatively be listed as with two m of degree, and f (X, Y) is the semanteme of X, YSimilarity.

In the present embodiment, correlated knowledge point computing unit comprises:

The first computing unit: the similarity descending by this domain knowledge point with candidate knowledge point, selectSort the candidate knowledge point of preceding predetermined number as the correlated knowledge point of this domain knowledge point;

In the embodiment that can replace at other, correlated knowledge point computing unit comprises the second computing unit:Set in advance a similarity threshold, choose candidate knowledge point that similarity is greater than this threshold value as this fieldThe correlated knowledge point of knowledge point.

A kind of system of obtaining of correlated knowledge point is provided in the present embodiment, comprise extraction unit, participle unit,Candidate unit, semantic vector computing unit, similarity calculated and correlated knowledge point computing unit,Adopt the mode of computing semantic vector, by calculating the similar of domain knowledge point and candidate knowledge pointDegree, obtains to each domain knowledge and puts relevant candidate knowledge point, thereby obtain each domain knowledge pointSeveral relevant object knowledge points. In the time building the entry of encyclopaedia catalogue, can search each field and knowWhether the correlated knowledge point of knowing point has existed, and if do not existed, needing increases. Come in this wayComplete inspection and the construction of the encyclopedical entry in field, greatly reduce artificial workload.

Obviously, above-described embodiment is only for example is clearly described, and not to embodimentRestriction. For those of ordinary skill in the field, can also do on the basis of the above descriptionGo out other multi-form variation or variation. Here without also giving thoroughly all embodimentsLift. And the apparent variation of being extended out thus or the still protection domain in the invention of variationAmong.

Those skilled in the art should understand, embodiments of the invention can be provided as method, system orComputer program. Therefore, the present invention can adopt complete hardware implementation example, completely implement software example,Or in conjunction with the form of the embodiment of software and hardware aspect. And the present invention can adopt one or moreThe computer-usable storage medium that wherein includes computer usable program code (includes but not limited to diskMemory, CD-ROM, optical memory etc.) form of the upper computer program of implementing.

The present invention is that reference is according to the method for the embodiment of the present invention, equipment (system) and computer program productThe flow chart of product and/or block diagram are described. Should understand can be by computer program instructions realization flow figureAnd/or flow process in each flow process and/or square frame and flow chart and/or block diagram in block diagramAnd/or the combination of square frame. Can provide these computer program instructions to all-purpose computer, special-purpose computer,The processor of Embedded Processor or other programmable data processing device, to produce a machine, makes to lead toThe instruction of crossing the processor execution of computer or other programmable data processing device produces for realizing at streamThe function of specifying in flow process of journey figure or multiple flow process and/or square frame of block diagram or multiple square frameDevice.

These computer program instructions also can be stored in energy vectoring computer or other programmable data processing are establishedIn the standby computer-readable memory with ad hoc fashion work, make to be stored in this computer-readable memoryIn instruction produce and comprise the manufacture of command device, this command device realize in flow process of flow chart orThe function of specifying in square frame of multiple flow processs and/or block diagram or multiple square frame.

These computer program instructions also can be loaded in computer or other programmable data processing device,Make to carry out sequence of operations step on computer or other programmable devices computer implemented to produceProcess, thereby the instruction of carrying out on computer or other programmable devices is provided for realizing at flow chartThe step of the function of specifying in square frame of a flow process or multiple flow process and/or block diagram or multiple square frameSuddenly.

Although described the preferred embodiments of the present invention, once those skilled in the art obtain cicadaBasic creative concept, can make other change and amendment to these embodiment. So, appended powerProfit requires to be intended to be interpreted as comprising preferred embodiment and fall into all changes and the amendment of the scope of the invention.

Claims

1. an acquisition methods for correlated knowledge point, is characterized in that:

Obtain domain knowledge point;

Determine the semantic vector of each candidate knowledge point;

2. the method for obtaining knowledge point correlated knowledge point according to claim 1, is characterized in that,According to described domain knowledge point, text is carried out to participle, obtains the processing of word segmentation result, comprising:

Described domain knowledge point is added in participle device;

Selection field digital resource, therefrom extracts text;

Use described participle device to carry out participle to described text, obtain the file after participle, as candidate's literary compositionPart.

3. the method for obtaining knowledge point correlated knowledge point according to claim 1 and 2, its feature existsIn, described according to word segmentation result and everyday words, determine the process of candidate knowledge point, comprising:

The digital resource of selecting common text, carries out participle to it and determines everyday words;

Word in alternative file is removed to described everyday words, obtain candidate knowledge point.

4. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-3, its featureBe, the process of the described semantic vector of determining each candidate knowledge point, comprising:

Determine the number of times that each candidate knowledge point occurs in alternative file;

The number of times occurring in candidate's text according to each candidate knowledge point and this candidate knowledge point, calculates bandThe binary tree of power path minimum;

Two of position according to each candidate knowledge point in described candidate's text and cum rights path minimumFork is set, and determines the semantic vector of each candidate knowledge point.

5. the method for obtaining knowledge point correlated knowledge point according to claim 4, is characterized in that,Described according to each knowledge point the y-bend of the position in described candidate's text and cum rights path minimumSet, determine the process of the semantic vector of each knowledge point, comprising:

Create neural network model;

Taking described alternative file as training sample, the binary tree of described cum rights path minimum is output layer,Train;

After having trained, obtain each time according to the knot vector in the binary tree of cum rights path minimumSelect the semantic vector of knowledge point.

6. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-5, its featureBe, described for each domain knowledge point, calculate the semantic phase of this domain knowledge point and candidate knowledge pointLike the processing of degree, comprising:

The computational methods of described semantic similarity are:

f (X, Y) = \frac{X \cdot Y}{| | X | | | | Y | |} = \frac{Σ_{i = 1}^{m} X_{i} \times Y_{i}}{\sqrt{Σ_{i = 1}^{m} {(X_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{m} {(Y_{i})}^{2}}}

7. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-5, its featureBe, described for each domain knowledge point, calculate the semantic phase of this domain knowledge point and candidate knowledge pointLike the processing of degree, comprising:

The computational methods of described semantic similarity are:

f (X, Y) = \frac{2 Σ_{i = 1}^{m} X_{i} \times Y_{i}}{Σ_{i = 1}^{m} {(x_{i})}^{2} + Σ_{i = 1}^{m} {(Y_{i})}^{2}}

8. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-7, its featureBe that the semantic similarity that described basis calculates is determined and put relevant object knowledge to this domain knowledgeThe processing of point, comprising:

Similarity descending by this domain knowledge point with candidate knowledge point, selects sequence preceding defaultThe candidate knowledge point of quantity is as the correlated knowledge point of this domain knowledge point.

Or set in advance a similarity threshold, choose the candidate knowledge point conduct that similarity is greater than this threshold valueThe correlated knowledge point of this domain knowledge point.

9. a system of obtaining the correlated knowledge point of knowledge point, is characterized in that:

Extraction unit: obtain domain knowledge point;

10. the system of obtaining knowledge point correlated knowledge point according to claim 9, is characterized in that,Participle unit comprises:

Extracting unit: select field digital resource, therefrom extract text;

11. according to the system of obtaining knowledge point correlated knowledge point described in claim 9 or 10, and its feature existsIn, candidate unit comprises:

12. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-11, its spyLevy and be, semantic vector computing unit comprises:

13. systems of obtaining knowledge point correlated knowledge point according to claim 12, is characterized in that,Semantic vector determining unit, comprising:

Modeling unit: create neural network model;

14. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-13, its spyLevy and be, similarity calculated comprises computing formula, as follows:

f (X, Y) = \frac{X \cdot Y}{| | X | | | | Y | |} = \frac{Σ_{i = 1}^{m} X_{i} \times Y_{i}}{\sqrt{Σ_{i = 1}^{m} {(X_{i})}^{2}} \times \sqrt{Σ_{i = 1}^{m} {(Y_{i})}^{2}}}

15. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-13, its spyLevy and be, described similarity calculated comprises that the computing formula of semantic similarity is:

f (X, Y) = \frac{2 Σ_{i = 1}^{m} X_{i} \times Y_{i}}{Σ_{i = 1}^{m} {(x_{i})}^{2} + Σ_{i = 1}^{m} {(Y_{i})}^{2}}

16. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-14, its spyLevy and be, correlated knowledge point computing unit comprises:

Or second computing unit: set in advance a similarity threshold, choose similarity and be greater than this threshold valueCandidate knowledge point is as the correlated knowledge point of this domain knowledge point.