CN105608075A - Related knowledge point acquisition method and system - Google Patents

Related knowledge point acquisition method and system Download PDF

Info

Publication number
CN105608075A
CN105608075A CN201410497470.3A CN201410497470A CN105608075A CN 105608075 A CN105608075 A CN 105608075A CN 201410497470 A CN201410497470 A CN 201410497470A CN 105608075 A CN105608075 A CN 105608075A
Authority
CN
China
Prior art keywords
knowledge point
candidate
domain
correlated
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410497470.3A
Other languages
Chinese (zh)
Inventor
叶茂
徐剑波
汤帜
杨亮
卢菁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University
Priority to CN201410497470.3A priority Critical patent/CN105608075A/en
Publication of CN105608075A publication Critical patent/CN105608075A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a related knowledge point acquisition method. The method comprises: firstly, acquiring domain knowledge points; then carrying out word segmentation on a text in a domain according to the domain knowledge points; obtaining candidate knowledge points after removing common words; obtaining semantic vectors of the candidate knowledge points; and obtaining candidate knowledge points, related to each domain knowledge point, as target knowledge points by calculating similarity between the domain knowledge points and the candidate knowledge points. Thus, a plurality of target knowledge points related to each domain knowledge point can be obtained. When constructing an encyclopedia directory entry, it may be determined, through searching, whether each domain knowledge point has a related knowledge point, and if not, a related knowledge point needs to be added. In this way, checking and construction of encyclopedia entries are completed, so that a manual workload is significantly reduced; time costs and labor costs are reduced; inaccuracy caused by subjectivity and non-uniform standards of manual checking is avoided; and efficiency and accuracy are greatly improved.

Description

A kind of acquisition methods of correlated knowledge point and system
Technical field
The present invention relates to a kind of electric digital data processing field, specifically obtaining of a kind of correlated knowledge pointAccess method and system.
Background technology
Digital publishing resource has become one of major way that information provides. People are a large amount of from paper readingTurn to electronic reading. Digital publishing resource comprise e-book, digital encyclopedia, digital periodical,Digital newspaper etc. The information of digital publishing Resource Supply is conventionally more authoritative and correct than internet. Therefore,How to become particularly important according to feature raising people's study or the reading experience of digital publishing resource.
Encyclopedia (Encyclopedia) is introduced the A to Z of of the mankind or the instrument of a certain class knowledgeBook. Often according to the layout of dictionary form (taking entry as elementary cell), collect each ken noun,Idiom, place name, event, personage, works etc. Encyclopedia can be comprehensive, comprises all necks(for example, " Great Britain's encyclopedia " is exactly that a famous comprehensive encyclopaedia is complete to the related content in territoryBook). Also can be professional encyclopedia, as the encyclopedia in some fields, as historical encyclopaediaPandect, military encyclopedia etc., the encyclopedia in this some fields is called field encyclopedia. HundredSection's pandect is considered to be the mark of a country and an epoch science and culture development.
Field encyclopedia is by the information classification of magnanimity, for user provides resource more targetedly. FieldEncyclopedia is also a kind of important digital publishing resource. The common mode group with entry of field encyclopediaKnit realm information. Field encyclopedia need to comprise the important entry in field. But, construction field hundredSection's pandect needs a large amount of human inputs. Because field entry number is more, therefore seek by artificial modeLook for suitable field entry not only consuming time, and easily omit some very relevant field entries. How trueWhether these entries that phasing closes have all included is a very important job, but implements needsSpend a large amount of manpower and time.
Distributedwordsrepresentations (distributed word represents) is at Rumelhar,D.E.,Hinton,G.E.,Williams,R.J.:LearningRepresenationsbyIn Back-propagatingErrors.Nature323 (6088): pp533-536 (1986), carry firstGo out, word lists is shown as vector row (continuousvectors) by this thought, and in vector space,The distance of similar word is nearer. Feedforwardneuralnetwork (feedforward neural network) usesIn the method for study term vector and language model (see Bengio, Y., Ducharme, R., Vincent,P.:ANeuralProbabilisticLanguageModel.JournalofMachineLearningResearch3, pp1137-1155 (2003)). Recently, Mikolov has proposed use skip-gramOr CBOW model, by a large amount of texts, train simple neutral net obtain at short notice word toAmount. Although in theory had some researchs about semantic vector, still lacked answering of this technologyWith popularization.
Summary of the invention
For this reason, technical problem to be solved by this invention is that in prior art, obtaining relevant entry needs peopleFor the problem of screening, workload is large, objectivity is poor, determine phase thereby propose one according to semantic vectorClose the method for knowledge point.
For solving the problems of the technologies described above, of the present inventionly a kind of acquisition methods of correlated knowledge point is provided and isSystem.
The acquisition methods that the invention provides a kind of correlated knowledge point, comprising:
Obtain domain knowledge point;
According to described domain knowledge point, text is carried out to participle, obtain word segmentation result;
According to word segmentation result and everyday words, determine candidate knowledge point;
Determine the semantic vector of each candidate knowledge point;
For each domain knowledge point, calculate the semantic similarity of this domain knowledge point and candidate knowledge point;
According to the semantic similarity calculating, determine and put relevant object knowledge point to this domain knowledge.
The present invention also provides a kind of system of the correlated knowledge point that obtains knowledge point, comprising:
Extraction unit: obtain domain knowledge point;
Participle unit: according to described domain knowledge point, text is carried out to participle, obtain word segmentation result;
Candidate unit: according to word segmentation result and everyday words, determine candidate knowledge point;
Semantic vector computing unit: the semantic vector of determining each candidate knowledge point;
Similarity calculated: for each domain knowledge point, calculate this domain knowledge point and candidate's knowledgeThe semantic similarity of point;
Correlated knowledge point computing unit: according to the semantic similarity calculating, determine and this domain knowledge pointRelevant object knowledge point.
Technique scheme of the present invention has the following advantages compared to existing technology,
(1) the invention provides a kind of acquisition methods of correlated knowledge point, first obtain domain knowledge point, thenAccording to these domain knowledge points, the text in field is carried out to participle, after removal everyday words, obtain candidate's knowledgePoint, and then the semantic vector of acquisition candidate knowledge point, by calculating domain knowledge point and candidate knowledge pointSimilarity, obtain to each domain knowledge and put relevant candidate knowledge point, as object knowledge point.Like this, just can obtain several object knowledge points that each domain knowledge point is relevant. Building encyclopaedia orderRecord entry time, whether the correlated knowledge point that can search each domain knowledge point exists, if do not deposited, needing increases. Complete in this way inspection and the construction of the encyclopedical entry in field.Greatly reduce artificial workload, saved time cost and cost of labor, and avoided hand inspectionSubjectivity and the inaccuracy brought of standard disunity, greatly improved efficiency and the degree of accuracy.
(2) acquisition methods of correlated knowledge point of the present invention, in correlated knowledge point acquisition process,The method that adopts calculated candidate knowledge point semantic vector, quantizes the semantic information of knowledge point, passes throughDigitized mode embodies its semantic feature, like this, follow-up when knowledge point is analyzed, canConvenient, for the application such as knowledge point search, recommendation, information filtering provide basis.
(3) the present invention also provides a kind of system of obtaining of correlated knowledge point, comprises extraction unit, participleUnit, candidate unit, semantic vector computing unit, similarity calculated and correlated knowledge point calculateUnit, adopt the mode of computing semantic vector, by calculating domain knowledge point and candidate knowledge pointSimilarity, obtains to each domain knowledge and puts relevant candidate knowledge point, knows thereby obtain each fieldKnow several object knowledge points that point is relevant. In the time building the entry of encyclopaedia catalogue, can search each neckWhether the correlated knowledge point of domain knowledge point exists, and if do not existed, needing increases. By this sideFormula completes inspection and the construction of the encyclopedical entry in field, greatly reduces artificial workload.
Brief description of the drawings
For content of the present invention is more likely to be clearly understood, below according to specific embodiment of the inventionAlso by reference to the accompanying drawings, the present invention is further detailed explanation, wherein for example
Fig. 1 is the flow chart of the acquisition methods of correlated knowledge point in embodiment 1;
Fig. 2 is the flow chart of the semantic vector of calculated candidate knowledge point in embodiment 2;
Fig. 3 is the schematic diagram of skip-gram model in embodiment 2;
Fig. 4 is the schematic diagram of CBOW model in embodiment 2;
Fig. 5 is the structured flowchart of the system of obtaining of correlated knowledge point in embodiment 4.
Detailed description of the invention
Embodiment 1:
In the present embodiment, provide a kind of acquisition methods of correlated knowledge point, obtain in field by the methodThe relevant knowledge point of all knowledge points, then according to these correlated knowledge points that obtain, for foundationField encyclopedia in entry carry out leakage detection and fill a vacancy, come further perfectly, there is extraordinary guidanceBe worth. Knowledge point refers to the elementary cell that information is transmitted, and the expression of research knowledge point is learnt improving with associatedNavigation, information recommendation, retrieve, set up dictionary etc. and there is important effect.
The acquisition methods of this correlated knowledge point, as shown in Figure 1, detailed process is as follows for flow chart:
First, obtain domain knowledge point, obtain all knowledge points in this field, as for build encyclopaediaWhen pandect, can obtain all entries in this field of having built well, as domain knowledge point.
Then, according to domain knowledge point, text is carried out to participle, obtain word segmentation result. Text choosing hereinSelect some digital resources in field, in order to make its knowledge point of containing enough extensive, generally understand multiselectSelect the electronic digit resource in some this areas. After the digital resource of selected field, therefrom extract text, soRear participle. In the time of participle, first domain knowledge point is added in participle device, and then carries out with this participle deviceParticiple. By domain knowledge point to be added to effect in participle device be knowledge point in field when the participle asA word is processed, as in " the upper hall of Emperor Qin receives and pays respects to " this sentence, when participle " Emperor Qin "Be a word, " emperor " is also a word, has two kinds of possibilities when participle, in domain knowledge point, depositsAt " Emperor Qin " this word, domain knowledge point is added after participle device, in the time of participle " Emperor Qin "To serve as a word. Like this, by domain knowledge point is added in participle device, can be better to neckText in territory carries out participle, and the word segmentation result in this area obtaining is more accurate.
After participle, obtain a large amount of words, these words had both comprised some knowledge points in this area,Also comprise the word that some are conventional, as you, they, have a meal etc. By the file after participle, asAlternative file.
Then,, according to word segmentation result and everyday words, determine candidate knowledge point. Because everyday words is often to makeWith a series of word, by above-mentioned word segmentation result, this part word is removed, just obtainedWith the word of domain-specific, using these words as candidate knowledge point. Everyday words is herein prior artIn the everyday words that determined. In other embodiment, can also come by the following method to determineEveryday words: select the digital resource of common text, as digital resources such as life newspaper, life magazines, rightIt carries out participle (adopt the vocabulary of stopping using to remove stop words, for example, adopt Harbin Institute of Technology's vocabulary of stopping using), will beThe word occurring in many texts is defined as everyday words. In alternative file, remove after everyday words, just obtainThe word in this area, as candidate knowledge point.
Then, calculate the semantic vector of each candidate knowledge point. The method of computing semantic vector can adoptMethod of the prior art, by the mode of computing semantic vector, quantizes each knowledge point by semantemeMode carry out digitized representations.
Then,, for each domain knowledge point, calculate the semantic phase of this domain knowledge point and candidate knowledge pointLike degree. Because candidate knowledge point obtains by a large amount of digital resource in field, therefore we thinkThe all knowledge points in domain knowledge point in this candidate knowledge point, are contained, in this candidate knowledge pointTo search the semantic vector that obtains every field knowledge point, then calculate the semanteme of itself and each candidate knowledge pointSimilarity.
The computational methods of semantic similarity are herein:
f ( X , Y ) = X · Y | | X | | | | Y | | = Σ i = 1 m X i × Y i Σ i = 1 m ( X i ) 2 × Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to compare two m row of similarity, and one of them is domain knowledge pointSemantic vector, another is the semantic vector of candidate knowledge point, f (X, Y) is that the semanteme of X, Y is similarDegree.
Like this, just obtain the semantic similarity of all spectra knowledge point and all candidates knowledge point.
Finally, according to the semantic similarity calculating, determine and put relevant object knowledge to this domain knowledgePoint. Can be by the similarity descending of this domain knowledge point and other candidate knowledge points, select sequence to existThe candidate knowledge point of front some is as the correlated knowledge point of this domain knowledge point. As replacingEmbodiment, also can set in advance a similarity threshold, choose the time that similarity is greater than this threshold valueSelect the correlated knowledge point of knowledge point as this domain knowledge point.
The acquisition methods of the correlated knowledge point providing in the present embodiment, first obtains domain knowledge point, then basisThese domain knowledge points carry out participle to the text in field, after removal everyday words, obtain candidate knowledge point,And then the semantic vector of acquisition candidate knowledge point, the phase by calculating domain knowledge point with candidate knowledge pointLike degree, obtain to each domain knowledge and put relevant candidate knowledge point, as object knowledge point. Like this,Just can obtain several object knowledge points that each domain knowledge point is relevant. At the word of building encyclopaedia catalogueWhen bar, whether the correlated knowledge point that can search each domain knowledge point exists, if do not existed,Needing increases. Complete in this way inspection and the construction of the encyclopedical entry in field. Greatly fallLow workload manually, has saved time cost and cost of labor, and has avoided the subjectivity of hand inspectionThe inaccuracy that property and standard disunity bring, has improved efficiency and the degree of accuracy greatly.
Embodiment 2:
In the present embodiment, provide a kind of acquisition methods of correlated knowledge point, the step in its step and embodiment 1Suddenly identical, the tool of the semantic vector of the each candidate of the calculating knowledge point in said process is provided in the present embodimentBody method, detailed process is as follows:
The first step, determines the number of times that each candidate knowledge point occurs in alternative file, has so just obtainedThe text of each candidate knowledge point and occurrence number thereof. Candidate's text is participle from selected digital resourceAfter the text that obtains, candidate knowledge point is that the word that obtains after participle in candidate's text is removed everyday words and obtainedWord, this part is identical with embodiment 1, does not repeat them here.
Second step, the number of times occurring in candidate's text according to each candidate knowledge point and this candidate knowledge point,Calculate the binary tree of cum rights path minimum.
Taking candidate knowledge point as leaf node, be the given weights of each leaf node, these weights are for knowingKnow the occurrence number of point in text, construct a binary tree, the weights of father node are two child nodesWeights sum, cum rights path is defined as the cum rights path sum of all leaf nodes, if cum rightsIt is minimum that path reaches, and claims that such binary tree is optimum binary tree, also referred to as Huffman tree. HereinBuilding method adopt existing mode in prior art to build, obtain cum rights path by existing algorithmThe binary tree of length minimum.
The 3rd step, the position according to each knowledge point in described candidate's text and cum rights path minimumBinary tree, determine the semantic vector of each knowledge point.
First, create skip-gram model, skip-gram model is a kind of nerve net in prior artNetwork model, as shown in Figure 3, for training word vector, cardinal principle is by current word to schematic diagramPredict that thereby the word in its front and back certain limit obtains suitable word vector representation, the training side of useMethod is random gradient descending method, is input as text data, and the result obtaining according to training can be obtained wordLanguage vector.
The embodiment that can replace as other, also can select CBOW model herein, and schematic diagram is as Fig. 4Shown in, it is also a kind of neural network model. CBOW model is pre-by the context at a word placeSurvey this word, as shown in Figure 3, this model is a kind of neural network model in prior art to process, for instructionPractice word vector, thereby cardinal principle is closed by this word of context-prediction at a word placeSuitable word vector representation, the training method of use is random gradient descending method, is input as text data,The result obtaining according to training is for can obtain word vector.
In addition,, in some documents, also provided concrete Jie of skip-gram model or CBOW modelContinue and apply, as follows:
Mikolov,T.,Chen,K.,Corrado,G.,etal.:EfficientEstimationofWordRepresentationsinVectorSpace.InProc.ICLRWorkshop(2013)
Mikolov,T.,Sutskever,I.,Chen,K.,etal.DistributedRepresentationsofWordsandPhrasesandTheirCompositionality.InProc.NIPS(2013)
Building after above-mentioned model, taking alternative file as training sample, described cum rights path minimumBinary tree is output layer, trains; After having trained, according to the binary tree of cum rights path minimumIn knot vector obtain the semantic vector of each candidate knowledge point. Concrete mode is for extracting optimum binary treeLeaf node vector after training corresponding to middle leaf node position, the i.e. knowledge point for this reason of this vectorSemantic vector.
In the present embodiment, obtaining after the semantic vector of candidate knowledge point, calculating domain knowledge point and candidateWhen the similarity of knowledge point, adopt following formula:
The computational methods of described semantic similarity are:
f ( X , Y ) = 2 Σ i = 1 m X i × Y i Σ i = 1 m ( X i ) 2 + Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to compare two m row of similarity, and one of them is domain knowledge pointSemantic vector, another is the semantic vector of candidate knowledge point, f (X, Y) is that the semanteme of X, Y is similarDegree.
Finally, show that by this semantic similarity each domain knowledge puts relevant candidate knowledge point, doFor object knowledge point. In the encyclopedia of field, search this object knowledge point, complete field encyclopediaThe inspection of entry and structure.
In the present embodiment, in correlated knowledge point acquisition process, the method for calculated candidate knowledge point semantic vector,The semantic information of knowledge point is quantized, embody its semantic feature by digitized mode, like this,Follow-up when knowledge point is analyzed, can be convenient, be knowledge point search, recommendation, information mistakeThe application such as filter provide basis.
Embodiment 3:
Field encyclopedia is a kind of important digital publishing resource. Field encyclopedia is conventionally with entryMode is organized realm information. Field encyclopedia need to comprise the important entry in field. But, buildField encyclopedia needs a large amount of human inputs. A kind of obtaining of correlated knowledge point of obtaining is provided in the present embodimentMethod, domain knowledge is put the entry in field encyclopedia namely. In the present embodiment, utilize field electronicsBook text and newpapers and periodicals text, the semanteme of the candidate entry calculating by skip-gram model toAmount. Calculate semanteme between the field entry building and the candidate entry obtaining by semantic vector similarDegree. Utilize the semantic similarity of entry, find with the encyclopaedical entry in field semanteme relevant and lostThe other field entry leaking, the possibility being missed to reduce some field entry. Concrete steps are as follows.
The first step, obtains domain knowledge point, as taking historical field as example, in historical encyclopedia, obtainsThe entry all establishing, using these field entries as domain knowledge point, then by these field entriesAdd in participle device dictionary. Participle device herein can be selected IK participle device, at other embodimentsIn also can select other participle device as Ansj participle device etc.
Second step, selects field e-book, and field e-book can be selected the e-book of historical domain-specific,As digital resources such as 5,000-year and down, dynasty in successive dynasties histories, select the numeral money in this area as much as possibleSource makes it contain all entries in this area as far as possible. From these field e-book, extract text, soThe above-mentioned participle device that has added field entry of rear use carries out participle to text, obtains the text F after participle.
The 3rd step, selects life newpapers and periodicals text, uses participle device to carry out participle to newpapers and periodicals text, according to dividingWord result is determined everyday words.
The 4th step, has comprised the word in field and some conventional words in word segmentation result in second step,In the 3rd step, obtain conventional word, removed like this commonly using in the text F after participle in second stepWord, remaining word is as field candidate entry.
The 5th step, according to field candidate entry, the number of times that in statistics file F, each candidate entry occurs, shapeBecome statistics file, the entry in statistics file is pressed the occurrence number inverted order of article name and is arranged. Statistics fileForm as follows, wherein oi,oj,okFor the title of entry, ti,tj,tkFor article name occurs in file FNumber of times.
oi,ti
oj,tj
ok,tk
According to this statistics file, taking entry as leaf node, form a Huffman tree. Build and breathe out hereinThe process of Fu Man tree is as follows:
1. generate the set R={r of n binary tree according to given n entry1,r2,..,rn, whereinEvery binary tree riIn only have a cum rights wiRoot node, weight wiEqual the occurrence number t of entryi,Left and right subtree is sky.
2. the tree of selecting two root node weights minimums in R as one of left and right subtree structure new twoFork tree, and the weights of putting the root node of new binary tree are the weights sum of root node in its left and right subtree.
3. in R, delete this two trees, and new binary tree is added in R.
4. repeat the 2nd step and the 3rd step, until only contain one tree in R.
This obtaining tree is Huffman tree.
The 6th step, with the training of skip-gram model, obtains each entry pair in Huffman treeThe leaf node vector of answering, thus the semantic vector of each entry obtained.
Skip-gram model is a kind of neural network model in prior art, for training word vector,Cardinal principle is to predict that by current word thereby the word in its front and back certain limit obtains suitable wordLanguage vector representation, the training method of use is random gradient descending method, is input as text data, according toThe result that training obtains can be obtained word vector.
In the present embodiment, first create skip-gram model, skip-gram model as shown in Figure 3,This model comprises input layer input, intermediate layer projection and output layer output. Wherein outputLayer adopts the Huffman tree in the 4th step. The path of each entry w from root node to leaf nodeBe expressed as L (w), n (w, j) represents the j under this paththNode, ch (n) represents the child of non-leaf node nChild node, s (x) is-symbol function, when x is that true time gets 1, otherwise gets-1. For training set w1,w2,…,wT(w1,w2,…,wTBe exactly the word in training set), skip-gram model will maximize probable valueWherein j ≠ 0, k is with wtCentered by window size, T is training setIn word number. Conventionally, k value is larger, and the result that training obtains is more accurate. But k value is larger,The training time needing is also longer. P (w|wI) be defined as p ( w | w I ) = Π j = 1 L ( w ) - 1 σ ( s ( n ( w , j + 1 ) = ch ( n ( w , j ) ) ) · v n ( w , j ) ′ T v w I ) , N (w, j) represents the j under this paththNode, s (x) is-symbol function, wherein σ (x)=1/ (1+exp (x)), vwIt is the vector table of leaf node wShow v 'nIt is the vector representation of non-leaf node n. When training, word w in training setiThe probability being dropped isWherein t is the threshold value of specifying, g (wi) be word wiThe frequency occurring, is used this generalThe object that rate abandons word is accelerate training speed and improve accuracy.
The 7th step, with the file F after participle as training sample, by random Gradient Descent backpropagationAlgorithm for Training model. After model training completes, obtain each candidate entry oiSemantic vector vi
The 8th step, for each the entry o in the encyclopedia of fieldi, calculate this entry and other allThe semantic similarity of candidate entry, the computational methods of similarity are situated between in above-mentioned other embodimentContinue, do not repeat them here, select as required the similarity calculating method in above-described embodiment. RootAccording to semantic similarity descending sort entry, obtain m the entry that similarity is the highest. Check that these entries areNo in the encyclopedia of field, if not in the encyclopedia of field, these entries are recorded inIn file, check for field encyclopedia builder.
Because entry number in the encyclopedia of field is more, therefore find suitable neck by artificial modeTerritory entry is not only consuming time, and easily omits some very relevant field entries. Being correlated with in the present embodimentThe acquisition methods of knowledge point, can check for field encyclopedia entry construction, for finding and fieldEncyclopedia entry is at semantically relevant other field entry, and to reduce, some field entry is missedPossibility.
Embodiment 4:
A kind of system of the correlated knowledge point that obtains knowledge point is provided in the present embodiment, as shown in Figure 5, comprises:
Extraction unit: obtain domain knowledge point;
Participle unit: according to described domain knowledge point, text is carried out to participle, obtain word segmentation result;
Candidate unit: according to word segmentation result and everyday words, determine candidate knowledge point;
Semantic vector computing unit: the semantic vector of determining each candidate knowledge point;
Similarity calculated: for each domain knowledge point, calculate this domain knowledge point and candidate's knowledgeThe semantic similarity of point;
Correlated knowledge point computing unit: according to the semantic similarity calculating, determine and this domain knowledge pointRelevant object knowledge point.
Wherein, participle unit comprises:
Participle device is set up unit: described domain knowledge point is added in participle device;
Extracting unit: select field digital resource, therefrom extract text;
Alternative file acquiring unit: use described participle device to carry out participle to described text, obtain after participleFile, as alternative file.
Wherein, candidate unit comprises:
Everyday words determining unit: select the digital resource of common text, it is carried out to participle and determine everyday words;
Candidate knowledge point determining unit: the word in alternative file is removed to described everyday words, obtain candidateKnowledge point.
Wherein, semantic vector computing unit comprises:
Statistic unit: determine the number of times that each candidate knowledge point occurs in alternative file;
Optimum binary tree computing unit: according to each candidate knowledge point and this candidate knowledge point at candidate's textThe number of times of middle appearance, the binary tree of calculating cum rights path minimum;
Semantic vector determining unit: position and band according to each candidate knowledge point in described candidate's textWeigh the binary tree of path minimum, determine the semantic vector of each candidate knowledge point.
Above-mentioned semantic vector determining unit, further comprises:
Modeling unit: create skip-gram model;
Training unit: taking described alternative file as training sample, the y-bend of described cum rights path minimumTree, for output layer, is trained;
Computing unit: after having trained, according to the knot vector in the binary tree of cum rights path minimumObtain the semantic vector of each candidate knowledge point.
In the present embodiment, similarity calculated comprises computing formula, as follows:
f ( X , Y ) = X · Y | | X | | | | Y | | = Σ i = 1 m X i × Y i Σ i = 1 m ( X i ) 2 × Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to compare two m row of similarity, and f (X, Y) is the semanteme of X, YSimilarity.
In the embodiment that can replace at other, described similarity calculated comprises semantic similarityComputing formula is:
f ( X , Y ) = 2 Σ i = 1 m X i × Y i Σ i = 1 m ( X i ) 2 + Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to relatively be listed as with two m of degree, and f (X, Y) is the semanteme of X, YSimilarity.
In the present embodiment, correlated knowledge point computing unit comprises:
The first computing unit: the similarity descending by this domain knowledge point with candidate knowledge point, selectSort the candidate knowledge point of preceding predetermined number as the correlated knowledge point of this domain knowledge point;
In the embodiment that can replace at other, correlated knowledge point computing unit comprises the second computing unit:Set in advance a similarity threshold, choose candidate knowledge point that similarity is greater than this threshold value as this fieldThe correlated knowledge point of knowledge point.
A kind of system of obtaining of correlated knowledge point is provided in the present embodiment, comprise extraction unit, participle unit,Candidate unit, semantic vector computing unit, similarity calculated and correlated knowledge point computing unit,Adopt the mode of computing semantic vector, by calculating the similar of domain knowledge point and candidate knowledge pointDegree, obtains to each domain knowledge and puts relevant candidate knowledge point, thereby obtain each domain knowledge pointSeveral relevant object knowledge points. In the time building the entry of encyclopaedia catalogue, can search each field and knowWhether the correlated knowledge point of knowing point has existed, and if do not existed, needing increases. Come in this wayComplete inspection and the construction of the encyclopedical entry in field, greatly reduce artificial workload.
Obviously, above-described embodiment is only for example is clearly described, and not to embodimentRestriction. For those of ordinary skill in the field, can also do on the basis of the above descriptionGo out other multi-form variation or variation. Here without also giving thoroughly all embodimentsLift. And the apparent variation of being extended out thus or the still protection domain in the invention of variationAmong.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system orComputer program. Therefore, the present invention can adopt complete hardware implementation example, completely implement software example,Or in conjunction with the form of the embodiment of software and hardware aspect. And the present invention can adopt one or moreThe computer-usable storage medium that wherein includes computer usable program code (includes but not limited to diskMemory, CD-ROM, optical memory etc.) form of the upper computer program of implementing.
The present invention is that reference is according to the method for the embodiment of the present invention, equipment (system) and computer program productThe flow chart of product and/or block diagram are described. Should understand can be by computer program instructions realization flow figureAnd/or flow process in each flow process and/or square frame and flow chart and/or block diagram in block diagramAnd/or the combination of square frame. Can provide these computer program instructions to all-purpose computer, special-purpose computer,The processor of Embedded Processor or other programmable data processing device, to produce a machine, makes to lead toThe instruction of crossing the processor execution of computer or other programmable data processing device produces for realizing at streamThe function of specifying in flow process of journey figure or multiple flow process and/or square frame of block diagram or multiple square frameDevice.
These computer program instructions also can be stored in energy vectoring computer or other programmable data processing are establishedIn the standby computer-readable memory with ad hoc fashion work, make to be stored in this computer-readable memoryIn instruction produce and comprise the manufacture of command device, this command device realize in flow process of flow chart orThe function of specifying in square frame of multiple flow processs and/or block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device,Make to carry out sequence of operations step on computer or other programmable devices computer implemented to produceProcess, thereby the instruction of carrying out on computer or other programmable devices is provided for realizing at flow chartThe step of the function of specifying in square frame of a flow process or multiple flow process and/or block diagram or multiple square frameSuddenly.
Although described the preferred embodiments of the present invention, once those skilled in the art obtain cicadaBasic creative concept, can make other change and amendment to these embodiment. So, appended powerProfit requires to be intended to be interpreted as comprising preferred embodiment and fall into all changes and the amendment of the scope of the invention.

Claims (16)

1. an acquisition methods for correlated knowledge point, is characterized in that:
Obtain domain knowledge point;
According to described domain knowledge point, text is carried out to participle, obtain word segmentation result;
According to word segmentation result and everyday words, determine candidate knowledge point;
Determine the semantic vector of each candidate knowledge point;
For each domain knowledge point, calculate the semantic similarity of this domain knowledge point and candidate knowledge point;
According to the semantic similarity calculating, determine and put relevant object knowledge point to this domain knowledge.
2. the method for obtaining knowledge point correlated knowledge point according to claim 1, is characterized in that,According to described domain knowledge point, text is carried out to participle, obtains the processing of word segmentation result, comprising:
Described domain knowledge point is added in participle device;
Selection field digital resource, therefrom extracts text;
Use described participle device to carry out participle to described text, obtain the file after participle, as candidate's literary compositionPart.
3. the method for obtaining knowledge point correlated knowledge point according to claim 1 and 2, its feature existsIn, described according to word segmentation result and everyday words, determine the process of candidate knowledge point, comprising:
The digital resource of selecting common text, carries out participle to it and determines everyday words;
Word in alternative file is removed to described everyday words, obtain candidate knowledge point.
4. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-3, its featureBe, the process of the described semantic vector of determining each candidate knowledge point, comprising:
Determine the number of times that each candidate knowledge point occurs in alternative file;
The number of times occurring in candidate's text according to each candidate knowledge point and this candidate knowledge point, calculates bandThe binary tree of power path minimum;
Two of position according to each candidate knowledge point in described candidate's text and cum rights path minimumFork is set, and determines the semantic vector of each candidate knowledge point.
5. the method for obtaining knowledge point correlated knowledge point according to claim 4, is characterized in that,Described according to each knowledge point the y-bend of the position in described candidate's text and cum rights path minimumSet, determine the process of the semantic vector of each knowledge point, comprising:
Create neural network model;
Taking described alternative file as training sample, the binary tree of described cum rights path minimum is output layer,Train;
After having trained, obtain each time according to the knot vector in the binary tree of cum rights path minimumSelect the semantic vector of knowledge point.
6. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-5, its featureBe, described for each domain knowledge point, calculate the semantic phase of this domain knowledge point and candidate knowledge pointLike the processing of degree, comprising:
The computational methods of described semantic similarity are:
f ( X , Y ) = X · Y | | X | | | | Y | | = Σ i = 1 m X i × Y i Σ i = 1 m ( X i ) 2 × Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to compare two m row of similarity, and f (X, Y) is the semanteme of X, YSimilarity.
7. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-5, its featureBe, described for each domain knowledge point, calculate the semantic phase of this domain knowledge point and candidate knowledge pointLike the processing of degree, comprising:
The computational methods of described semantic similarity are:
f ( X , Y ) = 2 Σ i = 1 m X i × Y i Σ i = 1 m ( x i ) 2 + Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to compare two m row of similarity, and f (X, Y) is the semanteme of X, YSimilarity.
8. according to the arbitrary described method of obtaining knowledge point correlated knowledge point of claim 1-7, its featureBe that the semantic similarity that described basis calculates is determined and put relevant object knowledge to this domain knowledgeThe processing of point, comprising:
Similarity descending by this domain knowledge point with candidate knowledge point, selects sequence preceding defaultThe candidate knowledge point of quantity is as the correlated knowledge point of this domain knowledge point.
Or set in advance a similarity threshold, choose the candidate knowledge point conduct that similarity is greater than this threshold valueThe correlated knowledge point of this domain knowledge point.
9. a system of obtaining the correlated knowledge point of knowledge point, is characterized in that:
Extraction unit: obtain domain knowledge point;
Participle unit: according to described domain knowledge point, text is carried out to participle, obtain word segmentation result;
Candidate unit: according to word segmentation result and everyday words, determine candidate knowledge point;
Semantic vector computing unit: the semantic vector of determining each candidate knowledge point;
Similarity calculated: for each domain knowledge point, calculate this domain knowledge point and candidate's knowledgeThe semantic similarity of point;
Correlated knowledge point computing unit: according to the semantic similarity calculating, determine and this domain knowledge pointRelevant object knowledge point.
10. the system of obtaining knowledge point correlated knowledge point according to claim 9, is characterized in that,Participle unit comprises:
Participle device is set up unit: described domain knowledge point is added in participle device;
Extracting unit: select field digital resource, therefrom extract text;
Alternative file acquiring unit: use described participle device to carry out participle to described text, obtain after participleFile, as alternative file.
11. according to the system of obtaining knowledge point correlated knowledge point described in claim 9 or 10, and its feature existsIn, candidate unit comprises:
Everyday words determining unit: select the digital resource of common text, it is carried out to participle and determine everyday words;
Candidate knowledge point determining unit: the word in alternative file is removed to described everyday words, obtain candidateKnowledge point.
12. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-11, its spyLevy and be, semantic vector computing unit comprises:
Statistic unit: determine the number of times that each candidate knowledge point occurs in alternative file;
Optimum binary tree computing unit: according to each candidate knowledge point and this candidate knowledge point at candidate's textThe number of times of middle appearance, the binary tree of calculating cum rights path minimum;
Semantic vector determining unit: position and band according to each candidate knowledge point in described candidate's textWeigh the binary tree of path minimum, determine the semantic vector of each candidate knowledge point.
13. systems of obtaining knowledge point correlated knowledge point according to claim 12, is characterized in that,Semantic vector determining unit, comprising:
Modeling unit: create neural network model;
Training unit: taking described alternative file as training sample, the y-bend of described cum rights path minimumTree, for output layer, is trained;
Computing unit: after having trained, according to the knot vector in the binary tree of cum rights path minimumObtain the semantic vector of each candidate knowledge point.
14. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-13, its spyLevy and be, similarity calculated comprises computing formula, as follows:
f ( X , Y ) = X · Y | | X | | | | Y | | = Σ i = 1 m X i × Y i Σ i = 1 m ( X i ) 2 × Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to compare two m row of similarity, and f (X, Y) is the semanteme of X, YSimilarity.
15. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-13, its spyLevy and be, described similarity calculated comprises that the computing formula of semantic similarity is:
f ( X , Y ) = 2 Σ i = 1 m X i × Y i Σ i = 1 m ( x i ) 2 + Σ i = 1 m ( Y i ) 2
Wherein, X, Y are the vector that need to compare two m row of similarity, and f (X, Y) is the semanteme of X, YSimilarity.
16. according to the arbitrary described system of obtaining knowledge point correlated knowledge point of claim 9-14, its spyLevy and be, correlated knowledge point computing unit comprises:
The first computing unit: the similarity descending by this domain knowledge point with candidate knowledge point, selectSort the candidate knowledge point of preceding predetermined number as the correlated knowledge point of this domain knowledge point;
Or second computing unit: set in advance a similarity threshold, choose similarity and be greater than this threshold valueCandidate knowledge point is as the correlated knowledge point of this domain knowledge point.
CN201410497470.3A 2014-09-26 2014-09-26 Related knowledge point acquisition method and system Pending CN105608075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410497470.3A CN105608075A (en) 2014-09-26 2014-09-26 Related knowledge point acquisition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410497470.3A CN105608075A (en) 2014-09-26 2014-09-26 Related knowledge point acquisition method and system

Publications (1)

Publication Number Publication Date
CN105608075A true CN105608075A (en) 2016-05-25

Family

ID=55988019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410497470.3A Pending CN105608075A (en) 2014-09-26 2014-09-26 Related knowledge point acquisition method and system

Country Status (1)

Country Link
CN (1) CN105608075A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106384319A (en) * 2016-09-20 2017-02-08 四川教云网络科技有限公司 Teaching resource personalized recommending method based on forgetting curve
CN107315734A (en) * 2017-05-04 2017-11-03 中国科学院信息工程研究所 A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
CN107451117A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 The segmenting method and device of English text
CN107967254A (en) * 2017-10-31 2018-04-27 科大讯飞股份有限公司 Knowledge point prediction method and device, storage medium and electronic equipment
CN108287848A (en) * 2017-01-10 2018-07-17 中国移动通信集团贵州有限公司 Method and system for semanteme parsing
CN109002499A (en) * 2018-06-29 2018-12-14 浙江蓝鸽科技有限公司 Subject pertinence knowledge point base construction method and its system
CN113157871A (en) * 2021-05-27 2021-07-23 东莞心启航联贸网络科技有限公司 News public opinion text processing method, server and medium applying artificial intelligence
CN117474014A (en) * 2023-12-27 2024-01-30 广东信聚丰科技股份有限公司 Knowledge point dismantling method and system based on big data analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207946A (en) * 2010-06-29 2011-10-05 天津海量信息技术有限公司 Knowledge network semi-automatic generation method
CN102708100A (en) * 2011-03-28 2012-10-03 北京百度网讯科技有限公司 Method and device for digging relation keyword of relevant entity word and application thereof
US20120317125A1 (en) * 2011-05-18 2012-12-13 International Business Machines Corporation Method and apparatus for identifier retrieval

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207946A (en) * 2010-06-29 2011-10-05 天津海量信息技术有限公司 Knowledge network semi-automatic generation method
CN102708100A (en) * 2011-03-28 2012-10-03 北京百度网讯科技有限公司 Method and device for digging relation keyword of relevant entity word and application thereof
US20120317125A1 (en) * 2011-05-18 2012-12-13 International Business Machines Corporation Method and apparatus for identifier retrieval

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘云芳 等: "信息检索中一种句子相似度的计算方法", 《应用科技》 *
朱明方 吴及: "《数据结构与算法》", 31 March 2010 *
韩永峰等: "基于领域特征词的突发时间层次分类方法", 《信息工程大学学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106384319A (en) * 2016-09-20 2017-02-08 四川教云网络科技有限公司 Teaching resource personalized recommending method based on forgetting curve
CN108287848A (en) * 2017-01-10 2018-07-17 中国移动通信集团贵州有限公司 Method and system for semanteme parsing
CN107315734A (en) * 2017-05-04 2017-11-03 中国科学院信息工程研究所 A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
CN107315734B (en) * 2017-05-04 2019-11-26 中国科学院信息工程研究所 A kind of method and system to be standardized based on time window and semantic variant word
CN107451117A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 The segmenting method and device of English text
CN107967254A (en) * 2017-10-31 2018-04-27 科大讯飞股份有限公司 Knowledge point prediction method and device, storage medium and electronic equipment
CN109002499A (en) * 2018-06-29 2018-12-14 浙江蓝鸽科技有限公司 Subject pertinence knowledge point base construction method and its system
CN109002499B (en) * 2018-06-29 2022-04-12 浙江蓝鸽科技有限公司 Discipline correlation knowledge point base construction method and system
CN113157871A (en) * 2021-05-27 2021-07-23 东莞心启航联贸网络科技有限公司 News public opinion text processing method, server and medium applying artificial intelligence
CN113157871B (en) * 2021-05-27 2021-12-21 宿迁硅基智能科技有限公司 News public opinion text processing method, server and medium applying artificial intelligence
CN117474014A (en) * 2023-12-27 2024-01-30 广东信聚丰科技股份有限公司 Knowledge point dismantling method and system based on big data analysis
CN117474014B (en) * 2023-12-27 2024-03-08 广东信聚丰科技股份有限公司 Knowledge point dismantling method and system based on big data analysis

Similar Documents

Publication Publication Date Title
CN105608075A (en) Related knowledge point acquisition method and system
CN107168945B (en) Bidirectional cyclic neural network fine-grained opinion mining method integrating multiple features
CN104615767B (en) Training method, search processing method and the device of searching order model
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN106570148A (en) Convolutional neutral network-based attribute extraction method
CN104573046A (en) Comment analyzing method and system based on term vector
CN107506389B (en) Method and device for extracting job skill requirements
CN104899298A (en) Microblog sentiment analysis method based on large-scale corpus characteristic learning
CN109739973A (en) Text snippet generation method, device, electronic equipment and storage medium
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN104035975B (en) It is a kind of to realize the method that remote supervisory character relation is extracted using Chinese online resource
CN105955962A (en) Method and device for calculating similarity of topics
CN105893362A (en) A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN106055623A (en) Cross-language recommendation method and system
CN103914445A (en) Data semantic processing method
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN111274814B (en) Novel semi-supervised text entity information extraction method
CN104484380A (en) Personalized search method and personalized search device
CN106033462A (en) Neologism discovering method and system
CN105528437A (en) Question-answering system construction method based on structured text knowledge extraction
CN106372064A (en) Characteristic word weight calculating method for text mining
CN108287911A (en) A kind of Relation extraction method based on about fasciculation remote supervisory
CN105631018A (en) Article feature extraction method based on topic model
CN104699797A (en) Webpage data structured analytic method and device
CN109635275A (en) Literature content retrieval and recognition methods and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160525

RJ01 Rejection of invention patent application after publication