CN107622129A - Method for organizing and device, the computer-readable storage medium of a kind of knowledge base - Google Patents

Method for organizing and device, the computer-readable storage medium of a kind of knowledge base Download PDF

Info

Publication number
CN107622129A
CN107622129A CN201710911400.1A CN201710911400A CN107622129A CN 107622129 A CN107622129 A CN 107622129A CN 201710911400 A CN201710911400 A CN 201710911400A CN 107622129 A CN107622129 A CN 107622129A
Authority
CN
China
Prior art keywords
dictionary
tensor
word
target text
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710911400.1A
Other languages
Chinese (zh)
Other versions
CN107622129B (en
Inventor
冯林
刘昕
况铁梅
姚琪
曾启文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MIGU Culture Technology Co Ltd
Original Assignee
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MIGU Culture Technology Co Ltd filed Critical MIGU Culture Technology Co Ltd
Priority to CN201710911400.1A priority Critical patent/CN107622129B/en
Publication of CN107622129A publication Critical patent/CN107622129A/en
Application granted granted Critical
Publication of CN107622129B publication Critical patent/CN107622129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of method for organizing of knowledge base and device, computer-readable storage medium, methods described to include:Build tensor space corresponding to the general dictionary of specialized dictionary;Based on tensor space corresponding to the general dictionary, tensor of each target text in the tensor space is determined;Based on tensor space corresponding to the general dictionary, the basic vector that target professional domain dictionary corresponds in the tensor space is determined;The basic vector corresponded to based on the target professional domain dictionary in the tensor space, calculate projection components of the tensor of each target text in the basic vector;Based on the projection components of each target text, the relation between each target text is determined, with knowledge base of organization.

Description

Method for organizing and device, the computer-readable storage medium of a kind of knowledge base
Technical field
The present invention relates to Knowledge Management Technology field, more particularly to a kind of method for organizing of knowledge base and device, computer Storage medium.
Background technology
The building mode of knowledge base is divided into artificial constructed and automatic structure, wherein, it is artificial constructed but can be divided into expert structure And group collaboration, automatic structure are that natural language text is extracted according to certain rule based on computer and realized.
The structure principle of knowledge base typically consists of the following components:1) knowledge to be managed is determined, i.e.,:Determine knowledge base Scope and border;2) source and the power of knowledge are determined, i.e.,:Determine the source of knowledge in knowledge base and moving for continuous updating Power;3) tissue of knowledge base, i.e.,:By scattered, dereferenced knowledge fragment, the knowledge that is mutually related that systematization is classified is organized into Storehouse;4) utilization of knowledge base, i.e.,:The value of knowledge is realized by the knowledge base of foundation.
For expert's building mode, this building mode efficiency is low and with high costs, it is difficult to tackles current internet The knowledge explosion brought and quick renewal, are not suitable for the large-scale knowledge base of structure;For group collaboration mode, this structure It is small to build the knowledge coverage rate of mode professional domain, and because it is open, if without manual examination and verification, a large amount of nothings can be introduced Effect, incoherent information, if carrying out manual examination and verification, efficient and inexpensive characteristic can not be accomplished again.For automatic structure side For formula, it is to handle original natural language text set according to certain programmed logic by computer, extraction knowledge point bar Mesh article, automatically build knowledge base.
The method of structure principle and automatic structure knowledge base from knowledge base can be seen that builds knowledge automatically by computer Storehouse is the development trend of construction of knowledge base.Processing logic is by a series of mathematical probabilities statistical method pair used by computer Text is handled, and is finally selected according to the evaluation method of setting, so as to export knowledge base.Structure with reference to knowledge base is former The method that reason and computer build knowledge base automatically can be seen that for computer builds knowledge base, most crucial part It is the tissue of knowledge base.
In the prior art, when building knowledge base automatically by computer, it is only based in text the word frequency of word and inverse The factors such as text frequency are classified to text, and the accuracy of classification is relatively low.
The content of the invention
In order to solve the above technical problems, the embodiments of the invention provide a kind of method for organizing of knowledge base and device, calculating Machine storage medium.
The method for organizing of knowledge base provided in an embodiment of the present invention, including:
Build tensor space corresponding to general dictionary;
Based on tensor space corresponding to the general dictionary, tensor of each target text in the tensor space is determined;
Based on tensor space corresponding to the general dictionary, determine that target professional domain dictionary corresponds to the tensor space In basic vector;
The basic vector corresponded to based on the target professional domain dictionary in the tensor space, calculate each target text Projection components of the tensor in the basic vector;
Based on the projection components of each target text, the relation between each target text is determined, with knowledge base of organization.
In the embodiment of the present invention, tensor space corresponding to the general dictionary of structure, including:
Basic dictionary is established, and obtains corpus corresponding to general dictionary;
Based on the basic dictionary, word segmentation processing is carried out to the corpus, obtains the first set of letters, wherein, it is described First set of letters forms original general dictionary;
Calculate the first information entropy of each word in the original general dictionary;
The word that first information entropy is more than or equal to presupposed information entropy threshold values is selected from the original general dictionary, is obtained To the second set of letters, wherein, second set of letters forms general dictionary;
Using each word in the general dictionary as a dimension, N-dimensional high order tensor space is formed, wherein, N is logical The word number included with dictionary, N are positive integer.
It is described to be based on tensor space corresponding to the general dictionary in the embodiment of the present invention, determine each target text in institute The tensor in tensor space is stated, including:
For each target text, word segmentation processing is carried out to the target text, obtains the 3rd set of letters;
Determine that first information entropy and each word of each word in the 3rd set of letters are literary in the target Occurrence number in this;
The first information entropy of each word and the product of occurrence number are calculated, obtains the end value of each word;
End value based on each word carries out assignment in the tensor space, obtains of the target text Amount.
It is described to be based on tensor space corresponding to the general dictionary in the embodiment of the present invention, determine target professional domain word The basic vector that storehouse corresponds in the tensor space, including:
Basic dictionary is established, and obtains target corpus corresponding to target professional domain dictionary;
Based on the basic dictionary, word segmentation processing is carried out to the target corpus, obtains the 4th set of letters, wherein, 4th set of letters forms original object professional domain dictionary;
Calculate the second information entropy of each word in the original object professional domain dictionary;
The second information entropy based on each word carries out assignment in the tensor space, and it is special to obtain the target Basic vector corresponding to the dictionary of industry field.
In the embodiment of the present invention, the base corresponded to based on the target professional domain dictionary in the tensor space Arrow, projection components of the tensor of each target text in the basic vector are calculated, including:
For each file destination, the tensor of the target text is projected in the basic vector, obtains N number of throwing Shadow component, wherein, N number of projection components form N-dimensional vector.
In the embodiment of the present invention, the projection components based on each target text, determine between each target text Relation, with knowledge base of organization, including:
Regularization is carried out to the projection components of each target text;
Regularization result based on projection components, calculate angle value of the projection components relative to basic vector;
Clustering processing is carried out relative to the angle value of basic vector to the projection components, to determine the pass between each target text System is so as to knowledge base of organization.
The tissue device of knowledge base provided in an embodiment of the present invention, including:
Tensor space construction unit, for building tensor space corresponding to general dictionary;
Tensor determining unit, for based on tensor space corresponding to the general dictionary, determining each target text described Tensor in tensor space;
Basic vector determining unit, for based on tensor space corresponding to the general dictionary, determining target professional domain dictionary Corresponding to the basic vector in the tensor space;
Projection components computing unit, for the base corresponded to based on the target professional domain dictionary in the tensor space Arrow, calculate projection components of the tensor of each target text in the basic vector;
Organizational unit, for the projection components based on each target text, the relation between each target text is determined, with Knowledge base of organization.
In the embodiment of the present invention, the tensor space construction unit includes:
Subelement is established, for establishing basic dictionary;
First obtains subelement, for obtaining corpus corresponding to general dictionary;
First participle processing unit, for based on the basic dictionary, word segmentation processing to be carried out to the corpus, obtain the One set of letters, wherein, first set of letters forms original general dictionary;
First information entropy computing unit, for calculating the first information entropy of each word in the original general dictionary Value;
Subelement is selected, is more than or equal to presupposed information for selecting first information entropy from the original general dictionary The word of entropy threshold values, the second set of letters is obtained, wherein, second set of letters forms general dictionary;
Subelement is built, for using each word in the general dictionary as a dimension, forming N-dimensional high order tensor Space, wherein, N is the word number that general dictionary includes, and N is positive integer.
In the embodiment of the present invention, the tensor determining unit includes:
Second word segmentation processing unit, for for each target text, word segmentation processing to be carried out to the target text, obtain the Three set of letters;
End value computation subunit, for determine each word in the 3rd set of letters first information entropy and Occurrence number of each word in the target text;Calculate the first information entropy of each word and occurrence number it Product, obtains the end value of each word;
First assignment handles subelement, is assigned for the end value based on each word in the tensor space Value, obtains the tensor of the target text.
In the embodiment of the present invention, the basic vector determining unit includes:
Subelement is established, for establishing basic dictionary;
Second obtains subelement, for obtaining target corpus corresponding to target professional domain dictionary;
3rd word segmentation processing unit, for based on the basic dictionary, carrying out word segmentation processing to the target corpus, obtaining To the 4th set of letters, wherein, the 4th set of letters forms original object professional domain dictionary;
Second information entropy computing unit, for calculating of each word in the original object professional domain dictionary Two information entropy;
Second assignment handle subelement, for based on the second information entropy of each word in the tensor space Assignment is carried out, obtains basic vector corresponding to the target professional domain dictionary.
In the embodiment of the present invention, projection components computing unit, it is specifically used for:For each file destination, by the mesh The tensor of mark text is projected in the basic vector, obtains N number of projection components, wherein, N number of projection components form N-dimensional Vector.
In the embodiment of the present invention, the organizational unit includes:
Regularizing filter unit, for carrying out Regularization to the projection components of each target text;
Angle calculation subelement, for the regularization result based on projection components, the projection components are calculated relative to base The angle value of arrow;
Subelement is clustered, it is each to determine for carrying out clustering processing relative to the angle value of basic vector to the projection components Relation between target text is so as to knowledge base of organization.
Computer executable instructions are stored with computer-readable storage medium provided in an embodiment of the present invention, the computer can be held The method for organizing of above-mentioned knowledge base is realized in row instruction when being executed by processor.
In the technical scheme of the embodiment of the present invention, tensor space corresponding to general dictionary is built;Based on the general dictionary Corresponding tensor space, determine tensor of each target text in the tensor space;Based on corresponding to the general dictionary Quantity space, determine the basic vector that target professional domain dictionary corresponds in the tensor space;Based on the target professional domain word The basic vector that storehouse corresponds in the tensor space, calculate projection components of the tensor of each target text in the basic vector; Based on the projection components of each target text, the relation between each target text is determined, with knowledge base of organization.Using the present invention The technical scheme of embodiment, by tensor of each target text in tensor space, the differentiation weights and measures of each dimension, make The primitive meaning of language is considered during Knowledge base organization.In addition, by the tensor of each target text in target professional domain dictionary Basic vector projection components, knowledge base is realized under the professional language scene that particular professional field is formed, in target text The influence of the original hint expression held, the accuracy of classification is higher, is advantageous to the structure of extensive and ultra-large knowledge base.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the method for organizing of the knowledge base of the embodiment of the present invention;
Fig. 2 is the structure schematic flow sheet of the tensor space of the embodiment of the present invention;
Fig. 3 is the structure schematic flow sheet of the tensor of the target text of the embodiment of the present invention;
Fig. 4 is the structure schematic flow sheet of the basic vector of the professional domain dictionary of the embodiment of the present invention;
Fig. 5 is the structure composition schematic diagram of the tissue device of the knowledge base of the embodiment of the present invention;
Fig. 6 is the structure composition schematic diagram of the tensor space construction unit of the embodiment of the present invention;
Fig. 7 is the structure composition schematic diagram of the tensor determining unit of the embodiment of the present invention;
Fig. 8 is the structure composition schematic diagram of the basic vector determining unit of the embodiment of the present invention;
Fig. 9 is the structure composition schematic diagram of the organizational unit of the embodiment of the present invention.
Embodiment
The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, below in conjunction with the accompanying drawings to this hair The realization of bright embodiment is described in detail, appended accompanying drawing purposes of discussion only for reference, is not used for limiting the embodiment of the present invention.
The technical scheme of the embodiment of the present invention, it make use of comentropy theorem, speech communication principle, the coordinate system of tensor constant Property, the relation of the component of tensor and basic vector, computer engineering method etc., it is proposed that a kind of organization scheme of new knowledge base, energy By the target text (such as article, paragraph or sentence) of knowledge base yet to be built, under the scene of different professional domains, automatically organize Into the knowledge base for the systematization tissue for being adapted to this professional domain.
The following technical scheme of the embodiment of the present invention is applied to computer equipment, the computer equipment can be notebook, The equipment that desktop computer, server etc. have computing capability.
The structure of knowledge base is an important content in Implementing Knowledge Management & Their, and the structure principle of knowledge base is including following several Individual part:1) knowledge to be managed is determined;2) source and the power of knowledge are determined;3) tissue of knowledge base;4) profit of knowledge base With.Wherein, the tissue of knowledge base is key factor, and the embodiment of the present invention is intended to efficiently realize the tissue of knowledge base.
Fig. 1 is the schematic flow sheet of the method for organizing of the knowledge base of the embodiment of the present invention, as shown in figure 1, the knowledge base Method for organizing comprise the following steps:
Step 101:Build tensor space corresponding to general dictionary.
In the embodiment of the present invention, general dictionary refers to the dictionary formed based on various Chinese electronic documents.Here, it is Chinese The form of electronic document includes but is not limited to word forms, ppt forms, pdf forms, txt forms etc..In addition, Chinese electronics Document can also be the original entries such as article, paragraph or the sentence in database.It is worth noting that, the skill of the embodiment of the present invention Art scheme is not limited to the electronic document of Chinese, and the electronic document of other language can also use the technical side of the embodiment of the present invention Case.
In the embodiment of the present invention, tensor space corresponding to general dictionary is built, can be realized by procedure below:
Basic dictionary is established, and obtains corpus corresponding to general dictionary;
Based on the basic dictionary, word segmentation processing is carried out to the corpus, obtains the first set of letters, wherein, it is described First set of letters forms original general dictionary;
Calculate the first information entropy of each word in the original general dictionary;
The word that first information entropy is more than or equal to presupposed information entropy threshold values is selected from the original general dictionary, is obtained To the second set of letters, wherein, second set of letters forms general dictionary;
Using each word in the general dictionary as a dimension, N-dimensional high order tensor space is formed, wherein, N is logical The word number included with dictionary, N are positive integer.
In such scheme, general dictionary refers to:First information entropy is more than or equal to the set of letters of presupposed information entropy threshold The dictionary formed.
Step 102:Based on tensor space corresponding to the general dictionary, determine each target text in the tensor space Tensor.
In the embodiment of the present invention, a target text is the minimum unit of relation to be built in knowledge base, such as article, section Fall or sentence.
In the embodiment of the present invention, tensor of each target text in the tensor space is determined, procedure below can be passed through Realize:
For each target text, word segmentation processing is carried out to the target text, obtains the 3rd set of letters;
Determine that first information entropy and each word of each word in the 3rd set of letters are literary in the target Occurrence number in this;
The first information entropy of each word and the product of occurrence number are calculated, obtains the end value of each word;
End value based on each word carries out assignment in the tensor space, obtains of the target text Amount.
In the embodiment of the present invention, by tensor of the target text in tensor space, the differentiation measurement of each dimension Weighing apparatus, makes the primitive meaning that language is considered during Knowledge base organization, in terms of existing technologies, the tissue of knowledge base is more accurate Really.
Step 103:Based on tensor space corresponding to the general dictionary, it is described to determine that target professional domain dictionary corresponds to Basic vector in tensor space.
In the embodiment of the present invention, target professional domain dictionary can be the Chinese electronic document of some specific area.It is worth It is noted that the technical scheme of the embodiment of the present invention is not limited to the electronic document of Chinese, the electronic document of other language also may be used With using the technical scheme of the embodiment of the present invention.
In the embodiment of the present invention, whole Chinese electronic documents is classified by field, such as:The Chinese of financial field Electronic document, the Chinese electronic document of political realms, the Chinese electronic document of military field, the Chinese electronic document of legal field Deng.
In the embodiment of the present invention, target professional domain dictionary corresponding basic vector in the tensor space, target is determined The basic vector that professional domain dictionary corresponds in the tensor space, can be realized by procedure below:
Basic dictionary is established, and obtains target corpus corresponding to target professional domain dictionary;
Based on the basic dictionary, word segmentation processing is carried out to the target corpus, obtains the 4th set of letters, wherein, 4th set of letters forms original object professional domain dictionary;
Calculate the second information entropy of each word in the original object professional domain dictionary;
The second information entropy based on each word carries out assignment in the tensor space, and it is special to obtain the target Basic vector corresponding to the dictionary of industry field.
Step 104:The basic vector corresponded to based on the target professional domain dictionary in the tensor space, calculate described each Projection components of the tensor of target text in the basic vector.
In the embodiment of the present invention, for each file destination, the tensor of the target text is entered in the basic vector Row projection, obtains N number of projection components, wherein, N number of projection components form N-dimensional vector.
In the embodiment of the present invention, divided by the tensor of each target text in the projection of the basic vector of target professional domain dictionary Amount, realizes knowledge base under the professional language scene that particular professional field is formed, to the original meaning table of target text content The influence reached, i.e.,:Consider under language-specific environment, the special meaning of text, in terms of existing technologies, the group of knowledge base It is more accurate to knit.
Step 105:Based on the projection components of each target text, the relation between each target text is determined, with tissue Knowledge base.
In the embodiment of the present invention, based on the projection components of each target text, the relation between each target text is determined, With knowledge base of organization, can be realized by procedure below:
Regularization is carried out to the projection components of each target text;
Regularization result based on projection components, calculate angle value of the projection components relative to basic vector;
Clustering processing is carried out relative to the angle value of basic vector to the projection components, to determine the pass between each target text System is so as to knowledge base of organization.
Here, angle value of each projection components relative to basic vector is calculated beneficial to the cosine law.
Here, clustering processing is carried out to angle value using clustering algorithms such as K average values (K-means).
In addition, the scheme of the embodiment of the present invention also includes following content:
The follow-up new text for adding knowledge base, is automatically added to the knowledge base organized.Specifically, follow-up new add is known Know the text in storehouse, after performing above-mentioned steps 101- steps 104, carry out Regularization, and its angle is calculated using the cosine law Angle value, recycle clustering algorithm to classify angle value, be included into the knowledge base having had built up.
Above-mentioned each step of the embodiment of the present invention is described in detail respectively below.
Fig. 2 be the embodiment of the present invention tensor space structure schematic flow sheet, as shown in Fig. 2 the flow include with Lower step:
Step 201:Basic dictionary is established, and obtains corpus corresponding to general dictionary.
In the embodiment of the present invention, based on xinhua dictionary, basic dictionary is established.
In the embodiment of the present invention, the various Chinese electronic documents of big data quantity rank are collected, are formed corresponding to general dictionary Corpus.
Here, big data quantity rank refers to the data volume of more than PB magnitudes, and here, data volume is with the memory space of computer Calculated for unit, 1PB=1024TB.
Step 202:Based on the basic dictionary, word segmentation processing is carried out to the corpus, obtains the first set of letters, its In, first set of letters forms original general dictionary.
In the embodiment of the present invention, corpus is compared into segmenting method using the preferential character string of short word and segmented, is formed Original general dictionary.Here, dictionary based on the contrast object that word segmentation processing uses.
Step 203:For the original general dictionary, the frequency of occurrences of each word is counted.
Step 204:Based on the frequency of occurrences of each word, the first information entropy of calculating each word.
In the embodiment of the present invention, word frequency-reverse document-frequency (TF-IDF, Term Frequency-Inverse are utilized Document Frequency) principle, calculate the information entropy of word.Here, information entropy represents the significance level of word.
Step 205:First information entropy is selected from the original general dictionary and is more than or equal to presupposed information entropy threshold values Word, obtain the second set of letters, wherein, second set of letters forms general dictionary.
In the embodiment of the present invention, user pre-sets comentropy threshold values, by the word in original general dictionary according to information Entropy is descending to be ranked up.Then, using the set of word more than comentropy threshold values as general dictionary.
Such as:The information entropy descending of all words in general dictionary is sorted, if there are 100,000 lists in general dictionary Word, 20,000 larger words of information entropy are selected, the corresponding dimension of each word, the mould of dimension is information entropy, thus structure Into the tensor space of a N-dimensional, N=20000.
Step 206:Using each word in the general dictionary as a dimension, N-dimensional high order tensor space is formed.
Here, N is the word number that general dictionary includes, and N is positive integer.
In the embodiment of the present invention, creatively in the form of tensor space, the number of constant measurement specialized vocabulary is constructed Learn space.The general dictionary and tensor of N-dimensional tensor space form characteristic invariable under different coordinates so that logical It is pervasive in different particular professional fields with dictionary.
Fig. 3 is the structure schematic flow sheet of the tensor of the target text of the embodiment of the present invention, as shown in figure 3, the flow Comprise the following steps:
Step 301:For each target text, word segmentation processing is carried out to the target text, obtains the 3rd set of letters.
In the embodiment of the present invention, target text is split as generally using the segmenting method based on machine learning and probability statistics The maximum word segmentation result of rate, obtains the 3rd set of letters.
Step 302:Determine each word in the 3rd set of letters first information entropy and each word in institute State the occurrence number in target text.
Step 303:The first information entropy of each word and the product of occurrence number are calculated, obtains the knot of each word Fruit value.
Step 304:End value based on each word carries out assignment in the tensor space, obtains the target The tensor of text.
In the embodiment of the present invention, assignment is carried out to tensor space with the end value of each word, forms of target text Amount.
Such as:Need to carry out organizational composition knowledge base to 10,000 texts, then by one of them in this ten thousand documents Document is segmented, for example has separated 700 words, wherein 500 words are the words belonged in general dictionary, that just uses this Information entropy corresponding to 500 words, to the corresponding dimension assignment in N-dimensional space, other 19500 dimensions assign 0 value entirely, this Sample is formed a tensor, can use matrix S1(N1,N2,`````,N20000) represent.Equally, to remaining 9999 texts Handled, one 10000 tensors will be obtained.
In the embodiment of the present invention, the corpus based on big data, general dictionary is determined, by each word in general dictionary The information entropy of (namely each dimension), as base corresponding to the word (namely the dimension) (here, base refer to unit ' 1' Weights and measures), the mould of each word of target text is the information entropy based on the word and the occurrence number in target text And determine, the moulds of whole words of target text forms tensor, in this way, the alienation weights and measures of each dimension, makes knowledge base The primitive meaning of text is considered during tissue, in terms of existing technologies, the tissue of knowledge base is more accurate.
Fig. 4 is the structure schematic flow sheet of the basic vector of the professional domain dictionary of the embodiment of the present invention, as shown in figure 4, described Flow comprises the following steps:
Step 401:Basic dictionary is established, and obtains target corpus corresponding to target professional domain dictionary.
In the embodiment of the present invention, based on xinhua dictionary, basic dictionary is established.
In the embodiment of the present invention, collect big data quantity rank (more than P magnitudes) classification (as finance, politics, it is military, Law etc.) Chinese electronic document, form specific corpus corresponding to particular professional field dictionary.
Step 402:Based on the basic dictionary, word segmentation processing is carried out to the target corpus, obtains the 4th set of words Close, wherein, the 4th set of letters forms original professional domain dictionary.
In the embodiment of the present invention, target language material Cooley is compared into segmenting method with the preferential character string of short word and segmented, Form original object professional domain dictionary, here, dictionary based on the contrast object of word segmentation processing.
Step 403:For the original object professional domain dictionary, the frequency of occurrences of each word is counted.
Step 404:Based on the frequency of occurrences of each word, the second information entropy of calculating each word.
Here, the second information entropy of word is calculated using TF-IDF principles.
Step 405:The second information entropy based on each word carries out assignment in the tensor space, obtains institute State basic vector corresponding to target professional domain dictionary.
In the embodiment of the present invention, if without some word in text, mould (namely dimension values) corresponding to this word is 0; Otherwise, mould (namely dimension values) corresponding to this word based on information entropy of this word in target professional domain dictionary and it is true It is fixed.
Such as:The corpus that the text of " military class " is formed is handled, the basic vector of one " military class " can be obtained, The basic vector represented by N N matrix, wherein N=20000, shown in equation below:
Here, S2It is 20000 × 20000 diagonal matrix, it is non-zero that often row, which only has 1 element,.
The technical scheme of the embodiment of the present invention, the part for creatively considering knowledge base (form knowledge base Target text), the original hint expression of each target text is abstracted into target text in tensor space corresponding to general dictionary The tensor formed, further, the tensor is expressed as the projection components under target professional domain basic vector, relative to existing skill For art, the tissue of knowledge base is more accurate.
The technical scheme of the embodiment of the present invention, consider the professional language scene under target professional domain with creating to target The influence of the original hint expression of text, construct N-dimensional vector corresponding to particular professional field and be used as basic vector.By in target text Tensor of the information entropy of included each word in tensor space, and this basic vector are calculated, and are obtained target text and are existed Projection components in target professional domain, in terms of existing technologies, the tissue of knowledge base are more accurate.
The embodiment of the present invention all introduces comentropy in tensor space, tensor, vector space, projection components so that follow-up Calculating is multilinear change.Specifically, if to carry out tissue according to the accurate meaning of each target text to knowledge base, just must It must consider language contextses in addition to the word information entropy of target text, word frequency and inverse text frequency, application field, text integrally etc. Multiple macroscopic effects factor, the factors such as word information entropy, word frequency and the inverse text frequency of text are rather than only on, relative to existing For technology, the tissue of knowledge base is more accurate.
Fig. 5 is the structure composition schematic diagram of the tissue device of the knowledge base of the embodiment of the present invention, as described in Figure 5, the dress Put including:
Tensor space construction unit 501, for building tensor space corresponding to general dictionary;
Tensor determining unit 502, for based on tensor space corresponding to the general dictionary, determining each target text in institute State the tensor in tensor space;
Basic vector determining unit 503, for based on tensor space corresponding to the general dictionary, determining target professional domain word The basic vector that storehouse corresponds in the tensor space;
Projection components computing unit 504, for being corresponded to based on the target professional domain dictionary in the tensor space Basic vector, calculate projection components of the tensor of each target text in the basic vector;
Organizational unit 505, for the projection components based on each target text, determine the pass between each target text System, with knowledge base of organization.
In one embodiment, as shown in fig. 6, the tensor space construction unit includes:
Subelement 601 is established, for establishing basic dictionary;
First obtains subelement 602, for obtaining corpus corresponding to general dictionary;
First participle processing unit 603, for based on the basic dictionary, carrying out word segmentation processing to the corpus, obtaining To the first set of letters, wherein, first set of letters forms original general dictionary;
First information entropy computing unit 604, for calculating the first letter of each word in the original general dictionary Cease entropy;
Subelement 605 is selected, for selecting first information entropy from the original general dictionary more than or equal to default The word of comentropy threshold values, the second set of letters is obtained, wherein, second set of letters forms general dictionary;
Subelement 606 is built, for using each word in the general dictionary as a dimension, forming N-dimensional high-order Tensor space, wherein, N is the word number that general dictionary includes, and N is positive integer.
In an embodiment of the present invention, as shown in fig. 7, the tensor determining unit includes:
Second word segmentation processing unit 701, for for each target text, carrying out word segmentation processing to the target text, obtaining To the 3rd set of letters;
End value computation subunit 702, for determining the first information entropy of each word in the 3rd set of letters Value and occurrence number of each word in the target text;Calculate the first information entropy of each word and go out occurrence The product of number, obtain the end value of each word;
First assignment handles subelement 703, enters for the end value based on each word in the tensor space Row assignment, obtain the tensor of the target text.
In an embodiment of the present invention, as shown in figure 8, the basic vector determining unit includes:
Subelement 801 is established, for establishing basic dictionary;
Second obtains subelement 802, for obtaining target corpus corresponding to target professional domain dictionary;
3rd word segmentation processing unit 803, for based on the basic dictionary, being carried out to the target corpus at participle Reason, obtains the 4th set of letters, wherein, the 4th set of letters forms original object professional domain dictionary;
Second information entropy computing unit 804, for calculating each word in the original object professional domain dictionary The second information entropy;
Second assignment handles subelement 805, for empty in the tensor based on the second information entropy of each word Between middle carry out assignment, obtain basic vector corresponding to the target professional domain dictionary.
In an embodiment of the present invention, the projection components computing unit, it is specifically used for:For each target text Part, the tensor of the target text is projected in the basic vector, obtains N number of projection components, wherein, N number of projection Component forms N-dimensional vector.
In an embodiment of the present invention, as shown in figure 9, the organizational unit includes:
Regularizing filter unit 901, for carrying out Regularization to the projection components of each target text;
Angle calculation subelement 902, for the regularization result based on projection components, calculate the projection components relative to The angle value of basic vector;
Subelement 903 is clustered, for carrying out clustering processing relative to the angle value of basic vector to the projection components, to determine Relation between each target text is so as to knowledge base of organization.
It will be appreciated by those skilled in the art that the realization of each unit and its subelement in the tissue device of above-mentioned knowledge base Function can refer to the associated description of the method for organizing of foregoing knowledge base and understand.Each unit in the tissue device of above-mentioned knowledge base And its subelement realizes that function can be realized by running on the program on processor, also can by specific logic circuit and Realize.
If the above-mentioned audio-video processing unit of the embodiment of the present invention is realized in the form of software function module and as independent Production marketing in use, can also be stored in a computer read/write memory medium.Based on such understanding, the present invention The part that the technical scheme of embodiment substantially contributes to prior art in other words can be embodied in the form of software product Out, the computer software product is stored in a storage medium, including some instructions are causing a computer equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read Only Memory), magnetic disc Or CD etc. is various can be with the medium of store program codes.So, the embodiment of the present invention is not restricted to any specific hardware Combined with software.
Correspondingly, the embodiment of the present invention also provides a kind of computer-readable storage medium, wherein being stored with, computer is executable to be referred to Order, the computer executable instructions realize the method for organizing of the above-mentioned knowledge base of the embodiment of the present invention when being executed by processor.
, can be in any combination in the case where not conflicting between technical scheme described in the embodiment of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed method and smart machine, Ke Yitong Other modes are crossed to realize.Apparatus embodiments described above are only schematical, for example, the division of the unit, only Only a kind of division of logic function, can have other dividing mode, such as when actually realizing:Multiple units or component can be tied Close, or be desirably integrated into another system, or some features can be ignored, or do not perform.In addition, shown or discussed each group Into the mutual coupling in part or direct-coupling or communication connection can be by some interfaces, equipment or unit it is indirect Coupling or communication connection, can be electrical, mechanical or other forms.
The above-mentioned unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can positioned at a place, can also be distributed to multiple network lists In member;Partly or entirely unit therein can be selected to realize the purpose of this embodiment scheme according to the actual needs.
In addition, each functional unit in various embodiments of the present invention can be fully integrated into a second processing unit, Can also be each unit individually as a unit, can also two or more units it is integrated in a unit; Above-mentioned integrated unit can both be realized in the form of hardware, and hardware can also be used to add the form of SFU software functional unit real It is existing.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.

Claims (13)

1. a kind of method for organizing of knowledge base, it is characterised in that methods described includes:
Build tensor space corresponding to general dictionary;
Based on tensor space corresponding to the general dictionary, tensor of each target text in the tensor space is determined;
Based on tensor space corresponding to the general dictionary, determine that target professional domain dictionary corresponds in the tensor space Basic vector;
The basic vector corresponded to based on the target professional domain dictionary in the tensor space, calculate of each target text Measure the projection components in the basic vector;
Based on the projection components of each target text, the relation between each target text is determined, with knowledge base of organization.
2. the method for organizing of knowledge base according to claim 1, it is characterised in that opened corresponding to the general dictionary of structure Quantity space, including:
Basic dictionary is established, and obtains corpus corresponding to general dictionary;
Based on the basic dictionary, word segmentation processing is carried out to the corpus, obtains the first set of letters, wherein, described first Set of letters forms original general dictionary;
Calculate the first information entropy of each word in the original general dictionary;
The word that first information entropy is more than or equal to presupposed information entropy threshold values is selected from the original general dictionary, obtains the Two set of letters, wherein, second set of letters forms general dictionary;
Using each word in the general dictionary as a dimension, N-dimensional high order tensor space is formed, wherein, N is general term The word number that storehouse includes, N are positive integer.
3. the method for organizing of knowledge base according to claim 2, it is characterised in that described corresponding based on the general dictionary Tensor space, determine tensor of each target text in the tensor space, including:
For each target text, word segmentation processing is carried out to the target text, obtains the 3rd set of letters;
Determine each word in the 3rd set of letters first information entropy and each word in the target text Occurrence number;
The first information entropy of each word and the product of occurrence number are calculated, obtains the end value of each word;
End value based on each word carries out assignment in the tensor space, obtains the tensor of the target text.
4. the method for organizing of knowledge base according to claim 1, it is characterised in that described corresponding based on the general dictionary Tensor space, determine target professional domain dictionary correspond to the tensor space in basic vector, including:
Basic dictionary is established, and obtains target corpus corresponding to target professional domain dictionary;
Based on the basic dictionary, word segmentation processing is carried out to the target corpus, obtains the 4th set of letters, wherein, it is described 4th set of letters forms original object professional domain dictionary;
Calculate the second information entropy of each word in the original object professional domain dictionary;
The second information entropy based on each word carries out assignment in the tensor space, obtains the target specialty neck Basic vector corresponding to the dictionary of domain.
5. the method for organizing of knowledge base according to claim 1, it is characterised in that described to be based on the target professional domain The basic vector that dictionary corresponds in the tensor space, calculate the projection in the basic vector point of the tensor of each target text Amount, including:
For each file destination, the tensor of the target text is projected in the basic vector, obtains N number of projection point Amount, wherein, N number of projection components form N-dimensional vector.
6. the method for organizing of the knowledge base according to any one of claim 1 to 5, it is characterised in that described based on described each The projection components of target text, the relation between each target text is determined, with knowledge base of organization, including:
Regularization is carried out to the projection components of each target text;
Regularization result based on projection components, calculate angle value of the projection components relative to basic vector;
To the projection components relative to basic vector angle value carry out clustering processing, with determine the relation between each target text from And knowledge base of organization.
7. a kind of tissue device of knowledge base, it is characterised in that described device includes:
Tensor space construction unit, for building tensor space corresponding to general dictionary;
Tensor determining unit, for based on tensor space corresponding to the general dictionary, determining each target text in the tensor Tensor in space;
Basic vector determining unit, for based on tensor space corresponding to the general dictionary, determining that target professional domain dictionary is corresponding Basic vector in the tensor space;
Projection components computing unit, for based on the target professional domain dictionary correspond to the tensor space in basic vector, Calculate projection components of the tensor of each target text in the basic vector;
Organizational unit, for the projection components based on each target text, the relation between each target text is determined, with tissue Knowledge base.
8. the tissue device of knowledge base according to claim 7, it is characterised in that the tensor space construction unit bag Include:
Subelement is established, for establishing basic dictionary;
First obtains subelement, for obtaining corpus corresponding to general dictionary;
First participle processing unit, for based on the basic dictionary, carrying out word segmentation processing to the corpus, it is single obtaining first Set of words, wherein, first set of letters forms original general dictionary;
First information entropy computing unit, for calculating the first information entropy of each word in the original general dictionary;
Subelement is selected, is more than or equal to presupposed information entropy valve for selecting first information entropy from the original general dictionary The word of value, the second set of letters is obtained, wherein, second set of letters forms general dictionary;
Subelement is built, for using each word in the general dictionary as a dimension, it is empty to form N-dimensional high order tensor Between, wherein, N is the word number that general dictionary includes, and N is positive integer.
9. the tissue device of knowledge base according to claim 8, it is characterised in that the tensor determining unit includes:
Second word segmentation processing unit, for for each target text, carrying out word segmentation processing to the target text, it is single obtaining the 3rd Set of words;
End value computation subunit, for determining the first information entropy of each word in the 3rd set of letters and each Occurrence number of the word in the target text;The first information entropy of each word and the product of occurrence number are calculated, Obtain the end value of each word;
First assignment handles subelement, and assignment is carried out in the tensor space for the end value based on each word, Obtain the tensor of the target text.
10. the tissue device of knowledge base according to claim 7, it is characterised in that the basic vector determining unit includes:
Subelement is established, for establishing basic dictionary;
Second obtains subelement, for obtaining target corpus corresponding to target professional domain dictionary;
3rd word segmentation processing unit, for based on the basic dictionary, word segmentation processing to be carried out to the target corpus, obtain the Four set of letters, wherein, the 4th set of letters forms original object professional domain dictionary;
Second information entropy computing unit, for calculating the second letter of each word in the original object professional domain dictionary Cease entropy;
Second assignment handles subelement, is carried out for the second information entropy based on each word in the tensor space Assignment, obtain basic vector corresponding to the target professional domain dictionary.
11. the tissue device of knowledge base according to claim 7, it is characterised in that the projection components computing unit, tool Body is used for:For each file destination, the tensor of the target text is projected in the basic vector, obtains N number of throwing Shadow component, wherein, N number of projection components form N-dimensional vector.
12. the tissue device of knowledge base according to claim 7, it is characterised in that the organizational unit includes:
Regularizing filter unit, for carrying out Regularization to the projection components of each target text;
Angle calculation subelement, for the regularization result based on projection components, the projection components are calculated relative to basic vector Angle value;
Subelement is clustered, for carrying out clustering processing relative to the angle value of basic vector to the projection components, to determine each target Relation between text is so as to knowledge base of organization.
13. a kind of computer-readable storage medium, it is characterised in that be stored thereon with computer executable instructions, the computer can be held The method and step described in claim any one of 1-6 is realized in row instruction when being executed by processor.
CN201710911400.1A 2017-09-29 2017-09-29 Method and device for organizing knowledge base and computer storage medium Active CN107622129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710911400.1A CN107622129B (en) 2017-09-29 2017-09-29 Method and device for organizing knowledge base and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710911400.1A CN107622129B (en) 2017-09-29 2017-09-29 Method and device for organizing knowledge base and computer storage medium

Publications (2)

Publication Number Publication Date
CN107622129A true CN107622129A (en) 2018-01-23
CN107622129B CN107622129B (en) 2020-03-24

Family

ID=61091108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710911400.1A Active CN107622129B (en) 2017-09-29 2017-09-29 Method and device for organizing knowledge base and computer storage medium

Country Status (1)

Country Link
CN (1) CN107622129B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765975A (en) * 2020-12-25 2021-05-07 北京百度网讯科技有限公司 Word segmentation ambiguity processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339919A1 (en) * 2012-06-18 2013-12-19 International Business Machines Corporation Method and Apparatus for Hierarchical Wafer Quality Predictive Modeling
CN106228245A (en) * 2016-07-21 2016-12-14 电子科技大学 Infer based on variation and the knowledge base complementing method of tensor neutral net
CN106528610A (en) * 2016-09-28 2017-03-22 厦门理工学院 Knowledge graph representation learning method based on path tensor decomposition
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339919A1 (en) * 2012-06-18 2013-12-19 International Business Machines Corporation Method and Apparatus for Hierarchical Wafer Quality Predictive Modeling
CN106228245A (en) * 2016-07-21 2016-12-14 电子科技大学 Infer based on variation and the knowledge base complementing method of tensor neutral net
CN106528610A (en) * 2016-09-28 2017-03-22 厦门理工学院 Knowledge graph representation learning method based on path tensor decomposition
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765975A (en) * 2020-12-25 2021-05-07 北京百度网讯科技有限公司 Word segmentation ambiguity processing method, device, equipment and medium
CN112765975B (en) * 2020-12-25 2023-08-04 北京百度网讯科技有限公司 Word segmentation disambiguation processing method, device, equipment and medium

Also Published As

Publication number Publication date
CN107622129B (en) 2020-03-24

Similar Documents

Publication Publication Date Title
Pan et al. Cross-domain sentiment classification via spectral feature alignment
CN112632385A (en) Course recommendation method and device, computer equipment and medium
Sahu et al. Feature engineering and ensemble-based approach for improving automatic short-answer grading performance
Yao et al. Text classification model based on fasttext
CN106599054B (en) Method and system for classifying and pushing questions
CN108388608B (en) Emotion feedback method and device based on text perception, computer equipment and storage medium
RU2583716C2 (en) Method of constructing and detection of theme hull structure
CN110209808A (en) A kind of event generation method and relevant apparatus based on text information
CN110019790A (en) Text identification, text monitoring, data object identification, data processing method
KR20190135129A (en) Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning
CN110705304B (en) Attribute word extraction method
CN112836509A (en) Expert system knowledge base construction method and system
CN111753048A (en) Document retrieval method, device, equipment and storage medium
CN107807958A (en) A kind of article list personalized recommendation method, electronic equipment and storage medium
CN109615001A (en) A kind of method and apparatus identifying similar article
CN110276382A (en) Listener clustering method, apparatus and medium based on spectral clustering
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
Zahedi et al. Improving text classification performance using PCA and recall-precision criteria
Baena-García et al. TF-SIDF: Term frequency, sketched inverse document frequency
CN106844596A (en) One kind is based on improved SVM Chinese Text Categorizations
CN110019821A (en) Text category training method and recognition methods, relevant apparatus and storage medium
Chen et al. Using latent Dirichlet allocation to improve text classification performance of support vector machine
Kaur et al. Domain ontology graph approach using markov clustering algorithm for text classification
CN107368610A (en) Big text CRF and rule classification method and system based on full text
CN107622129A (en) Method for organizing and device, the computer-readable storage medium of a kind of knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant