CN110532544A - Low-resource text tour field construction of knowledge base method and system - Google Patents

Low-resource text tour field construction of knowledge base method and system Download PDF

Info

Publication number
CN110532544A
CN110532544A CN201910650742.1A CN201910650742A CN110532544A CN 110532544 A CN110532544 A CN 110532544A CN 201910650742 A CN201910650742 A CN 201910650742A CN 110532544 A CN110532544 A CN 110532544A
Authority
CN
China
Prior art keywords
low
chinese
vector
resource text
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910650742.1A
Other languages
Chinese (zh)
Other versions
CN110532544B (en
Inventor
赵小兵
冯小兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minzu University of China
Original Assignee
Minzu University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minzu University of China filed Critical Minzu University of China
Priority to CN201910650742.1A priority Critical patent/CN110532544B/en
Publication of CN110532544A publication Critical patent/CN110532544A/en
Application granted granted Critical
Publication of CN110532544B publication Critical patent/CN110532544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of low-resource text tour field construction of knowledge base method and system, is related to computer field.Chinese tour field knowledge base and the building tour field Chinese-low-resource text dictionary of the present invention by building comprising multiple triple knowledge;Multiple triple knowledge in Chinese tourism knowledge base are translated as low-resource text by the tour field Chinese-low-resource text dictionary, low-resource text tour field knowledge base is constructed with this.The present invention constructs triple knowledge by resourceful Chinese tourism corpus, obtain comprehensive Chinese tour field knowledge base, Chinese tour field knowledge base is moved into low-resource text tour field knowledge base again, it solves because low-resource text tour field corpus is deficient in network, the technical issues of being not easy to directly acquire low-resource text comprehensive sight spot knowledge content, it realizes plentiful language knowledge migration to the target in low-resource language field, is advantageously implemented the intelligent Service of the other informations such as low-resource text travel information.

Description

Low-resource text tour field construction of knowledge base method and system
Technical field
The present invention relates to field of computer technology, and in particular to a kind of low-resource text tour field construction of knowledge base side Method and system.
Background technique
Tourism has become for one of most important amusement and recreation of people.With the fast development of internet, so that in network More and more Chinese tour sites are emerged, provide travel information abundant for tourist.Chinese tour site contains much information, It is longer that text length is introduced at sight spot, has nothing in common with each other comprising information, in contrast, low-resource text not yet realizes travel information Intelligent Service.How to help low-resource text to construct knowledge base by resourceful language becomes current natural language processing In important one of research hotspot.
However, because in current network low-resource text tour field corpus it is deficient, be not easy to directly acquire low-resource text complete The sight spot knowledge content in face, there are certain difficulties when constructing knowledge base.
Summary of the invention
(1) the technical issues of solving
In view of the deficiencies of the prior art, the present invention provides a kind of low-resource text tour field construction of knowledge base method, It solves because low-resource text tour field corpus is deficient in network, is not easy to directly acquire the comprehensive sight spot of low-resource text and know Know content, it is difficult to which building constructs the technical issues of knowledge base.
(2) technical solution
In order to achieve the above object, the present invention is achieved by the following technical programs:
A kind of low-resource text tour field construction of knowledge base method of the present invention, the method are executed by computer, including Following steps:
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y K nearest low-resource text constructs tour field as the corresponding k low-resource character translation Candidate Set of Chinese The Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
Preferably, the S1 is specifically included:
S101, obtain Chinese operative function text corpus;
S102, it is based on Word2Vec model training text corpus, obtains term vector model and part of speech vector model;
S103, the position vector that each word in sentence is obtained based on the sentence in text corpus;
S104, word-based vector model obtain term vector, part of speech vector are obtained based on part of speech vector model, by term vector It is added with part of speech vector matrix, obtains the term vector of fusion part of speech vector, then incorporate position vector, obtain fusion multiple features Term vector;
S105, the term vector for merging multiple features is inputted into Relation extraction model, obtains the probability distribution of entity relationship;
S106, the entity relationship between two entities is determined according to the probability distribution of entity relationship, is based on entity relationship structure Triple knowledge is built, by entity triple knowledge with database form structured storage, constructs Chinese tour field knowledge base.
Preferably, the step S101 specifically:
Operative function is obtained by crawler technology, and operative function is pre-processed, by pretreated tourism text This composition text corpus, the pretreatment include subordinate sentence, participle, part-of-speech tagging.
Preferably, the step S103 specifically:
It is each of the sentence sequence of terms S=[s1, s2 ..., sl] of l word si for a length, away from Relative distance from first instance e1 and second instance e2 is i-i1 and i-i2, and the first instance e1 and second instance e2 are equal For the target entity, wherein i indicates index subscript of the current term in sentence, i1 and i2 be respectively first instance e1 and The index subscript of second instance e2, before negative indicates that current term is located at entity word;Then initialized using Word2vec To the position vector matrix of a 2ld, wherein d indicates the dimension of position vector;The position vector table of each word in sentence It is shown as pvi=[pvi1, pvi2], wherein pvi1 and pvi2 respectively indicates in sentence i-th of word to entity e1's and entity e2 The vector of relative distance indicates.
Preferably, the step S104 specifically:
Term vector N={ n1, n2 ... nl } that word-based vector model and the training of part of speech vector model obtain and part of speech to Amount V=v1, v2 ... and vl }, N is subjected to matrix with V and is added, the term vector for obtaining fusion part of speech vector is expressed as W=α N+ (1- α) V enables α=0.5, i.e. W2=0.5 (N+V) after fusion part of speech vector;Position vector is incorporated again, obtains fusion multiple features Term vector W3=[W2, PV], wherein PV={ pv1, pv2 ... pvl }.
Preferably, the step S105 specifically:
It using the term vector W3 for merging multiple features as Relation extraction mode input, is handled, is obtained using two-way LS TM The from front to back and from back to front text semantic information of both direction of sequence of terms in sentence;Wherein i-th word BLSTM output, which calculates, uses formula:
BLSTM layers of output are connected with softmax classifier, obtain the probability distribution of entity relationship;
Wherein, the entity relationship includes 9 kinds, is had including positional relationship, opening relationships, creation relationship, affiliated pass It is, closes on relationship, correlativity, inclusion relation, equivalence relation, relation on attributes.
Preferably, the step S106 specifically:
The entity relationship between two entities is determined according to probability distribution, according to the entity relationship structure between two entities Build triple knowledge<first instance e1, second instance e2, entity relationship>, by handling text corpus, obtain multiple ternarys Group knowledge, and then construct Chinese tour field knowledge base.
Preferably, the step S2 specifically:
Chinese corpus and low-resource text corpus are obtained by crawler technology, garbage is removed, only retains article text Then information is segmented, removes stop words.
Preferably, the step S3 specifically:
Its corresponding Chinese word is obtained using fastText term vector model training by the pretreated Chinese corpus Vector X uses fastText term vector model training to obtain its corresponding low by the pretreated low-resource text corpus Resource text term vector Y.
The present invention also provides a kind of low-resource text tour field construction of knowledge base system, the system comprises computer, The computer includes:
At least one storage unit;
At least one processing unit;
Wherein, at least one instruction is stored at least one described storage unit, at least one instruction is by described At least one processing unit is loaded and is executed to perform the steps of
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this The domain Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
(3) beneficial effect
The present invention provides a kind of low-resource text tour field construction of knowledge base method and systems.With prior art phase Than, have it is following the utility model has the advantages that
Chinese tour field knowledge base and building tour field of the present invention by building comprising multiple triple knowledge The Chinese-low-resource text dictionary;By the tour field Chinese-low-resource text dictionary by Chinese travel knowledge base in multiple three Tuple knowledge is translated as low-resource text, constructs low-resource text tour field knowledge base with this.The present invention is by resourceful Chinese tourism corpus construct triple knowledge, obtain comprehensive Chinese tour field knowledge base, then by Chinese tour field Knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text tour field corpus is deficient in network It is weary, it the technical issues of being not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes plentiful language knowledge migration To the target in low-resource language field, it is advantageously implemented the intelligent Service of the other informations such as low-resource text travel information.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also Other drawings may be obtained according to these drawings without any creative labor.
Fig. 1 is a kind of block diagram of low-resource text tour field construction of knowledge base method of the embodiment of the present invention;
Fig. 2 be in the embodiment of the present invention in sentence each word to target entity distance calculation method schematic diagram;
Fig. 3 is the block diagram of Relation extraction model in the embodiment of the present invention;
Fig. 4 is the structural schematic diagram of term vector mapping mode in the embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, to the technology in the embodiment of the present invention Scheme is clearly and completely described, it is clear that and described embodiments are some of the embodiments of the present invention, rather than whole Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts The every other embodiment obtained, shall fall within the protection scope of the present invention.
The embodiment of the present application is solved by providing a kind of low-resource text tour field construction of knowledge base method and system Because low-resource text tour field corpus is deficient in network, it is not easy to directly acquire in the comprehensive sight spot knowledge of low-resource text Hold, it is difficult to which the technical issues of constructing knowledge base is realized plentiful language knowledge migration to the target in low-resource language field.
Technical solution in the embodiment of the present application is in order to solve the above technical problems, general thought is as follows:
Chinese tour field knowledge base and building building of the embodiment of the present invention by building comprising multiple triple knowledge At the tour field Chinese-low-resource text dictionary;Chinese is traveled knowledge base by the tour field Chinese-low-resource text dictionary In multiple triple knowledge be translated as low-resource text, low-resource text tour field knowledge base is constructed with this.The present invention borrows It helps resourceful Chinese to travel corpus building triple knowledge, obtains comprehensive Chinese tour field knowledge base, then by the Chinese Literary tour field knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text is traveled in network The technical issues of field corpus is deficient, is not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes abundant language Say that knowledge migration to the target in low-resource language field, is advantageously implemented the intelligence of the other informations such as low-resource text travel information Energyization service.
In order to better understand the above technical scheme, right in conjunction with appended figures and specific embodiments Above-mentioned technical proposal is described in detail.
The embodiment of the present invention provides a kind of low-resource text tour field construction of knowledge base method, as shown in Figure 1, described Method is executed by computer, comprising the following steps:
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this The domain Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
Chinese tour field knowledge base and building building of the embodiment of the present invention by building comprising multiple triple knowledge At the tour field Chinese-low-resource text dictionary;Chinese is traveled knowledge base by the tour field Chinese-low-resource text dictionary In multiple triple knowledge be translated as low-resource text, low-resource text tour field knowledge base is constructed with this.The present invention borrows It helps resourceful Chinese to travel corpus building triple knowledge, obtains comprehensive Chinese tour field knowledge base, then by the Chinese Literary tour field knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text is traveled in network The technical issues of field corpus is deficient, is not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes abundant language Say that knowledge migration to the target in low-resource language field, is advantageously implemented the intelligence of the other informations such as low-resource text travel information Energyization service.
Each step is described in detail below.
It should be noted that low-resource text here is by taking Tibetan language as an example.
S1, building include the Chinese tour field knowledge base of multiple triple knowledge.Specifically includes the following steps:
S101, the text corpus of operative function for obtaining Chinese are obtained by crawler technology in the specific implementation process Operative function is taken, and operative function is pre-processed, constitutes text corpus by pretreated operative function, it is described Pretreatment includes subordinate sentence, participle, part-of-speech tagging.Wherein part of speech refers to noun, verb, adverbial word etc. part of speech, implements in the present invention In example, these vocabulary can be divided into thinner division, as title is divided into name, place name, group, mechanism name, transliteration people Name etc..
Such as: it is domestic that Lazi temple is located at Shigatse District Gyangzi County.
Its word segmentation result are as follows: it is domestic that Lazi temple is located at Shigatse District Gyangzi County.
Participle and word are to annotation results: and La Zisi/na is located at/and the Shigatse District v/ns Gyangzi County/ns is domestic/s./wj
Wherein: ns indicates place name;Na indicates sight spot name;V indicates verb;S indicates that word wj in place indicates punctuation mark;
S102, it is based on Word2Vec model training text corpus, obtains term vector model and part of speech vector model;
S103, the distance based on each word in sentence in text corpus to target entity obtain each in sentence The position vector of word, target entity refer to first instance e1 and second instance e2.
In the specific implementation process, by the distance of each word in calculating sentence to target entity, so that the same sentence Different example relationships also can be very good to distinguish in son, and specific implementation is: the sentence sequence of terms S for being l for a length Each of=[s1, s2 ..., sl] word si, the relative distance apart from first instance e1 and second instance e2 is i- I1 and i-i2, wherein i indicates index subscript of the current term in sentence, and i1 and i2 are first instance e1 and second real respectively The index subscript of body e2, before negative indicates that current term is located at entity word.As shown in Fig. 2, " proper loud, high-pitched sound song moral temple is built in sentence 16 end of the centurys belonged to the Gelug Sect." sequence of terms that length is 8 is obtained after word segmentation processing, wherein " being built in " distance first is real Body e1 " proper loud, high-pitched sound song moral temple " relative distance is 1, and the relative distance apart from second instance e2 " Gelug Sect " is -5.Then use Word2vec initializes to obtain the position vector matrix of a 2ld, and wherein d indicates the dimension of position vector.Finally, in sentence The position vector of each word is expressed as pvi=[pvi1, pvi2], and wherein pvi1 and pvi2 respectively indicates i-th of word in sentence Language to entity e1 and entity e2 relative distance vector indicate.
S104, word-based vector model obtain term vector, part of speech vector are obtained based on part of speech vector model, by term vector It is added with part of speech vector matrix, obtains the term vector of fusion part of speech vector, then incorporate position vector, obtain fusion multiple features Term vector.
In the specific implementation process, word-based vector model and part of speech vector model training obtain term vector N=n1, N2 ... nl } and part of speech vector V=v1, v2 ... and vl }, N is subjected to matrix with V and is added, the word of fusion part of speech vector is obtained Vector indicates W=α N+ (1- α) V, (wherein 0≤α≤1);As α=1, W1=N does not merge the term vector of part-of-speech information; Enable α=0.5, i.e. W2=0.5 (N+V) after fusion part of speech vector;Position vector is incorporated again, obtains the term vector of fusion multiple features W3=[W2, PV], wherein PV={ pv1, pv2 ... pvl }.
S105, the term vector for merging multiple features is inputted into Relation extraction model, obtains the probability distribution of entity relationship;
In the specific implementation process Relation extraction model include BLSTM layers, full articulamentum and soft max classifier, such as scheme Shown in 3.It using the term vector for merging multiple features as Relation extraction mode input, is handled using two-way LSTM, obtains sentence The from front to back and from back to front text semantic information of both direction of middle sequence of terms.Wherein the BLSTM of i-th of word is defeated It calculates out and uses formula:
BLSTM layers of output are connected with softmax classifier, obtain the probability distribution of entity relationship;
Wherein, the entity relationship includes 9 kinds, is had including positional relationship, opening relationships, creation relationship, affiliated pass It is, closes on relationship, correlativity, inclusion relation, equivalence relation, relation on attributes.
S106, the entity relationship between two entities is determined according to the probability distribution of entity relationship, is based on entity relationship structure Triple knowledge is built, by entity triple knowledge with database form structured storage, constructs Chinese tour field knowledge base.
Specifically: the entity relationship between two entities is determined according to probability distribution, according to the entity between two entities Relationship construct triple knowledge<first instance e1, second instance e2, entity relationship>, such as in sentence, " the northern Cave Temple is Gansu Province A Cave Temple " in,<northern the Cave Temple, Gansu Province, positional relationship>such a triple knowledge can be constructed.Pass through processing Text corpus obtains multiple triple knowledge, and then constructs Chinese tour field knowledge base.
S2, Chinese corpus and Tibetan language corpus are obtained, and the Chinese corpus and the Tibetan language corpus is pre-processed.
In the specific implementation process, Chinese corpus and Tibetan language corpus are obtained by crawler technology, removes garbage, only Retain article text message, then segmented, stop words etc. is gone to pre-process.
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place The Tibetan language corpus of reason obtains its corresponding Tibetan language term vector Y;
In the specific implementation process, fastText term vector model training is used by the pretreated Chinese corpus Its corresponding Chinese term vector X is obtained, is obtained by the pretreated Tibetan language corpus using fastText term vector model training To its corresponding Tibetan language term vector Y;
S4, the Linear Mapping matrix T that the Chinese term vector X to the Tibetan language term vector Y is obtained based on MUSE model; The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, it is described to show that the Chinese term vector X is mapped to Corresponding vector U after Tibetan language term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this The domain Chinese-low-resource text dictionary.
Such as: as shown in figure 4, U (U=TX) expression is mapped to Tibetan language term vector space Later Han Dynasty cliction vector table in figure Show, Y indicates Tibetan language term vector space, and x is Chinese word, VxThe corresponding term vector of x, found out in Tibetan language term vector space Y with VxThe candidate collection that k nearest Tibetan language word of COS distance is translated as Chinese x to Tibetan language, i.e. y1, y2 ....For example, working as k When=5, Chinese word " Dingri County " is after the Chinese hides across language term vector mapping, the Tibetan language word nearest with its COS distance Be successively "(Dingri County) ", "(day ancestor) ", "(settled date) ", "(Rikaze) ", " (middle side) ", then the Tibetan language that can select " Dingri County " is translated as
S6, multiple triple knowledge in Chinese tourism knowledge base are translated as by Tibetan language based on tour field Chinese hiding dictionary, Tibetan language tour field knowledge base is constructed with this.
The embodiment of the present invention also provides a kind of low-resource text tour field construction of knowledge base system, the system comprises Computer, the computer include:
At least one storage unit;
At least one processing unit;
Wherein, at least one instruction is stored at least one described storage unit, at least one instruction is by described At least one processing unit is loaded and is executed to perform the steps of
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this The domain Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
In conclusion compared with prior art, have it is following the utility model has the advantages that
Chinese tour field knowledge base and building building of the embodiment of the present invention by building comprising multiple triple knowledge At the tour field Chinese-low-resource text dictionary;Chinese is traveled knowledge base by the tour field Chinese-low-resource text dictionary In multiple triple knowledge be translated as low-resource text, low-resource text tour field knowledge base is constructed with this.The present invention borrows It helps resourceful Chinese to travel corpus building triple knowledge, obtains comprehensive Chinese tour field knowledge base, then by the Chinese Literary tour field knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text is traveled in network The technical issues of field corpus is deficient, is not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes abundant language Say that knowledge migration to the target in low-resource language field, is advantageously implemented the intelligence of the other informations such as low-resource text travel information Energyization service.
It should be noted that through the above description of the embodiments, those skilled in the art can be understood that It can be realized by means of software and necessary general hardware platform to each embodiment.Based on this understanding, above-mentioned skill Substantially the part that contributes to existing technology can be embodied in the form of software products art scheme in other words, the meter Calculation machine software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with Another entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this Actual relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to nonexcludability Include so that include a series of elements process, method, article or equipment not only include those elements, but also Further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including There is also other identical elements in the process, method, article or equipment of the element.
The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to the foregoing embodiments Invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each implementation Technical solution documented by example is modified or equivalent replacement of some of the technical features;And these modification or Replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of low-resource text tour field construction of knowledge base method, which is characterized in that the method is executed by computer, packet Include following steps:
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and the Chinese corpus and the low-resource text corpus is carried out Pretreatment;
S3, its corresponding Chinese term vector X is obtained based on the pretreated Chinese corpus of process, based on by pretreated institute It states low-resource text corpus and obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping matrix T that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model; The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, it is described to show that the Chinese term vector X is mapped to Corresponding vector U after low-resource text term vector Y;
S5, it calculates in low-resource text term vector Y and to indicate that COS distance is nearest with term vector of the Chinese word in vector U K low-resource text constructs the tour field Chinese-low money as the corresponding k low-resource character translation Candidate Set of Chinese Source document words allusion quotation;
S6, multiple triple knowledge in Chinese tourism knowledge base are translated as based on the tour field Chinese-low-resource text dictionary Low-resource text constructs low-resource text tour field knowledge base with this.
2. low-resource text tour field construction of knowledge base method as described in claim 1, which is characterized in that the S1 is specific Include:
S101, obtain Chinese operative function text corpus;
S102, it is based on Word2Vec model training text corpus, obtains term vector model and part of speech vector model;
S103, the position vector that each word in sentence is obtained based on the sentence in text corpus;
S104, word-based vector model obtain term vector, part of speech vector are obtained based on part of speech vector model, by term vector and part of speech Vector matrix is added, and obtains the term vector of fusion part of speech vector, then incorporate position vector, obtains the term vector of fusion multiple features;
S105, the term vector for merging multiple features is inputted into Relation extraction model, obtains the probability distribution of entity relationship;
S106, the entity relationship between two entities is determined according to the probability distribution of entity relationship, based on entity relationship building three Tuple knowledge constructs Chinese tour field knowledge base by entity triple knowledge with database form structured storage.
3. low-resource text tour field construction of knowledge base method as claimed in claim 2, which is characterized in that the step S101 specifically:
Operative function is obtained by crawler technology, and operative function is pre-processed, by pretreated operative function structure At text corpus, the pretreatment includes subordinate sentence, participle, part-of-speech tagging.
4. low-resource text tour field construction of knowledge base method as claimed in claim 3, which is characterized in that the step S103 specifically:
It is each of the sentence sequence of terms S=[s1, s2 ..., sl] of l word si, distance first for a length The relative distance of entity e1 and second instance e2 are i-i1 and i-i2, and the first instance e1 and second instance e2 are the mesh Entity is marked, wherein i indicates index subscript of the current term in sentence, and i1 and i2 are first instance e1 and second instance e2 respectively Index subscript, before negative indicates that current term is located at entity word;Then initialize to obtain a 2ld's using Word2vec Position vector matrix, wherein d indicates the dimension of position vector;The position vector of each word is expressed as pvi=in sentence [pvi1, pvi2], wherein pvi1 and pvi2 respectively indicate i-th of word in sentence to entity e1 and entity e2 relative distance Vector indicates.
5. low-resource text tour field construction of knowledge base method as claimed in claim 4, which is characterized in that the step S104 specifically:
The term vector N={ n1, n2 ... nl } and part of speech vector V=that word-based vector model and the training of part of speech vector model obtain V1, v2 ... vl }, N is subjected to matrix with V and is added, the term vector for obtaining fusion part of speech vector is expressed as W=α N+ (1- α) V, Enable α=0.5, i.e. W2=0.5 (N+V) after fusion part of speech vector;Position vector is incorporated again, obtains the term vector W3 of fusion multiple features =[W2, PV], wherein PV={ pv1, pv2 ... pvl }.
6. low-resource text tour field construction of knowledge base method as claimed in claim 5, which is characterized in that the step S105 specifically:
It using the term vector W3 for merging multiple features as Relation extraction mode input, is handled using two-way LSTM, obtains sentence The from front to back and from back to front text semantic information of both direction of middle sequence of terms;The wherein BLSTM output of i-th of word It calculates and uses formula:
BLSTM layers of output are connected with softmax classifier, obtain the probability distribution of entity relationship;
Wherein, the entity relationship includes 9 kinds, has including positional relationship, opening relationships, creation relationship, belonging relation, closes on Relationship, correlativity, inclusion relation, equivalence relation, relation on attributes.
7. low-resource text tour field construction of knowledge base method as claimed in claim 6, which is characterized in that the step S106 specifically:
The entity relationship between two entities is determined according to probability distribution, and ternary is constructed according to the entity relationship between two entities Group knowledge<first instance e1, second instance e2, entity relationship>, by handle text corpus, obtain multiple triple knowledge, And then construct Chinese tour field knowledge base.
8. low-resource text tour field construction of knowledge base method as described in claim 1, which is characterized in that the step S2 Specifically:
Chinese corpus and low-resource text corpus are obtained by crawler technology, garbage is removed, only retains article text message, Then it segmented, remove stop words.
9. low-resource text tour field construction of knowledge base method as described in claim 1, which is characterized in that the step S3 Specifically:
Its corresponding Chinese term vector is obtained using fastText term vector model training by the pretreated Chinese corpus X obtains its corresponding low-resource using fastText term vector model training by the pretreated low-resource text corpus Text term vector Y.
10. a kind of low-resource text tour field construction of knowledge base system, which is characterized in that the system comprises computer, institutes Stating computer includes:
At least one storage unit;
At least one processing unit;
Wherein, be stored at least one instruction at least one described storage unit, at least one instruction by it is described at least One processing unit is loaded and is executed to perform the steps of
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and the Chinese corpus and the low-resource text corpus is carried out Pretreatment;
S3, its corresponding Chinese term vector X is obtained based on the pretreated Chinese corpus of process, based on by pretreated institute It states low-resource text corpus and obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping matrix T that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model; The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, it is described to show that the Chinese term vector X is mapped to Corresponding vector U after low-resource text term vector Y;
S5, it calculates in low-resource text term vector Y and to indicate that COS distance is nearest with term vector of the Chinese word in vector U K low-resource text constructs the tour field Chinese-as the corresponding k low-resource character translation Candidate Set of Chinese with this Low-resource text dictionary;
S6, multiple triple knowledge in Chinese tourism knowledge base are translated as based on the tour field Chinese-low-resource text dictionary Low-resource text constructs low-resource text tour field knowledge base with this.
CN201910650742.1A 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base Active CN110532544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910650742.1A CN110532544B (en) 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910650742.1A CN110532544B (en) 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base

Publications (2)

Publication Number Publication Date
CN110532544A true CN110532544A (en) 2019-12-03
CN110532544B CN110532544B (en) 2023-03-24

Family

ID=68660345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910650742.1A Active CN110532544B (en) 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base

Country Status (1)

Country Link
CN (1) CN110532544B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132660A (en) * 2020-09-25 2020-12-25 尚娱软件(深圳)有限公司 Commodity recommendation method, system, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
WO2015079591A1 (en) * 2013-11-27 2015-06-04 Nec Corporation Crosslingual text classification method using expected frequencies
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015079591A1 (en) * 2013-11-27 2015-06-04 Nec Corporation Crosslingual text classification method using expected frequencies
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
柳路芳 等: ""基于词向量与可比语料库的双语词典提取研究"", 《计算机工程与科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132660A (en) * 2020-09-25 2020-12-25 尚娱软件(深圳)有限公司 Commodity recommendation method, system, device and storage medium
CN112132660B (en) * 2020-09-25 2023-12-26 尚娱软件(深圳)有限公司 Commodity recommendation method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN110532544B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN104239501B (en) Mass video semantic annotation method based on Spark
CN104133848A (en) Tibetan language entity knowledge information extraction method
CN106021224A (en) Bilingual discourse annotation method
Modi et al. Review on abstractive text summarization techniques (ATST) for single and multi documents
CN113032552A (en) Text abstract-based policy key point extraction method and system
Gatiatullin et al. About Turkic Morpheme Portal.
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
Hu et al. BIM oriented intelligent data mining and representation
CN110532544A (en) Low-resource text tour field construction of knowledge base method and system
CN104281695A (en) Combination theory based quasi natural language semantic information extraction method and system
Yiming et al. Research on the construction of maritime legal knowledge graph
Li et al. An Approach and Implementation for Knowledge Graph Construction and Q&A System
Taghiyareh et al. A Semantic Rule‑based Framework for Efficient Retrieval of Educational Materials
Lakshmi et al. Knowledge graph curation from text via ontologies
Agnoloni et al. Semantic web standards and ontologies for legislative drafting support
Amirhosseini et al. Quantitative evaluation of simplicity invisible domain in Islamic knowledge organizations
Gong et al. A term extraction algorithm based on machine learning and comprehensive feature strategy
CN117407511B (en) Electric power safety regulation intelligent question-answering method and system based on Bert model
Huang et al. An Analysis Model of English Text Coherence Based on RST Dependency Relationship
Zhu et al. Research on construction and automatic expansion of multi-source lexical semantic knowledge base
Guo et al. A review of the development and application of natural language processing
Zhang et al. Research on the Application of Computer Artificial Intelligence Machine Translation System in the Sci-tech Journals
Yang et al. Deep Annotation of The Tang Dynasty Seven-Character Quatrain Corpus and Generation of Data Set for Poetry Composition Teaching System
Ming et al. A Method of Semantic Redundant Information Filtering for Abstract Meaning Representation Graph
Huang et al. Data analysis of overseas chinese literature based on co-occurrence relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant