CN110532544A - Low-resource text tour field construction of knowledge base method and system - Google Patents
Low-resource text tour field construction of knowledge base method and system Download PDFInfo
- Publication number
- CN110532544A CN110532544A CN201910650742.1A CN201910650742A CN110532544A CN 110532544 A CN110532544 A CN 110532544A CN 201910650742 A CN201910650742 A CN 201910650742A CN 110532544 A CN110532544 A CN 110532544A
- Authority
- CN
- China
- Prior art keywords
- low
- chinese
- vector
- resource text
- term vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of low-resource text tour field construction of knowledge base method and system, is related to computer field.Chinese tour field knowledge base and the building tour field Chinese-low-resource text dictionary of the present invention by building comprising multiple triple knowledge;Multiple triple knowledge in Chinese tourism knowledge base are translated as low-resource text by the tour field Chinese-low-resource text dictionary, low-resource text tour field knowledge base is constructed with this.The present invention constructs triple knowledge by resourceful Chinese tourism corpus, obtain comprehensive Chinese tour field knowledge base, Chinese tour field knowledge base is moved into low-resource text tour field knowledge base again, it solves because low-resource text tour field corpus is deficient in network, the technical issues of being not easy to directly acquire low-resource text comprehensive sight spot knowledge content, it realizes plentiful language knowledge migration to the target in low-resource language field, is advantageously implemented the intelligent Service of the other informations such as low-resource text travel information.
Description
Technical field
The present invention relates to field of computer technology, and in particular to a kind of low-resource text tour field construction of knowledge base side
Method and system.
Background technique
Tourism has become for one of most important amusement and recreation of people.With the fast development of internet, so that in network
More and more Chinese tour sites are emerged, provide travel information abundant for tourist.Chinese tour site contains much information,
It is longer that text length is introduced at sight spot, has nothing in common with each other comprising information, in contrast, low-resource text not yet realizes travel information
Intelligent Service.How to help low-resource text to construct knowledge base by resourceful language becomes current natural language processing
In important one of research hotspot.
However, because in current network low-resource text tour field corpus it is deficient, be not easy to directly acquire low-resource text complete
The sight spot knowledge content in face, there are certain difficulties when constructing knowledge base.
Summary of the invention
(1) the technical issues of solving
In view of the deficiencies of the prior art, the present invention provides a kind of low-resource text tour field construction of knowledge base method,
It solves because low-resource text tour field corpus is deficient in network, is not easy to directly acquire the comprehensive sight spot of low-resource text and know
Know content, it is difficult to which building constructs the technical issues of knowledge base.
(2) technical solution
In order to achieve the above object, the present invention is achieved by the following technical programs:
A kind of low-resource text tour field construction of knowledge base method of the present invention, the method are executed by computer, including
Following steps:
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus
It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place
The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model
Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping
Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y
K nearest low-resource text constructs tour field as the corresponding k low-resource character translation Candidate Set of Chinese
The Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base
It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
Preferably, the S1 is specifically included:
S101, obtain Chinese operative function text corpus;
S102, it is based on Word2Vec model training text corpus, obtains term vector model and part of speech vector model;
S103, the position vector that each word in sentence is obtained based on the sentence in text corpus;
S104, word-based vector model obtain term vector, part of speech vector are obtained based on part of speech vector model, by term vector
It is added with part of speech vector matrix, obtains the term vector of fusion part of speech vector, then incorporate position vector, obtain fusion multiple features
Term vector;
S105, the term vector for merging multiple features is inputted into Relation extraction model, obtains the probability distribution of entity relationship;
S106, the entity relationship between two entities is determined according to the probability distribution of entity relationship, is based on entity relationship structure
Triple knowledge is built, by entity triple knowledge with database form structured storage, constructs Chinese tour field knowledge base.
Preferably, the step S101 specifically:
Operative function is obtained by crawler technology, and operative function is pre-processed, by pretreated tourism text
This composition text corpus, the pretreatment include subordinate sentence, participle, part-of-speech tagging.
Preferably, the step S103 specifically:
It is each of the sentence sequence of terms S=[s1, s2 ..., sl] of l word si for a length, away from
Relative distance from first instance e1 and second instance e2 is i-i1 and i-i2, and the first instance e1 and second instance e2 are equal
For the target entity, wherein i indicates index subscript of the current term in sentence, i1 and i2 be respectively first instance e1 and
The index subscript of second instance e2, before negative indicates that current term is located at entity word;Then initialized using Word2vec
To the position vector matrix of a 2ld, wherein d indicates the dimension of position vector;The position vector table of each word in sentence
It is shown as pvi=[pvi1, pvi2], wherein pvi1 and pvi2 respectively indicates in sentence i-th of word to entity e1's and entity e2
The vector of relative distance indicates.
Preferably, the step S104 specifically:
Term vector N={ n1, n2 ... nl } that word-based vector model and the training of part of speech vector model obtain and part of speech to
Amount V=v1, v2 ... and vl }, N is subjected to matrix with V and is added, the term vector for obtaining fusion part of speech vector is expressed as W=α N+
(1- α) V enables α=0.5, i.e. W2=0.5 (N+V) after fusion part of speech vector;Position vector is incorporated again, obtains fusion multiple features
Term vector W3=[W2, PV], wherein PV={ pv1, pv2 ... pvl }.
Preferably, the step S105 specifically:
It using the term vector W3 for merging multiple features as Relation extraction mode input, is handled, is obtained using two-way LS TM
The from front to back and from back to front text semantic information of both direction of sequence of terms in sentence;Wherein i-th word
BLSTM output, which calculates, uses formula:
BLSTM layers of output are connected with softmax classifier, obtain the probability distribution of entity relationship;
Wherein, the entity relationship includes 9 kinds, is had including positional relationship, opening relationships, creation relationship, affiliated pass
It is, closes on relationship, correlativity, inclusion relation, equivalence relation, relation on attributes.
Preferably, the step S106 specifically:
The entity relationship between two entities is determined according to probability distribution, according to the entity relationship structure between two entities
Build triple knowledge<first instance e1, second instance e2, entity relationship>, by handling text corpus, obtain multiple ternarys
Group knowledge, and then construct Chinese tour field knowledge base.
Preferably, the step S2 specifically:
Chinese corpus and low-resource text corpus are obtained by crawler technology, garbage is removed, only retains article text
Then information is segmented, removes stop words.
Preferably, the step S3 specifically:
Its corresponding Chinese word is obtained using fastText term vector model training by the pretreated Chinese corpus
Vector X uses fastText term vector model training to obtain its corresponding low by the pretreated low-resource text corpus
Resource text term vector Y.
The present invention also provides a kind of low-resource text tour field construction of knowledge base system, the system comprises computer,
The computer includes:
At least one storage unit;
At least one processing unit;
Wherein, at least one instruction is stored at least one described storage unit, at least one instruction is by described
At least one processing unit is loaded and is executed to perform the steps of
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus
It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place
The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model
Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping
Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y
K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this
The domain Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base
It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
(3) beneficial effect
The present invention provides a kind of low-resource text tour field construction of knowledge base method and systems.With prior art phase
Than, have it is following the utility model has the advantages that
Chinese tour field knowledge base and building tour field of the present invention by building comprising multiple triple knowledge
The Chinese-low-resource text dictionary;By the tour field Chinese-low-resource text dictionary by Chinese travel knowledge base in multiple three
Tuple knowledge is translated as low-resource text, constructs low-resource text tour field knowledge base with this.The present invention is by resourceful
Chinese tourism corpus construct triple knowledge, obtain comprehensive Chinese tour field knowledge base, then by Chinese tour field
Knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text tour field corpus is deficient in network
It is weary, it the technical issues of being not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes plentiful language knowledge migration
To the target in low-resource language field, it is advantageously implemented the intelligent Service of the other informations such as low-resource text travel information.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or
Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only
Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also
Other drawings may be obtained according to these drawings without any creative labor.
Fig. 1 is a kind of block diagram of low-resource text tour field construction of knowledge base method of the embodiment of the present invention;
Fig. 2 be in the embodiment of the present invention in sentence each word to target entity distance calculation method schematic diagram;
Fig. 3 is the block diagram of Relation extraction model in the embodiment of the present invention;
Fig. 4 is the structural schematic diagram of term vector mapping mode in the embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, to the technology in the embodiment of the present invention
Scheme is clearly and completely described, it is clear that and described embodiments are some of the embodiments of the present invention, rather than whole
Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts
The every other embodiment obtained, shall fall within the protection scope of the present invention.
The embodiment of the present application is solved by providing a kind of low-resource text tour field construction of knowledge base method and system
Because low-resource text tour field corpus is deficient in network, it is not easy to directly acquire in the comprehensive sight spot knowledge of low-resource text
Hold, it is difficult to which the technical issues of constructing knowledge base is realized plentiful language knowledge migration to the target in low-resource language field.
Technical solution in the embodiment of the present application is in order to solve the above technical problems, general thought is as follows:
Chinese tour field knowledge base and building building of the embodiment of the present invention by building comprising multiple triple knowledge
At the tour field Chinese-low-resource text dictionary;Chinese is traveled knowledge base by the tour field Chinese-low-resource text dictionary
In multiple triple knowledge be translated as low-resource text, low-resource text tour field knowledge base is constructed with this.The present invention borrows
It helps resourceful Chinese to travel corpus building triple knowledge, obtains comprehensive Chinese tour field knowledge base, then by the Chinese
Literary tour field knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text is traveled in network
The technical issues of field corpus is deficient, is not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes abundant language
Say that knowledge migration to the target in low-resource language field, is advantageously implemented the intelligence of the other informations such as low-resource text travel information
Energyization service.
In order to better understand the above technical scheme, right in conjunction with appended figures and specific embodiments
Above-mentioned technical proposal is described in detail.
The embodiment of the present invention provides a kind of low-resource text tour field construction of knowledge base method, as shown in Figure 1, described
Method is executed by computer, comprising the following steps:
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus
It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place
The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model
Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping
Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y
K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this
The domain Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base
It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
Chinese tour field knowledge base and building building of the embodiment of the present invention by building comprising multiple triple knowledge
At the tour field Chinese-low-resource text dictionary;Chinese is traveled knowledge base by the tour field Chinese-low-resource text dictionary
In multiple triple knowledge be translated as low-resource text, low-resource text tour field knowledge base is constructed with this.The present invention borrows
It helps resourceful Chinese to travel corpus building triple knowledge, obtains comprehensive Chinese tour field knowledge base, then by the Chinese
Literary tour field knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text is traveled in network
The technical issues of field corpus is deficient, is not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes abundant language
Say that knowledge migration to the target in low-resource language field, is advantageously implemented the intelligence of the other informations such as low-resource text travel information
Energyization service.
Each step is described in detail below.
It should be noted that low-resource text here is by taking Tibetan language as an example.
S1, building include the Chinese tour field knowledge base of multiple triple knowledge.Specifically includes the following steps:
S101, the text corpus of operative function for obtaining Chinese are obtained by crawler technology in the specific implementation process
Operative function is taken, and operative function is pre-processed, constitutes text corpus by pretreated operative function, it is described
Pretreatment includes subordinate sentence, participle, part-of-speech tagging.Wherein part of speech refers to noun, verb, adverbial word etc. part of speech, implements in the present invention
In example, these vocabulary can be divided into thinner division, as title is divided into name, place name, group, mechanism name, transliteration people
Name etc..
Such as: it is domestic that Lazi temple is located at Shigatse District Gyangzi County.
Its word segmentation result are as follows: it is domestic that Lazi temple is located at Shigatse District Gyangzi County.
Participle and word are to annotation results: and La Zisi/na is located at/and the Shigatse District v/ns Gyangzi County/ns is domestic/s./wj
Wherein: ns indicates place name;Na indicates sight spot name;V indicates verb;S indicates that word wj in place indicates punctuation mark;
S102, it is based on Word2Vec model training text corpus, obtains term vector model and part of speech vector model;
S103, the distance based on each word in sentence in text corpus to target entity obtain each in sentence
The position vector of word, target entity refer to first instance e1 and second instance e2.
In the specific implementation process, by the distance of each word in calculating sentence to target entity, so that the same sentence
Different example relationships also can be very good to distinguish in son, and specific implementation is: the sentence sequence of terms S for being l for a length
Each of=[s1, s2 ..., sl] word si, the relative distance apart from first instance e1 and second instance e2 is i-
I1 and i-i2, wherein i indicates index subscript of the current term in sentence, and i1 and i2 are first instance e1 and second real respectively
The index subscript of body e2, before negative indicates that current term is located at entity word.As shown in Fig. 2, " proper loud, high-pitched sound song moral temple is built in sentence
16 end of the centurys belonged to the Gelug Sect." sequence of terms that length is 8 is obtained after word segmentation processing, wherein " being built in " distance first is real
Body e1 " proper loud, high-pitched sound song moral temple " relative distance is 1, and the relative distance apart from second instance e2 " Gelug Sect " is -5.Then use
Word2vec initializes to obtain the position vector matrix of a 2ld, and wherein d indicates the dimension of position vector.Finally, in sentence
The position vector of each word is expressed as pvi=[pvi1, pvi2], and wherein pvi1 and pvi2 respectively indicates i-th of word in sentence
Language to entity e1 and entity e2 relative distance vector indicate.
S104, word-based vector model obtain term vector, part of speech vector are obtained based on part of speech vector model, by term vector
It is added with part of speech vector matrix, obtains the term vector of fusion part of speech vector, then incorporate position vector, obtain fusion multiple features
Term vector.
In the specific implementation process, word-based vector model and part of speech vector model training obtain term vector N=n1,
N2 ... nl } and part of speech vector V=v1, v2 ... and vl }, N is subjected to matrix with V and is added, the word of fusion part of speech vector is obtained
Vector indicates W=α N+ (1- α) V, (wherein 0≤α≤1);As α=1, W1=N does not merge the term vector of part-of-speech information;
Enable α=0.5, i.e. W2=0.5 (N+V) after fusion part of speech vector;Position vector is incorporated again, obtains the term vector of fusion multiple features
W3=[W2, PV], wherein PV={ pv1, pv2 ... pvl }.
S105, the term vector for merging multiple features is inputted into Relation extraction model, obtains the probability distribution of entity relationship;
In the specific implementation process Relation extraction model include BLSTM layers, full articulamentum and soft max classifier, such as scheme
Shown in 3.It using the term vector for merging multiple features as Relation extraction mode input, is handled using two-way LSTM, obtains sentence
The from front to back and from back to front text semantic information of both direction of middle sequence of terms.Wherein the BLSTM of i-th of word is defeated
It calculates out and uses formula:
BLSTM layers of output are connected with softmax classifier, obtain the probability distribution of entity relationship;
Wherein, the entity relationship includes 9 kinds, is had including positional relationship, opening relationships, creation relationship, affiliated pass
It is, closes on relationship, correlativity, inclusion relation, equivalence relation, relation on attributes.
S106, the entity relationship between two entities is determined according to the probability distribution of entity relationship, is based on entity relationship structure
Triple knowledge is built, by entity triple knowledge with database form structured storage, constructs Chinese tour field knowledge base.
Specifically: the entity relationship between two entities is determined according to probability distribution, according to the entity between two entities
Relationship construct triple knowledge<first instance e1, second instance e2, entity relationship>, such as in sentence, " the northern Cave Temple is Gansu Province
A Cave Temple " in,<northern the Cave Temple, Gansu Province, positional relationship>such a triple knowledge can be constructed.Pass through processing
Text corpus obtains multiple triple knowledge, and then constructs Chinese tour field knowledge base.
S2, Chinese corpus and Tibetan language corpus are obtained, and the Chinese corpus and the Tibetan language corpus is pre-processed.
In the specific implementation process, Chinese corpus and Tibetan language corpus are obtained by crawler technology, removes garbage, only
Retain article text message, then segmented, stop words etc. is gone to pre-process.
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place
The Tibetan language corpus of reason obtains its corresponding Tibetan language term vector Y;
In the specific implementation process, fastText term vector model training is used by the pretreated Chinese corpus
Its corresponding Chinese term vector X is obtained, is obtained by the pretreated Tibetan language corpus using fastText term vector model training
To its corresponding Tibetan language term vector Y;
S4, the Linear Mapping matrix T that the Chinese term vector X to the Tibetan language term vector Y is obtained based on MUSE model;
The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, it is described to show that the Chinese term vector X is mapped to
Corresponding vector U after Tibetan language term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y
K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this
The domain Chinese-low-resource text dictionary.
Such as: as shown in figure 4, U (U=TX) expression is mapped to Tibetan language term vector space Later Han Dynasty cliction vector table in figure
Show, Y indicates Tibetan language term vector space, and x is Chinese word, VxThe corresponding term vector of x, found out in Tibetan language term vector space Y with
VxThe candidate collection that k nearest Tibetan language word of COS distance is translated as Chinese x to Tibetan language, i.e. y1, y2 ....For example, working as k
When=5, Chinese word " Dingri County " is after the Chinese hides across language term vector mapping, the Tibetan language word nearest with its COS distance
Be successively "(Dingri County) ", "(day ancestor) ", "(settled date) ", "(Rikaze) ", "
(middle side) ", then the Tibetan language that can select " Dingri County " is translated as
S6, multiple triple knowledge in Chinese tourism knowledge base are translated as by Tibetan language based on tour field Chinese hiding dictionary,
Tibetan language tour field knowledge base is constructed with this.
The embodiment of the present invention also provides a kind of low-resource text tour field construction of knowledge base system, the system comprises
Computer, the computer include:
At least one storage unit;
At least one processing unit;
Wherein, at least one instruction is stored at least one described storage unit, at least one instruction is by described
At least one processing unit is loaded and is executed to perform the steps of
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and to the Chinese corpus and the low-resource text corpus
It is pre-processed;
S3, based on its corresponding Chinese term vector X is obtained by the pretreated Chinese corpus, based on by pre- place
The low-resource text corpus of reason obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model
Matrix T;The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, obtains the Chinese term vector X mapping
Corresponding vector U after to the low-resource text term vector Y;
S5, term vector expression COS distance of the calculating with Chinese word in vector U in low-resource text term vector Y
K nearest low-resource text constructs tourism neck as the corresponding k low-resource character translation Candidate Set of Chinese with this
The domain Chinese-low-resource text dictionary;
S6, the tour field Chinese-low-resource text dictionary is based on by multiple triple knowledge in Chinese tourism knowledge base
It is translated as low-resource text, low-resource text tour field knowledge base is constructed with this.
In conclusion compared with prior art, have it is following the utility model has the advantages that
Chinese tour field knowledge base and building building of the embodiment of the present invention by building comprising multiple triple knowledge
At the tour field Chinese-low-resource text dictionary;Chinese is traveled knowledge base by the tour field Chinese-low-resource text dictionary
In multiple triple knowledge be translated as low-resource text, low-resource text tour field knowledge base is constructed with this.The present invention borrows
It helps resourceful Chinese to travel corpus building triple knowledge, obtains comprehensive Chinese tour field knowledge base, then by the Chinese
Literary tour field knowledge base moves to low-resource text tour field knowledge base, solves because low-resource text is traveled in network
The technical issues of field corpus is deficient, is not easy to directly acquire low-resource text comprehensive sight spot knowledge content, realizes abundant language
Say that knowledge migration to the target in low-resource language field, is advantageously implemented the intelligence of the other informations such as low-resource text travel information
Energyization service.
It should be noted that through the above description of the embodiments, those skilled in the art can be understood that
It can be realized by means of software and necessary general hardware platform to each embodiment.Based on this understanding, above-mentioned skill
Substantially the part that contributes to existing technology can be embodied in the form of software products art scheme in other words, the meter
Calculation machine software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with
Another entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this
Actual relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to nonexcludability
Include so that include a series of elements process, method, article or equipment not only include those elements, but also
Further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including
There is also other identical elements in the process, method, article or equipment of the element.
The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to the foregoing embodiments
Invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each implementation
Technical solution documented by example is modified or equivalent replacement of some of the technical features;And these modification or
Replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.
Claims (10)
1. a kind of low-resource text tour field construction of knowledge base method, which is characterized in that the method is executed by computer, packet
Include following steps:
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and the Chinese corpus and the low-resource text corpus is carried out
Pretreatment;
S3, its corresponding Chinese term vector X is obtained based on the pretreated Chinese corpus of process, based on by pretreated institute
It states low-resource text corpus and obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping matrix T that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model;
The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, it is described to show that the Chinese term vector X is mapped to
Corresponding vector U after low-resource text term vector Y;
S5, it calculates in low-resource text term vector Y and to indicate that COS distance is nearest with term vector of the Chinese word in vector U
K low-resource text constructs the tour field Chinese-low money as the corresponding k low-resource character translation Candidate Set of Chinese
Source document words allusion quotation;
S6, multiple triple knowledge in Chinese tourism knowledge base are translated as based on the tour field Chinese-low-resource text dictionary
Low-resource text constructs low-resource text tour field knowledge base with this.
2. low-resource text tour field construction of knowledge base method as described in claim 1, which is characterized in that the S1 is specific
Include:
S101, obtain Chinese operative function text corpus;
S102, it is based on Word2Vec model training text corpus, obtains term vector model and part of speech vector model;
S103, the position vector that each word in sentence is obtained based on the sentence in text corpus;
S104, word-based vector model obtain term vector, part of speech vector are obtained based on part of speech vector model, by term vector and part of speech
Vector matrix is added, and obtains the term vector of fusion part of speech vector, then incorporate position vector, obtains the term vector of fusion multiple features;
S105, the term vector for merging multiple features is inputted into Relation extraction model, obtains the probability distribution of entity relationship;
S106, the entity relationship between two entities is determined according to the probability distribution of entity relationship, based on entity relationship building three
Tuple knowledge constructs Chinese tour field knowledge base by entity triple knowledge with database form structured storage.
3. low-resource text tour field construction of knowledge base method as claimed in claim 2, which is characterized in that the step
S101 specifically:
Operative function is obtained by crawler technology, and operative function is pre-processed, by pretreated operative function structure
At text corpus, the pretreatment includes subordinate sentence, participle, part-of-speech tagging.
4. low-resource text tour field construction of knowledge base method as claimed in claim 3, which is characterized in that the step
S103 specifically:
It is each of the sentence sequence of terms S=[s1, s2 ..., sl] of l word si, distance first for a length
The relative distance of entity e1 and second instance e2 are i-i1 and i-i2, and the first instance e1 and second instance e2 are the mesh
Entity is marked, wherein i indicates index subscript of the current term in sentence, and i1 and i2 are first instance e1 and second instance e2 respectively
Index subscript, before negative indicates that current term is located at entity word;Then initialize to obtain a 2ld's using Word2vec
Position vector matrix, wherein d indicates the dimension of position vector;The position vector of each word is expressed as pvi=in sentence
[pvi1, pvi2], wherein pvi1 and pvi2 respectively indicate i-th of word in sentence to entity e1 and entity e2 relative distance
Vector indicates.
5. low-resource text tour field construction of knowledge base method as claimed in claim 4, which is characterized in that the step
S104 specifically:
The term vector N={ n1, n2 ... nl } and part of speech vector V=that word-based vector model and the training of part of speech vector model obtain
V1, v2 ... vl }, N is subjected to matrix with V and is added, the term vector for obtaining fusion part of speech vector is expressed as W=α N+ (1- α) V,
Enable α=0.5, i.e. W2=0.5 (N+V) after fusion part of speech vector;Position vector is incorporated again, obtains the term vector W3 of fusion multiple features
=[W2, PV], wherein PV={ pv1, pv2 ... pvl }.
6. low-resource text tour field construction of knowledge base method as claimed in claim 5, which is characterized in that the step
S105 specifically:
It using the term vector W3 for merging multiple features as Relation extraction mode input, is handled using two-way LSTM, obtains sentence
The from front to back and from back to front text semantic information of both direction of middle sequence of terms;The wherein BLSTM output of i-th of word
It calculates and uses formula:
BLSTM layers of output are connected with softmax classifier, obtain the probability distribution of entity relationship;
Wherein, the entity relationship includes 9 kinds, has including positional relationship, opening relationships, creation relationship, belonging relation, closes on
Relationship, correlativity, inclusion relation, equivalence relation, relation on attributes.
7. low-resource text tour field construction of knowledge base method as claimed in claim 6, which is characterized in that the step
S106 specifically:
The entity relationship between two entities is determined according to probability distribution, and ternary is constructed according to the entity relationship between two entities
Group knowledge<first instance e1, second instance e2, entity relationship>, by handle text corpus, obtain multiple triple knowledge,
And then construct Chinese tour field knowledge base.
8. low-resource text tour field construction of knowledge base method as described in claim 1, which is characterized in that the step S2
Specifically:
Chinese corpus and low-resource text corpus are obtained by crawler technology, garbage is removed, only retains article text message,
Then it segmented, remove stop words.
9. low-resource text tour field construction of knowledge base method as described in claim 1, which is characterized in that the step S3
Specifically:
Its corresponding Chinese term vector is obtained using fastText term vector model training by the pretreated Chinese corpus
X obtains its corresponding low-resource using fastText term vector model training by the pretreated low-resource text corpus
Text term vector Y.
10. a kind of low-resource text tour field construction of knowledge base system, which is characterized in that the system comprises computer, institutes
Stating computer includes:
At least one storage unit;
At least one processing unit;
Wherein, be stored at least one instruction at least one described storage unit, at least one instruction by it is described at least
One processing unit is loaded and is executed to perform the steps of
S1, building include the Chinese tour field knowledge base of multiple triple knowledge;
S2, Chinese corpus and low-resource text corpus are obtained, and the Chinese corpus and the low-resource text corpus is carried out
Pretreatment;
S3, its corresponding Chinese term vector X is obtained based on the pretreated Chinese corpus of process, based on by pretreated institute
It states low-resource text corpus and obtains its corresponding low-resource text term vector Y;
S4, the Linear Mapping matrix T that the Chinese term vector X to the low-resource text term vector Y is obtained based on MUSE model;
The Linear Mapping matrix T is multiplied with the Chinese term vector X matrix again, it is described to show that the Chinese term vector X is mapped to
Corresponding vector U after low-resource text term vector Y;
S5, it calculates in low-resource text term vector Y and to indicate that COS distance is nearest with term vector of the Chinese word in vector U
K low-resource text constructs the tour field Chinese-as the corresponding k low-resource character translation Candidate Set of Chinese with this
Low-resource text dictionary;
S6, multiple triple knowledge in Chinese tourism knowledge base are translated as based on the tour field Chinese-low-resource text dictionary
Low-resource text constructs low-resource text tour field knowledge base with this.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910650742.1A CN110532544B (en) | 2019-07-18 | 2019-07-18 | Method and system for constructing low-resource word tourism field knowledge base |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910650742.1A CN110532544B (en) | 2019-07-18 | 2019-07-18 | Method and system for constructing low-resource word tourism field knowledge base |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110532544A true CN110532544A (en) | 2019-12-03 |
CN110532544B CN110532544B (en) | 2023-03-24 |
Family
ID=68660345
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910650742.1A Active CN110532544B (en) | 2019-07-18 | 2019-07-18 | Method and system for constructing low-resource word tourism field knowledge base |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110532544B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112132660A (en) * | 2020-09-25 | 2020-12-25 | 尚娱软件(深圳)有限公司 | Commodity recommendation method, system, device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104133848A (en) * | 2014-07-01 | 2014-11-05 | 中央民族大学 | Tibetan language entity knowledge information extraction method |
WO2015079591A1 (en) * | 2013-11-27 | 2015-06-04 | Nec Corporation | Crosslingual text classification method using expected frequencies |
CN106777274A (en) * | 2016-06-16 | 2017-05-31 | 北京理工大学 | A kind of Chinese tour field knowledge mapping construction method and system |
-
2019
- 2019-07-18 CN CN201910650742.1A patent/CN110532544B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015079591A1 (en) * | 2013-11-27 | 2015-06-04 | Nec Corporation | Crosslingual text classification method using expected frequencies |
CN104133848A (en) * | 2014-07-01 | 2014-11-05 | 中央民族大学 | Tibetan language entity knowledge information extraction method |
CN106777274A (en) * | 2016-06-16 | 2017-05-31 | 北京理工大学 | A kind of Chinese tour field knowledge mapping construction method and system |
Non-Patent Citations (1)
Title |
---|
柳路芳 等: ""基于词向量与可比语料库的双语词典提取研究"", 《计算机工程与科学》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112132660A (en) * | 2020-09-25 | 2020-12-25 | 尚娱软件(深圳)有限公司 | Commodity recommendation method, system, device and storage medium |
CN112132660B (en) * | 2020-09-25 | 2023-12-26 | 尚娱软件(深圳)有限公司 | Commodity recommendation method, system, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110532544B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104239501B (en) | Mass video semantic annotation method based on Spark | |
CN104133848A (en) | Tibetan language entity knowledge information extraction method | |
CN106021224A (en) | Bilingual discourse annotation method | |
Modi et al. | Review on abstractive text summarization techniques (ATST) for single and multi documents | |
CN113032552A (en) | Text abstract-based policy key point extraction method and system | |
Gatiatullin et al. | About Turkic Morpheme Portal. | |
CN114911893A (en) | Method and system for automatically constructing knowledge base based on knowledge graph | |
Hu et al. | BIM oriented intelligent data mining and representation | |
CN110532544A (en) | Low-resource text tour field construction of knowledge base method and system | |
CN104281695A (en) | Combination theory based quasi natural language semantic information extraction method and system | |
Yiming et al. | Research on the construction of maritime legal knowledge graph | |
Li et al. | An Approach and Implementation for Knowledge Graph Construction and Q&A System | |
Taghiyareh et al. | A Semantic Rule‑based Framework for Efficient Retrieval of Educational Materials | |
Lakshmi et al. | Knowledge graph curation from text via ontologies | |
Agnoloni et al. | Semantic web standards and ontologies for legislative drafting support | |
Amirhosseini et al. | Quantitative evaluation of simplicity invisible domain in Islamic knowledge organizations | |
Gong et al. | A term extraction algorithm based on machine learning and comprehensive feature strategy | |
CN117407511B (en) | Electric power safety regulation intelligent question-answering method and system based on Bert model | |
Huang et al. | An Analysis Model of English Text Coherence Based on RST Dependency Relationship | |
Zhu et al. | Research on construction and automatic expansion of multi-source lexical semantic knowledge base | |
Guo et al. | A review of the development and application of natural language processing | |
Zhang et al. | Research on the Application of Computer Artificial Intelligence Machine Translation System in the Sci-tech Journals | |
Yang et al. | Deep Annotation of The Tang Dynasty Seven-Character Quatrain Corpus and Generation of Data Set for Poetry Composition Teaching System | |
Ming et al. | A Method of Semantic Redundant Information Filtering for Abstract Meaning Representation Graph | |
Huang et al. | Data analysis of overseas chinese literature based on co-occurrence relationship |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |