CN110532544B - Method and system for constructing low-resource word tourism field knowledge base - Google Patents

Method and system for constructing low-resource word tourism field knowledge base Download PDF

Info

Publication number
CN110532544B
CN110532544B CN201910650742.1A CN201910650742A CN110532544B CN 110532544 B CN110532544 B CN 110532544B CN 201910650742 A CN201910650742 A CN 201910650742A CN 110532544 B CN110532544 B CN 110532544B
Authority
CN
China
Prior art keywords
word
low
resource
chinese
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910650742.1A
Other languages
Chinese (zh)
Other versions
CN110532544A (en
Inventor
赵小兵
冯小兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minzu University of China
Original Assignee
Minzu University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minzu University of China filed Critical Minzu University of China
Priority to CN201910650742.1A priority Critical patent/CN110532544B/en
Publication of CN110532544A publication Critical patent/CN110532544A/en
Application granted granted Critical
Publication of CN110532544B publication Critical patent/CN110532544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a method and a system for constructing a knowledge base in the field of low-resource character tourism, and relates to the field of computers. The method comprises the steps of constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge and constructing a Chinese-low resource character dictionary of the tourism field; and translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words through the Chinese-low-resource word dictionary in the tourism field, thereby constructing the low-resource word tourism field knowledge base. According to the invention, the ternary group knowledge is constructed by virtue of the Chinese tourism linguistic data with rich resources, the comprehensive Chinese tourism field knowledge base is obtained, and then the Chinese tourism field knowledge base is migrated to the low-resource character tourism field knowledge base, so that the technical problem that the comprehensive scenic spot knowledge content of the low-resource characters is difficult to directly obtain due to the shortage of the linguistic data of the low-resource character tourism field in the network is solved, the aim of migrating the rich language knowledge to the low-resource language field is realized, and the intelligent service of other information such as the low-resource character tourism information is favorably realized.

Description

Low-resource character tourism field knowledge base construction method and system
Technical Field
The invention relates to the technical field of computers, in particular to a method and a system for constructing a low-resource character tourism field knowledge base.
Background
Tourism has become one of the most important leisure and recreation of people. With the rapid development of the internet, more and more Chinese travel websites emerge in the network, and rich travel information is provided for tourists. Chinese tourist websites have a large amount of information, scenic spot introduction texts have a long spread, and contain different information, on the contrary, low-resource characters do not realize intelligent service of tourist information. How to help low-resource characters to construct a knowledge base by means of resource-rich languages becomes one of important research hotspots in current natural language processing.
However, because of the lack of corpus in the low-resource character tourism field in the current network, it is not easy to directly obtain the comprehensive scenic spot knowledge content of the low-resource characters, and there is a certain difficulty in constructing the knowledge base.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a method for constructing a knowledge base in the low-resource character tourism field, which solves the technical problems that the comprehensive scenic spot knowledge content of low-resource characters is difficult to directly acquire and the knowledge base is difficult to construct due to the shortage of the linguistic data in the low-resource character tourism field in the network.
(II) technical scheme
In order to realize the purpose, the invention is realized by the following technical scheme:
the invention discloses a method for constructing a low-resource character tourism field knowledge base, which is executed by a computer and comprises the following steps of:
s1, constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge;
s2, obtaining Chinese linguistic data and low-resource text linguistic data, and preprocessing the Chinese linguistic data and the low-resource text linguistic data;
s3, acquiring a Chinese word vector X corresponding to the preprocessed Chinese corpus based on the preprocessed Chinese corpus, and acquiring a low-resource Wen Zici vector Y corresponding to the preprocessed low-resource text corpus based on the preprocessed low-resource text corpus;
s4, acquiring a linear mapping matrix T from the Chinese word vector X to the low-resource word vector Y based on an MUSE model; multiplying the linear mapping matrix T and the Chinese word vector X matrix to obtain a vector U corresponding to the Chinese word vector X after being mapped to the low-resource character word vector Y;
s5, calculating k low-resource characters which have the closest word vector quantity expression cosine distance with Chinese words in a vector U in the low-resource character word vector Y, and taking the k low-resource characters as a translation candidate set of the k low-resource characters corresponding to Chinese to construct a Chinese-low resource character dictionary in the tourism field;
and S6, translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words based on the Chinese-low-resource word dictionary in the tourism field, and constructing the low-resource word tourism field knowledge base.
Preferably, S1 specifically includes:
s101, acquiring a text corpus of Chinese travel texts;
s102, training a text corpus based on a Word2Vec model to obtain a Word vector model and a Word vector model;
s103, acquiring a position vector of each word in the sentence based on the sentences in the text corpus;
s104, obtaining word vectors based on the word vector model, obtaining word characteristic vectors based on the word characteristic vector model, adding the word vectors and the word characteristic vector matrixes to obtain word vectors fused with the word characteristic vectors, and then fusing position vectors to obtain word vectors fused with multiple characteristics;
s105, inputting the word vectors fused with the multiple features into a relation extraction model to obtain probability distribution of entity relations;
s106, judging the entity relationship between the two entities according to the probability distribution of the entity relationship, constructing triple knowledge based on the entity relationship, structurally storing the entity triple knowledge in a database form, and constructing a Chinese tourism field knowledge base.
Preferably, the step S101 specifically includes:
the method comprises the steps of obtaining a tourism text through a crawler technology, preprocessing the tourism text, forming a text corpus by the preprocessed tourism text, wherein the preprocessing comprises sentence segmentation, word segmentation and word marking.
Preferably, the step S103 specifically includes:
for each word si in a sentence word sequence S = [ S1, S2., sl ] of length l, the relative distances from a first entity e1 and a second entity e2 are i-i1 and i-i2, the first entity e1 and the second entity e2 are both the target entities, wherein i represents the index subscript of the current word in the sentence, i1 and i2 are the index subscripts of the first entity e1 and the second entity e2 respectively, and the negative number represents that the current word is positioned in front of the entity word; then adopting Word2vec to initialize to obtain a 2ld position vector matrix, wherein d represents the dimensionality of the position vector; the position vector for each word in the sentence is represented as pvi = [ pvi, pvi ], where pvi and pvi represent vector representations of the relative distances of the i-th word in the sentence to entity e1 and entity e2, respectively.
Preferably, the step S104 specifically includes:
performing matrix addition on a word vector N = { N1, N2,. Nl } and a part-of-speech vector V = { V1, V2,. Vl } obtained by training a word vector model and a part-of-speech vector model to obtain a word vector of a fused part-of-speech vector, wherein the word vector is represented as W = alpha N + (1-alpha) V, and alpha =0.5, namely W2=0.5 (N + V) after the part-of-speech vector is fused; and then, fusing the position vector to obtain a multi-feature fused word vector W3= [ W2, PV ], wherein PV = { PV1, PV 2.. Pvl }.
Preferably, the step S105 specifically includes:
inputting a word vector W3 fused with multiple features as a relation extraction model, and processing by adopting a bidirectional LS (least square) TM (TM) to obtain text semantic information of a word sequence in a sentence in two directions from front to back and from back to front; the output calculation of the BLSTM of the ith word adopts a formula:
Figure BDA0002135133550000041
connecting the output of the BLSTM layer with a softmax classifier to obtain probability distribution of entity relations;
the entity relations comprise 9 types, including position relations, establishing relations, belongingrelations, proximity relations, correlation relations, inclusion relations, equivalence relations and attribution relations.
Preferably, the step S106 specifically includes:
and judging the entity relationship between two entities according to the probability distribution, constructing the triple knowledge < the first entity e1, the second entity e2 and the entity relationship > according to the entity relationship between the two entities, and acquiring a plurality of triple knowledge by processing the text corpus so as to construct a Chinese tourism field knowledge base.
Preferably, the step S2 specifically includes:
chinese linguistic data and low-resource text linguistic data are obtained through a crawler technology, useless information is removed, only text information of an article is reserved, and then word segmentation and word deactivation are carried out.
Preferably, the step S3 specifically includes:
and training the preprocessed Chinese corpus by adopting a fastText word vector model to obtain a corresponding Chinese word vector X, and training the preprocessed low-resource text corpus by adopting a fastText word vector model to obtain a corresponding low-resource text word vector Y.
The invention also provides a low-resource character tourism field knowledge base construction system, which comprises a computer, wherein the computer comprises:
at least one memory cell;
at least one processing unit;
wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:
s1, constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge;
s2, obtaining Chinese linguistic data and low-resource text linguistic data, and preprocessing the Chinese linguistic data and the low-resource text linguistic data;
s3, acquiring a corresponding Chinese word vector X based on the preprocessed Chinese corpus, and acquiring a corresponding low-resource Wen Zici vector Y based on the preprocessed low-resource text corpus;
s4, acquiring a linear mapping matrix T from the Chinese word vector X to the low-resource text word vector Y based on an MUSE model; multiplying the linear mapping matrix T and the Chinese word vector X matrix to obtain a vector U corresponding to the Chinese word vector X after being mapped to the low-resource character word vector Y;
s5, calculating k low-resource characters which have the closest cosine distance to the word vector quantity of the Chinese word in the vector U in the low-resource character word vector Y, and taking the k low-resource characters as a translation candidate set of the k low-resource characters corresponding to the Chinese so as to construct a Chinese-low-resource character dictionary in the tourism field;
and S6, translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words based on the Chinese-low-resource word dictionary in the tourism field, and constructing the low-resource word tourism field knowledge base.
(III) advantageous effects
The invention provides a method and a system for constructing a low-resource character tourism field knowledge base. Compared with the prior art, the method has the following beneficial effects:
the method comprises the steps of constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge and constructing a Chinese-low resource word dictionary in the tourism field; and translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words through the Chinese-low-resource word dictionary in the tourism field, thereby constructing the low-resource word tourism field knowledge base. According to the invention, the ternary group knowledge is constructed by virtue of the Chinese tourism linguistic data with rich resources, the comprehensive Chinese tourism field knowledge base is obtained, and then the Chinese tourism field knowledge base is migrated to the low-resource character tourism field knowledge base, so that the technical problem that the comprehensive scenic spot knowledge content of the low-resource characters is difficult to directly obtain due to the shortage of the linguistic data in the low-resource character tourism field in the network is solved, the aim of migrating the rich language knowledge to the low-resource language field is realized, and the intelligent service of other information such as the low-resource character tourism information is favorably realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a block diagram of a method for constructing a low-resource textual tourist domain knowledge base according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a method for calculating a distance from each word in a sentence to a target entity according to an embodiment of the present invention;
FIG. 3 is a block diagram of a relationship extraction model in an embodiment of the invention;
fig. 4 is a schematic structural diagram of a word vector mapping manner in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides a method and a system for constructing a low-resource character tourism field knowledge base, so that the technical problems that comprehensive scenic spot knowledge contents of low-resource characters are not easy to directly obtain and the knowledge base is difficult to construct due to the fact that the low-resource character tourism field in a network is deficient in language materials are solved, and the aim of migrating rich language knowledge to the low-resource language field is achieved.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
the embodiment of the invention constructs a Chinese-low resource word dictionary in the tourism field by constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge and constructing; and translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource characters through the Chinese-low-resource character dictionary in the tourism field so as to construct the low-resource character tourism field knowledge base. According to the invention, triple knowledge is constructed by virtue of the Chinese tourism linguistic data with rich resources, a comprehensive Chinese tourism field knowledge base is obtained, and then the Chinese tourism field knowledge base is migrated to the low-resource character tourism field knowledge base, so that the technical problem that comprehensive scenic spot knowledge content of low-resource characters is difficult to directly obtain due to the shortage of the linguistic data of the low-resource character tourism field in the network is solved, the aim of migrating rich language knowledge to the low-resource language field is realized, and the intelligent service of other information such as low-resource character tourism information is favorably realized.
In order to better understand the technical scheme, the technical scheme is described in detail in the following with reference to the attached drawings of the specification and specific embodiments.
The embodiment of the invention provides a method for constructing a low-resource character tourism field knowledge base, which is executed by a computer and comprises the following steps of:
s1, constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge;
s2, obtaining Chinese linguistic data and low-resource text linguistic data, and preprocessing the Chinese linguistic data and the low-resource text linguistic data;
s3, acquiring a Chinese word vector X corresponding to the preprocessed Chinese corpus based on the preprocessed Chinese corpus, and acquiring a low-resource Wen Zici vector Y corresponding to the preprocessed low-resource text corpus based on the preprocessed low-resource text corpus;
s4, acquiring a linear mapping matrix T from the Chinese word vector X to the low-resource character word vector Y based on an MUSE model; multiplying the linear mapping matrix T with the Chinese word vector X matrix to obtain a vector U corresponding to the Chinese word vector X after being mapped to the low-resource character word vector Y;
s5, calculating k low-resource characters which have the shortest cosine distance with the word vector quantity of the Chinese word in the vector U in the low-resource character word vector Y, and taking the k low-resource characters as a translation candidate set of the k low-resource characters corresponding to the Chinese so as to construct a Chinese-low-resource character dictionary in the tourism field;
and S6, translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words based on the Chinese-low-resource word dictionary in the tourism field, and constructing the low-resource word tourism field knowledge base.
The embodiment of the invention constructs a Chinese-low resource word dictionary in the tourism field by constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge and constructing; and translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource characters through the Chinese-low-resource character dictionary in the tourism field so as to construct the low-resource character tourism field knowledge base. According to the invention, triple knowledge is constructed by virtue of the Chinese tourism linguistic data with rich resources, a comprehensive Chinese tourism field knowledge base is obtained, and then the Chinese tourism field knowledge base is migrated to the low-resource character tourism field knowledge base, so that the technical problem that comprehensive scenic spot knowledge content of low-resource characters is difficult to directly obtain due to the shortage of the linguistic data of the low-resource character tourism field in the network is solved, the aim of migrating rich language knowledge to the low-resource language field is realized, and the intelligent service of other information such as low-resource character tourism information is favorably realized.
The steps are described in detail below.
Note that the low resource text is exemplified by the Tibetan language.
S1, constructing a Chinese travel field knowledge base containing a plurality of triple knowledge. The method specifically comprises the following steps:
s101, a text corpus of Chinese tourism texts is obtained, in the specific implementation process, the tourism texts are obtained through a crawler technology and are preprocessed, the preprocessed tourism texts form the text corpus, and the preprocessing comprises sentence segmentation, word segmentation and part of speech tagging. Where the parts of speech refer to parts of speech such as nouns, verbs, adverbs, etc., in the embodiment of the present invention, these words are divided into finer divisions, such as names divided into names of people, places, organizations, transliterated names, etc.
For example: the Lacumin temple is located in Jiangzin county in the Japanese Kai region.
The word segmentation result is as follows: the Lacumin temple is located in Jiangzin county in the Japanese Kai region.
And segmentation and word direction labeling results: the Lacumin temple/na is located in/v Japanese Kai region/ns Jiangzin county/ns county/s. /wj
Wherein: ns represents a place name; na represents a scene name; v represents a verb; s represents the place word wj represents punctuation;
s102, training a text corpus based on a Word2Vec model to obtain a Word vector model and a Word vector model;
s103, obtaining a position vector of each word in the sentence based on the distance from each word in the sentence in the text corpus to a target entity, wherein the target entity refers to the first entity e1 and the second entity e2.
In the specific implementation process, the distance from each word in a sentence to a target entity is calculated, so that different example relationships in the same sentence can be well distinguished, and the specific implementation is as follows: for each word si in a sentence word sequence S = [ S1, S2., sl ] with a length of l, the relative distances from the first entity e1 and the second entity e2 are i-i1 and i-i2, where i represents the index of the current word in the sentence, i1 and i2 are the index indices of the first entity e1 and the second entity e2, respectively, and a negative number indicates that the current word is located before the entity word. As shown in fig. 2, the sentence "qiaga Qu Desi was built at the end of the 16 th century and belongs to gruppe. The word sequence with the length of 8 is obtained after word segmentation, wherein the relative distance between the word sequence and the first entity e1, namely QIGA Qu Desi, is 1, and the relative distance between the word sequence and the second entity e2, namely Groupi, is-5. Then, a 2ld position vector matrix is obtained by adopting Word2vec initialization, wherein d represents the dimension of the position vector. Finally, the position vector representation of each word in the sentence is pvi = [ pvi, pvi ], where pvi1 and pvi represent vector representations of the relative distances of the ith word in the sentence to entity e1 and entity e2, respectively.
S104, word vectors are obtained based on the word vector model, word characteristic vectors are obtained based on the word characteristic vector model, the word vectors and the word characteristic vector matrix are added to obtain word vectors with the word characteristic vectors fused, and then position vectors are fused to obtain multi-feature fused word vectors.
In a specific implementation process, based on a word vector N = { N1, N2,. Nl } and a part-of-speech vector V = { V1, V2,. Vl } obtained by training a word vector model and a part-of-speech vector model, matrix addition is performed on N and V to obtain a word vector representation W = alpha N + (1-alpha) V of a fused part-of-speech vector (wherein 0 is not less than alpha and not more than 1); when α =1, W1= N, i.e., a word vector without the part-of-speech information fused; let α =0.5, i.e., W2=0.5 (N + V) after fusing the part-of-speech vectors; and then fusing the position vector to obtain a word vector W3= [ W2, PV ] fusing multiple characteristics, wherein PV = { PV1, PV 2.. Pvl }.
S105, inputting the word vectors fused with the multiple features into a relation extraction model to obtain probability distribution of entity relations;
in a specific implementation process, the relation extraction model comprises a BLSTM layer, a full connection layer and a softmax classifier, as shown in FIG. 3. And inputting the word vector fused with the multiple features as a relation extraction model, and processing by adopting a bidirectional LSTM to obtain text semantic information of a word sequence in a sentence in two directions from front to back and from back to front. The BLSTM output calculation of the ith word uses the formula:
Figure BDA0002135133550000111
connecting the output of the BLSTM layer with a softmax classifier to obtain the probability distribution of the entity relationship;
the entity relations comprise 9 types, including position relations, establishing relations, belongingrelations, proximity relations, correlation relations, inclusion relations, equivalence relations and attribution relations.
S106, judging the entity relationship between the two entities according to the probability distribution of the entity relationship, constructing triple knowledge based on the entity relationship, structurally storing the entity triple knowledge in a database form, and constructing a Chinese tourism field knowledge base.
The method specifically comprises the following steps: the entity relationship between the two entities is determined according to the probability distribution, and the triple knowledge < the first entity e1, the second entity e2 and the entity relationship > is constructed according to the entity relationship between the two entities, for example, in a sentence "the temple of the north grotto is a cave temple of the Gansu province", the triple knowledge < the temple of the north grotto, the Gansu province and the position relationship > can be constructed. And acquiring a plurality of triple knowledge by processing the text corpus so as to construct a Chinese tourism field knowledge base.
S2, obtaining Chinese language materials and Tibetan language materials, and preprocessing the Chinese language materials and the Tibetan language materials.
In the specific implementation process, chinese linguistic data and Tibetan linguistic data are obtained through a crawler technology, useless information is removed, only text information of an article is kept, and then preprocessing such as word segmentation and word stop is performed.
S3, acquiring a corresponding Chinese word vector X based on the preprocessed Chinese corpus, and acquiring a corresponding Tibetan word vector Y based on the preprocessed Tibetan corpus;
in a specific implementation process, the preprocessed Chinese corpus is trained by adopting a fastText word vector model to obtain a corresponding Chinese word vector X, and the preprocessed Tibetan corpus is trained by adopting a fastText word vector model to obtain a corresponding Tibetan word vector Y;
s4, acquiring a linear mapping matrix T from the Chinese word vector X to the Tibetan word vector Y based on an MUSE model; multiplying the linear mapping matrix T with the Chinese word vector X matrix to obtain a corresponding vector U after the Chinese word vector X is mapped to the Tibetan word vector Y;
and S5, calculating k low-resource characters which have the closest cosine distance with the word vector of the Chinese word in the vector U in the low-resource character word vector Y, and taking the k low-resource characters as a translation candidate set of the k low-resource characters corresponding to the Chinese so as to construct a Chinese-low-resource character dictionary in the tourism field.
For example: as shown in fig. 4, U (U = TX) in the figure represents chinese word vector representation after being mapped to Tibetan word vector space, Y represents Tibetan word vector space, x is chinese word, V x Is the word vector corresponding to x, finds out the word vector space Y corresponding to V in the Tibetan x K Tibetan words with the cosine closest to the Tibetan are used as a candidate set of translation from Chinese x to Tibetan, namely y1, y 2. For example, when k =5, after the chinese word "sun county" is subjected to the chinese-tibetan cross-language word vector mapping, the tibetan words closest to the cosine thereof are sequentially "
Figure BDA0002135133550000131
(heliostat county) "," "based on the status of the sun">
Figure BDA0002135133550000132
(day zong) "," "," "based on the status of the sun">
Figure BDA0002135133550000133
(sun-day) "," reserved on a sun or on a sun>
Figure BDA0002135133550000134
(Japanese Kate) ", a,“/>
Figure BDA0002135133550000135
(middle side) ", the Tibetan translation of" heliostat county "can be selected as->
Figure BDA0002135133550000136
/>
And S6, translating a plurality of triple recognitions in the Chinese travel knowledge base into Tibetan based on the travel field Chinese-Tibetan dictionary, and constructing a Tibetan travel field knowledge base.
The embodiment of the invention also provides a low-resource character tourism field knowledge base construction system, which comprises a computer, wherein the computer comprises:
at least one memory cell;
at least one processing unit;
wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:
s1, constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge;
s2, obtaining Chinese linguistic data and low-resource literal linguistic data, and preprocessing the Chinese linguistic data and the low-resource literal linguistic data;
s3, acquiring a corresponding Chinese word vector X based on the preprocessed Chinese corpus, and acquiring a corresponding low-resource Wen Zici vector Y based on the preprocessed low-resource text corpus;
s4, acquiring a linear mapping matrix T from the Chinese word vector X to the low-resource word vector Y based on an MUSE model; multiplying the linear mapping matrix T and the Chinese word vector X matrix to obtain a vector U corresponding to the Chinese word vector X after being mapped to the low-resource character word vector Y;
s5, calculating k low-resource characters which have the closest cosine distance to the word vector quantity of the Chinese word in the vector U in the low-resource character word vector Y, and taking the k low-resource characters as a translation candidate set of the k low-resource characters corresponding to the Chinese so as to construct a Chinese-low-resource character dictionary in the tourism field;
and S6, translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words based on the Chinese-low-resource word dictionary in the tourism field, and constructing the low-resource word tourism field knowledge base.
In summary, compared with the prior art, the method has the following beneficial effects:
the embodiment of the invention constructs a Chinese-low resource word dictionary in the tourism field by constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge and constructing; and translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource characters through the Chinese-low-resource character dictionary in the tourism field so as to construct the low-resource character tourism field knowledge base. According to the invention, triple knowledge is constructed by virtue of the Chinese tourism linguistic data with rich resources, a comprehensive Chinese tourism field knowledge base is obtained, and then the Chinese tourism field knowledge base is migrated to the low-resource character tourism field knowledge base, so that the technical problem that comprehensive scenic spot knowledge content of low-resource characters is difficult to directly obtain due to the shortage of the linguistic data of the low-resource character tourism field in the network is solved, the aim of migrating rich language knowledge to the low-resource language field is realized, and the intelligent service of other information such as low-resource character tourism information is favorably realized.
It should be noted that, through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A low-resource character tourism field knowledge base construction method is characterized in that the method is executed by a computer and comprises the following steps:
s1, constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge, comprising the following steps:
s101, acquiring a text corpus of Chinese travel texts;
s102, training a text corpus based on a Word2Vec model to obtain a Word vector model and a part-of-speech vector model;
s103, acquiring a position vector of each word in the sentence based on the sentences in the text corpus;
s104, obtaining word vectors based on the word vector model, obtaining word characteristic vectors based on the word characteristic vector model, adding the word vectors and the word characteristic vector matrixes to obtain word vectors fused with the word characteristic vectors, and then fusing position vectors to obtain word vectors fused with multiple characteristics;
s105, inputting the word vectors fused with the multiple features into a relation extraction model to obtain probability distribution of entity relations;
s106, judging the entity relationship between two entities according to the probability distribution of the entity relationship, constructing triple knowledge based on the entity relationship, structurally storing the entity triple knowledge in a database form, and constructing a Chinese tourism field knowledge base;
s2, obtaining Chinese linguistic data and low-resource text linguistic data, and preprocessing the Chinese linguistic data and the low-resource text linguistic data;
s3, acquiring a corresponding Chinese word vector X based on the preprocessed Chinese corpus, and acquiring a corresponding low-resource word vector Y based on the preprocessed low-resource word corpus;
s4, acquiring a linear mapping matrix T from the Chinese word vector X to the low-resource character word vector Y based on an MUSE model; multiplying the linear mapping matrix T with the Chinese word vector X to obtain a vector U corresponding to the Chinese word vector X after being mapped to the low-resource character word vector Y;
s5, calculating k low-resource characters which are closest to word vectors of Chinese words in a vector U in the low-resource character word vector Y and represent cosine distances, and taking the k low-resource characters as k low-resource character translation candidate sets corresponding to Chinese to construct a Chinese-low resource character dictionary in the tourism field;
and S6, translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words based on the Chinese-low-resource word dictionary in the tourism field, and constructing the low-resource word tourism field knowledge base.
2. The method for constructing a low-resource literal tourism domain knowledge base as claimed in claim 1, wherein the step S101 is specifically:
the method comprises the steps of obtaining a tourism text through a crawler technology, preprocessing the tourism text, forming a text corpus by the preprocessed tourism text, wherein the preprocessing comprises sentence segmentation, word segmentation and part-of-speech tagging.
3. The method for constructing a low-resource literal tourist domain knowledge base according to claim 2, wherein said step S103 is specifically:
for each word si in a sentence word sequence S = [ S1, S2., sl ] with a length of l, the relative distances from a first entity e1 and a second entity e2 are i-i1 and i-i2, the first entity e1 and the second entity e2 are both target entities, wherein i represents the index subscript of the current word in the sentence, i1 and i2 are the index subscripts of the first entity e1 and the second entity e2 respectively, and a negative number represents that the current word is located before the entity word; then, adopting Word2vec to initialize to obtain a 2ld position vector matrix, wherein d represents the dimension of the position vector; the position vector of each word in the sentence is represented as pvi = [ pvi, pvi ], where pvi and pvi represent vector representations of the relative distances of the ith word in the sentence to entity e1 and entity e2, respectively.
4. The method for constructing a low-resource literal tourist domain knowledge base according to claim 3, wherein said step S104 specifically comprises:
performing matrix addition on a word vector N = { N1, N2,. Nl } and a part-of-speech vector V = { V1, V2,. Vl } obtained by training a word vector model and a part-of-speech vector model to obtain a word vector of a fused part-of-speech vector, wherein the word vector is represented as W = alpha N + (1-alpha) V, and alpha =0.5, namely W2=0.5 (N + V) after the part-of-speech vector is fused; and then, fusing the position vector to obtain a multi-feature fused word vector W3= [ W2, PV ], wherein PV = { PV1, PV 2.. Pvl }.
5. The method for constructing a low-resource literal tourist domain knowledge base according to claim 4, wherein said step S105 specifically comprises:
inputting a word vector W3 fused with multiple features as a relation extraction model, and processing by adopting a bidirectional LS (least square) TM (TM) to obtain text semantic information of a word sequence in a sentence in two directions from front to back and from back to front; the output calculation of the BLSTM of the ith word adopts a formula:
Figure FDA0003972435660000031
connecting the output of the BLSTM layer with a softmax classifier to obtain the probability distribution of the entity relationship;
the entity relations comprise 9 types, including position relations, establishing relations, creating relations, belonged relations, proximity relations, related relations, inclusion relations, equivalence relations and attribute relations.
6. The method for constructing a low-resource literal tourism domain knowledge base as claimed in claim 5, wherein the step S106 is specifically:
and judging the entity relationship between two entities according to the probability distribution, constructing the triple knowledge < the first entity e1, the second entity e2 and the entity relationship > according to the entity relationship between the two entities, and acquiring a plurality of triple knowledge by processing the text corpus so as to construct a Chinese tourism field knowledge base.
7. The method for constructing a low-resource literal tourist domain knowledge base according to claim 1, wherein the step S2 is specifically:
chinese linguistic data and low-resource text linguistic data are obtained through a crawler technology, useless information is removed, only text information of an article is reserved, and then word segmentation and word deactivation are carried out.
8. The method for constructing a low-resource literal tourism domain knowledge base as claimed in claim 1, wherein said step S3 is specifically:
and training the preprocessed Chinese corpus by adopting a fastText word vector model to obtain a corresponding Chinese word vector X, and training the preprocessed low-resource text corpus by adopting a fastText word vector model to obtain a corresponding low-resource text word vector Y.
9. A low-resource literal tourism domain knowledge base construction system, characterized in that the system comprises a computer, the computer comprises:
at least one memory cell;
at least one processing unit;
wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:
s1, constructing a Chinese tourism field knowledge base containing a plurality of triple knowledge, comprising the following steps: s101, acquiring a text corpus of Chinese travel texts;
s102, training a text corpus based on a Word2Vec model to obtain a Word vector model and a part-of-speech vector model;
s103, acquiring a position vector of each word in the sentence based on the sentences in the text corpus;
s104, obtaining word vectors based on the word vector model, obtaining word characteristic vectors based on the word characteristic vector model, adding the word vectors and the word characteristic vector matrixes to obtain word vectors fused with the word characteristic vectors, and then fusing position vectors to obtain word vectors fused with multiple characteristics;
s105, inputting the word vectors fused with the multiple features into a relation extraction model to obtain probability distribution of entity relations;
s106, judging the entity relationship between two entities according to the probability distribution of the entity relationship, constructing triple knowledge based on the entity relationship, structurally storing the entity triple knowledge in a database form, and constructing a Chinese tourism field knowledge base;
s2, obtaining Chinese linguistic data and low-resource literal linguistic data, and preprocessing the Chinese linguistic data and the low-resource literal linguistic data;
s3, acquiring a corresponding Chinese word vector X based on the preprocessed Chinese corpus, and acquiring a corresponding low-resource word vector Y based on the preprocessed low-resource word corpus;
s4, acquiring a linear mapping matrix T from the Chinese word vector X to the low-resource character word vector Y based on an MUSE model; multiplying the linear mapping matrix T with the Chinese word vector X to obtain a vector U corresponding to the Chinese word vector X after being mapped to the low-resource character word vector Y;
s5, calculating k low-resource characters which are closest to word vectors of Chinese words in a vector U in the low-resource character word vector Y and represent cosine distances, and using the k low-resource characters as k low-resource character translation candidate sets corresponding to Chinese, so as to construct a Chinese-low resource character dictionary in the tourism field;
and S6, translating a plurality of triple knowledge in the Chinese tourism knowledge base into low-resource words based on the Chinese-low-resource word dictionary in the tourism field, and constructing the low-resource word tourism field knowledge base.
CN201910650742.1A 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base Active CN110532544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910650742.1A CN110532544B (en) 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910650742.1A CN110532544B (en) 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base

Publications (2)

Publication Number Publication Date
CN110532544A CN110532544A (en) 2019-12-03
CN110532544B true CN110532544B (en) 2023-03-24

Family

ID=68660345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910650742.1A Active CN110532544B (en) 2019-07-18 2019-07-18 Method and system for constructing low-resource word tourism field knowledge base

Country Status (1)

Country Link
CN (1) CN110532544B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112132660B (en) * 2020-09-25 2023-12-26 尚娱软件(深圳)有限公司 Commodity recommendation method, system, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
WO2015079591A1 (en) * 2013-11-27 2015-06-04 Nec Corporation Crosslingual text classification method using expected frequencies
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015079591A1 (en) * 2013-11-27 2015-06-04 Nec Corporation Crosslingual text classification method using expected frequencies
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于词向量与可比语料库的双语词典提取研究";柳路芳 等;《计算机工程与科学》;20180228;第40卷(第2期);第368-373页 *

Also Published As

Publication number Publication date
CN110532544A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN104809176B (en) Tibetan language entity relation extraction method
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
US8990067B2 (en) Machine translation into a target language by interactively and automatically formalizing non-formal source language into formal source language
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN113515632B (en) Text classification method based on graph path knowledge extraction
CN114997288A (en) Design resource association method
CN110532544B (en) Method and system for constructing low-resource word tourism field knowledge base
CN109086285B (en) Intelligent Chinese processing method, system and device based on morphemes
CN112257442B (en) Policy document information extraction method based on corpus expansion neural network
Qiu et al. Review of development and construction of Uyghur knowledge graph
CN114169345A (en) Method and system for day-to-day machine translation using homologous words
Liu Research on literary translation based on the improved optimization model
CN113886530A (en) Semantic phrase extraction method and related device
US9311302B2 (en) Method, system and medium for character conversion between different regional versions of a language especially between simplified chinese and traditional chinese
CN112149428A (en) Intelligent writing auxiliary system based on semantic analysis and deep learning
CN111881689A (en) Method, system, device and medium for processing polysemous word vector
Zhang et al. A machine learning classification algorithm for vocabulary grading in Chinese language teaching
Hu et al. Exploring discourse structure in document-level machine translation
Zhu et al. Research on construction and automatic expansion of multi-source lexical semantic knowledge base
Yan Research on English Chinese Translation System for Tourism Based on Globish
Sun Online algorithm design of english translation of film and television works under the background of media cultural information
Huang et al. An Analysis Model of English Text Coherence Based on RST Dependency Relationship
Iheanetu et al. Corpus-Size Quantification for Computational Morphological Analysis of Igbo Language
Wu et al. Research on Intelligent Retrieval Model of Multilingual Text Information in Corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant