CN110489624B - Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector - Google Patents

Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector Download PDF

Info

Publication number
CN110489624B
CN110489624B CN201910628354.3A CN201910628354A CN110489624B CN 110489624 B CN110489624 B CN 110489624B CN 201910628354 A CN201910628354 A CN 201910628354A CN 110489624 B CN110489624 B CN 110489624B
Authority
CN
China
Prior art keywords
sentence
chinese
parallel
pseudo
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910628354.3A
Other languages
Chinese (zh)
Other versions
CN110489624A (en
Inventor
余正涛
黄继豪
线岩团
郭军军
翟家欣
文永华
高盛祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201910628354.3A priority Critical patent/CN110489624B/en
Publication of CN110489624A publication Critical patent/CN110489624A/en
Application granted granted Critical
Publication of CN110489624B publication Critical patent/CN110489624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for extracting pseudo parallel sentence pairs of crossing over from Chinese based on sentence characteristic vectors, belonging to the technical field of natural language processing. Firstly, collecting and preprocessing parallel and non-parallel training corpora and testing corpora of the Hanyue sentence pair and comparable corpora for extracting pseudo parallel sentence pairs; marking parts of speech with large difference in the syntax of Hanyue; then, the external characteristics of sentences and the difference characteristics of the Hanyue syntax are blended into the embedding layer; the output of the embedding layer obtains a sentence characteristic vector through a neural network, and then a pseudo parallel corpus extraction model is trained through the calculation of the classification layer; and extracting the Chinese-character-to-character parallel corpus. The invention can effectively extract the more and more pseudo parallel sentence pairs of the Chinese from the more and more comparable corpus of the Chinese, and has high accuracy.

Description

Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector
Technical Field
The invention relates to a method for extracting pseudo parallel sentence pairs of crossing over from Chinese based on sentence characteristic vectors, belonging to the technical field of natural language processing.
Background
Data-driven machine translation (statistical machine translation, neural machine translation) is a more demanding requirement for the amount of data used to train the model. In particular, neural machine translation has achieved a good result in machine translation with large-scale corpus, such as English-to-French, Chinese-to-English, etc. But, for neural machine translation with scarce resources and small corpus scale, such as Chinese-Yuan neural machine translation, the translation performance is not very ideal. Therefore, how to extract the Han-Yuan pseudo parallel sentence pair has very important application prospect
At present, parallel sentence pairs are extracted from a monolingual corpus based on word embedding on the basis of extracting the parallel sentence pairs by utilizing a neural network structure, so that the translation performance of a neural machine is improved, the parallel sentence pairs relevant to the outside field and the inside field are screened based on sentence vectors, and the translation performance of the machine in the field is improved. The above methods effectively extract pseudo parallel sentence pairs and improve the performance of machine translation, but most of them compare two sentences from the level of words to see whether the two sentences are parallel or not, and the method is not easy to capture some characteristics of the sentences.
A large amount of Chinese-Vietnamese comparable linguistic data, such as Wikipedia data of Chinese and Vietnamese, can be crawled on the network. And in these comparable corpora, there are pairs of pseudo-parallel sentences in hanse. Therefore, under the condition of giving comparable corpuses, how to obtain the pseudo parallel sentence pairs from the comparable corpuses becomes one of the difficulties and key technologies of the task. Therefore, the invention aims to solve the problem of how to extract the Chinese-character-crossing pseudo parallel sentence pairs from the Chinese-character-crossing comparable corpus. The method for extracting the pseudo-parallel Chinese-cross sentence pair based on the sentence characteristic vector is provided.
Disclosure of Invention
The invention provides a method for extracting a Chinese-Yue pseudo parallel sentence pair based on a sentence characteristic vector, which is used for solving the problem of low accuracy in extracting the Chinese-Yue pseudo parallel sentence pair from a Chinese-Yue comparable corpus.
The technical scheme of the invention is as follows: the method for extracting the Hanyue pseudo parallel sentence pair based on the sentence characteristic vector comprises the following specific steps:
step1, corpus collection and pretreatment: collecting and preprocessing parallel and non-parallel training corpora and testing corpora of the Hanyue sentence pair and comparable corpora for extracting pseudo-parallel sentence pairs;
as a preferred embodiment of the present invention, the Step1 specifically comprises the following steps:
step1.1, crawling Chinese-overtaking parallel sentence pairs and non-parallel Chinese-overtaking sentence pairs on a certain scale from the Internet to serve as training data of a Chinese-overtaking pseudo parallel sentence pair extraction model, and enabling a classification label whether the sentence pairs are parallel or not to exist behind each sentence pair. Extracting a small part from the training data to be used as a test set; then crawling comparable linguistic data;
step1.2, manually screening the crawled corpus, then carrying out position labeling on the crawled corpus, and marking sentence labels; and screening comparable linguistic data to achieve the effects of reducing the calculation times of the model and reducing the time complexity.
In Step1.2, the specific process of screening comparable corpora is as follows:
the pseudo-parallel corpus extraction model for the Chinese language is used for converting the extraction problem of the pseudo-parallel sentence pairs into a two-classification problem, and the Chinese language is larger than the corpus in scale, so that pre-trained Chinese words are embedded and projected into a Vietnam word embedding space, so that the Chinese language and the Vietnam language can be represented in the same space;
equation 1 is a sentence-embedded representation, where | S | is the length of the sentence,
Figure BDA0002127901760000021
is the word embedding of the ith word of the sentence S in the same Chinese-Yuan language space;
Figure BDA0002127901760000022
S(x,y)=Φ(xemb,yemb) (2)
equation 2In, phi (x)emb,yemb) Cosine similarity of the sentence x and the sentence y in the same language space; expressing each sentence by using an average of word embedding of words constituting the sentence, namely sentence embedding S, calculating the similarity of each pair of Chinese-Vietnamese candidate parallel sentence pairs by using the sentence embedding to obtain a score S (x, y), and reserving 10 closest Vietnamese sentences for each Chinese sentence;
because the length ratio of the parallel Chinese-cross sentence pairs is within a certain range, if the length of the Chinese-cross sentence pairs is not within the range, the parallel probability of the Chinese-cross sentence pairs is low, the length ratio range of the Chinese-cross parallel sentence pairs is counted according to the Chinese-cross parallel corpus, and the sentence pairs beyond the range are removed, so that the comparable corpus is screened out.
And screening comparable corpora to achieve the effects of reducing the calculation times of the model and reducing the time complexity. The model converts the extraction problem of pseudo parallel sentence pair into a binary classification problem, if the Chinese-Yuan comparable corpus is 106Chinese sentences and 106The Vietnamese sentence composition needs to be carried out by 106×106And (4) secondary classification calculation and complex sentence feature vector calculation, so that a Chinese-pseudo parallel sentence pair cannot be extracted from the Chinese-pseudo comparable corpus quickly. Through screening, the Chinese-Yue sentence pairs which are more likely to be in a mutual translation relation can be screened from the Chinese-Yue sentence than the corpus, so that the calculation times of the model are greatly reduced, and the time complexity is reduced.
Step2, selecting the Chinese-Yue syntax difference characteristics: according to the characteristic that the modifying words in the syntactic difference of the more and more Chinese are arranged at the back, the part of speech with large difference in the syntactic difference of the more and more Chinese is marked;
the Chinese sentences and Vietnamese sentences with simple syntactic structures have basically consistent word sequences, and the most basic arrangement sequence of the syntactic components is active guest SVO or active supplementary SVP; the biggest difference between Chinese and Vietnamese is that the arrangement sequence of modifiers and the central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; relative to the Chinese language, Vietnamese has the characteristic of postpositional modifiers;
according to the biggest difference between Chinese and Vietnamese, the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; the definite languages and the idioms of the Chinese language and the Vietnamese languages comprise verbs, adverbs, adjectives and nouns in the Chinese language and the Vietnamese languages, so that the parts of speech with large difference in the syntax of the Chinese language and the Vietnamese languages can be labeled.
The Chinese sentences and Vietnamese sentences with simple syntactic structures have basically consistent word sequences, and the most basic arrangement sequence of syntactic components is active guest (SVO) or active complement (SVP). The biggest difference between Chinese and Vietnamese is that the arrangement order of modifiers (fixed language and idiosyncratic language) and the central language of the two languages is different. Therefore, the words marked with the parts of speech are more recognizable in sentences. The marks are used as the Chinese-Yue syntax difference characteristics and are merged into the embedding layer, so that the accuracy of extracting the pseudo parallel sentence pairs can be improved.
The example of the chinese vietnamese language as in table 1 shows the difference characteristic of the chinese vietnamese syntax, mainly the difference of the word order. Therefore, Step2 labels the parts of speech of verb, adverb, adjective and noun in Chinese and Vietnamese.
TABLE 1 Hanyue syntax Difference characteristics
Figure BDA0002127901760000031
Step3, constructing an embedding layer of the pseudo parallel corpus extraction model of the Hanyue, and fusing external features of sentences and the difference features of the Hanyue syntax in the embedding layer;
in the embedding layer, since the neural network does not capture the position information in a sequence when calculating a sequence, that is, two sequences composed of the same elements, although they are different in the arrangement of the elements, the result obtained through the neural network is the same;
as a preferable embodiment of the present invention, the Step3 comprises the following specific steps:
sted3.1, to solve the problem of insensitivity to the position of a word and the model can distinguish two sentences; in the Embedding layer, a Position Embedding layer Position Embedding and a sentence Embedding layer Segment Embedding are added;
step3.2, vectorizing sentence characteristics in the embedding layer, and then fusing the sentence external characteristics and the Hanyue syntax difference characteristics into the embedding layer in a vector addition mode for outputting; the external characteristics of the sentence comprise traditional word embedding, sentence dividing characteristics and position information characteristics, and the Chinese-Vietnamese syntax difference characteristics comprise the part-of-speech characteristics of verbs, adverbs, adjectives and nouns in Chinese and Vietnamese. Four embedding layers of part-of-speech information are fused, and category characteristics can be processed more effectively.
In formula 3, E is the word embedding of each word output through the embedding layer, which is the traditional word embedding EtokenClause characteristics ESLocation information characteristic EPPart-of-speech feature E of Hanyue syntactic difference feature partPOSSum of these four vectors
E=Etoken+ES+EP+EPOS (3)
Step4, training a Hanyue pseudo parallel corpus extraction model: the output of the Step3 embedding layer obtains a sentence characteristic vector through a neural network, and then a pseudo parallel corpus extraction model is trained through the calculation of a classification layer; the model can judge whether the sentence pairs are pseudo parallel sentence pairs or not, and finally, the pseudo parallel linguistic data of the more Chinese words can be extracted from the comparable linguistic data of the more Chinese words;
the method adopts a neural network structure based on a self-attention mechanism, and the structure can quickly extract important features of sparse data. The sentence characteristics can be extracted, and the sentence characterization can be applied to more downstream tasks and achieve better effect. It can be derived which information in the target language sentence is more relevant to the elements in the target language sentence and the source language sentence, giving higher weight to the more effective information.
Step5, extracting the Chinese-Yue pseudo-parallel sentence pair from the comparable corpus: and (3) extracting the Chinese-crossing pseudo parallel sentence pairs from the Chinese-crossing comparable corpus by using the trained Chinese-crossing pseudo parallel corpus extraction model.
The beneficial effects of the invention are:
1. firstly, judging whether two Chinese-cross sentences are parallel or not, mainly comparing the characteristics of the sentences in a Chinese-cross bilingual space, wherein the characteristic vectors of the sentences can often contain some characteristics of the sentences, and the accuracy rate of judging whether the Chinese-cross sentence pairs are parallel or not is improved;
2. the method splices the Chinese-to-Yue sentence pairs as an integral input, integrates the external characteristics of the Chinese-to-Yue sentence pairs and the Chinese-to-Yue syntactic difference characteristics in a model, obtains sentence characteristic vectors through neural network calculation, and finally inputs the sentence characteristic vectors into a classification layer to obtain a final result.
Drawings
FIG. 1 is a detailed flow chart of the present invention;
FIG. 2 is a model diagram of a model embedding layer proposed by the present invention;
fig. 3 is a block diagram of a sentence feature vector-based extraction hanyue pseudo parallel sentence pair model provided by the present invention.
Detailed Description
Example 1: as shown in fig. 1-3, the method for extracting the pseudo-parallel sentence pair based on the sentence feature vector,
step1, crawling 12 ten thousand Han-Yuan parallel sentence pairs and 12 ten thousand non-parallel Han-Yuan sentence pairs from the Internet as the training data of the Han-Yuan pseudo parallel sentence pair extraction model, wherein each sentence pair is followed by a classification label of whether the sentences are parallel or not. Extracting 5000 Hanyue sentences from the training data as a test set, wherein 2500 sentence pairs are Hanyue parallel sentence pairs, and 2500 sentence pairs are non-parallel sentence pairs, as shown in Table 2; then crawling comparable linguistic data;
TABLE 2 Experimental data
Figure BDA0002127901760000051
Manually screening the crawled linguistic data, and then carrying out position labeling and sentence label marking on the crawled linguistic data; and screening comparable linguistic data to achieve the effects of reducing the calculation times of the model and reducing the time complexity.
The specific process for screening the comparable corpora is as follows:
the pseudo-parallel corpus extraction model for the Chinese language is used for converting the extraction problem of the pseudo-parallel sentence pairs into a two-classification problem, and the Chinese language is larger than the corpus in scale, so that pre-trained Chinese words are embedded and projected into a Vietnam word embedding space, so that the Chinese language and the Vietnam language can be represented in the same space;
equation 1 is a sentence-embedded representation, where | S | is the length of the sentence,
Figure BDA0002127901760000052
is the word embedding of the ith word of the sentence S in the same Chinese-Yuan language space;
Figure BDA0002127901760000053
S(x,y)=Φ(xemb,yemb) (2)
in equation 2, phi (x)emb,yemb) Cosine similarity of the sentence x and the sentence y in the same language space; expressing each sentence by using an average of word embedding of words constituting the sentence, namely sentence embedding S, calculating the similarity of each pair of Chinese-Vietnamese candidate parallel sentence pairs by using the sentence embedding to obtain a score S (x, y), and reserving 10 closest Vietnamese sentences for each Chinese sentence;
because the length ratio of the parallel Chinese-cross sentence pairs is within a certain range, if the length of the Chinese-cross sentence pairs is not within the range, the parallel probability of the Chinese-cross sentence pairs is low, the length ratio range of the Chinese-cross parallel sentence pairs is counted according to the Chinese-cross parallel corpus, and the sentence pairs beyond the range are removed, so that the comparable corpus is screened out.
Step2, selecting the Chinese-Yue syntax difference characteristics: according to the characteristic that the modifying words in the syntactic difference of the more and more Chinese are arranged at the back, the part of speech with large difference in the syntactic difference of the more and more Chinese is marked;
the Chinese sentences and Vietnamese sentences with simple syntactic structures have basically consistent word sequences, and the most basic arrangement sequence of the syntactic components is active guest SVO or active supplementary SVP; the biggest difference between Chinese and Vietnamese is that the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and idioms; compared with Chinese, Vietnamese has the characteristic of postposition of modifier;
according to the biggest difference between Chinese and Vietnamese, the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; the definite words and the idioms of the Chinese and the Vietnamese comprise verbs, adverbs, adjectives and nouns in the Chinese and the Vietnamese, and the parts of speech of verbs, adverbs, adjectives and nouns in the Chinese and the Vietnamese can be labeled by labeling parts of speech with large difference in the syntax of the Chinese and the Vietnamese.
Step3, constructing an embedding layer of the pseudo parallel corpus extraction model of the Hanyue, and fusing external features of sentences and the difference features of the Hanyue syntax in the embedding layer;
in the embedding layer, position information in a sequence cannot be captured when a neural network calculates the sequence, namely, two sequences composed of the same elements are different in arrangement of the elements, but the result obtained by the neural network is the same;
as a preferable embodiment of the present invention, the Step3 comprises the following specific steps:
step3.1, in order to solve the problem of insensitivity to the position of a word, and a model can distinguish two sentences; in the Embedding layer, a Position Embedding layer Position Embedding and a sentence Embedding layer Segment Embedding are added;
step3.2, vectorizing sentence characteristics in the embedding layer, and merging the external characteristics of the sentences and the Hanyue syntax difference characteristics into the embedding layer in a vector addition mode for outputting; the external characteristics of the sentence comprise traditional word embedding, sentence dividing characteristics and position information characteristics, and the Chinese-Vietnamese syntax difference characteristics comprise the part-of-speech characteristics of verbs, adverbs, adjectives and nouns in Chinese and Vietnamese.
In formula 3, E is the word embedding of each word output through the embedding layer, which is the traditional word embedding EtokenClause characteristics ESLocation information characteristic EPPart-of-speech feature E of Hanyue syntactic difference feature partPOSSum of these four vectors
E=Etoken+Es+EP+EPOS (3)
Step4, training a Hanyue pseudo parallel corpus extraction model: the output of the Step3 embedding layer obtains a sentence characteristic vector through a neural network, and then a pseudo parallel corpus extraction model is trained through the calculation of a classification layer;
the method adopts a neural network structure based on a self-attention mechanism, and the structure can quickly extract important features of sparse data.
When a sentence feature vector is obtained through the neural network in Step4, the neural network used is a self-attention mechanism neural network, which can effectively extract the features of the sentence itself, and the generation of the feature vector by using the self-attention mechanism is based on the sentence itself, and the specific process is as follows:
A. the sentence elements formed by splicing the Chinese and the Yue are composed of a series of key, value key value pairs and a series of query;
calculating the similarity between the query and each key to obtain a weight coefficient, wherein common similarity functions comprise dot products, splicing, detectors and the like;
B. regularizing the weighting coefficients using a softmax function;
C. finally, weighting and summing the weights and the corresponding values together to obtain a final attention result;
the calculation formula is as follows:
Figure BDA0002127901760000071
q, K, V are query, key, value, d in sentence respectivelykIs the dimension of the Q, K, V vector.
Step5, extracting Hanyue pseudo parallel sentence pairs from comparable corpora: and (3) extracting the Chinese-crossing pseudo parallel sentence pairs from the Chinese-crossing comparable corpus by using the trained Chinese-crossing pseudo parallel corpus extraction model. Table 3 shows the result of extracting the chinese-to-overtime pseudo parallel sentences on the chinese-to-overtime comparable corpus, and table 4 shows the accuracy of extracting the chinese-to-overtime pseudo parallel sentences on the chinese-to-overtime comparable corpus with respect to the different methods.
TABLE 3 pairs of pseudo-parallel sentences extracted from Hanyue
Figure BDA0002127901760000072
Figure BDA0002127901760000081
TABLE 4 Experimental results for different extraction methods
Method Rate of accuracy
LSTM 59.16
LSTM+POS 60.45
Self-attention 62.94
Self-attention+POS 63.32
From the above data, it can be seen that, regarding different pseudo-parallel sentence pair extraction methods, when only LSTM (Long Short-Term Memory, which is a time-cycle neural network suitable for processing and predicting important events with relatively Long intervals and delays in time series) is used in a model for extracting a hanyu pseudo-parallel sentence pair based on sentence feature vectors, the accuracy is relatively low, when LSTM + POS (Long Short-Term Memory network and part-of-speech features) is used, the accuracy is 60.45%, when Self-attention (Self-attention mechanism) is used, the accuracy is 62.94%, and when Self-attention + POS (Self-attention mechanism and part-of-speech features) of the present invention is used, the accuracy is highest, reaching 63.32%.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (1)

1. The method for extracting the Hanyue pseudo parallel sentence pair based on the sentence characteristic vector is characterized in that: the method comprises the following specific steps:
step1, corpus collection and pretreatment: collecting and preprocessing parallel and non-parallel training corpora and testing corpora of the Hanyue sentence pair and comparable corpora for extracting pseudo parallel sentence pairs;
step2, selecting the Hanyue syntax difference characteristics: according to the characteristic that the modifying words in the Chinese-Yue syntax difference are arranged at the rear positions, the part of speech with large difference in the Chinese-Yue syntax is marked;
step3, constructing an embedding layer of the pseudo parallel corpus extraction model of the Hanyue, and fusing external features of sentences and the difference features of the Hanyue syntax in the embedding layer;
step4, training a Hanyue pseudo parallel corpus extraction model: the output of the Step3 embedding layer obtains a sentence characteristic vector through a neural network, and then a pseudo parallel corpus extraction model is trained through the calculation of a classification layer;
step5, extracting Hanyue pseudo parallel sentence pairs from comparable corpora: extracting a Chinese-crossing pseudo parallel sentence pair from a Chinese-crossing comparable corpus by using a trained Chinese-crossing pseudo parallel corpus extraction model;
the specific steps of Step1 are as follows:
step1.1, crawling a Chinese-more parallel sentence pair and a non-parallel Chinese-more sentence pair as training data of a Chinese-more pseudo parallel sentence pair extraction model by using a crawler, wherein each sentence pair is provided with a classification label for judging whether the sentences are parallel or not, selecting a test set from the training data, and crawling comparable linguistic data;
step1.2, manually screening the crawled corpus, then carrying out position labeling on the crawled corpus, and marking sentence labels; then, comparable linguistic data are screened for achieving the effects of reducing the calculation times of the model and reducing the time complexity;
in Step1.2, the specific process of screening comparable corpora is as follows:
the pseudo-parallel corpus extraction model for the Chinese language is used for converting the extraction problem of the pseudo-parallel sentence pairs into a two-classification problem, and the Chinese language is larger than the corpus in scale, so that pre-trained Chinese words are embedded and projected into a Vietnam word embedding space, so that the Chinese language and the Vietnam language can be represented in the same space;
equation 1 is a sentence-embedded representation, where | S | is the length of the sentence,
Figure FDA0003659066090000011
is the word embedding of the ith word of the sentence S in the same Chinese-Yuan language space;
Figure FDA0003659066090000012
S(x,y)=Ф(xemb,yemb) (2)
in equation 2, phi (x)emb,yemb) The cosine similarity of the sentence x and the sentence y in the same language space; expressing each sentence by using an average of word embedding of words constituting the sentence, namely sentence embedding S, calculating the similarity of each pair of Chinese-Vietnamese candidate parallel sentence pairs by using the sentence embedding to obtain a score S (x, y), and reserving 10 closest Vietnamese sentences for each Chinese sentence;
because the length ratio of the parallel Chinese-cross sentence pairs is within a certain range, if the length of the Chinese-cross sentence pairs is not within the range, the parallel probability of the Chinese-cross sentence pairs is low, the length ratio range of the Chinese-cross parallel sentence pairs is counted according to the Chinese-cross parallel corpus, and the sentence pairs beyond the range are removed, so that the comparable corpus is screened out;
in Step 2:
the Chinese sentences and Vietnamese sentences with simple syntactic structures have basically consistent word sequences, and the most basic arrangement sequence of the syntactic components is active guest SVO or active supplementary SVP; the biggest difference between Chinese and Vietnamese is that the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and idioms; compared with Chinese, Vietnamese has the characteristic of postposition of modifier;
according to the biggest difference between Chinese and Vietnamese, the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; the definite languages and the idioms as the Chinese and Vietnamese languages comprise verbs, adverbs, adjectives and nouns in the Chinese and Vietnamese languages, so that the parts of speech with large difference in the syntax of the Chinese and Vietnamese languages are labeled to obtain the parts of speech of the verbs, the adverbs, the adjectives and the nouns in the Chinese and Vietnamese languages;
the specific steps of Step3 are as follows:
step3.1, in order to solve the problem of insensitivity to the Position of a word and a model, two sentences can be distinguished, and a Position Embedding layer Position Embedding and a sentence Embedding layer Segment Embedding are added in an Embedding layer;
step3.2, vectorizing sentence characteristics in the embedding layer, and then merging the external characteristics of the sentences and the Hanyue syntax difference characteristics into the embedding layer in a vector addition mode; the external characteristics of the sentence comprise traditional word embedding, sentence segmentation characteristics and position information characteristics, and the Chinese-Vietnamese syntactic difference characteristics comprise the part-of-speech characteristics of verbs, adverbs, adjectives and nouns in Chinese and Vietnamese;
when a sentence feature vector is obtained through the neural network in Step4, the neural network used is a neural network based on the self-attention mechanism, which can effectively extract the features of the sentence itself, and the feature vector generated by using the self-attention mechanism is based on the sentence itself.
CN201910628354.3A 2019-07-12 2019-07-12 Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector Active CN110489624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910628354.3A CN110489624B (en) 2019-07-12 2019-07-12 Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910628354.3A CN110489624B (en) 2019-07-12 2019-07-12 Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector

Publications (2)

Publication Number Publication Date
CN110489624A CN110489624A (en) 2019-11-22
CN110489624B true CN110489624B (en) 2022-07-19

Family

ID=68547034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910628354.3A Active CN110489624B (en) 2019-07-12 2019-07-12 Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector

Country Status (1)

Country Link
CN (1) CN110489624B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287688B (en) * 2020-09-17 2022-02-11 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN113505571A (en) * 2021-07-30 2021-10-15 沈阳雅译网络技术有限公司 Data selection and training method for neural machine translation
CN114970571B (en) * 2022-06-23 2024-08-27 昆明理工大学 Hantai pseudo parallel sentence pair generation method based on double discriminators

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002082946A (en) * 2000-09-08 2002-03-22 Atr Onsei Gengo Tsushin Kenkyusho:Kk Translation word selecting method and computer- readable recording medium with translation word selection processing program recorded thereon
CN102591857A (en) * 2011-01-10 2012-07-18 富士通株式会社 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
CN102945231A (en) * 2012-10-19 2013-02-27 中国科学院计算技术研究所 Construction method and system of incremental-translation-oriented structured language model
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104281716A (en) * 2014-10-30 2015-01-14 百度在线网络技术(北京)有限公司 Parallel corpus alignment method and device
CN104391885A (en) * 2014-11-07 2015-03-04 哈尔滨工业大学 Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN104991890A (en) * 2015-07-15 2015-10-21 昆明理工大学 Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora
CN105512114A (en) * 2015-12-14 2016-04-20 清华大学 Parallel sentence pair screening method and system
CN106202035A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese conversion of parts of speech disambiguation method based on combined method
CN106919619A (en) * 2015-12-28 2017-07-04 阿里巴巴集团控股有限公司 A kind of commercial articles clustering method, device and electronic equipment
CN108460028A (en) * 2018-04-12 2018-08-28 苏州大学 Sentence weight is incorporated to the field adaptive method of neural machine translation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484682B (en) * 2015-08-25 2019-06-25 阿里巴巴集团控股有限公司 Machine translation method, device and electronic equipment based on statistics
US10193833B2 (en) * 2016-03-03 2019-01-29 Oath Inc. Electronic message composition support method and apparatus

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002082946A (en) * 2000-09-08 2002-03-22 Atr Onsei Gengo Tsushin Kenkyusho:Kk Translation word selecting method and computer- readable recording medium with translation word selection processing program recorded thereon
CN102591857A (en) * 2011-01-10 2012-07-18 富士通株式会社 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
CN102945231A (en) * 2012-10-19 2013-02-27 中国科学院计算技术研究所 Construction method and system of incremental-translation-oriented structured language model
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104281716A (en) * 2014-10-30 2015-01-14 百度在线网络技术(北京)有限公司 Parallel corpus alignment method and device
CN104391885A (en) * 2014-11-07 2015-03-04 哈尔滨工业大学 Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN104991890A (en) * 2015-07-15 2015-10-21 昆明理工大学 Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora
CN105512114A (en) * 2015-12-14 2016-04-20 清华大学 Parallel sentence pair screening method and system
CN106919619A (en) * 2015-12-28 2017-07-04 阿里巴巴集团控股有限公司 A kind of commercial articles clustering method, device and electronic equipment
CN106202035A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese conversion of parts of speech disambiguation method based on combined method
CN108460028A (en) * 2018-04-12 2018-08-28 苏州大学 Sentence weight is incorporated to the field adaptive method of neural machine translation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于句子特征向量的汉-越伪平行句对抽取》;翟家欣等;《山西大学学报(自然科学版)》;20190816;第42卷(第4期);770-776 *
《基于数据增强技术的神经机器翻译》;蔡子龙等;《中文信息学报》;20180715;第32卷(第7期);30-36 *

Also Published As

Publication number Publication date
CN110489624A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
Ray et al. A mixed approach of deep learning method and rule-based method to improve aspect level sentiment analysis
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
Gupta et al. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi
CN112668319B (en) Vietnamese news event detection method based on Chinese information and Vietnamese statement method guidance
CN110378409A (en) It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN104951548A (en) Method and system for calculating negative public opinion index
CN110489624B (en) Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector
Zotova et al. Multilingual stance detection in tweets: The Catalonia independence corpus
Kapusta et al. Comparison of fake and real news based on morphological analysis
CN114417851B (en) Emotion analysis method based on keyword weighted information
Song et al. Toward any-language zero-shot topic classification of textual documents
CN111259156A (en) Hot spot clustering method facing time sequence
Avetisyan et al. Word embeddings for the armenian language: intrinsic and extrinsic evaluation
Lee et al. Detecting suicidality with a contextual graph neural network
Stankevičius et al. Testing pre-trained Transformer models for Lithuanian news clustering
pal Singh et al. Naive Bayes classifier for word sense disambiguation of Punjabi language
Gao et al. Attention-based BiLSTM network with lexical feature for emotion classification
Rao et al. ASRtrans at semeval-2022 task 5: Transformer-based models for meme classification
Bhargava et al. Deep paraphrase detection in indian languages
Santos et al. Morphological Skip-Gram: Replacing FastText characters n-gram with morphological knowledge
Putra et al. Sentence boundary disambiguation for Indonesian language
Zheng et al. A novel hierarchical convolutional neural network for question answering over paragraphs
Zotova et al. Multilingual stance detection: The catalonia independence corpus
CN115129818A (en) Knowledge-driven multi-classification-based emotion reason pair extraction method and system
Wai Myanmar language part-of-speech tagging using deep learning models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant