CN111695361A

CN111695361A - Method for constructing Chinese-English bilingual corpus and related equipment thereof

Info

Publication number: CN111695361A
Application number: CN202010356769.2A
Authority: CN
Inventors: 邓悦; 金戈; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-09-22
Also published as: WO2021218012A1

Abstract

The invention relates to the technical field of computers, and provides a method for constructing a Chinese-English bilingual corpus and related equipment thereof, wherein the method for constructing the Chinese-English bilingual corpus comprises the following steps: acquiring a Chinese entity, an English entity and a mapping relation and a translation relation between the Chinese entity and the English entity, and constructing a bilingual entity word network according to a preset requirement; calculating single language representation estimated values and cross-language entity estimated values of a bilingual entity word network according to the Chinese entity, the English entity, the context words, the preset hyperlink set and the preset sentence set; calculating cross-language statement evaluation values corresponding to the obtained comparable statement network by using the training statements; calculating a target estimate value according to the three estimate values; and according to the target estimation value, combining a bilingual entity word network and a comparable sentence network into a Chinese-English bilingual corpus, and storing the Chinese-English corpus in a block chain. The invention improves the accuracy of the linguistic data in the Chinese-English bilingual corpus by utilizing the relevance between the two networks.

Description

Method for constructing Chinese-English bilingual corpus and related equipment thereof

Technical Field

The invention relates to the technical field of computers, in particular to a method for constructing a Chinese-English bilingual corpus and related equipment thereof.

Background

At present, the traditional training methods of Chinese-English bilingual corpus based on the bilingual dialogue system are mainly divided into two categories: one method is to use a corpus of a corresponding field to match a Chinese-English bilingual corpus to be tested, and does not need to parallel the corpus, but the training process is unstable and has high complexity, so that the method can only be limited on small-scale data and the accuracy is not high; another method is to use the existing multi-language resources to automatically generate a 'pseudo bilingual document', which is stable, but the training data is time-consuming and has insufficient accuracy due to the large data volume and uncertainty. Therefore, when the dialogue system uses the English-bilingual corpus to recognize, semantic recognition errors exist, and the accuracy of the dialogue system is further influenced.

Disclosure of Invention

The embodiment of the invention provides a method for constructing a Chinese-English bilingual corpus and related equipment thereof, which are used for solving the problems that the traditional Chinese-English bilingual corpus is low in training accuracy and the accuracy of a dialog system applying the Chinese-English bilingual corpus is further influenced is low.

A method for constructing a Chinese-English bilingual corpus comprises the following steps:

acquiring a Chinese entity, an English entity and a mapping relation and a translation relation between the Chinese entity and the English entity from a preset entity library;

constructing a bilingual entity word network according to the Chinese entity, the English entity, the mapping relation and the inter-translation relation and according to preset requirements;

acquiring context words corresponding to each Chinese entity and each English entity from a preset database;

calculating a monolingual representation valuation and a cross-language entity valuation of the bilingual entity word network based on the Chinese entity, the English entity, the context words, a preset hyperlink set and a preset sentence set;

acquiring a comparable sentence network and training sentences, and calculating cross-language sentence evaluation values corresponding to the comparable sentence network by using the training sentences;

carrying out weighted summation on the single language representation estimated value, the cross-language entity estimated value and the cross-language statement estimated value to obtain a target estimated value;

and comparing the target estimated value with a preset threshold value, and combining the bilingual entity word network and the comparable sentence network into a Chinese-English bilingual corpus by using the bilingual entity word network and the comparable sentence network if preset conditions are met.

A device for constructing a Chinese-English bilingual corpus comprises:

the first acquisition module is used for acquiring Chinese entities, English entities and mapping relations and inter-translation relations between the Chinese entities and the English entities from a preset entity library;

the construction module is used for constructing a bilingual entity word network according to the Chinese entity, the English entity, the mapping relation and the inter-translation relation and according to a preset requirement;

the second acquisition module is used for acquiring context words corresponding to each Chinese entity and each English entity from a preset database;

the first calculation module is used for calculating the single language representation estimated value and the cross-language entity estimated value of the bilingual entity word network based on the Chinese entity, the English entity, the context word, the preset hyperlink set and the preset statement set;

the second calculation module is used for acquiring a comparable sentence network and a training sentence and calculating a cross-language sentence evaluation value corresponding to the comparable sentence network by using the training sentence;

the summation module is used for carrying out weighted summation on the single language representation estimated value, the cross-language entity estimated value and the cross-language statement estimated value to obtain a target estimated value;

and the combination module is used for comparing the target estimated value with a preset threshold value, and combining the bilingual entity word network and the comparable sentence network into a Chinese-English bilingual corpus if a preset condition is achieved.

A computer device, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps of the above-mentioned chinese and english entity information method when executing the computer program.

A computer-readable storage medium, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned chinese and english entity information method.

The construction method of the Chinese-English bilingual corpus and the related equipment thereof construct the bilingual entity word network based on the mapping relationship and the inter-translation relationship, can strengthen the association relationship between the Chinese entity and the English entity, by means of calculating single language representation estimated value, cross-language entity estimated value and cross-language statement estimated value, whether a bilingual entity word network and a comparable statement network meet set requirements or not can be accurately judged, and finally, under the condition that the target estimated value meets preset conditions, the bilingual entity word network and the comparable sentence network are used to form a Chinese-English bilingual corpus, because the Chinese-English bilingual corpus is composed of 2 networks, the relevance between different corpora in the Chinese-English bilingual corpus can be improved, the accuracy of the corpora in the Chinese-English bilingual corpus can be further improved, and the accuracy of a dialog system using the Chinese-English bilingual corpus can be further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flowchart of a method for constructing a Chinese-English bilingual corpus according to an embodiment of the present invention;

fig. 2 is a flowchart of step S2 in the method for constructing a chinese-english bilingual corpus according to the embodiment of the present invention;

fig. 3 is a flowchart of step S4 in the method for constructing a chinese-english bilingual corpus according to the embodiment of the present invention;

fig. 4 is a flowchart of step S5 in the method for constructing a chinese-english bilingual corpus according to the embodiment of the present invention;

FIG. 5 is a flowchart of step S53 in the method for constructing a Chinese-English bilingual corpus according to the embodiment of the present invention;

fig. 6 is a flowchart of step S7 in the method for constructing a chinese-english bilingual corpus according to the embodiment of the present invention;

FIG. 7 is a schematic diagram of an apparatus for constructing a Chinese-English bilingual corpus according to an embodiment of the present invention;

fig. 8 is a block diagram of a basic mechanism of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for constructing the Chinese-English bilingual corpus is applied to the server side, and the server side can be specifically realized by an independent server or a server cluster consisting of a plurality of servers. In an embodiment, as shown in fig. 1, a method for constructing a chinese-english bilingual corpus is provided, which includes the following steps:

s1: the Chinese entity, the English entity and the mapping relation and the inter-translation relation between the Chinese entity and the English entity are obtained from a preset entity library.

In the embodiment of the invention, the mapping relation between the Chinese entity and the English entity refers to setting a connection relation between the Chinese entity and the English entity according to the actual requirement of a user. For example, there is a mapping relationship between the Chinese entity "apple" and the English entity "front".

It should be noted that the inter-translation relationship refers to a translation relationship between a chinese entity and an english entity, for example, if the chinese entity is "apple" and the english entity is "applet", and since the english corresponding to "apple" is "applet", the chinese entity and the english entity applet belong to the inter-translation relationship.

Specifically, the Chinese entity, the English entity, and the mapping relationship and the inter-translation relationship between the Chinese entity and the English entity are obtained from a preset entity library. The preset entity library is a database specially used for storing Chinese entities, English entities and mapping relations and inter-translation relations between the Chinese entities and the English entities.

S2: and constructing a bilingual entity word network according to the Chinese entity, the English entity, the mapping relation and the inter-translation relation and according to the preset requirement.

In the embodiment of the present invention, a bilingual entity word network is constructed according to the mapping relationship and the inter-translation relationship between the chinese entity, the english entity, and the chinese entity obtained in step S1, and according to the preset requirements.

The preset requirement refers to a requirement for constructing a bilingual entity word network set according to actual requirements of a user.

S3: and obtaining context words corresponding to each Chinese entity and each English entity from a preset database.

In the embodiment of the invention, the contextual word refers to a word which has an association relation with a Chinese entity or an English entity in a conversation scene. The Chinese entity and the English entity are respectively matched with each legal word in the preset database, when the Chinese entity and the legal word are matched to be the same, the context word corresponding to the legal word is used as the context word corresponding to the Chinese entity, and similarly, the context word corresponding to the English entity is obtained.

The preset database is a database specially used for storing legal words and context words corresponding to the legal words.

For example, the preset database contains legal words "apple" and "pear", where the context words corresponding to "apple" are "apple" and "fruit", and the context words corresponding to "pear" are "pear" and "fruit"; if the Chinese entity is an apple, the Chinese entity is respectively matched with legal words of apple and pear to obtain the legal words of apple which are the same as the Chinese entity, and the context words of apple and fruit are used as the context words of the Chinese entity.

S4: and calculating the single-language representation estimated value and the cross-language entity estimated value of the bilingual entity word network based on the Chinese entity, the English entity, the context word, the preset hyperlink set and the preset sentence set.

In the embodiment of the invention, based on the Chinese entity, the English entity, the context word, the preset hyperlink set and the preset statement set, the single-language representation estimated value and the cross-language entity estimated value corresponding to the bilingual entity word network are calculated according to a preset calculation formula. The preset calculation formula refers to a formula set by a user and used for calculating the corresponding single-language representation valuation and the cross-language entity valuation.

The preset set of hyperlinks refers to a set of user-selected hyperlinks.

The preset sentence set is a set of sentences selected by the user from the Baidu encyclopedia in advance.

S5: and acquiring a comparable sentence network and a training sentence, and calculating a cross-language sentence evaluation value corresponding to the comparable sentence network by using the training sentence.

Specifically, a comparable sentence network and a training sentence are obtained from a preset initial library, and a cross-language sentence evaluation value corresponding to the comparable sentence network is calculated by using a preset calculation mode and the training sentence.

The preset initial library is a database which is specially used for storing comparable statement networks and training statements.

The preset calculation mode refers to a calculation method for calculating cross-language sentence estimation values corresponding to a comparable sentence network according to training sentences.

It should be noted that the comparable sentence network refers to a network formed by chinese sentences, english sentences, and the association relationship between the chinese sentences and the english sentences.

S6: and carrying out weighted summation on the single language representation estimated value, the cross-language entity estimated value and the cross-language statement estimated value to obtain a target estimated value.

Specifically, the single-language representation estimated value, the cross-language entity estimated value and the cross-language statement estimated value are multiplied by corresponding preset weights respectively, and the multiplication results are summed and calculated to obtain a calculated result as a target estimated value.

The preset weight value is a ratio preset by a user, and a specific value of the preset weight value may be 0.3, and may also be set according to the actual needs of the user, which is not limited herein.

S7: and comparing the target estimated value with a preset threshold value, and combining a bilingual entity word network and a comparable sentence network into a Chinese-English bilingual corpus if preset conditions are met.

In the embodiment of the present invention, the preset condition is a condition set by a user according to an actual demand, and may be specifically set according to a comparison result obtained by comparing a target estimation value with a preset threshold.

Specifically, the target estimation value is compared with a preset threshold value, and if the target estimation value is smaller than or equal to the preset threshold value, the current bilingual entity word network and the comparable sentence network are combined into a Chinese-English bilingual corpus; and if the comparison result is that the target estimation value is larger than the preset threshold, carrying out iterative updating on the current bilingual entity word network and the comparable sentence network until the target estimation value is smaller than or equal to the threshold.

The preset threshold is a value set according to the actual requirement of the user, and is not limited herein.

In the embodiment, a bilingual entity word network is constructed based on a mapping relation and a mutual translation relation, the association relation between a Chinese entity and an English entity can be strengthened, whether the bilingual entity word network and the comparable sentence network meet set requirements can be accurately judged by calculating a single-language representation estimated value, a cross-language entity estimated value and a cross-language sentence estimated value, and finally, a Chinese-English bilingual corpus is formed by the bilingual entity word network and the comparable sentence network under the condition that a target estimated value meets preset conditions.

In one embodiment, as shown in fig. 2, the step S2 of constructing a bilingual entity word network according to the predetermined requirement according to the chinese entity, the english entity, the mapping relationship, and the inter-translation relationship includes the following steps:

s21: all Chinese entities are obtained as a first set and all English entities are obtained as a second set.

Specifically, all the chinese entities acquired in step S1 are taken as the first set, and all the english entities acquired are taken as the second set.

S22: and acquiring a first mapping entity having a mapping relation with the Chinese entity from a mapping database as a third set, and acquiring a second mapping entity having a mapping relation with the English entity from the mapping database as a fourth set, wherein the mapping database comprises the first mapping entity and the second mapping entity.

In an embodiment of the present invention, the mapping database includes different first entities and second entities, each first entity has its corresponding first mapping entity, each second entity has its corresponding second mapping entity, and there is a mapping relationship between the first mapping entity and the first entity and a mapping relationship between the second mapping entity and the second entity.

Matching the Chinese entity with a first entity in a mapping database, if the Chinese entity is the same as the first entity, acquiring a first mapping entity corresponding to the first entity, and taking all the first mapping entities as a third set; similarly, matching the english entity with the second entity, if the english entity is the same as the second entity, obtaining a second mapping entity corresponding to the second entity, and taking all the second mapping entities as a fourth set.

S23: and acquiring the Chinese entity and the English entity which have the inter-translation relationship as the inter-translation entities, and combining all the inter-translation entities into a fifth set.

In the embodiment of the invention, Chinese entities and English entities with mutual translation relations are obtained from a preset entity library to be used as mutual translation entities, and all the mutual translation entities are combined into a fifth set.

S24: constructing a bilingual entity word network according to a formula (1) based on the first set, the second set, the third set, the fourth set and the fifth set:

E＝(^zhY^en,R^zhYR^enYR) formula (1)

Wherein E is a bilingual entity word network,^zhin the form of a first set of one or more,^enis a second set, R^zhIs a third set, R^enR is the fifth set.

Specifically, according to the first set, the second set, the third set, the fourth set and the fifth set, the five sets are combined into a new set according to formula (1), and the new set is used as a bilingual entity word network.

In this embodiment, the first set, the second set, the third set, the fourth set and the fifth set are obtained respectively, so that the bilingual entity word network can be quickly and accurately constructed according to the formula (1), the construction accuracy of the bilingual entity word network is ensured, and the accuracy of constructing a Chinese-English bilingual corpus by using the bilingual entity word network is further improved.

In one embodiment, as shown in fig. 3, the context words include chinese context words and english context words, and the step of calculating the estimated monolingual representation and the estimated cross-language entity of the bilingual entity word network in step S4 includes the following steps based on the chinese entity, the english entity, the context words, the preset hyperlink set and the preset sentence set:

s41: and importing the Chinese entity and the English entity into a preset processing port for vector feature conversion to obtain a training entity.

In the embodiment of the present invention, the preset processing port is a processing port dedicated to converting a chinese entity or an english entity into word vector features, and specifically, a word2vec model is used to perform vector feature conversion processing.

Specifically, the Chinese entity and the English entity are directly led into a preset processing port to be subjected to vector feature conversion, word vector features after conversion processing are obtained, and the word vector features are used as training entities.

It should be noted that, while the chinese entity and the english entity are converted into the word vector feature, there is a form in which part of the word vector feature includes a hyperlink.

S42: based on the training entity, the context words, the preset hyperlink set and the preset statement set, calculating a monolingual characterization evaluation value according to a formula (2):

wherein L is a monolingual representation valuation, zh is a Chinese entity, en is an English entity,

for training entities, D is a preset sentence set, A is a preset hyperlink set, G is a context word,

represents: (i) if it is not

Whether it is a contextual word; (ii) if it is not

Is linked to

The entity of (1); (iii) if q is present in A, judging

Whether or not it is a contextual word for q, q being an element in D.

S43: based on the context words, cross-language entity valuations are calculated according to equation (3):

wherein I is a cross-language entity valuation,

as a result of the current entity being present,

representing context words, if the current entity is a Chinese entity, representing English context words corresponding to the Chinese entity; if the current entity is an English entity, the Chinese context words corresponding to the English entity are represented; namely, represent and

entities of other languages of the connection.

In this embodiment, according to the formula (2) and the formula (3), the monolingual characteristic estimated value and the cross-language entity estimated value corresponding to the bilingual entity word network can be respectively, quickly and accurately calculated, and the accuracy of determining the target estimated value according to the monolingual characteristic estimated value and the cross-language entity estimated value in the following process is ensured.

In one embodiment, as shown in fig. 4, the step S5 of obtaining a comparable sentence network and a training sentence, and calculating a cross-language sentence estimation value corresponding to the comparable sentence network by using the training sentence includes the following steps:

s51: and acquiring a comparable sentence network from a preset initial library, wherein the comparable sentence network comprises Chinese sentences and English sentences.

In the embodiment of the invention, the comparable statement network is obtained directly from the preset initial library.

It should be noted that the comparable sentence network is composed of chinese sentences and english sentences, and there is a preset association relationship between the chinese sentences and the english sentences, but because the association relationship is inaccurate as the data amount increases, it is necessary to optimize the comparable sentence network.

S52: the method comprises the steps of obtaining a Chinese sentence and an English sentence which comprise 2 same entities as a training sentence, wherein the training sentence comprises a Chinese sentence vector corresponding to the Chinese sentence.

In the embodiment of the present invention, the same entity between the chinese sentence and the english sentence refers to a chinese entity and an english entity that have a mutual translation relationship, for example: the Chinese entity "apple" and the English entity "apple" belong to the same entity.

Specifically, the training sentences are directly obtained from a preset training library, wherein the preset training library refers to a database specially used for storing the training sentences.

S53: and converting the training sentences into comprehensive vectors according to a preset vector conversion mode.

Specifically, the training sentence is converted into a comprehensive vector according to a preset vector conversion mode.

The preset vector conversion mode may specifically be that a training sentence is converted into a comprehensive vector through a word2vec model.

S54: calculating a cross-language sentence estimate according to formula (4) based on the synthesis vector and the chinese sentence vector:

where J is the cross-language statement evaluation,

in order to synthesize the vector, the vector is synthesized,

for a Chinese sentence vector, K is a comparable network of sentences.

Specifically, the synthesis vector and the chinese sentence vector are substituted into formula (4), and the cross-language sentence estimate is calculated using formula (4).

In this example, by obtaining the comparable sentence network and the training sentences, the cross-language sentence estimation value corresponding to the comparable sentence network can be quickly and accurately calculated by using the formula (4), and the accuracy of determining the target estimation value by using the cross-language sentence estimation value subsequently is ensured.

In one embodiment, as shown in fig. 5, the step of converting the training sentence into the synthetic vector in S53 according to the predetermined vector conversion manner includes the following steps:

s531: and performing semantic accuracy judgment on the training sentence, determining a first weight value of the training sentence according to a judgment result, and taking the training sentence with the determined first weight value as a first target sentence.

In the embodiment of the invention, a training sentence is imported into a preset semantic port to judge semantic accuracy, accuracy is output, a weight value corresponding to the accuracy is obtained from a preset weight table to serve as a first weight value, and finally the training sentence with the first weight value serves as a first target sentence.

The preset semantic port is a processing port which is trained in advance and used for performing semantic accuracy judgment on training sentences and outputting accuracy according to judgment results.

The preset weight table is a data table for storing different accuracies and weight values corresponding to the accuracies.

For example: "Vancouver is the capital of Canada. "and" Vancouver are important cities of Canada. "the two words are imported into the preset semantic port, the preset semantic port judges through semantic accuracy that the relation expressed by the former is wrong, and the information expressed by the latter is correct. It outputs the former with an accuracy of 0% and the latter with an accuracy of 100%.

S532: and carrying out sentence vector conversion on the first target sentence to obtain a first vector.

Specifically, the first target statement is led into a preset vector conversion port to be subjected to sentence vector conversion processing, and a first vector after conversion processing is obtained. The preset vector conversion port is a processing port specially used for sentence vector conversion processing.

S533: determining a second weight value of each vocabulary contained in the Chinese sentence and the English sentence in the training sentence in a sentence meaning matching mode, and taking the training sentence with the determined second weight value as a second target sentence.

In the embodiment of the invention, sentence meaning matching refers to a mode of judging whether translation corresponding to an English sentence in a training sentence is matched with a Chinese sentence or not on the basis of the Chinese sentence in the training sentence. Specifically, the training sentences are led into a preset matching port to be subjected to sentence meaning matching processing, each group of words in the English sentences is endowed with a second weight value after the sentence meaning matching processing, and the training sentences with the second weight values are used as second target sentences.

The preset matching port is a processing port which is trained in advance and used for carrying out sentence meaning matching on the training sentences, and endows the training sentences with second weighted values corresponding to each group of words in the English sentences contained in the training sentences according to matching results.

The specific processing procedure of the preset matching port is as follows: converting the Chinese sentence and the English sentence in the training sentence into a Chinese sentence vector and an English sentence vector through a word2vec model, comparing the vector of each dimension in the Chinese sentence vector with the vector of each dimension in the English sentence vector, and if the vector of the Chinese sentence vector in the same dimension is different from the English sentence vector, giving a second weight value corresponding to the vector in the English sentence vector according to a second weight value corresponding to a vector ratio trained in advance.

It should be noted that each vector has its corresponding vocabulary, and the vectors of the chinese vocabulary and the english vocabulary having the inter-translation relationship are the same.

For example, a training sentence contains chinese sentences: "Xiaoming was a student at Beijing university," English sentences are: "Xiaoming _ space 7years in Peaking university", the Chinese sentence is converted into a Chinese sentence vector through the word2vec model as follows: (1,2,3,4,5), converting English sentences into English sentence vectors through word2vec model, wherein the English sentence vectors are as follows: (1,0,3,4,9), wherein the vocabulary in the english sentence corresponding to the second dimension vector 0 in the english sentence vector is "intent", and the vocabulary in the english sentence corresponding to the fifth dimension vector 9 is "7 years".

Comparing vectors of each dimension in the Chinese sentence vector and the English sentence vector to obtain that the second dimension 2 is different from 0, the fifth dimension 5 is different from 9, if the ratio of the pre-trained vectors 2:0 corresponds to a second weight value of 50%, the ratio of the vectors 5:9 corresponds to a second weight value of 1%, and the ratio of the same vectors is 100%; the second weight value corresponding to the second-dimension vector in the english sentence vector is 50%, that is, the second weight value corresponding to the intent is 50%; the second weight value corresponding to the vector of the fifth dimension is 1%, that is, the second weight value of the vocabulary of 7years is 1%, and the second weight values corresponding to the vectors of other same ratios are 100%.

S534: and performing sentence vector conversion on the English sentence in the second target sentence to obtain a second vector.

Specifically, the second target statement is led into a preset vector conversion port to be subjected to sentence vector conversion processing, and a second vector after the conversion processing is obtained.

S535: and calculating a comprehensive vector corresponding to the training sentence based on the first vector and the second vector.

In the embodiment of the present invention, a comprehensive vector corresponding to the training sentence is calculated according to formula (5) based on the first vector and the second vector:

wherein the content of the first and second substances,

in order to synthesize the vector, the vector is synthesized,

in order to be the first vector, the vector is,

in order to be the second vector, the vector is,

in order to train the sentence(s),

is a preset sentence vector.

In this embodiment, by performing semantic accuracy judgment on the training sentence, whether the training sentence has a semantic error or not can be identified, and the first weight value of the training sentence is determined according to the judgment condition to determine the first target sentence, so that semantic accuracy of the first target sentence can be improved, then, according to a sentence meaning matching manner, similarity between a chinese sentence and an english sentence meaning in the training sentence can be identified, and further, the second weight value of each vocabulary in the english sentence is determined according to the similarity to determine the second target sentence, so that accuracy of the second target sentence is improved, and finally, a comprehensive vector corresponding to the training sentence is calculated according to the first target sentence and the second target sentence, so that accuracy of the comprehensive vector can be ensured.

In one embodiment, as shown in fig. 6, the step S7 of comparing the target estimation value with a preset threshold, and if the preset condition is reached, combining the bilingual entity word network and the comparable sentence network into a chinese-english bilingual corpus includes the following steps:

s71: the target estimate is compared to a preset threshold.

Specifically, the target estimation value is compared with a preset threshold value.

S72: and if the target estimation value is less than or equal to a preset threshold value, combining the bilingual entity word network and the comparable sentence network into a Chinese-English bilingual corpus.

Specifically, according to the comparison method in step S71, if the target estimate value is less than or equal to the preset threshold, the bilingual entity word network and the comparable sentence network corresponding to the target estimate value are combined into a chinese-english bilingual corpus.

S73: and if the target estimated value is larger than the preset threshold, iteratively updating the bilingual entity word network and the comparable sentence network according to a preset parameter updating mode until the target estimated value is smaller than or equal to the preset threshold, and combining the iteratively updated bilingual entity word network and the comparable sentence network into a Chinese-English bilingual corpus.

Specifically, according to the comparison method in step S71, if the target estimate value is greater than the preset threshold, the bilingual entity word network and the comparable sentence network are iteratively updated according to the preset parameter update method until the target estimate value is less than or equal to the preset threshold, and the bilingual entity word network and the comparable sentence network after iterative update are combined into a chinese-english bilingual corpus.

The preset parameter updating mode is a mode for updating parameters in the bilingual entity word network and the comparable sentence network according to actual requirements of a user.

In this embodiment, by comparing the target estimate value with the preset threshold, the chinese-english bilingual corpus is determined when the target estimate value is less than or equal to the preset threshold, the bilingual entity word network and the comparable sentence network are updated iteratively again when the target estimate value is greater than the preset threshold, and the chinese-english bilingual corpus is determined when the target estimate value is less than or equal to the preset threshold. The mode of comparing the target estimated value with the preset threshold value to determine the Chinese-English bilingual corpus can ensure that the Chinese-English bilingual corpus is determined under the condition of reaching the set condition, and further the accuracy of the Chinese-English bilingual corpus is improved.

In an embodiment, after step S7, the method for constructing a chinese-english bilingual corpus further includes: the Chinese-English bilingual corpus is stored in a block chain.

It should be emphasized that, in order to further ensure the privacy and security of the chinese-english corpus, the chinese-english corpus can also be stored in a node of a blockchain.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a device for constructing a chinese-english bilingual corpus is provided, wherein the device for constructing a chinese-english bilingual corpus corresponds to the method for constructing a chinese-english bilingual corpus in the above embodiment one to one. As shown in fig. 7, the apparatus for constructing a chinese-english bilingual corpus includes a first obtaining module 71, a constructing module 72, a second obtaining module 73, a first calculating module 74, a second calculating module 75, a summing module 76, and a combining module 77. The functional modules are explained in detail as follows:

a first obtaining module 71, configured to obtain a chinese entity, an english entity, and a mapping relationship and a translation relationship between the chinese entity and the english entity from a preset entity library;

the construction module 72 is used for constructing a bilingual entity word network according to the Chinese entity, the English entity, the mapping relation and the inter-translation relation and according to the preset requirement;

a second obtaining module 73, configured to obtain context words corresponding to each chinese entity and each english entity from a preset database;

a first calculating module 74, configured to calculate a single-language representation estimated value and a cross-language entity estimated value of a bilingual entity word network based on the chinese entity, the english entity, the context word, the preset hyperlink set, and the preset sentence set;

a second calculating module 75, configured to obtain a comparable sentence network and a training sentence, and calculate a cross-language sentence estimation value corresponding to the comparable sentence network by using the training sentence;

a summation module 76, configured to perform weighted summation on the monolingual representation estimated value, the cross-language entity estimated value, and the cross-language statement estimated value to obtain a target estimated value;

and the combination module 77 is used for comparing the target estimated value with a preset threshold value, and combining the target estimated value with a Chinese-English bilingual corpus by using a bilingual entity word network and a comparable sentence network if a preset condition is achieved.

Further, the building block 72 includes:

the third acquisition submodule is used for acquiring all Chinese entities as a first set and all English entities as a second set;

the fourth obtaining sub-module is used for obtaining a first mapping entity which has a mapping relation with a Chinese entity from a mapping database as a third set and a second mapping entity which has a mapping relation with an English entity from the mapping database as a fourth set, wherein the mapping database comprises the first mapping entity and the second mapping entity;

the fifth acquisition submodule is used for acquiring the Chinese entity and the English entity which have the inter-translation relationship as the inter-translation entities and combining all the inter-translation entities into a fifth set;

the network construction submodule is used for constructing a bilingual entity word network according to a formula (1) based on the first set, the second set, the third set, the fourth set and the fifth set:

E＝(^zhY^en,R^zhYR^enYR) formula (1)

Further, the first calculation module 74 includes:

the conversion sub-module is used for leading the Chinese entity and the English entity into a preset processing port for vector feature conversion to obtain a training entity;

and the third computation submodule is used for computing a single language representation estimated value according to a formula (2) based on the training entity, the context words, the preset hyperlink set and the preset statement set:

represents: (i) if it is not

Whether it is a contextual word; (ii) if it is not

Is linked to

The entity of (1); (iii) if q is present in A, judging

Whether q is a contextual word of q, q being an element in D;

a fourth computation submodule for computing cross-language entity valuations according to equation (3) based on the context words:

wherein I is a cross-language entity valuation,

as a result of the current entity being present,

entities of other languages of the connection.

Further, the second calculation module 75 includes:

a sixth obtaining submodule, configured to obtain a comparable sentence network from the preset initial library, where the comparable sentence network includes a chinese sentence and an english sentence;

a seventh obtaining submodule, configured to obtain a chinese sentence and an english sentence that include 2 identical entities as training sentences, where the training sentences include chinese sentence vectors corresponding to the chinese sentences;

the comprehensive vector conversion sub-module is used for converting the training sentences into comprehensive vectors according to a preset vector conversion mode;

and the fifth calculation submodule is used for calculating the cross-language statement estimated value according to the formula (4) according to the comprehensive vector and the Chinese statement vector:

where J is the cross-language statement evaluation,

in order to synthesize the vector, the vector is synthesized,

for a Chinese sentence vector, K is a comparable network of sentences.

Further, the integrated vector conversion sub-module includes:

the judging unit is used for performing semantic accuracy judgment on the training sentences, determining a first weight value of the training sentences according to the judgment result, and taking the training sentences with the determined first weight values as first target sentences;

the first vector acquisition unit is used for carrying out sentence vector conversion on the first target statement to obtain a first vector;

the matching unit is used for determining a second weight value of each vocabulary contained in the Chinese sentences and the English sentences in the training sentences in a sentence meaning matching mode, and taking the training sentences with the determined second weight values as second target sentences;

the second vector acquisition unit is used for performing sentence vector conversion on the English sentences in the second target sentences to obtain second vectors;

and the comprehensive vector calculation unit is used for calculating a comprehensive vector corresponding to the training sentence based on the first vector and the second vector.

Further, the combination module 77 includes:

the comparison submodule is used for comparing the target estimation value with a preset threshold value;

the first comparison submodule is used for combining the bilingual entity word network and the comparable sentence network into a Chinese-English bilingual corpus if the target estimation value is less than or equal to a preset threshold value;

and the second comparison submodule is used for carrying out iterative update on the bilingual entity word network and the comparable sentence network according to a preset parameter update mode if the target estimation value is greater than the preset threshold value until the target estimation value is less than or equal to the preset threshold value, and combining the bilingual entity word network and the comparable sentence network after iterative update into a Chinese-English bilingual corpus.

Some embodiments of the present application disclose a computer device. Referring specifically to fig. 8, a block diagram of a basic structure of a computer device 80 according to an embodiment of the present application is shown.

As illustrated in fig. 8, the computer device 80 includes a memory 81, a processor 82, and a network interface 83, which are communicatively connected to each other through a system bus. It is noted that only a computer device 80 having components 81-83 is shown in FIG. 8, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 81 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 81 may be an internal storage unit of the computer device 80, such as a hard disk or a memory of the computer device 80. In other embodiments, the memory 81 may also be an external storage device of the computer device 80, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 80. Of course, the memory 81 may also include both internal and external storage devices of the computer device 80. In this embodiment, the memory 81 is generally used for storing an operating system and various types of application software installed on the computer device 80, such as program codes of the method for constructing the chinese-english bilingual corpus. Further, the memory 81 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 82 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 82 is generally operative to control the overall operation of the computer device 80. In this embodiment, the processor 82 is configured to execute the program code stored in the memory 81 or process data, for example, execute the program code of the method for constructing the chinese-english bilingual corpus.

The network interface 83 may include a wireless network interface or a wired network interface, and the network interface 83 is generally used to establish a communication connection between the computer device 80 and other electronic devices.

The present application further provides another embodiment, that is, a computer-readable storage medium is provided, where a chinese and english entity information entry program is stored in the computer-readable storage medium, where the chinese and english entity information entry program is executable by at least one processor, so that the at least one processor executes the steps of any one of the methods for constructing a chinese-english bilingual corpus.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a computer device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Finally, it should be noted that the above-mentioned embodiments illustrate only some of the embodiments of the present application, and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for constructing a Chinese-English bilingual corpus is characterized by comprising the following steps:

2. The method for constructing a bilingual chinese corpus according to claim 1, wherein said step of constructing a bilingual entity word network according to predetermined requirements based on said chinese entity, said english entity, said mapping relationship, and said inter-translation relationship comprises:

acquiring all the Chinese entities as a first set and all the English entities as a second set;

acquiring a first mapping entity having the mapping relation with the Chinese entity from a mapping database as a third set, and acquiring a second mapping entity having the mapping relation with the English entity as a fourth set, wherein the mapping database comprises the first mapping entity and the second mapping entity;

acquiring the Chinese entity and the English entity with the inter-translation relationship as inter-translation entities, and combining all the inter-translation entities into a fifth set;

constructing the bilingual entity word network based on the first set, the second set, the third set, the fourth set, and the fifth set according to the following formula:

E＝(^zhY^en,R^zhYR^enYR)

wherein E is the bilingual entity word network,^zhfor the purpose of the first set of data bits,^enis said second set, R^zhIs said third set, R^enR is the fifth set.

3. The method for constructing a chinese-english bilingual corpus of claim 1, wherein the context words comprise chinese context words and english context words, and the step of calculating the estimated values of the single-language features and the estimated values of the cross-language entities of the bilingual entity word network based on the chinese entities, the english entities, the context words, the preset hyperlink set and the preset sentence set comprises:

importing the Chinese entity and the English entity into a preset processing port for vector feature conversion to obtain a training entity;

based on the training entity, the context words, a preset hyperlink set and a preset statement set, calculating the single language representation valuation according to the following formula:

wherein L is the single language representation valuation, zh is the Chinese entity, en is the English entity,

for the training entity, D is a preset sentence set, A is a preset hyperlink set, G is the context word,

represents: (i) if it is not

Whether it is the contextual word; (ii) if it is not

Is linked to

The entity of (1); (iii) if q is present in A, judging

Whether q is a contextual word of q, q being an element in D;

based on the context words, the cross-language entity valuation is calculated according to the following formula:

wherein I is the cross-language entity valuation,

as a result of the current entity being present,

representing the context words, if the current entity is the Chinese entity, representing the English context words corresponding to the Chinese entity; if the current entity is the English entity, the Chinese context words corresponding to the English entity are represented; namely, represent and

entities of other languages of the connection.

4. The method for constructing a chinese-english bilingual corpus according to claim 1, wherein the step of obtaining a comparable sentence network and training sentences and calculating cross-language sentence estimates corresponding to the comparable sentence network using the training sentences comprises:

acquiring the comparable sentence network from a preset initial library, wherein the comparable sentence network comprises Chinese sentences and English sentences;

acquiring the Chinese sentence and the English sentence containing 2 same entities as the training sentence, wherein the training sentence contains a Chinese sentence vector corresponding to the Chinese sentence;

converting the training sentences into comprehensive vectors according to a preset vector conversion mode;

calculating the cross-language sentence valuation according to the comprehensive vector and the Chinese sentence vector and the following formula:

wherein J is the cross-language statement valuation,

for the purpose of the integrated vector, the vector is,

for the Chinese sentence vector, K is the comparable sentence network.

5. The method for constructing a Chinese-English bilingual corpus according to claim 4, wherein said step of converting said training sentences into synthetic vectors according to a predetermined vector conversion method comprises:

performing semantic accuracy judgment on the training sentence, determining a first weight value of the training sentence according to a judgment result, and taking the training sentence with the determined first weight value as a first target sentence;

performing sentence vector conversion on the first target sentence to obtain a first vector;

determining a second weight value of each vocabulary contained in the Chinese sentence and the English sentence in the training sentence in a sentence meaning matching mode, and taking the training sentence with the determined second weight value as a second target sentence;

performing sentence vector conversion on the English sentence in the second target sentence to obtain a second vector;

and calculating a comprehensive vector corresponding to the training sentence based on the first vector and the second vector.

6. The method for constructing a chinese-english bilingual corpus according to claim 1, wherein the step of comparing the target estimate with a predetermined threshold and combining the target estimate with the bilingual entity word network and the comparable sentence network to form the chinese-english bilingual corpus using the bilingual entity word network and the comparable sentence network if a predetermined condition is met comprises:

comparing the target estimate with a preset threshold;

if the target estimated value is less than or equal to a preset threshold value, combining the bilingual entity word network and the comparable sentence network into the Chinese-English bilingual corpus;

and if the target estimation value is larger than a preset threshold value, performing iterative updating on the bilingual entity word network and the comparable sentence network according to a preset parameter updating mode until the target estimation value is smaller than or equal to the preset threshold value, and combining the bilingual entity word network and the comparable sentence network after iterative updating into the Chinese-English bilingual corpus.

7. The method for constructing a chinese-english bilingual corpus according to claim 1, further comprising, after said combining into a chinese-english bilingual corpus: and storing the Chinese-English bilingual corpus in a block chain.

8. The utility model provides a construction equipment of chinese-english bilingual corpus, its characterized in that, construction equipment of chinese-english bilingual corpus includes:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for constructing a chinese-english corpus according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the steps of the method for constructing a chinese-english bilingual corpus according to any one of claims 1 to 7.