CN111563381A - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN111563381A
CN111563381A CN201910111565.XA CN201910111565A CN111563381A CN 111563381 A CN111563381 A CN 111563381A CN 201910111565 A CN201910111565 A CN 201910111565A CN 111563381 A CN111563381 A CN 111563381A
Authority
CN
China
Prior art keywords
language
model
text processing
corpus
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910111565.XA
Other languages
Chinese (zh)
Other versions
CN111563381B (en
Inventor
黄睿
李辰
包祖贻
刘恒友
李林琳
司罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910111565.XA priority Critical patent/CN111563381B/en
Publication of CN111563381A publication Critical patent/CN111563381A/en
Application granted granted Critical
Publication of CN111563381B publication Critical patent/CN111563381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses a text processing method and a text processing device. The text processing method comprises the following steps: learning to obtain a language model at least from a corpus collection of a first source language corpus and a target language corpus including an unlabeled text processing result; learning in a second source language corpus set with the text processing result to obtain a text processing model; acquiring a text to be processed; determining a cross-language aligned context-dependent word vector of at least one word included in the text to be processed through a language model; and acquiring a text processing result of the text to be processed according to the cross-language aligned context related word vector through the text processing model. By adopting the processing mode, the corresponding content containing the context information between the source language and the target language is aligned with the depth semantic vector, and the cross-language model migration is carried out based on the aligned depth semantic vector containing the context information; therefore, the accuracy of the model after the migration can be effectively improved, and the accuracy of text processing is improved.

Description

Text processing method and device
Technical Field
The application relates to the technical field of natural language processing, in particular to a text processing method and device.
Background
In a service scenario related to text processing, with the continuous development of services, the same text processing task needs to be executed on multiple languages. Taking a text processing task of named entity recognition in Chinese as an example, a trade name 'apple mobile phone' needs to be recognized from a text 'I bought an apple mobile phone today'; for example, text processing tasks for classifying Chinese emotions are used, and the text processing tasks can be extended to other languages except Chinese to identify that the user comment belongs to positive emotions with good quality and exceeding expectation.
For languages with abundant manual labeling data resources (such as Chinese, English and the like), a text processing model (such as named entity recognition or emotion classification model) can be trained directly according to training data; for languages with scarce manual annotation data resources (such as languages like vietnamese and thai), because there is not enough training data to train the model, a cross-language model migration scheme is usually adopted, that is: and migrating the model trained on the language with rich artificial labeling data resources (called source language for short) to the language with poor artificial labeling data resources (called target language for short) for use.
At present, a typical cross-language model migration scheme is mainly implemented by aligned word vectors, which is referred to as a cross-language word vector alignment method for short. The method is characterized in that firstly, starting from word vector spaces of source language and target language which are independent respectively, the corresponding words between the source language and the target language are tried to be mapped to an area in the space, taking Chinese and English as an example, the cross-language word vector alignment method is tried to map Chinese 'apple' and English 'apple' and Chinese 'man' and English 'man' to the same position of the word vector spaces; then, based on the aligned word vector space, a text processing model is trained in the source language and can be directly migrated to the target language for use.
However, in the process of implementing the invention, the inventor finds that the existing scheme has at least the following problems: since different languages may not be precisely corresponding at a word level, for example, "brown" in english as a name and a color respectively corresponds to "brown" and "brown" in chinese, word granularity alignment based on the above cross-language word vector cannot solve the problem of mismatching of word meaning information, and therefore, it is not sufficient to perform cross-language model migration by considering only word level alignment, which will result in low accuracy of the model after migration.
Disclosure of Invention
The application provides a text processing method to solve the problem that in the prior art, the accuracy of a model after migration is low. The present application additionally provides a text processing apparatus.
The application provides a text processing method, which comprises the following steps:
learning to obtain a language model at least from a corpus collection of a first source language corpus and a target language corpus including an unlabeled text processing result; and learning in a second source language corpus set with the text processing result to obtain a text processing model;
acquiring a text to be processed in a source language or a target language;
determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed;
and taking the cross-language aligned context related word vector as input data of the text processing model, and obtaining a text processing result of the text to be processed through the text processing model.
Optionally, the language model is obtained by learning through the following steps:
acquiring the corpus collection;
constructing a neural network of the language model; the neural network comprises at least one semantic vector extraction layer, a language category discriminator is arranged behind the semantic vector extraction layer, the discriminator is used for discriminating the language category of the word vector output by the last semantic vector extraction layer adjacent to the discriminator, and the language category comprises a source language and a target language;
and training the neural network according to the corpus collection by taking the judgment accuracy of the discriminator as a training target, wherein the judgment accuracy is greater than a first accuracy threshold and less than a second accuracy threshold, and the confusion degree of the language model is less than a confusion degree threshold.
Optionally, the corpus collection includes a corpus collection of a plurality of source languages;
learning to obtain a language model at least from a corpus set of a first source language corpus set and a target language corpus set of a plurality of source languages including an unlabeled text processing result; the language category discriminated by the discriminator includes a plurality of source languages and target languages;
and learning from a second source language corpus set of a plurality of source languages with labeled text processing results to obtain the text processing model.
Optionally, the corpus collection does not include parallel corpuses between the source language and the target language.
Optionally, the determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed includes:
obtaining semantic vectors output by each semantic vector extraction layer in the language model;
and for each word, splicing the semantic vectors of the words output by each semantic vector extraction layer to serve as the cross-language aligned context related word vectors of the words.
Optionally, the determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed includes:
obtaining semantic vectors output by each semantic vector extraction layer in the language model;
and aiming at each word, taking the weighted average of the semantic vectors of the words output by each semantic vector extraction layer as the cross-language aligned context-dependent word vector of the words.
Optionally, the text processing model is obtained by learning through the following steps:
acquiring the second source language corpus;
determining, by the language model, a cross-language aligned context-dependent word vector for at least one word included in the second source language corpus;
constructing a neural network of the text processing model;
and training the neural network according to the cross-language aligned context related word vector and a corresponding relation set between the marked text processing results.
Optionally, the text processing model includes: named entity recognition model, emotion classification model and part of speech tagging model.
The present application also provides a text processing apparatus, including:
the language model building unit is used for learning to obtain a language model at least from a corpus set of a first source language corpus set and a target language corpus set which comprise unmarked text processing results;
the text processing model building unit is used for intensively learning from the second source language corpus labeled with the text processing result to obtain a text processing model;
the text acquisition unit is used for acquiring a text to be processed in a source language or a target language;
a semantic vector determining unit, configured to determine, through the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed;
and the model prediction unit is used for taking the cross-language aligned context related word vector as input data of the text processing model and acquiring a text processing result of the text to be processed through the text processing model.
Optionally, the method further includes:
a language model construction unit;
the language model building unit includes:
a corpus acquiring subunit, configured to acquire the corpus collection;
the neural network constructing subunit is used for constructing a neural network of the language model; the neural network comprises at least one semantic vector extraction layer, a language category discriminator is arranged behind the semantic vector extraction layer, the discriminator is used for discriminating the language category of the word vector output by the last semantic vector extraction layer adjacent to the discriminator, and the language category comprises a source language and a target language;
and the training subunit is used for training the neural network according to the corpus aggregation by taking the judgment accuracy of the discriminator as a training target, wherein the judgment accuracy is greater than a first accuracy threshold and less than a second accuracy threshold, and the confusion degree of the language model is less than a confusion degree threshold.
Optionally, the corpus collection includes a corpus collection of a plurality of source languages;
the language model building unit is specifically used for learning to obtain a language model at least from a corpus collection of a first source language corpus collection and a target language corpus collection of a plurality of source languages including an unlabeled text processing result; the language category discriminated by the discriminator includes a plurality of source languages and target languages;
the text processing model constructing unit is specifically configured to learn in a centralized manner from second source language corpora of a plurality of source languages to which text processing results are labeled, to obtain the text processing model.
Optionally, the semantic vector determining unit includes:
the output vector acquisition subunit is used for acquiring the output vectors of each semantic vector extraction layer in the language model;
and the vector splicing subunit is used for splicing the semantic vectors of the words output by the semantic vector extraction layers aiming at the words to serve as the context-related word vectors of the words aligned in the cross-language mode.
Optionally, the semantic vector determining unit includes:
the output vector acquisition subunit is used for acquiring semantic vectors output by each semantic vector extraction layer in the language model;
and the vector calculation subunit is used for taking the weighted average value of the semantic vectors of the words output by each semantic vector extraction layer as the context-related word vectors of the words aligned in the cross-language mode aiming at each word.
Optionally, the text processing model building unit includes:
a training data set obtaining subunit, configured to obtain the second source language corpus;
a semantic vector determining subunit, configured to determine, through the language model, a cross-language aligned context-dependent word vector of at least one word included in the second source language corpus;
the neural network construction subunit is used for constructing a neural network of the text processing model;
and the training subunit is used for training the neural network according to the cross-language aligned context related word vector and a corresponding relation set between the labeled text processing results.
The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.
The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.
Compared with the prior art, the method has the following advantages:
according to the text processing method provided by the embodiment of the application, a language model is obtained by learning at least from a corpus collection of a first source language corpus and a target language corpus which comprise unmarked text processing results; and learning in a second source language corpus set with the text processing result to obtain a text processing model; acquiring a text to be processed in a source language or a target language; determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed; using the cross-language aligned context related word vector as input data of the text processing model, and predicting a text processing result of the text to be processed through the text processing model; the processing mode enables the content containing the context information corresponding to the source language and the target language to be aligned with the depth semantic vector, and the cross-language model migration is carried out based on the aligned depth semantic vector containing the context information; therefore, the accuracy of the model after the migration can be effectively improved, and the accuracy of text processing is improved. In addition, the language model can be trained only through the ubiquitous monolingual unmarked corpus, so that a large amount of bilingual resources such as any additional parallel corpus and the like are not needed, the dependence on the bilingual resources is completely eliminated, and the method can be widely applied to the real resource-deficient languages.
Drawings
FIG. 1 is a flow diagram of an embodiment of a text processing method provided herein;
FIG. 2 is a detailed flowchart of an embodiment of a text processing method provided in the present application;
FIG. 3 is a schematic diagram of a language model of an embodiment of a text processing method provided in the present application;
FIG. 4 is a detailed flowchart of an embodiment of a text processing method provided in the present application;
FIG. 5 is a schematic diagram of another language model of an embodiment of a text processing method provided by the present application;
fig. 6 is a schematic diagram of an embodiment of a text processing apparatus provided in the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
In the application, a text processing method and a text processing device are provided. Each of the schemes is described in detail in the following examples.
The technical scheme provided by the embodiment of the application has the core basic idea that: and aligning the depth semantic vectors of the corresponding contents containing the context information between the source language and the target language, and performing cross-language model migration based on the aligned depth semantic vectors containing the context information. The deep semantic vector space between the source language and the target language is aligned, and then the aligned semantic vector is used for assisting the cross-language model migration, so that the accuracy of the model after the migration can be effectively improved, and the accuracy of text processing is improved. In addition, the multilingual language model can be trained only through the ubiquitous monolingual unmarked corpus without any additional bilingual resources such as parallel corpora and the like, so that the multilingual language model can be widely applied to real resource-deficient languages.
First embodiment
Please refer to fig. 1, which is a flowchart of an embodiment of a text processing method provided in the present application, an execution main body of the method includes a text processing apparatus, and the text processing apparatus may be deployed on a server or any other device capable of implementing the method. The text processing method provided by the application comprises the following steps:
step S101: learning to obtain a language model at least from a corpus collection of a first source language corpus and a target language corpus including an unlabeled text processing result; and learning from the second source language corpus set labeled with the text processing result to obtain a text processing model.
The text processing includes, but is not limited to: named entity recognition, sentiment classification, part of speech tagging, and the like. Accordingly, the text processing model includes, but is not limited to: named entity recognition models, emotion classification models, part-of-speech tagging models, and the like.
For convenience of description, the method provided by the embodiment of the present application will be described below by taking named entity identification as an example.
The source language is usually a language with abundant manual labeled data resources, such as Chinese, English, and the like, and the named entity recognition model can be trained directly according to the second source language corpus of the labeled named entity. The target language is usually a language with scarce manual labeling data resources, such as languages like vietnamese, thai and other small languages, and because there is not enough training data, the named entity recognition model cannot be trained directly according to the training data, but a cross-language model migration scheme is adopted, namely: and migrating the named entity recognition model trained on the source language to the target language for use.
Step S101 includes two parts: 1) learning to obtain a language model at least from a corpus collection of a first source language corpus and a target language corpus of the unlabeled named entity; 2) a named entity recognition model is learned from the second set of source language corpora with labeled named entities. These two parts will be described separately below.
1) The language model is obtained by learning from at least a corpus collection comprising a first source language corpus and a target language corpus of the unlabeled named entity.
The language model in the embodiment of the application is multilingual abstract mathematical modeling according to the objective facts of the source language and the target language, can be used for judging the smoothness and the rationality of a sentence in the source language and the target language, and is used for predicting the next word by utilizing the information in the previous sentence. The language model training can be directly based on the large-scale source language and target language corpora to perform unsupervised model training. The language model provided in this embodiment is obtained by at least learning from a target language corpus and a first source language corpus of the unlabeled text processing result. By means of the language model, a cross-language aligned context dependent word vector representation corresponding to each word in the input text may be determined, which may be referred to as a cross-language aligned deep semantic vector, and correspondingly, the language model may be referred to as a multi-lingual language model.
The cross-language aligned context-dependent word vector is a semantic vector of words that is dynamically variable according to the context information of the words and is alignable for different languages, and thus may be referred to as a depth semantic vector of cross-language aligned words. Compared with the conventional word vector, the cross-language aligned context-dependent word vector has stronger expression capability on word semantics and aligns the semantic vector in a cross-language mode.
The network structure of the language model includes, but is not limited to, at least one of the following network structures: a bidirectional long-time and short-time memory network BLSTM, a time sequence convolutional neural network TCNN, a Transformer model and the like. In specific implementation, a network structure with appropriate effect and execution efficiency can be selected according to different data and operation environments. For example, with a language model based on BLSTM, the output context-dependent word vectors are more accurate but less efficient to execute; with a CNN-based language model, the accuracy of the output context-dependent word vectors is low, but the execution efficiency is high, and so on.
Please refer to fig. 2, which is a flowchart illustrating a text processing method according to an embodiment of the present disclosure. In this embodiment, the language model may be learned by the following steps:
step S201: and acquiring the corpus collection.
The corpus in the corpus collection does not need to have the labeling information of named entities, and does not need parallel corpuses between a source language and a target language.
Step S203: and constructing a neural network of the language model.
Please refer to fig. 3, which is a schematic diagram of a language model according to an embodiment of a text processing method provided in the present application. In this embodiment, the neural network of the language model may include a plurality of semantic vector extraction layers (layers 1 to n in the drawing), each of which includes a language category discriminator, and the discriminator is configured to discriminate a language category of a word vector output by an adjacent last semantic vector extraction layer, where the language category includes a source language and a target language.
In this embodiment, the structure of the language model is: given an input sequence W ═ W1, W2, …, wm]Firstly, the word vector matrix E is searched and converted into a sequence H of word vectors0Namely: h0WE; then passing through n layers of semantic vector extraction layers (the network structure of the semantic vector extraction layer can be BLSTM, and also can be BLSTMMay be a Transformer block, etc.) into HnNamely: hl=Layer_function(Hl-1) L is more than or equal to 1 and less than or equal to n; finally, output is generated through a softmax layer: o ═ softmax (H)nET)。
As can be seen from fig. 3, in the language model, all languages share all parameters of the language model except for the vocabulary. Each language has its own vocabulary, the word vector matrix E. Finally, the cross-language aligned context-dependent word vector may be determined from the output vector of each semantic vector extraction layer, such as a splicing method or a weighted average method, etc.
Step S205: and training the neural network according to the corpus collection by taking the judgment accuracy of the discriminator as a training target, wherein the judgment accuracy is greater than a first accuracy threshold and less than a second accuracy threshold, and the confusion degree of the language model is less than a confusion degree threshold.
The method provided by the embodiment of the application aligns the depth semantic vectors output by each semantic vector extraction layer of the language model in a countermeasure training mode. For each semantic vector extraction layer output from the source and target languages, it is aligned into the same vector space in a countermeasure-training manner. The specific method comprises the following steps: adding a Discriminator D behind the output of each semantic vector extraction layer of the language model, and training the Discriminator D to distinguish whether the output vector is from a source language or a target language; and simultaneously training the language model, aiming at ensuring that the output vector (the output vector of each semantic vector extraction layer) generated by the language model can not be distinguished from the source language or the target language by the discriminator until the accuracy of the classifier classification is low, and considering that the deep semantic vectors output by the source language and the target language are aligned to the same space.
The first accuracy threshold comprises a lower limit of the discrimination accuracy, and the second accuracy threshold comprises an upper limit of the discrimination accuracy, that is, when the discrimination accuracy of the discriminator is within a range between the two thresholds, the training target of the discriminator is reached. For example, the first accuracy threshold is 48% and the second accuracy threshold is 52%.
Suppose that the output vectors of the source language and the target language (output vectors of the semantic vector extraction layer) are h respectivelysAnd htTraining a discriminator D to try to make hsClassification into 0, htClassified as 1, the loss function loss during training is:
LD=cross_entropy(D(hs),0)+cross_entropy(D(ht) And 1) setting a loss function loss (-LD) with opposite directions for the multilingual language model at the same time.
-LD=cross_entropy(D(hs),1)+cross_entropy(D(ht),0)
Under the condition that the language models are two languages, the judgment accuracy is stabilized at about 50%, and the judgment of whether the depth semantic vector is from the source language or the target language can not be distinguished by the discriminator.
In this embodiment, because the language model stacks multiple semantic vector extraction layers, semantic information of a deeper layer can be extracted, and cross-language alignment is easier. In the experiment, the word of English "brown" is recalled nearest neighbor from the output vector of the language model, and a name and a color in Chinese can be recalled respectively under different contexts, so that the effectiveness of the method provided by the embodiment of the application is verified.
The way in which the language model is constructed is explained so far.
2) A named entity recognition model is learned from the second set of source language corpora with labeled named entities.
When the named entity recognition model is to be trained, the named entity recognition model is obtained by learning from the marked source language corpus in a machine learning mode of sequence marking with respect to the source language corpus to which the named entity is to be marked. The input data of the named entity recognition model is semantic vectors corresponding to all words in the source language corpus, and the output data is named entities.
It should be noted that the source language corpus used for training the language model and training the downstream named entity recognition model may be different corpora. Because the data volume of the labeled second source language corpus on the downstream named entity recognition task is very small and is not enough for training the language model, when the language model is trained, the source language and the target language both adopt large-scale unlabeled corpuses, and the training data of the language model mainly comprises the unlabeled first source language corpus and the target language corpus and can also comprise the labeled second source language corpus.
Please refer to fig. 4, which is a flowchart illustrating a text processing method according to an embodiment of the present disclosure. In this embodiment, the text processing model is obtained by learning through the following steps:
step S401: acquiring the second source language corpus;
step S403: determining, by the language model, a cross-language aligned context-dependent word vector for at least one word included in the second source language corpus.
The method provided by the embodiment of the application can use the cross-language aligned context related word vector of at least one word included in the second source language corpus as the input vector of the named entity recognition model, so that before the named entity recognition model is trained, at least one word included in the second source language corpus can be converted into the cross-language aligned context related word vector through the language model.
Step S405: constructing a neural network of the text processing model;
step S407: and training the neural network according to the cross-language aligned context related word vector and a corresponding relation set between the marked text processing results.
After the language model and the text processing model are obtained through training, the text to be processed in the source language or the target language can be processed through the two models.
Step S103: and acquiring the text to be processed of the source language or the target language.
In the named entity recognition scenario, the text to be processed may be "i buy an apple mobile phone today", and the method provided by the embodiment of the present application may identify the trade name "apple mobile phone" in the text.
After the text to be processed is obtained, the next step may be entered to determine a cross-language aligned context-dependent word vector of at least one word included in the text to be processed through the language model.
Step S105: determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed.
According to the method provided by the embodiment of the application, the conventional word vectors respectively corresponding to at least one word included in the text to be processed are used as the input vectors of the language model, and the cross-language aligned context related word vectors of the at least one word included in the text to be processed are determined through the language model.
In specific implementation, word segmentation processing may be performed on a text to be processed through a word segmentation device to obtain words included in the text, then the word vector matrix E of the language is searched and converted into word vectors, a word vector sequence corresponding to the text to be processed is obtained, the sequence is used as input data of a language model, and a cross-language aligned context-dependent word vector of at least one word included in the text to be processed is determined through the language model.
In one example, a cross-language aligned context-dependent word vector may be determined using the following steps: 1) obtaining semantic vectors output by each semantic vector extraction layer in the language model; 2) and for each word, splicing the semantic vectors of the words output by each semantic vector extraction layer to serve as the cross-language aligned context related word vectors of the words. For example, the semantic vector output by the semantic vector extraction layer is a 300-dimensional vector, the language model comprises n semantic vector extraction layers, and then the dimension of the cross-language aligned context-dependent word vector of each word is 300 × n.
In another example, a cross-language aligned context-dependent word vector may be determined using the following steps: 1) obtaining semantic vectors output by each semantic vector extraction layer in the language model; 2) and aiming at each word, taking the weighted average of the semantic vectors of the words output by each semantic vector extraction layer as the cross-language aligned context-dependent word vector of the words. For example, the semantic vector output by the semantic vector extraction layer is a 300-dimensional vector, the language model comprises n semantic vector extraction layers, the 1 st layer weight is w1, …, and the nth layer weight is wn, and the dimension of the cross-language aligned context-dependent word vector of each word is still 300 dimensions. By adopting the processing mode, the input vector dimensionality of the text processing model can be effectively reduced, so that the parameter quantity of the text processing model can be effectively reduced, and the text processing speed is improved.
After the cross-language aligned context-dependent word vector of at least one word included in the text to be processed is obtained, the next step can be entered to obtain a text processing result of the text to be processed in the target language through a text processing model trained in the source language.
Step S107: and taking the semantic vector of the at least one word as input data of the text processing model, and acquiring a text processing result of the text to be processed through the text processing model.
Please refer to fig. 5, which is a schematic diagram of another language model according to an embodiment of a text processing method provided in the present application. In another example, the corpus collection includes a corpus collection of a plurality of source languages; the language categories include a plurality of source languages and target languages; correspondingly, a language model can be obtained by learning at least from a corpus set of a first source language corpus set and a target language corpus set of a plurality of source languages comprising the processing results of the unlabeled texts; the language category discriminated by the discriminator includes a plurality of source languages and target languages; and learning from a second source language corpus set of the plurality of source languages with the labeled text processing results to obtain the text processing model. By adopting the processing mode, the multilingual deep semantic vectors are aligned by using the language model, then the text processing model is transferred to the target language from a plurality of source languages, and text processing is executed on the texts in a plurality of languages through one text processing model; therefore, the number of the text processing models can be effectively reduced, and the construction efficiency of the text processing models is improved. Meanwhile, the text processing model is constructed on a plurality of source languages, so that the model accuracy can be effectively improved, and the text processing effect is improved.
As can be seen from the foregoing embodiments, in the text processing method provided in the embodiments of the present application, a language model is obtained by learning from at least a corpus set including a first source language corpus and a target language corpus of an unlabeled text processing result; and learning in a second source language corpus set with the text processing result to obtain a text processing model; acquiring a text to be processed in a source language or a target language; determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed; using the cross-language aligned context related word vector as input data of the text processing model, and predicting a text processing result of the text to be processed through the text processing model; the processing mode enables the content containing the context information corresponding to the source language and the target language to be aligned with the depth semantic vector, and the cross-language model migration is carried out based on the aligned depth semantic vector containing the context information; therefore, the accuracy of the model after the migration can be effectively improved, and the accuracy of text processing is improved. In addition, the language model can be trained only through the ubiquitous monolingual unmarked corpus, so that a large amount of bilingual resources such as any additional parallel corpus and the like are not needed, the dependence on the bilingual resources is completely eliminated, and the method can be widely applied to the real resource-deficient languages.
In the foregoing embodiment, a text processing method is provided, and correspondingly, the present application further provides a text processing apparatus. The apparatus corresponds to an embodiment of the method described above.
Second embodiment
Please refer to fig. 6, which is a schematic diagram of an embodiment of a text processing apparatus of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
The present application further provides a text processing apparatus comprising:
a language model constructing unit 601, configured to learn a language model from at least a corpus collection of a first source language corpus and a target language corpus including an unlabeled text processing result;
a text processing model constructing unit 603, configured to learn from the second source language corpus to which the text processing result is labeled in a centralized manner to obtain a text processing model;
a text obtaining unit 605, configured to obtain a text to be processed in a source language or a target language;
a semantic vector determining unit 607, configured to determine, through the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed;
and the model prediction unit 609 is configured to use the cross-language aligned context-related word vector as input data of the text processing model, and obtain a text processing result of the to-be-processed text through the text processing model.
Optionally, the method further includes:
a language model construction unit;
the language model building unit includes:
a corpus acquiring subunit, configured to acquire the corpus collection;
the neural network constructing subunit is used for constructing a neural network of the language model; the neural network comprises at least one semantic vector extraction layer, a language category discriminator is arranged behind the semantic vector extraction layer, the discriminator is used for discriminating the language category of the word vector output by the last semantic vector extraction layer adjacent to the discriminator, and the language category comprises a source language and a target language;
and the training subunit is used for training the neural network according to the corpus aggregation by taking the judgment accuracy of the discriminator as a training target, wherein the judgment accuracy is greater than a first accuracy threshold and less than a second accuracy threshold, and the confusion degree of the language model is less than a confusion degree threshold.
Optionally, the corpus collection includes a corpus collection of a plurality of source languages;
the language model building unit is specifically used for learning to obtain a language model at least from a corpus collection of a first source language corpus collection and a target language corpus collection of a plurality of source languages including an unlabeled text processing result; the language category discriminated by the discriminator includes a plurality of source languages and target languages;
the text processing model constructing unit is specifically configured to learn in a centralized manner from second source language corpora of a plurality of source languages to which text processing results are labeled, to obtain the text processing model.
Optionally, the semantic vector determining unit includes:
the output vector acquisition subunit is used for acquiring the output vectors of each semantic vector extraction layer in the language model;
and the vector splicing subunit is used for splicing the semantic vectors of the words output by the semantic vector extraction layers aiming at the words to serve as the context-related word vectors of the words aligned in the cross-language mode.
Optionally, the semantic vector determining unit includes:
the output vector acquisition subunit is used for acquiring semantic vectors output by each semantic vector extraction layer in the language model;
and the vector calculation subunit is used for taking the weighted average value of the semantic vectors of the words output by each semantic vector extraction layer as the context-related word vectors of the words aligned in the cross-language mode aiming at each word.
Optionally, the text processing model building unit includes:
a training data set obtaining subunit, configured to obtain the second source language corpus;
a semantic vector determining subunit, configured to determine, through the language model, a cross-language aligned context-dependent word vector of at least one word included in the second source language corpus;
the neural network construction subunit is used for constructing a neural network of the text processing model;
and the training subunit is used for training the neural network according to the cross-language aligned context related word vector and a corresponding relation set between the labeled text processing results.
As can be seen from the foregoing embodiments, the text processing apparatus provided in the embodiments of the present application obtains a language model by learning from at least a corpus set including a first source language corpus and a target language corpus of an unlabeled text processing result; and learning in a second source language corpus set with the text processing result to obtain a text processing model; acquiring a text to be processed in a source language or a target language; determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed; using the cross-language aligned context related word vector as input data of the text processing model, and predicting a text processing result of the text to be processed through the text processing model; the processing mode enables the content containing the context information corresponding to the source language and the target language to be aligned with the depth semantic vector, and the cross-language model migration is carried out based on the aligned depth semantic vector containing the context information; therefore, the accuracy of the model after the migration can be effectively improved, and the accuracy of text processing is improved. In addition, the language model can be trained only through the ubiquitous monolingual unmarked corpus, so that a large amount of bilingual resources such as any additional parallel corpus and the like are not needed, the dependence on the bilingual resources is completely eliminated, and the method can be widely applied to the real resource-deficient languages.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (14)

1. A text processing method, comprising:
learning to obtain a language model at least from a corpus collection of a first source language corpus and a target language corpus including an unlabeled text processing result; and learning in a second source language corpus set with the text processing result to obtain a text processing model;
acquiring a text to be processed in a source language or a target language;
determining, by the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed;
and taking the cross-language aligned context related word vector as input data of the text processing model, and obtaining a text processing result of the text to be processed through the text processing model.
2. The method of claim 1, wherein the language model is learned by the steps of:
acquiring the corpus collection;
constructing a neural network of the language model; the neural network comprises at least one semantic vector extraction layer, a language category discriminator is arranged behind the semantic vector extraction layer, the discriminator is used for discriminating the language category of the word vector output by the last semantic vector extraction layer, and the language category comprises a source language and a target language;
and training the neural network according to the corpus collection by taking the judgment accuracy of the discriminator as a training target, wherein the judgment accuracy is greater than a first accuracy threshold and less than a second accuracy threshold, and the confusion degree of the language model is less than a confusion degree threshold.
3. The method of claim 2,
the corpus collection comprises a corpus collection of a plurality of source languages;
learning to obtain a language model at least from a corpus set of a first source language corpus set and a target language corpus set of a plurality of source languages including an unlabeled text processing result; the language category discriminated by the discriminator includes a plurality of source languages and target languages;
and learning from a second source language corpus set of a plurality of source languages with labeled text processing results to obtain the text processing model.
4. The method of claim 1, wherein the corpus collection does not include parallel corpora between the source language and the target language.
5. The method of claim 1, wherein said passing through the language model and determining a cross-language aligned context-dependent word vector of at least one word included in the text to be processed comprises:
obtaining semantic vectors output by each semantic vector extraction layer in the language model;
and for each word, splicing the semantic vectors of the words output by each semantic vector extraction layer to serve as the cross-language aligned context related word vectors of the words.
6. The method of claim 1, wherein said passing through the language model and determining a cross-language aligned context-dependent word vector of at least one word included in the text to be processed comprises:
obtaining semantic vectors output by each semantic vector extraction layer in the language model;
and aiming at each word, taking the weighted average of the semantic vectors of the words output by each semantic vector extraction layer as the cross-language aligned context-dependent word vector of the words.
7. The method of claim 1, wherein the text processing model is learned using the following steps:
acquiring the second source language corpus;
determining, by the language model, a cross-language aligned context-dependent word vector for at least one word included in the second source language corpus;
constructing a neural network of the text processing model;
and training the neural network according to the cross-language aligned context related word vector and a corresponding relation set between the marked text processing results.
8. The method of claim 1, wherein the text processing model comprises: named entity recognition model, emotion classification model and part of speech tagging model.
9. A text processing apparatus, comprising:
the language model building unit is used for learning to obtain a language model at least from a corpus set of a first source language corpus set and a target language corpus set which comprise unmarked text processing results;
the text processing model building unit is used for intensively learning from the second source language corpus labeled with the text processing result to obtain a text processing model;
the text acquisition unit is used for acquiring a text to be processed in a source language or a target language;
a semantic vector determining unit, configured to determine, through the language model, a cross-language aligned context-dependent word vector of at least one word included in the text to be processed;
and the model prediction unit is used for taking the cross-language aligned context related word vector as input data of the text processing model and acquiring a text processing result of the text to be processed through the text processing model.
10. The apparatus of claim 9, further comprising:
a language model construction unit;
the language model building unit includes:
a corpus acquiring subunit, configured to acquire the corpus collection;
the neural network constructing subunit is used for constructing a neural network of the language model; the neural network comprises at least one semantic vector extraction layer, a language category discriminator is arranged behind the semantic vector extraction layer, the discriminator is used for discriminating the language category of the word vector output by the last semantic vector extraction layer adjacent to the discriminator, and the language category comprises a source language and a target language;
and the training subunit is used for training the neural network according to the corpus aggregation by taking the judgment accuracy of the discriminator as a training target, wherein the judgment accuracy is greater than a first accuracy threshold and less than a second accuracy threshold, and the confusion degree of the language model is less than a confusion degree threshold.
11. The apparatus of claim 9,
the corpus collection comprises a corpus collection of a plurality of source languages;
the language model building unit is specifically used for learning to obtain a language model at least from a corpus collection of a first source language corpus collection and a target language corpus collection of a plurality of source languages including an unlabeled text processing result; the language category discriminated by the discriminator includes a plurality of source languages and target languages;
the text processing model constructing unit is specifically configured to learn in a centralized manner from second source language corpora of a plurality of source languages to which text processing results are labeled, to obtain the text processing model.
12. The apparatus of claim 9, wherein the semantic vector determining unit comprises:
the output vector acquisition subunit is used for acquiring the output vectors of each semantic vector extraction layer in the language model;
and the vector splicing subunit is used for splicing the semantic vectors of the words output by the semantic vector extraction layers aiming at the words to serve as the context-related word vectors of the words aligned in the cross-language mode.
13. The apparatus of claim 9, wherein the semantic vector determining unit comprises:
the output vector acquisition subunit is used for acquiring semantic vectors output by each semantic vector extraction layer in the language model;
and the vector calculation subunit is used for taking the weighted average value of the semantic vectors of the words output by each semantic vector extraction layer as the context-related word vectors of the words aligned in the cross-language mode aiming at each word.
14. The apparatus of claim 9, wherein the text processing model building unit comprises:
a training data set obtaining subunit, configured to obtain the second source language corpus;
a semantic vector determining subunit, configured to determine, through the language model, a cross-language aligned context-dependent word vector of at least one word included in the second source language corpus;
the neural network construction subunit is used for constructing a neural network of the text processing model;
and the training subunit is used for training the neural network according to the cross-language aligned context related word vector and a corresponding relation set between the labeled text processing results.
CN201910111565.XA 2019-02-12 2019-02-12 Text processing method and device Active CN111563381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910111565.XA CN111563381B (en) 2019-02-12 2019-02-12 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910111565.XA CN111563381B (en) 2019-02-12 2019-02-12 Text processing method and device

Publications (2)

Publication Number Publication Date
CN111563381A true CN111563381A (en) 2020-08-21
CN111563381B CN111563381B (en) 2023-04-21

Family

ID=72069482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910111565.XA Active CN111563381B (en) 2019-02-12 2019-02-12 Text processing method and device

Country Status (1)

Country Link
CN (1) CN111563381B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926324A (en) * 2021-02-05 2021-06-08 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN116108859A (en) * 2023-03-17 2023-05-12 美云智数科技有限公司 Emotional tendency determination, sample construction and model training methods, devices and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080082320A1 (en) * 2006-09-29 2008-04-03 Nokia Corporation Apparatus, method and computer program product for advanced voice conversion
US20090083023A1 (en) * 2005-06-17 2009-03-26 George Foster Means and Method for Adapted Language Translation
CN106383816A (en) * 2016-09-26 2017-02-08 大连民族大学 Chinese minority region name identification method based on deep learning
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN109101476A (en) * 2017-06-21 2018-12-28 阿里巴巴集团控股有限公司 A kind of term vector generates, data processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090083023A1 (en) * 2005-06-17 2009-03-26 George Foster Means and Method for Adapted Language Translation
US20080082320A1 (en) * 2006-09-29 2008-04-03 Nokia Corporation Apparatus, method and computer program product for advanced voice conversion
CN106383816A (en) * 2016-09-26 2017-02-08 大连民族大学 Chinese minority region name identification method based on deep learning
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN109101476A (en) * 2017-06-21 2018-12-28 阿里巴巴集团控股有限公司 A kind of term vector generates, data processing method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
B. MURALI KARTHICK等: "Improving deep neural networks using state projection vectors of subspace Gaussian mixture model as features" *
唐亮;席耀一;彭波;刘香伟;易绵竹;: "基于词向量的越汉跨语言事件检索研究" *
张剑等: "基于词向量特征的循环神经网络语言模型" *
胡亚楠;惠浩添;钱龙华;朱巧明;: "基于机器翻译的双语协同关系抽取" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926324A (en) * 2021-02-05 2021-06-08 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN112926324B (en) * 2021-02-05 2022-07-29 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN116108859A (en) * 2023-03-17 2023-05-12 美云智数科技有限公司 Emotional tendency determination, sample construction and model training methods, devices and equipment

Also Published As

Publication number Publication date
CN111563381B (en) 2023-04-21

Similar Documents

Publication Publication Date Title
Zhai et al. Neural models for sequence chunking
US20230016365A1 (en) Method and apparatus for training text classification model
US11328129B2 (en) Artificial intelligence system using phrase tables to evaluate and improve neural network based machine translation
WO2020220539A1 (en) Data increment method and device, computer device and storage medium
EP3926531B1 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
CN110619044B (en) Emotion analysis method, system, storage medium and equipment
CN110489559A (en) A kind of file classification method, device and storage medium
CN106610990A (en) Emotional tendency analysis method and apparatus
CN115017916A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN111563381B (en) Text processing method and device
Tüselmann et al. Recognition-free question answering on handwritten document collections
CN111460224B (en) Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
Zhang et al. Modeling the relationship between user comments and edits in document revision
Gallay et al. Utilizing vector models for automatic text lemmatization
CN110276001B (en) Checking page identification method and device, computing equipment and medium
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN114610878A (en) Model training method, computer device and computer-readable storage medium
CN114239555A (en) Training method of keyword extraction model and related device
CN111126066A (en) Method and device for determining Chinese retrieval method based on neural network
US20240095456A1 (en) Text summarization with emotion conditioning
CN113254587B (en) Search text recognition method and device, computer equipment and storage medium
Morris et al. Welsh automatic text summarisation
US20240086434A1 (en) Enhancing dialogue management systems using fact fetchers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant