CN115718904A - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN115718904A
CN115718904A CN202211412308.8A CN202211412308A CN115718904A CN 115718904 A CN115718904 A CN 115718904A CN 202211412308 A CN202211412308 A CN 202211412308A CN 115718904 A CN115718904 A CN 115718904A
Authority
CN
China
Prior art keywords
text
language
target
vector
target language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211412308.8A
Other languages
Chinese (zh)
Inventor
姬子明
李长亮
李小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202211412308.8A priority Critical patent/CN115718904A/en
Publication of CN115718904A publication Critical patent/CN115718904A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a text processing method and a text processing device, wherein the text processing method comprises the following steps: acquiring a text to be processed corresponding to a source language; constructing a source language code vector corresponding to the text to be processed, and converting the source language code vector into a target language code vector; fusing the source language code vector and the target language code vector to obtain a fused vector; and decoding the fusion vector to generate a target text corresponding to the target language. By the text processing method, the cross-language text processing can be realized, the vector dimension is realized in the conversion process, and the text processing precision is further improved.

Description

Text processing method and device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a text processing method. The application also relates to a text processing device, a computing device and a computer readable storage medium.
Background
Artificial Intelligence (AI) refers to the ability of an engineered (i.e., designed and manufactured) system to perceive the environment, as well as the ability to acquire, process, apply, and represent knowledge. The artificial intelligence deep learning framework realizes the encapsulation of the algorithm. With the development of artificial intelligence, various deep learning frameworks are continuously emerging; the universal deep learning framework such as TensorFlow, pyTorch and the like is applied to the fields of natural language processing, computer vision, voice processing and the like, and the industries such as machine translation, intelligent finance, intelligent medical treatment, automatic driving and the like. Is a deep learning framework which is widely applied nowadays. The research of the cross-language automatic summarization technology also becomes an important direction at present, the automatic summarization is a key technology for solving the problem of information explosion, and the cross-language automatic summarization technology can enable a user to quickly browse multi-country documents, so that the user can quickly know information of different countries and regions. In the prior art, when cross-language automatic abstract generation is realized, a pipeline form, namely text-translation-abstract or text-abstract-translation, or a reinforcement learning model is mostly used. However, the above scheme not only has a large error, but also rarely considers the influence of information interaction among multiple languages on the cross-language summarization technology. Therefore, an effective solution to solve the above problems is desired.
Disclosure of Invention
In view of this, embodiments of the present application provide a text processing method to solve technical defects in the prior art. The embodiment of the application also provides a text processing device, a computing device and a computer readable storage medium.
According to a first aspect of embodiments of the present application, there is provided a text processing method, including:
acquiring a text to be processed corresponding to a source language;
constructing a source language code vector corresponding to the text to be processed, and converting the source language code vector into a target language code vector;
fusing the source language code vector and the target language code vector to obtain a fused vector;
and decoding the fusion vector to generate a target text corresponding to the target language.
According to a second aspect of embodiments of the present application, there is provided a text processing apparatus including:
the acquisition module is configured to acquire the text to be processed corresponding to the source language;
the construction module is configured to construct a source language code vector corresponding to the text to be processed, and convert the source language code vector into a target language code vector;
the fusion module is configured to fuse the source language code vector and the target language code vector to obtain a fusion vector;
and the decoding module is configured to generate a target text corresponding to the target language by decoding the fusion vector.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is for storing computer-executable instructions that, when executed by the processor, implement the steps of the text processing method.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the text processing method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the text processing method.
According to the text processing method, in order to improve the cross-language text processing accuracy, after the text to be processed corresponding to the source language is obtained, the source language code vector corresponding to the text to be processed is constructed, and then the source language code vector is converted into the target language code vector from the dimension of the code vector, so that the problem of cross-language vector mapping can be effectively solved; then, the source language code vector and the target language code vector are fused into a fusion vector, and finally, decoding processing is carried out through the fusion vector, so that a target text corresponding to the target language can be obtained; the method realizes cross-language mapping by converting in the coding stage, and can effectively ensure the text processing accuracy.
Drawings
Fig. 1 is a schematic structural diagram of a text processing method according to an embodiment of the present application;
fig. 2 is a flowchart of a text processing method according to an embodiment of the present application;
fig. 3 is a processing flow chart of a text processing method applied in a summary generation scenario according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of a computing device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit and scope of this application, and thus this application is not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Cross-language automatic summarization: given a text document in one language (e.g., chinese), a summary text in another language (e.g., english) is obtained by an algorithm model.
Cross-language automatic summary pipeline: firstly, translating the A language text into the B language text, and then abstracting the B language text; or, the text in the language A is firstly abstracted, and then the abstracted text is translated into the text in the language B.
The source language: the language is the default corresponding language of the text to be processed.
Target language: it is another language different from the source language, and the contents corresponding to the target language and the source language can be translated with each other.
And (3) text to be processed: refers to text that requires summarization, including but not limited to news, articles, novels, etc.
Target text: the method is to extract the abstract of the text to be processed, and then the abstract of the corresponding target language is extracted.
In the present application, a text processing method is provided. The present application relates to a text processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
In practical application, with the development of internet technology, the explosive growth of information becomes a normal state, so that mass data on the internet contains a large amount of redundant and invalid information, and it becomes more and more important for a user how to quickly and effectively select information required by the user. Especially, when the information interoperability is stronger, the information required by the user is not limited to the native language, and may also need the information in other languages to meet the requirement. Therefore, the study of crossover language technology in the field of artificial intelligence is also an important direction at present. The automatic summarization is a key technology for solving the problem of information explosion, and the cross-language automatic summarization technology can enable people to quickly browse multi-country documents and help users to quickly know information in different languages and regions. Meanwhile, the research of the cross-language automatic summarization method has great significance for application scenes such as cross-border e-commerce (assisting users in making decisions), public opinion analysis (helping analysts filter redundant information), content recommendation (recommending foreign language news for users) and the like. Therefore, the cross-language automatic summarization technology has important research value and application value.
In the prior art, in order to support the implementation of cross-language automatic summarization technology, most of the cross-language automatic summarization pipeline implementation is based on a text-translation-summarization or a text-summarization-translation. Or the method is realized through a reinforcement learning model, namely, a text is input into the model, and a summary is generated in a cross-language mode; and then, or by constructing a dictionary, namely generating a word with the maximum probability each time and generating the abstract by splicing the words. However, although cross-language abstract generation can be realized in either a pipeline form, a reinforcement learning model or a dictionary building form, there are large error propagation, no capability of directly generating a target language abstract by using a text of one language, and no consideration is given to the influence of information interaction among multiple languages on the cross-language abstract technology. There is therefore a need for an effective solution to the above problems.
Referring to the schematic structural diagram shown in fig. 1, in order to improve the accuracy of processing a cross-language text, the text processing method provided by the present application may construct a source language code vector corresponding to a to-be-processed text after obtaining the to-be-processed text corresponding to a source language, and then convert the source language code vector into a target language code vector from the dimension of the code vector, so as to effectively solve the problem of cross-language vector mapping; then, the source language code vector and the target language code vector are fused into a fusion vector, and finally, decoding processing is carried out through the fusion vector, so that a target text corresponding to the target language can be obtained; the method realizes cross-language mapping by converting in the coding stage, and can effectively ensure the text processing accuracy.
Fig. 2 is a flowchart of a text processing method according to an embodiment of the present application, which specifically includes steps S202 to S208:
step S202: and acquiring the text to be processed corresponding to the source language.
Specifically, the source language specifically refers to a language corresponding to the default of the text to be processed, and correspondingly, the text to be processed specifically refers to a text that needs to be abstracted and generated, including but not limited to news, articles, papers, novels, and the like. It should be noted that the text processing method provided by the present application can be applied to: the method includes the steps of generating a cross-language abstract according to a text to be processed uploaded by a user, or generating a cross-language abstract according to all texts to be processed contained in a text library corresponding to a search engine.
The core of the embodiment of the application lies in cross-language abstract generation, the cross-language abstract generation processes aiming at different languages are basically the same, the source language is English, the target language is Chinese, and the process of generating the Chinese abstract for the English text to be processed is introduced in detail.
In practical application, with the intercommunication of technologies and languages, many times when a user searches information such as related knowledge points and news, articles in other languages except for non-native languages may be involved, the text in the non-native language is difficult to understand for the user, and translation also consumes much time, and even after translation, the user experience may be affected to a great extent because the text content may not be the content required by the user. In order to support the user to quickly know whether the text content belongs to the content required by the user, the user can automatically generate the abstract of each text, the abstract corresponds to the native language of the user and is used for assisting the user to know the text content, and the user can translate or download the text after determining that the text belongs to the required text. In the process, cross-language abstract generation is performed on the text of the non-native language (the abstract of the native language is generated correspondingly on the text to be processed of the non-native language).
Based on this, after the to-be-processed text corresponding to the source language is obtained, mapping from the source language space to the target language space can be realized through a trans-language coding vector conversion mode subsequently, and then decoding is performed, so that the target text corresponding to the target language, namely the abstract of the to-be-processed text corresponding to the target language, can be obtained, and thus the accuracy of abstract generation is effectively ensured.
Further, before processing the text to be processed, in order to reduce interference generated by redundant data, the text to be processed may be preprocessed, and after obtaining the standard text, the following abstract extraction may be performed, which is specifically implemented in this embodiment as step S2022 to step S2024:
step S2022, obtain the service text corresponding to the source language.
Step S2024, obtaining the to-be-processed text corresponding to the source language by preprocessing the service text.
Specifically, the service text corresponding to the source language specifically refers to a text which has not been preprocessed, and the text includes partial redundant information, such as punctuation marks, wrongly written or mispronounced characters, special characters and other elements; correspondingly, the preprocessing is performed on the service text, specifically, the operation of removing the redundant information contained in the service text is performed, so as to reduce the influence of the redundant information on the accuracy of the subsequent summary generation. The preprocessing process includes, but is not limited to, data cleaning, word segmentation, error correction, and/or redundant character elimination.
Based on this, after the service text corresponding to the source language is acquired, in order to further improve the accuracy of generating the cross-language abstract, the service text may be preprocessed first, so that the service text may be converted into a to-be-processed text that does not include redundant information.
In conclusion, before the abstract is generated, the service text is preprocessed, so that the service text can be converted into the text which is not influenced by the redundant information, and the accuracy of the abstract generated subsequently is effectively promoted.
On this basis, preprocessing the service text can be realized by combining data cleaning and word segmentation processing, that is, by cleaning redundant information in the text and processing the text word segmentation into a form composed of a plurality of word units, the generation processing operation of the cross-language can be conveniently completed word by word at the abstract generation stage, in this embodiment, the specific implementation manner is as follows:
processing the service text according to a preset text cleaning strategy and a word segmentation processing strategy to obtain a standard text corresponding to the source language; and taking the standard text as the text to be processed corresponding to the source language.
Specifically, the text cleaning strategy specifically refers to a strategy for cleaning redundant information in a service text, and includes but is not limited to special character removal, wrongly written character correction, letter case correction and the like; correspondingly, the word segmentation processing strategy specifically refers to a strategy for performing word segmentation processing on the cleaned service text, and the word segmentation processing may be performed by an nltk tool or a jieba tool, and the like, which is not limited herein. Correspondingly, the standard text specifically refers to a text which is obtained by cleaning and word segmentation and consists of a plurality of word units, and the semantic expression of the text is the same as that of the text to be processed.
Based on the method, after the service text of the corresponding source language is obtained, the service text can be cleaned according to a text cleaning strategy to remove special characters in the service text, correct wrongly written characters and/or correct sizes of letters so as to obtain an initial standard text; and then, performing word segmentation processing on the initial standard text by using a word segmentation processing strategy, namely performing word segmentation on the initial standard text by using a word segmentation tool, so that a standard text consisting of a plurality of word units can be obtained, and the standard text corresponds to a source language and is taken as a text to be processed corresponding to the source language, thereby facilitating the subsequent generation of an abstract of a cross-language based on the standard text.
For example, in order to support a user to extract a text related to a chinese keyword from a chinese and english related text after inputting the chinese keyword, and assist the user in understanding the content of the text, a summary corresponding to each text needs to be constructed first, and a text in any language should have a chinese summary, so that the user can conveniently read the text. Based on the method, after the English service text corresponding to the English is obtained, the English service text can be cleaned, namely redundant characters in the English service text are removed, the case and the case of letters are corrected, and misspelled words are corrected, so that an English initial standard text is obtained; and then, performing word segmentation processing on the English initial standard text by using a jieba tool to obtain a standard text consisting of a plurality of English words, and taking the standard text as a corresponding English text to be processed, so that a Chinese abstract corresponding to an English business text can be conveniently generated on the basis of the standard text.
In conclusion, in the preprocessing stage, the service text is processed according to the text cleaning strategy and the word segmentation treatment strategy, so that the interference of redundant information in the service text can be reduced, the subsequent coding treatment is facilitated, and the generation accuracy of the cross-language abstract is effectively improved.
Step S204, a source language code vector corresponding to the text to be processed is constructed, and the source language code vector is converted into a target language code vector.
Specifically, after the to-be-processed text corresponding to the source language is obtained, further, in order to reduce propagation errors and improve the digest generation accuracy of the cross-language, a source language code vector corresponding to the to-be-processed text may be first constructed, and then mapped to the target language from the source language from the code vector dimension, that is: the source language coding vector is converted into the target language coding vector, and then subsequent decoding is carried out, so that the operation of cross languages is completed in the coding stage, and the error caused by translation in the process of generating the abstract first and then translating or translating first and then generating the abstract can be reduced.
The source language code vector specifically refers to a code vector corresponding to a source language obtained after a text to be processed is coded; correspondingly, the target language code vector is specifically a code vector corresponding to the target language obtained after the source language code vector is converted; in the conversion process, the semantic vector corresponding to the text is mapped to the target language space from the source language space, so that language switching is realized in the semantic expression form, the semantics corresponding to the text are not modified, the subsequent decoding is facilitated on the basis, and the more accurate target text, namely the abstract corresponding to the target language, can be obtained.
When the method is specifically implemented, the semantic vector corresponding to the text to be processed is mapped to the target language space from the source language space, namely: in the process of converting the source language code vector into the target language code vector, the conversion is actually carried out from the vector representation dimension; for example, the source language is chinese, the target language is english, the semantic vector corresponding to the "school" in the source language space is x, the semantic vector corresponding to the "school" in the target language space is y, after the semantic vector corresponding to the "school" is obtained, the semantic vector y having a mapping relationship with the semantic vector x can be directly determined in the target language space, and the semantic vector y is used as a target language encoding vector mapped by the "school" in the target language space. The vector-based conversion is realized, and the error caused by generation of the abstract after translation can be reduced. The semantic vector in the source language space and the semantic vector in the target language space are mapped in advance, so that conversion can be directly carried out based on the mapping relation when in use.
Further, when constructing the source language encoding vector and performing the inter-language conversion of the encoding vector, in order to ensure the encoding accuracy and the encoding and conversion efficiency, the encoding may be implemented through the target language model, in this embodiment, the specific implementation manner is as shown in step S2042 to step S2044:
step S2042, inputting the text to be processed into a target language model, and coding the text to be processed through a coding layer in the target language model to obtain the source language code vector.
Step S2044, mapping the source language code vector by the mapper in the target language model to obtain the target language code vector.
Specifically, the target language model can generate a target text of a target language corresponding to the text to be processed of the source language, and the generated target text is an abstract of the corresponding target language; and the target language model comprises an encoding layer, a decoding layer and a mapper for mapping the encoding vector. Correspondingly, the mapper is specifically a processor for converting the source language code vector into the target language code vector, and is configured to maintain a vector mapping relationship from the source language to the target language, where the mapping relationship may be optimized.
Based on this, after the to-be-processed text corresponding to the source language is obtained, the to-be-processed text corresponding to the source language can be input into the target language model, the to-be-processed text is encoded through the encoding layer in the target language model, so that the semantic vector corresponding to the source language output by the encoding layer, namely the source language encoding vector, is obtained, and then the source language encoding vector is mapped through the mapper in the target language model, namely the source language encoding vector is mapped from the source language space to the target language space, so that the semantic vector corresponding to the target language, namely the target language encoding vector, is obtained for performing abstract generation subsequently.
According to the above example, after the text to be processed corresponding to the english language is obtained, the text to be processed corresponding to the english language can be input into a trained target language model (the text to be processed english-chinese abstract), and the text to be processed corresponding to the english language is encoded through an encoding layer (Encoder) in the target language model, so as to obtain a semantic vector S1 corresponding to the english language; then, the semantic vector S1 corresponding to english is transmitted to a Mapper (Mapper), and the Mapper maps the semantic vector S1 from an english language space to a chinese language space, that is: the semantic vector S1 corresponding to English is converted into the semantic vector S2 corresponding to Chinese for subsequent use.
In practical application, when the mapper performs the cross-language code vector conversion, the mapper actually converts the expression form of the source language code vector into the expression form of the target language code vector by using the maintained cross-language mapping relationship, so as to achieve the purpose of the cross-language conversion of the code vector.
In conclusion, in the encoding stage, the source language encoding vector is converted into the target language encoding vector, so that the propagation error of translation before and after the abstract is generated can be effectively reduced, the accuracy of cross-language abstract generation is further improved, cross-language spatial mapping can be customized according to actual requirements, and the flexibility of cross-language abstract generation is further improved.
In addition, in order to ensure that the target language model has high prediction accuracy, and therefore, the target language model needs to be trained sufficiently before use, in this embodiment, the training process of the target language model is as follows, for example, in steps S2142 to S2148:
step S2142, obtaining an initial sample corresponding to the source language, and processing the initial sample through a coding layer in an initial language model to obtain a sample coding vector corresponding to the source language;
step S2144, the sample coding vector is processed through the mapper in the initial language model, and a sample mapping vector corresponding to the target language is obtained;
step S2146, the sample mapping vector and the sample coding vector are fused, and the fusion result is processed through a decoding layer in the initial language model, so that a predicted text corresponding to the target language is obtained;
step S2148, according to the reference text and the predicted text corresponding to the initial sample, performing parameter adjustment on the initial language model until the target language model meeting the training stop condition is obtained.
Specifically, the initial sample specifically refers to a text for training an initial language model, and a language corresponding to the text is a source language; correspondingly, the sample coding vector specifically refers to a semantic vector corresponding to the source language obtained after the initial sample is coded; correspondingly, the sample mapping vector specifically refers to a semantic vector of the target language corresponding to the initial sample obtained after the sample coding vector is mapped to the target language space from the source language space; correspondingly, the prediction sample specifically refers to a prediction abstract of an initial sample corresponding to a target language obtained by decoding a fusion result obtained by fusing a sample mapping vector and a sample coding vector; correspondingly, the reference text specifically refers to a real abstract of the target language corresponding to the initial sample; correspondingly, the training stopping condition specifically refers to a condition for stopping the training of the model, and includes, but is not limited to, a loss value comparison condition, that is, when the loss value of the initial language model is smaller than a preset loss value threshold, it is determined that the initial language model meets the training stopping condition; or an iteration number condition, namely that the initial language model meets the training stopping condition under the condition that the training iteration number of the initial language model exceeds a preset iteration number threshold value.
It should be noted that, when the model training process is controlled through the loss value comparison condition, the loss value of the current training stage needs to be calculated through a loss function, and the selection of the loss function may be completed according to requirements, such as selecting a cross entropy loss function, a logarithmic loss function, a quadratic loss function, and the like.
Based on this, in order to enable the target language model to have more accurate prediction capability, after the initial sample corresponding to the source language is obtained, the initial sample corresponding to the source language is encoded through the encoding layer in the initial language model to obtain the sample encoding vector corresponding to the source language, and then the mapper in the initial language model is used to map the sample encoding vector corresponding to the source language to the target language space from the source language space to obtain the sample mapping vector corresponding to the target language, which indicates that the semantic vector of the initial sample in the target language space has been obtained. In order to further generate the accuracy of the abstract in a cross-language manner, before decoding, a sample coding vector and a sample mapping vector can be fused, and then a decoding layer is used for decoding a fusion result, so that a prediction text corresponding to a target language can be obtained; and finally, performing parameter adjustment on the initial language model based on the reference text and the predicted text corresponding to the initial sample until a target language model meeting the training stopping condition is obtained.
It should be noted that, in the process of training the model, in order to ensure that the model has a high predictive ability, after each training is finished, the language model in the current stage may be verified by using a verification set to determine the predictive ability of the language model in the current stage, and if the predictive ability does not meet the requirement, it indicates that the training stop condition is not met, and the training is continued until the target language model meeting the training stop condition is obtained.
In summary, the initial language model is trained by using the initial sample in the training stage, so that the initial language model has the capability of generating the abstract in the cross-language manner, and the operation of processing the abstract in the cross-language manner can be completed quickly and accurately in the application stage.
Before the initial language model is trained, in order to reduce the training period, improve the efficiency and reduce the sample demand, the parameters of the existing model can be obtained first to initialize the parameters of the initial language model, and then the subsequent training is performed, in this embodiment, the specific implementation is as step S2242 to step S2244:
step S2242, obtaining service model parameters corresponding to the service language model;
step S2244, updating the initial model parameters of the initial language model according to the service model parameters.
Specifically, the business language model refers to a model that has been trained and has a certain prediction capability, and in the corresponding model parameters, part of the parameters may be multiplexed into the initial language model, and correspondingly, the business model parameters refer to model parameters that can be utilized by the initial language model in the business language model, such as model parameters corresponding to the coding layer and the decoding layer.
Based on this, before training the initial language model, the existing business language model may be obtained, and the business model parameters of the business language model are determined, and then the initial model parameters of the initial language model are updated by using the business model parameters, that is: and initializing parameters of the initial language model, and realizing that the training of the initial language model can be strengthened on the basis of the parameters of the business model in the subsequent training stage.
In practical application, considering that different language models have different prediction capabilities and model parameters in the models cannot be all multiplexed, in order to select model parameters meeting the requirements corresponding to an initial language model, model parameter types needing to be adjusted in the initial language model can be sorted first, then in a business language model, model parameters related to the model parameter types are selected as business model parameters, and then parameters in the initial language model are initialized according to a type relation, so that a model training period can be effectively reduced.
In summary, before model training, the parameters of the initial language model are initialized by using the parameters of the existing business language model, so that the initial language model has a certain prediction capability from the initial training stage, and then training is performed on the basis of the prediction capability, so that the training period can be effectively shortened, the sample demand can be reduced, and the model training efficiency can be improved.
And S206, fusing the source language code vector and the target language code vector to obtain a fused vector.
Specifically, after the target language code vector corresponding to the target language is obtained, further, in the current stage, a source language code vector corresponding to the source language and a target language code vector corresponding to the target language are already obtained, and in order to improve accuracy of generating the abstract across languages, before generating the target text corresponding to the target language, the semantic vector corresponding to the source language and the semantic vector corresponding to the target language may be fused, so that in the decoding stage, mutual alignment between the target language and the source language may be supported, and accuracy of generating the abstract may be improved. The fusion vector is specifically a coding vector obtained by fusing a source language coding vector and a target language coding vector.
In specific implementation, when the source language code vector and the target language code vector are fused, the source language code vector and the target language code vector are spliced, so that a fusion vector is obtained according to a splicing result. In addition, the vector fusion can be realized by adopting a bitwise addition mode, namely, elements contained in the language coding vector and elements contained in the target language coding vector are determined, and the two are fused by adopting a bitwise addition calculation mode, so that the fusion vector can be obtained.
Furthermore, because the fusion vector comprises the semantic vector corresponding to the source language and the semantic vector corresponding to the target language, in the decoding stage, the first text corresponding to the source language may be generated first according to the fusion vector, and the generation process of the first text is word-by-word generation. And then generating a second text corresponding to the target language according to the fusion vector, wherein the generated first text can be aligned when the second text is generated, that is, each word unit in the second text can be aligned to the word unit with the corresponding relation in the first text when each word unit in the second text is generated, so that the finally generated second text is more accurate, and the text generation quality corresponding to the target language is improved.
And step S208, decoding the fusion vector to generate a target text corresponding to the target language.
Specifically, after the fusion vector is obtained, further, a source language code vector corresponding to the source language and a target language code vector corresponding to the target language are recorded in the fusion vector, so that when decoding is performed, a decoding vector corresponding to the target language and a decoding vector corresponding to the source language are obtained respectively. And for supporting the generation of the abstract of the cross-language, the decoding vector corresponding to the target language can be selected for conversion, and the decoding vector is used for generating the target text corresponding to the target language and is used as the abstract of the text to be processed in the target language.
Further, since the encoding process and the mapping process are completed by the cooperation of the encoding layer and the mapper in the language model, the fusion and the decoding process before decoding are also completed through the language model, and in this embodiment, the specific implementation is as follows, in step S2082 to step S2084:
step S2082, fusing the source language code vector and the target language code vector in the target language model to obtain the fusion vector;
step S2084, decoding the fusion vector through a decoding layer in the target language model, and generating a target text corresponding to the target language according to a decoding processing result.
That is, after obtaining a source language code vector corresponding to a source language and a target language code vector corresponding to a target language, the source language code vector and the target language code vector may be fused in a target language model to obtain a fusion vector fusing semantic vectors of the two languages; and then, inputting the fusion vector into a decoding layer in the target language model for decoding processing so as to obtain a target text corresponding to the target language according to a decoding processing result, and realizing cross-language generation of an abstract corresponding to the text to be processed so as to be conveniently used in an actual application scene.
On this basis, when the target text corresponding to the target language of the text to be processed is generated through the target language model, in order to further improve the generation accuracy of the target text, the original text of the source language may be generated first, and then the target text corresponding to the target language is generated in a manner of aligning the original text, in this embodiment, the specific implementation manner is as in steps S2182 to S2184:
step S2182, decoding the source language encoding vector in the fusion vector through a decoding layer in the target language model to obtain an initial text corresponding to the source language;
step S2184, decoding the target language coding vector in the fusion vector according to the processing strategy of aligning the initial text through a decoding layer in the target language model, so as to obtain a target text corresponding to the target language.
Specifically, the initial text corresponding to the source language specifically refers to a digest text corresponding to the source language obtained after decoding the fusion vector through the decoding layer; correspondingly, the processing strategy for aligning the initial text specifically refers to a strategy for aligning the initial text with the target text generation stage corresponding to the target language, so that when the target text is generated, correction can be completed by combining the initial text.
Based on this, after the fusion vector fusing the source language code vector and the target language code vector is input to the decoding layer in the target language model, in order to improve the accuracy of generating the abstract across languages, the decoding layer may be used to decode the source language code vector in the fusion vector to generate an initial text corresponding to the source language, i.e. the abstract corresponding to the source language; and then, decoding the target language encoding vector in the fusion vector by using a decoding layer in the target language model, and aligning the initial text in the decoding process so as to ensure that the target text output by the model has higher accuracy. In practical application, in a scene where the fusion vector is obtained by means of vector concatenation, the fusion vector can be input into a decoding layer of a target language model to generate an abstract of a corresponding source language, and then when the abstract of the target language is generated by the decoding layer, the abstract of the target language obtained by decoding is aligned with the abstract of the source language, so that the attachment degree between the abstract of the target language and the abstract of the source language is higher.
In a scenario where the fused vector is obtained by vector addition, in order to obtain an abstract of a corresponding target language at a decoding stage, before decoding, element splitting may be performed on each element included in the fused vector, and the splitting processing operation may be completed according to a fused record during vector fusion, that is, an element value before adding each element is equal to an element value after adding and then splitting. The splitting processing operation can obtain a source language code vector and a target language code vector, and the decoding layer is used for decoding the source language code vector to generate an initial text corresponding to a source language, namely a abstract corresponding to the source language; and then, decoding the split target language code vector by utilizing a decoding layer in the target language model, and aligning the initial text in the decoding process so as to ensure that the target text output by the model has higher accuracy.
In practical application, when decoding a target language code vector in a fusion vector according to an alignment initial text processing strategy, the alignment processing is completed in a mode of generating a target text word by word; that is to say, in the decoding stage, the decoding vector corresponding to the source language is obtained first, and then the decoding vector corresponding to the source language and the target language coding vector corresponding to the target language in the fusion vector are input to the decoding layer for decoding together, so that the decoding processing result can be aligned with the vector expression corresponding to the word unit in the initial text from the word unit granularity, the expression accuracy of each word unit in the target text is corrected, and finally, the target text composed of all word units can fully represent the abstract corresponding to the target language of the text to be processed, thereby effectively ensuring the accuracy of generating the cross-language abstract.
According to the above example, after the semantic vector S1 corresponding to English and the semantic vector S2 corresponding to Chinese are obtained, the semantic vector S1 and the semantic vector S2 can be subjected to vector splicing to obtain a fusion vector; in the decoding stage, the fused vector can be input into a decoding layer (Deccoder) in the language model to decode the semantic vector S1 corresponding to English through the decoding layer to obtain a decoding vector S11 corresponding to English, wherein the decoding vector S11 can generate an abstract (character-by-character generation) of the text to be processed corresponding to English; and then, the decoding vector S11 and the semantic vector S2 in the fusion vector are delivered to a decoding layer for decoding processing, in the decoding processing process, the English abstract is aligned, so that a decoding vector S22 corresponding to Chinese is obtained according to the decoding and aligning processing result, the Chinese abstract corresponding to Chinese can be generated based on the decoding vector S22, and a model is output, so that the Chinese abstract corresponding to the English service text can be determined. Optionally, before decoding, the spliced fusion vector may be split to obtain a semantic vector S1 and a semantic vector S2, and then the semantic vector S1 and the semantic vector S2 are decoded by the decoding layer to obtain corresponding decoding vectors.
In specific implementation, when the decoding layer performs decoding processing, the encoding vector corresponding to the whole text to be processed is calculated to obtain a decoding vector, and the decoding vector is converted through the output layer of the model to obtain the target text. The decoding process is a process of converting high-dimensional vector expression into low-dimensional vector expression, and the core features of the text to be processed are expressed through the low-dimensional vectors, so that the abstract with higher generality can be obtained.
In summary, by adopting the way of decoding the initial text first and then decoding the target text, the target text can be aligned to the initial text, and the target text corresponding to the target language can be aligned to the initial text of the source language from the semantic level and the word level when being generated, so that semantic errors of the target text and the initial text of the source language are reduced, the accuracy of cross-language abstract generation is further ensured, and an abstract with higher accuracy can be fed back to a user, and the use requirement of the user is met.
In addition, considering that a user may need target texts of different languages in different service scenarios, when selecting a target text, the target text may be selected in response to a request of the user, and in this embodiment, the specific implementation is as follows, in step S2282 to step S2284:
step S2282, decoding the fusion vector to obtain a first text corresponding to the source language and a second text corresponding to the target language;
step S2284, in response to the cross-language selection request, selects the second text as the target text from the first text corresponding to the source language and the second text corresponding to the target language.
Specifically, the first text specifically refers to a summary obtained by decoding a source language code vector in the fusion vector, and the summary corresponds to the source language; correspondingly, the second text specifically refers to an abstract obtained by decoding the target language encoding vector in the fusion vector, and the abstract corresponds to the target language. Correspondingly, the cross-language selection request specifically refers to a request submitted for the second text, and is used for feeding back a target text different from the source language to the user.
Based on this, after the fused vector is decoded, a first text corresponding to the source language and a second text corresponding to the target language are obtained, and based on a cross-language selection request submitted by the user, it is determined that the user needs to select the abstract corresponding to the target language, so that the second text corresponding to the target language can be selected from the first text and the second text in response to the cross-language selection request, and the feedback to the user can be realized. Meanwhile, if the user selects the first text, the first text corresponding to the source language can be selected to feed back to the user.
In summary, by responding to the cross-language selection request of the user, the second text corresponding to the target language different from the source language is selected as the target text, so that feedback of the target text according to the user-defined selection can be realized, and the flexibility of user selection is improved.
On this basis, considering that the target language model may have a reduced prediction capability after a period of time, in order to ensure that the target language model has a higher prediction capability at any time node, the target language model may be used in a manner of using edge optimization, which is specifically implemented in the embodiment as step S2382 to step S2388:
step S2382, sending an adjustment request to the user submitting the cross-language selection request, wherein the adjustment request carries a second text corresponding to the target language;
step S2384, receiving a text adjustment instruction submitted by the user in response to the adjustment request and for a second text corresponding to the target language;
step S2386, updating the second text corresponding to the target language according to the text adjustment instruction, and obtaining a third text corresponding to the target language;
step S2388, the target language model is optimized according to the third text corresponding to the target language and the text to be processed.
Specifically, the adjustment request is a request for inviting the user to manually adjust the second text, and is used for proofreading the second text; correspondingly, the text adjusting instruction specifically refers to an instruction for adjusting the second text submitted by the user, the word unit or word unit and the like adjusted by the user can be made clear through the text adjusting instruction, and correspondingly, the third text specifically refers to a text which is adjusted and corresponds to the target language; correspondingly, the model is optimized, specifically, parameters of the model are further adjusted, so that the model has better prediction capability.
Based on this, after the second text responding to the cross-language event is acquired, in order to improve the prediction capability of the target language model, the optimization of the target language model can be completed by inviting the user. Namely: and sending an adjustment request to a user submitting the cross-language selection request, wherein the adjustment request is used for inviting the user and informing the user of the purpose of the invitation, and meanwhile, a second text corresponding to the target language is carried in the adjustment request.
When a text adjusting instruction submitted by a user aiming at the second text of the corresponding target language in response to the adjusting request is received, the user agrees to participate in the invitation, and the text adjusting instruction is submitted aiming at the second text of the corresponding target language, at the moment, the second text of the corresponding target language can be updated according to the text adjusting instruction, and the second text is used for correcting inaccurate positions generated in the second text so as to obtain a third text of the corresponding target language according to an updating result; the content description of the third text is more accurate at this time. On the basis, the target language model can be optimized by using the third text corresponding to the target language and the text to be processed, so that the target language model has better prediction capability.
According to the above example, after the abstract corresponding to English and the abstract corresponding to Chinese are obtained, the abstract corresponding to Chinese can be fed back to the user in response to the selection request of the user; in order to improve the predictive ability of the language model, an adjustment request can be fed back to the user, and the adjustment request carries the Chinese abstract. When a text adjusting instruction submitted by a user aiming at the Chinese abstract is received, which indicates that the Chinese abstract is not generated accurately enough, the Chinese abstract can be updated to the Chinese target abstract in response to the text adjusting instruction, the accuracy of the obtained target abstract is higher, and finally the Chinese target abstract and the English text to be processed are utilized to optimize the language model.
In conclusion, the target language model is optimized by inviting the user, so that the prediction capability of the target language model can be improved, the user can participate in the model optimization stage, and the model can be updated by utilizing effective resources, so that the operation and maintenance cost of the model is reduced.
In addition, in a corresponding service search scenario, in order to support search, the summary content meeting the query requirement may be fed back to the user, and then the summary corresponding to the text related in the search scenario needs to be written into the text library, so that a response may be completed according to the read event in the query stage, in this embodiment, the specific implementation is as in step S2482 to step S2486:
step S2482, establishing a cross-language relationship between the target text and the text to be processed, and writing the target text into a text library corresponding to the target language according to the cross-language relationship;
step S2484, determining a target language keyword corresponding to the text reading event under the condition that the text reading event related to the target language is monitored;
step S2486, reading a set number of target service texts matched with the target language keywords from the text library as a response to the text reading event.
Specifically, the cross-language relationship is a relationship between a target text and a text to be processed, is used for representing the target text of the target language, and is an abstract of the text to be processed of the source language; correspondingly, the text reading event specifically refers to an event corresponding to a user search request, and the event is used for querying a text of a related target language keyword. Correspondingly, the target language keyword specifically refers to a keyword associated with a text reading event, belongs to the target language, and is used for querying the text associated with the target language keyword. Correspondingly, the text database is a database used for storing the cross-language relationship and the target text corresponding to the text to be processed; correspondingly, the target service text specifically refers to a target text associated with the keywords in the target language in the text library.
It should be noted that, the target service texts matched with the target language keywords may be one or more, and the determination of the matching relationship may be implemented by calculating the association degree, that is, calculating the association degree between the target language keywords and the target texts contained in the text library, and selecting the target texts with the association degree greater than a set threshold value as the target service texts; or selecting the target text with the maximum relevance as the target service text.
Based on the method, after the target text is obtained, the cross-language relationship between the target text and the text to be processed can be established, and then the target text is written into a text library corresponding to the target language according to the cross-language relationship; in the application stage, if a text reading event of the associated target language is monitored, which indicates that an event for querying keywords of the associated target language exists at the current moment, the text reading event can be analyzed at the moment to obtain the keywords of the target language associated with the text reading event; and then reading the target service texts of which the set number is matched with the target language keywords from the text library as the response of the text reading event.
In practical application, considering that a plurality of target service texts corresponding to the keywords in the target language may be obtained when the target service texts are read, if all the target service texts are used as responses of the text reading events, the browsing experience of a user will be affected, so that a set number of target service texts can be selected as responses of the text reading events for the convenience of the user to look up. Wherein, the set quantity can be set according to actual requirements.
In summary, by persisting the target text in the text library, after the text reading event is monitored, the corresponding target service text can be rapidly matched from the text library, so that the text reading event can be responded in a short time, and the user's viewing experience can be improved.
On this basis, the association degree between the target service text and the service text to be processed can be clarified, and the association degree is higher, so that the optimization of the target language model can be performed on this basis, in this embodiment, the specific implementation is as follows, for example, step S2582 to step S2586:
step S2582, determining a to-be-processed service text associated with the target service text;
step S2584, constructing a target sample pair based on the target service text and the service text to be processed;
and S2586, optimizing the target language model by using the target sample pair.
Specifically, the service text to be processed specifically refers to an original text associated with the target service text and corresponds to the source language; correspondingly, the target sample pair refers to a sample pair composed of the to-be-processed business text and the target business text with an incidence relation, and is used for optimizing the model. On the basis, after the target service text is determined, the target service text is shown to meet the current service scene, and the target service text generated by the target language model has certain accuracy, on the basis, in order to improve the model prediction capability, the to-be-processed service text related to the target service text can be determined firstly, and then a target sample pair is constructed according to the target service text and the to-be-processed service text; and then, optimizing the target language model by using the target sample pair.
According to the above example, after the Chinese abstract associated with the English business text is obtained, the cross-language relationship between the English business text and the Chinese abstract can be established, and the Chinese abstract is written into the text library corresponding to the Chinese according to the cross-language relationship. After a text reading event is monitored, determining the corresponding Chinese keywords, calculating the association degree between the Chinese keywords and the Chinese abstracts in the text library, selecting a set number of Chinese abstracts as the response of the text reading event, and feeding back the response to the user.
Furthermore, in the Chinese abstracts with the set quantity, the mark user finally selects the Chinese abstract as the target Chinese abstract, and selects the English business text related to the target Chinese abstract to feed back to the user. On the basis, a sample pair can be formed by combining the target Chinese abstract and the English business text, and the language model is optimized by utilizing the sample pair so as to achieve the purpose of improving the prediction capability of the model.
Furthermore, when the language model is optimized, the English business text in the sample pair is input into the language model to be processed, a predicted Chinese abstract output by the language model can be obtained, and the language model can be optimized by combining the target Chinese abstract and the predicted Chinese abstract, so that the purpose of improving the model prediction accuracy is achieved. During specific optimization, a loss value can be calculated according to the target Chinese abstract and the predicted Chinese abstract, and then the language model is subjected to parameter adjustment according to the loss value, so that the language model with high prediction precision can be obtained according to a parameter adjustment result for use.
According to the text processing method, in order to improve the cross-language text processing accuracy, after the text to be processed corresponding to the source language is obtained, the source language code vector corresponding to the text to be processed is constructed, and then the source language code vector is converted into the target language code vector from the dimension of the code vector, so that the problem of cross-language vector mapping can be effectively solved; then, the source language code vector and the target language code vector are fused into a fusion vector, and finally, decoding processing is carried out through the fusion vector, so that a target text corresponding to the target language can be obtained; the method realizes cross-language mapping by converting at the encoding stage, and can effectively ensure the text processing accuracy.
The following will further describe the text processing method by taking an application of the text processing method provided by the present application in a summary generation scenario as an example with reference to fig. 3. Fig. 3 shows a processing flow chart of a text processing method applied in a summary generation scenario according to an embodiment of the present application, which specifically includes the following steps:
step S302, acquiring the text to be processed corresponding to the source language.
In practice, when performing cross-language summarization, a common technique in the art is to use a pipeline form, i.e., text-translation-summarization, or text-summarization-translation. Or a reinforced learning model is used, and then a probability dictionary is constructed, and a word with the maximum probability is generated each time. However, in the current scenario with high information accuracy requirement, a pipeline form is used, and large error propagation is generated. And the influence of information interaction among multiple languages on the cross-language summarization technology is rarely considered. There is therefore a need for an effective solution to the above problems.
In view of this, the text processing method provided by the present application may be implemented such that, in the process of generating the abstract, the coding vector corresponding to the source language may be directly mapped to the language space corresponding to the target language through the pre-trained mapper, so as to obtain the coding vector corresponding to the target language, and in the decoding stage, the coding vectors corresponding to the source language and the target language are fused for decoding, so as to generate the abstract corresponding to two languages, and when the abstract text corresponding to the target language is generated, the abstract text corresponding to the source language may be aligned, so as to effectively ensure the accuracy of abstract generation, and meet the requirement of cross-language abstract generation, thereby facilitating the use of downstream services.
In the embodiment, a text processing method is described by taking football-related articles of which the source language is Chinese, the target language is English and the text to be processed is corresponding to Chinese as an example; it should be noted that the football article contains more text contents, and the embodiment describes the text processing method only based on part of the contents for convenience of description.
And step S304, preprocessing the text to be processed to obtain the standard text corresponding to the source language.
The acquired text to be processed corresponding to the chinese language is { i is a small letter, this introduces football player B to everybody; second, the players who live in football players of country a and B, the president, and currently play the role in football club D of the football-nail league of country C; in order to facilitate the subsequent accurate extraction of the English text abstract from the text to be processed, the text to be processed can be preprocessed, namely, the data cleaning and word segmentation processing are carried out on the text to be processed, so that the interference generated by redundant data is reduced.
On the basis, a text to be processed corresponding to the Chinese language is preprocessed to obtain a first standard text { B, which is born by football players in China A and China B, is preoccupation, and is currently available in a D football club of football-nail league in China C, and then the first standard text is subjected to word segmentation to obtain a second standard text { B; birth to; a is ground; b country; a soccer player; performing job; a front; now; is effective in; c country; a soccer ball; a first-level tournament; the method (1); d, football club; checking, wherein the standard text rejects redundant data compared to the text to be processed, the rejected redundant data being irrelevant to the content of the main lines of text.
Step S306, inputting the standard text corresponding to the source language into the abstract extraction model, and coding the standard text through the coding layer in the abstract extraction model to obtain the coding vector corresponding to the source language.
After the standard text corresponding to the Chinese character is obtained, the preprocessed text can be input into the abstract extraction model, and the standard text can be encoded through an encoding layer in the abstract extraction model, so that an encoding vector EV1 corresponding to the Chinese character is obtained according to an encoding processing result. The encoding vector EV1 is obtained by mapping the standard text to the language space of the corresponding Chinese.
Step S308, the coding vector corresponding to the source language is input to the mapper in the abstract extraction model for mapping, and the coding vector corresponding to the target language is obtained.
After the coding vector EV1 corresponding to chinese is obtained, in order to generate a text abstract corresponding to english in the following process, the coding vector EV1 may be mapped from the language space of chinese to the language space of english by a mapper in the abstract extraction model. That is, the process can convert the encoding vector corresponding to chinese into the encoding vector EV2 corresponding to english from the vector conversion dimension, and then perform subsequent decoding based on this to obtain the text abstract corresponding to english.
And S310, fusing the coding vector corresponding to the source language and the coding vector corresponding to the target language to obtain a coding fusion vector.
Step S312, inputting the encoded fusion vector into a decoding layer in the abstract extraction model for decoding, so as to generate a text abstract corresponding to the source language and a text abstract corresponding to the target language.
After the encoding vector EV1 corresponding to chinese and the encoding vector EV2 corresponding to english are obtained, the encoding vector EV1 corresponding to chinese and the encoding vector EV2 corresponding to english may be merged at this time, and then input to the decoding layer in the abstract extraction model for decoding. In the process, the text abstract corresponding to the Chinese is generated firstly, and then the text abstract corresponding to the English is generated, the text abstract corresponding to the Chinese can be aligned, so that the text abstract corresponding to the English obtained after decoding has higher accuracy. Based on the method, the text abstract corresponding to Chinese is TS1, and the text abstract corresponding to English is TS2.
Step S314, based on the abstract extraction request of the user, selecting the text abstract corresponding to the target language to feed back to the user.
After the text abstract TS1 corresponding to Chinese and the text abstract TS2 corresponding to English are obtained, in order to facilitate the use of the user, the user is determined to be able to use the text after the text abstract TS1 corresponding to Chinese and the text abstract TS2 corresponding to English are converted into English based on the abstract extraction request of the user, so that the text abstract TS2 corresponding to English can be fed back to the user, and the user can conveniently know the main text content of the text to be processed.
In summary, in the process of generating the abstract, the encoding vector corresponding to the source language can be directly mapped to the language space corresponding to the target language through the pre-trained mapper in the encoding stage, so as to obtain the encoding vector corresponding to the target language, and the encoding vector corresponding to the source language and the target language is fused in the decoding stage, so as to generate the abstract corresponding to two languages, and when the abstract text corresponding to the target language is generated, the abstract text corresponding to the source language can be aligned, so that the accuracy of abstract generation is effectively ensured, the requirement of cross-language abstract generation can be met, and convenience is brought to downstream services.
Corresponding to the above method embodiment, the present application further provides a text processing apparatus embodiment, and fig. 4 shows a schematic structural diagram of a text processing apparatus provided in an embodiment of the present application. As shown in fig. 4, the apparatus includes:
an obtaining module 402, configured to obtain a to-be-processed text corresponding to a source language;
a constructing module 404, configured to construct a source language code vector corresponding to the text to be processed, and convert the source language code vector into a target language code vector;
a fusion module 406, configured to fuse the source language code vector and the target language code vector to obtain a fusion vector;
and a decoding module 408 configured to generate a target text corresponding to the target language by performing decoding processing on the fusion vector.
In an optional embodiment, the obtaining module 402 is further configured to:
acquiring a service text corresponding to a source language; and preprocessing the service text to obtain the text to be processed corresponding to the source language.
In an optional embodiment, the building module 404 is further configured to:
inputting the text to be processed into a target language model, and coding the text to be processed through a coding layer in the target language model to obtain the source language type coding vector; and mapping the source language code vector through a mapper in the target language model to obtain the target language code vector.
In an optional embodiment, the fusion module 406 is further configured to:
fusing the source language code vector and the target language code vector in the target language model to obtain a fused vector;
wherein the decoding module 408 is further configured to:
and decoding the fusion vector through a decoding layer in the target language model, and generating a target text corresponding to the target language according to a decoding processing result.
In an optional embodiment, the decoding module 408 is further configured to:
decoding the fusion vector to obtain a first text corresponding to the source language and a second text corresponding to the target language; and responding to a cross-language selection request, and selecting a second text as the target text from a first text corresponding to the source language and a second text corresponding to the target language.
In an optional embodiment, the decoding module 408 is further configured to:
decoding the source language type coding vector in the fusion vector through a decoding layer in the target language model to obtain an initial text corresponding to the source language type; and decoding the target language coding vector in the fusion vector according to a processing strategy aligning the initial text through a decoding layer in the target language model to obtain a target text corresponding to the target language.
In an optional embodiment, the apparatus further comprises a model training module configured to: acquiring an initial sample corresponding to the source language, and processing the initial sample through a coding layer in an initial language model to acquire a sample coding vector corresponding to the source language; processing the sample coding vector through a mapper in the initial language model to obtain a sample mapping vector corresponding to the target language; fusing the sample mapping vector and the sample coding vector, and processing a fusion result through a decoding layer in the initial language model to obtain a predicted text corresponding to the target language; and performing parameter adjustment on the initial language model according to the reference text and the predicted text corresponding to the initial sample until the target language model meeting the training stopping condition is obtained.
In an optional embodiment, the model training module is further configured to:
acquiring service model parameters corresponding to a service language model; and updating the initial model parameters of the initial language model according to the business model parameters.
In an optional embodiment, the apparatus further comprises:
the matching module is configured to establish a cross-language relationship between the target text and the text to be processed, and write the target text into a text library corresponding to the target language according to the cross-language relationship; determining a target language keyword corresponding to the text reading event under the condition that the text reading event related to the target language is monitored; and reading a set number of target service texts matched with the target language keywords in the text library as the response of the text reading event.
In an optional embodiment, the apparatus further comprises:
the first optimization module is configured to determine a to-be-processed service text associated with the target service text; constructing a target sample pair based on the target service text and the service text to be processed; and optimizing the target language model by using the target sample pair.
In an optional embodiment, the obtaining module 402 is further configured to:
processing the service text according to a preset text cleaning strategy and a word segmentation processing strategy to obtain a standard text corresponding to the source language; and taking the standard text as the text to be processed corresponding to the source language.
In an optional embodiment, the apparatus further comprises:
a second optimization module, configured to send an adjustment request to a user who submits the cross-language selection request, where the adjustment request carries a second text corresponding to the target language; receiving a text adjusting instruction submitted by the user aiming at a second text corresponding to the target language in response to the adjusting request; updating a second text corresponding to the target language according to the text adjusting instruction to obtain a third text corresponding to the target language; and optimizing the target language model according to a third text corresponding to the target language and the text to be processed.
According to the text processing device, in order to improve the cross-language text processing accuracy, after the text to be processed corresponding to the source language is obtained, the source language code vector corresponding to the text to be processed is constructed, and then the source language code vector is converted into the target language code vector from the dimension of the code vector, so that the problem of cross-language vector mapping can be effectively solved; then, the source language code vector and the target language code vector are fused into a fusion vector, and finally, decoding processing is carried out through the fusion vector, so that a target text corresponding to the target language can be obtained; the method realizes cross-language mapping by converting at the encoding stage, and can effectively ensure the text processing accuracy.
The above is a schematic scheme of a text processing apparatus of the present embodiment. It should be noted that the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the text processing apparatus can be referred to the description of the technical solution of the text processing method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
Fig. 5 illustrates a block diagram of a computing device 500 provided according to an embodiment of the present application. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.
Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the application, the above-described components of computing device 500 and other components not shown in FIG. 5 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 5 is for purposes of example only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.
Wherein processor 520 is configured to execute the computer-executable instructions of the text processing method.
The foregoing is a schematic diagram of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text processing method.
An embodiment of the present application further provides a computer readable storage medium storing computer instructions that, when executed by a processor, are used for a text processing method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text processing method.
The computer instructions comprise computer program code which may be in the form of source code, object code, executable text or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
An embodiment of the present application further provides a chip, in which a computer program is stored, and the computer program implements the steps of the text processing method when executed by the chip.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (15)

1. A method of text processing, comprising:
acquiring a text to be processed corresponding to a source language;
constructing a source language code vector corresponding to the text to be processed, and converting the source language code vector into a target language code vector;
fusing the source language code vector and the target language code vector to obtain a fused vector;
and decoding the fusion vector to generate a target text corresponding to the target language.
2. The method according to claim 1, wherein the obtaining of the text to be processed corresponding to the source language comprises:
acquiring a service text corresponding to a source language;
and preprocessing the service text to obtain the text to be processed corresponding to the source language.
3. The method according to claim 1, wherein the constructing a source language code vector corresponding to the text to be processed and converting the source language code vector into a target language code vector comprises:
inputting the text to be processed into a target language model, and coding the text to be processed through a coding layer in the target language model to obtain the source language code vector;
and mapping the source language code vector through a mapper in the target language model to obtain the target language code vector.
4. The method according to claim 3, wherein said fusing said source language code vector and said target language code vector to obtain a fused vector comprises:
fusing the source language code vector and the target language code vector in the target language model to obtain a fused vector;
the generating of the target text corresponding to the target language by decoding the fusion vector includes:
and decoding the fusion vector through a decoding layer in the target language model, and generating a target text corresponding to the target language according to a decoding processing result.
5. The method according to claim 4, wherein the generating of the target text corresponding to the target language by decoding the fused vector comprises:
decoding the fusion vector to obtain a first text corresponding to the source language and a second text corresponding to the target language;
and responding to a cross-language selection request, and selecting the second text as the target text from the first text corresponding to the source language and the second text corresponding to the target language.
6. The method according to claim 4, wherein said decoding the fused vector by a decoding layer in the target language model, and generating a target text corresponding to the target language according to a decoding result, comprises:
decoding the source language type coding vector in the fusion vector through a decoding layer in the target language model to obtain an initial text corresponding to the source language type;
and decoding the target language type coding vector in the fusion vector according to a processing strategy for aligning the initial text through a decoding layer in the target language model to obtain a target text corresponding to the target language type.
7. The method according to any of claims 3-6, wherein the training of the target language model comprises:
acquiring an initial sample corresponding to the source language, and processing the initial sample through a coding layer in an initial language model to acquire a sample coding vector corresponding to the source language;
processing the sample coding vector through a mapper in the initial language model to obtain a sample mapping vector corresponding to the target language;
fusing the sample mapping vector and the sample coding vector, and processing a fusion result through a decoding layer in the initial language model to obtain a predicted text corresponding to the target language;
and adjusting parameters of the initial language model according to the reference text and the predicted text corresponding to the initial sample until the target language model meeting the training stop condition is obtained.
8. The method according to claim 7, wherein before the step of obtaining the initial sample corresponding to the source language is executed, the method further comprises:
acquiring service model parameters corresponding to a service language model;
and updating the initial model parameters of the initial language model according to the service model parameters.
9. The method according to any one of claims 3 to 6, wherein after the step of generating the target text corresponding to the target language by decoding the fused vector is executed, the method further comprises:
establishing a cross-language relationship between the target text and the text to be processed, and writing the target text into a text library corresponding to the target language according to the cross-language relationship;
determining a target language keyword corresponding to the text reading event under the condition that the text reading event related to the target language is monitored;
and reading a set number of target service texts matched with the target language keywords in the text library as the response of the text reading event.
10. The method of claim 9, further comprising:
determining a to-be-processed service text associated with the target service text;
constructing a target sample pair based on the target service text and the service text to be processed;
and optimizing the target language model by using the target sample pair.
11. The method according to claim 2, wherein the obtaining the text to be processed corresponding to the source language by preprocessing the service text comprises:
processing the service text according to a preset text cleaning strategy and a word segmentation processing strategy to obtain a standard text corresponding to the source language;
and taking the standard text as the text to be processed corresponding to the source language.
12. The method of claim 5, further comprising:
sending an adjustment request to a user submitting the cross-language selection request, wherein the adjustment request carries a second text corresponding to the target language;
receiving a text adjusting instruction submitted by the user aiming at a second text corresponding to the target language in response to the adjusting request;
updating a second text corresponding to the target language according to the text adjusting instruction to obtain a third text corresponding to the target language;
and optimizing the target language model according to the third text corresponding to the target language and the text to be processed.
13. A text processing apparatus, comprising:
the acquisition module is configured to acquire the text to be processed corresponding to the source language;
the construction module is configured to construct a source language code vector corresponding to the text to be processed, and convert the source language code vector into a target language code vector;
the fusion module is configured to fuse the source language code vector and the target language code vector to obtain a fusion vector;
and the decoding module is configured to generate a target text corresponding to the target language by decoding the fusion vector.
14. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the method of any one of claims 1-12.
15. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 12.
CN202211412308.8A 2022-11-11 2022-11-11 Text processing method and device Pending CN115718904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211412308.8A CN115718904A (en) 2022-11-11 2022-11-11 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211412308.8A CN115718904A (en) 2022-11-11 2022-11-11 Text processing method and device

Publications (1)

Publication Number Publication Date
CN115718904A true CN115718904A (en) 2023-02-28

Family

ID=85254980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211412308.8A Pending CN115718904A (en) 2022-11-11 2022-11-11 Text processing method and device

Country Status (1)

Country Link
CN (1) CN115718904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076614A (en) * 2023-10-13 2023-11-17 中山大学深圳研究院 Cross-language text retrieval method and terminal equipment based on transfer learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076614A (en) * 2023-10-13 2023-11-17 中山大学深圳研究院 Cross-language text retrieval method and terminal equipment based on transfer learning
CN117076614B (en) * 2023-10-13 2024-02-02 中山大学深圳研究院 Cross-language text retrieval method and terminal equipment based on transfer learning

Similar Documents

Publication Publication Date Title
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN113961685A (en) Information extraction method and device
CN117668181A (en) Information processing method, device, terminal equipment and storage medium
CN111813923A (en) Text summarization method, electronic device and storage medium
CN110659392B (en) Retrieval method and device, and storage medium
CN117473034A (en) Interactive text processing method and device, electronic equipment and storage medium
WO2024169529A1 (en) Knowledge base construction method, data retrieval method and apparatus, and cloud device
CN115718904A (en) Text processing method and device
CN117036833B (en) Video classification method, apparatus, device and computer readable storage medium
CN116958997B (en) Graphic summary method and system based on heterogeneous graphic neural network
CN113887244A (en) Text processing method and device
CN116913278A (en) Voice processing method, device, equipment and storage medium
CN115409025A (en) Marketing text creation method, device and equipment
CN114492467A (en) Fault-tolerant translation method and device for training fault-tolerant translation model
KR20220130864A (en) A system for providing a service that produces voice data into multimedia converted contents
CN113591493A (en) Translation model training method and translation model device
CN114722817A (en) Event processing method and device
CN112559750A (en) Text data classification method and device, nonvolatile storage medium and processor
CN118429658B (en) Information extraction method and information extraction model training method
KR102435242B1 (en) An apparatus for providing a producing service of transformed multimedia contents using matching of video resources
CN113569099B (en) Model training method and device, electronic equipment and storage medium
CN117453895B (en) Intelligent customer service response method, device, equipment and readable storage medium
CN117891927B (en) Question and answer method and device based on large language model, electronic equipment and storage medium
KR20220130862A (en) A an apparatus for providing a producing service of transformed multimedia contents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination