CN115858752A - Semantic representation method, device, storage medium and equipment for multi-language fusion - Google Patents

Semantic representation method, device, storage medium and equipment for multi-language fusion Download PDF

Info

Publication number
CN115858752A
CN115858752A CN202211539197.7A CN202211539197A CN115858752A CN 115858752 A CN115858752 A CN 115858752A CN 202211539197 A CN202211539197 A CN 202211539197A CN 115858752 A CN115858752 A CN 115858752A
Authority
CN
China
Prior art keywords
text information
language
target
semantic representation
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211539197.7A
Other languages
Chinese (zh)
Inventor
孔聪聪
姜磊
王静
胡加学
贺志阳
赵景鹤
鹿晓亮
魏思
赵志伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Xunfei Medical Co ltd
Original Assignee
Anhui Xunfei Medical Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Xunfei Medical Co ltd filed Critical Anhui Xunfei Medical Co ltd
Priority to CN202211539197.7A priority Critical patent/CN115858752A/en
Publication of CN115858752A publication Critical patent/CN115858752A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a semantic representation method, a semantic representation device, a storage medium and equipment for multi-language fusion. The method comprises the following steps: the method comprises the steps of utilizing text information in a plurality of data samples of a high-resource source language to expand at least one low-resource target language to obtain target text information of at least one target language, determining a plurality of groups of parallel data according to the text information of each data sample and the target text information of at least one target language, utilizing the plurality of groups of parallel data to train an initial semantic representation model, in the training process, respectively processing results of the text information in each group of parallel data and the target text information of at least one target language according to the initial semantic representation model, and updating the initial semantic representation model by the same tag information, optimizing semantic representation of at least one low-resource target language by utilizing knowledge learned in the high-resource source language, and improving the accuracy of semantic representation of the low-resource target language.

Description

Semantic representation method, device, storage medium and equipment for multi-language fusion
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a semantic representation method and device for multi-language fusion, a computer-readable storage medium and computer equipment.
Background
In recent years, with the development and progress of the internet, national languages are more and more appeared in the internet, and semantic understanding of the national languages is becoming an important point of attention. Deep learning has impressive achievements in many natural language processing tasks, particularly with the advent of pre-trained language models, with breakthrough progress in machine understanding human language. However, minority languages such as Tibetan and Wei are inherently poor in resources and lack of large-scale corpora, and even though there are efforts such as multi-language pre-training language models to solve the problem of low-resource languages, they still fail to cover well. Therefore, the semantic representation and understanding problem of minority languages such as Tibetan language and wiki language is still difficult to solve.
Disclosure of Invention
The embodiment of the application provides a semantic representation method, a semantic representation device, a computer-readable storage medium and computer equipment for multi-language fusion, which can optimize the semantic representation of at least one low-resource target language by using the knowledge of text information of a high-resource source language and improve the accuracy of the semantic representation of the low-resource target language.
The embodiment of the application provides a semantic representation method for multi-language fusion, which comprises the following steps:
acquiring a data set corresponding to a source language of high resources, wherein the data set comprises a plurality of data samples, and each data sample comprises text information of the source language and label information corresponding to the text information;
expanding at least one low-resource target language by utilizing the text information in each data sample to obtain target text information of at least one target language;
determining a plurality of groups of parallel data according to the text information in each data sample and the target text information obtained by text information expansion, wherein each group of parallel data comprises the text information of a training sample and the target text information of at least one target language;
training an initial semantic representation model of multi-language fusion according to a plurality of groups of parallel data to obtain a semantic representation model, wherein in the training process, the initial semantic representation model is updated according to processing results of the initial semantic representation model for respectively processing the text information in each group of parallel data and the target text information of at least one target language and label information corresponding to the text information in each group of parallel data;
and performing semantic processing on the text information to be processed by using the semantic representation model to obtain semantic representation of the text information to be processed, wherein the language of the text information to be processed is any one of a source language or at least one target language.
The embodiment of the present application further provides a semantic representation apparatus for multi-language fusion, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data set corresponding to a source language of high resources, the data set comprises a plurality of data samples, and each data sample comprises text information of the source language and label information corresponding to the text information;
the expansion module is used for expanding at least one low-resource target language by utilizing the text information in each data sample to obtain target text information of at least one target language;
the data determination module is used for determining a plurality of groups of parallel data according to the text information in each data sample and the target text information obtained by expanding the text information, wherein each group of parallel data comprises the text information of a training sample and the target text information of at least one target language;
the model training module is used for training an initial semantic representation model of multi-language fusion according to a plurality of groups of parallel data to obtain a semantic representation model, wherein in the training process, the initial semantic representation model is updated according to processing results of the initial semantic representation model for respectively processing the text information in each group of parallel data and the target text information of at least one target language and label information corresponding to the text information in each group of parallel data;
and the processing module is used for performing semantic processing on the text information to be processed by using the semantic representation model to obtain semantic representation of the text information to be processed, wherein the language of the text information to be processed is any one of a source language or at least one target language.
The present application further provides a computer-readable storage medium, where a computer program is stored, where the computer program is suitable for being loaded by a processor to perform the steps in the semantic representation method for multi-language fusion according to any one of the above embodiments.
The embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and the processor executes the steps in the semantic representation method for multi-language fusion according to any of the above embodiments by calling the computer program stored in the memory.
The multilingual fusion semantic representation method, the multilingual fusion semantic representation device, the computer readable storage medium and the computer equipment provided by the embodiment of the application extend at least one low-resource target language by using text information in a plurality of data samples of a high-resource source language to obtain target text information of the at least one target language, determine a plurality of groups of parallel data according to the text information of each data sample and the target text information of the at least one target language, enable the text information of a source language and the target text information of the at least one target language to correspond one to one, enrich the target text information of the at least one low-resource target language while avoiding the initial semantic representation model of the low-resource fusion from being biased towards the text information of the high-resource source language, then train the initial semantic representation model by using the plurality of groups of parallel data, respectively process the text information in each group of parallel data and the target text information of the at least one target language according to the initial semantic representation model in the training process result, and update the initial semantic representation model by using the same semantic tag information to enable the initial semantic representation model to learn at least one language among different languages, optimize the low-resource of the multilingual representation of the low-resource.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application.
Fig. 2 is a flowchart illustrating a semantic representation method for multi-language fusion according to an embodiment of the present application.
Fig. 3 is a schematic diagram of a translation module provided in an embodiment of the present application.
Fig. 4 is a schematic diagram of a format of a preset data dictionary according to an embodiment of the present application.
FIG. 5 is a diagram of a semantic representation model for multi-language fusion according to an embodiment of the present application.
Fig. 6 is a sub-flow diagram of a semantic representation method for multi-language fusion according to an embodiment of the present application.
Fig. 7 is a schematic diagram of word segmentation and semantic tag addition provided in the embodiment of the present application.
Fig. 8 is a schematic structural diagram of a semantic representation apparatus for multi-language fusion according to an embodiment of the present application.
Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a semantic representation method and device for multi-language fusion, a computer-readable storage medium and computer equipment. Specifically, the multi-language fusion semantic representation method according to the embodiment of the present application may be executed by a computer device, and the multi-language fusion semantic representation apparatus according to the embodiment of the present application is integrated in the computer device, where the multi-language fusion semantic representation apparatus may be integrated in one or more computer devices, such as a process of training a multi-language fusion semantic representation model executed in one computer device and executed in another computer device when using the semantic representation model, and correspondingly, a process of training the semantic representation model is integrated in one computer device and integrated in another computer device.
The computer device may be a terminal device or a server or the like. The terminal equipment can be equipment such as a smart phone, a tablet Computer, a notebook Computer, a touch screen, a game machine, a Personal Computer (PC), an intelligent vehicle-mounted terminal and a robot. The server may be an independent physical server, a service node in a block chain system, a server cluster formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Prior to a detailed description of the protocol in the examples of the present application, further analysis of the prior art follows. Aiming at semantic representation of low-resource target languages such as minority languages, the current main method is to extract text features of sentences by a deep learning method, represent sentence semantics in a sentence vector mode and learn sentence semantic representation on the basis of a classification task or a sentence matching task.
Such a method has some problems as follows:
(1) The data volume is small: the corpus of low-resource target languages such as minority languages has too few resources, language texts with labels are lacked, extraction of text features of sentences is difficult, expression learning of sentence semantics is difficult under the condition, and obtained sentence vectors excessively depend on features contained in data resources.
(2) Single language models have limitations and are less effective: on one hand, a single language model is more complex to use, a plurality of models need to be trained aiming at different languages under the scene with multi-language requirements, and the single model fusing the multiple languages is not convenient in maintenance or practical application; secondly, for low-resource minority languages, the effect of the monolingual model is worse.
(3) Impact of high-resource corpus on low-resource corpus: on one hand, in the current multi-language model, the target language with low resources, such as national language, is rarely covered; on the other hand, for a low-resource target language, because the corpus before training is much smaller than the corpus of a high resource, the multilingual model obtained by training is more prone to the source language of the high resource, and the semantic representation of the low-resource language can be influenced, so that the semantic representation of the sub-resource is inaccurate.
In this embodiment, the high resource and the low resource are relatively speaking, the high resource means that the availability of the relevant data sets/corpora is high, that is, it is relatively easier to obtain more data sets/corpora, and the low resource means that the availability of the data sets/corpora is low, that is, it is relatively difficult to obtain or it is relatively difficult to obtain more data sets/corpora, for example, only a few data sets/corpora can be obtained.
Such as high availability for data sets in chinese and low availability for minority languages like tibetan, vernacular, etc., especially in certain application scenarios, such as an outbound scenario. Such as higher availability for data sets in chinese and lower availability for languages like french, indian, etc.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. The application scenario involves an outbound scenario in which an intelligent robot or a human may ask relevant questions and then the user answers the relevant questions, thus making multiple rounds of questions and answers. The server can acquire the voice information of multiple rounds of questions and answers, extract the voice information answered by the user, convert the voice information answered by the user into corresponding texts, and classify the corresponding texts to obtain the intention of the user.
In the outbound scenario, most of the texts are Chinese texts, minority national languages such as Tibetan languages and dimension languages are also involved, and for a Chinese text, semantic information of the Chinese text is easy to understand.
The method comprises the steps of expanding minority languages such as Tibetan languages and dimension languages by utilizing a high-resource Chinese data set obtained in an outbound scene to obtain multiple groups of parallel data, training an initial semantic representation model of the multi-language fusion by utilizing the multiple groups of parallel data, performing semantic processing and classification processing on the multiple groups of parallel data by utilizing the initial semantic representation model of the multi-language fusion, updating the initial semantic model by utilizing a processing result of the classification processing and corresponding label information in the Chinese data set to obtain a semantic representation model, and performing semantic processing on text information of the low-resource minority languages by utilizing the semantic representation model to obtain accurate semantic information.
It should be noted that the application scenario in fig. 1 is not limited, but is only to help understanding the scheme in the embodiment of the present application, and the scheme in the embodiment of the present application may also be applied to more application scenarios.
Fig. 2 is a flowchart of a semantic representation method for multi-language fusion provided in an embodiment of the present application, where the method includes the following steps.
101, acquiring a data set corresponding to a source language of a high resource, wherein the data set comprises a plurality of data samples, and each data sample comprises text information of the source language and label information corresponding to the text information.
The high resource source language may be a universal language specified by a country, and usually, the availability of a data set/corpus of the universal language specified by the country is high, such as chinese in country a, english in country B, and the like.
The data set corresponding to the source language of the high resource may be a data set of any application scenario, and the application scenario is described as an example, which is also taken as an example hereinafter.
The data set comprises a plurality of data samples, each data sample comprises text information of a source language and label information corresponding to the text information, the text information is in sentence units, one text information corresponds to one label information, and the label information is represented by 0 and 1. If the number of classification results corresponding to all text information of the source language is N, the number of label information is N0 and N1; which classification result is hit by the current text is represented by 1, and the others are represented by 0.
The text information in the source language in one data sample includes answer text for a question of the intelligent robot answered by the user in chinese, the answer text being in sentence units, one answer text corresponding to a real intention of the user, the real intention being represented by intention labels, i.e., the label information includes intention labels.
And 102, expanding at least one low-resource target language by using the text information in each data sample to obtain target text information of at least one target language.
The at least one target language may include one target language, two target languages, three target languages, or more target languages, etc.
And for the text information in each data sample, translating the text information into the text information of at least one low-resource target language, and taking the translated text information as the target text information of at least one target language. After translation, the text information of one data sample corresponds to at least one target text information of a target language, such as target text information corresponding to a target language a, target text information corresponding to a target language B, target text information corresponding to a target language C, and the like.
The low-resource target language may include minority languages such as Tibetan language and Wei language, may also include non-universal dialects, and may also include languages other than the native language. In the embodiment of the present application, a low-resource target language is taken as an example of a minority language such as Tibetan language and Wei language.
In one embodiment, the translation tool may be used directly to perform translation processing on the text information in the source language, for example, a target translation tool for obtaining a translation between the source language and at least one low-resource target language; and translating the text information in each data sample by using a target translation tool corresponding to the target language to obtain the target text information of at least one target language. If a target translation tool from Chinese to Tibetan is obtained, translating text information of the Chinese into target text information of the Tibetan by utilizing the target translation tool; and acquiring a target translation tool from Chinese to dimensional language, and translating the text information of Chinese into the target text information of dimensional language by using the target translation tool.
In one embodiment, other ways may also be used to perform the translation process. Fig. 3 is a schematic diagram of a translation module provided in this application, which can be obtained by training in advance. And acquiring text information in each sample datum, performing coding processing by using a coder to obtain coding characteristics of the text information, and performing decoding processing on the coding characteristics by using a decoder of at least one target language to obtain target text information of at least one target language. Both the encoder and the decoder may be transform modules. As shown in fig. 3, text information of chinese is translated into target text information of dimensional language and target text information of tibetan.
The method for expanding at least one low-resource target language in this step may be other than translation, as long as the target text information of at least one low-resource target language having the same meaning as the text information of the source language is obtained.
In the step, at least one target language of low resources is expanded by utilizing the text information of the source language of high resources to obtain target text information, and the target text information of the target language of low resources is enriched, so that the model is prevented from being biased to the text information of high resources in the training process.
After the target text information of at least one target language is obtained through expansion, format conversion processing can be further performed on the text information in each data sample and the target text information of the at least one target language obtained through expansion according to the format of a preset data dictionary, so that the text information after the format conversion processing and the target text information of the at least one target language are obtained. All text data are processed into a uniform data format by format conversion processing, so that further processing in the following text is facilitated.
The format of the preset data dictionary may be as shown in fig. 4, where "mainSuit" is a plan for the user to return to the question "is there? "answer text, that is, text information in source language," intent "is an intent corresponding to the text information, where the intent is a category of the text information," node _ type "is a node name, one node represents one question, under which the intelligent robot asks the corresponding question, and mainSuit is an answer text for the question by the user, and consists of multiple nodes (questions) in a complete external call. Other fields are reserved fields or fields irrelevant to the content of the application and are not considered for the moment.
103, determining multiple groups of parallel data according to the text information in each data sample and the target text information obtained by text information expansion, wherein each group of parallel data comprises the text information of a training sample and the target text information of at least one target language.
The text information of a training sample and the target text information of at least one target language obtained by expanding the text information are used as a group of parallel data, for example, the text information of Chinese and the target text information of Tibetan and the target text information of Vital obtained by expanding the text information are used as a group of parallel data.
Therefore, the text information of the source language and the target text information of the target language are in one-to-one correspondence, the semantic representation of the text information with low resources is avoided when the model is trained, in addition, texts containing the same semantic meaning and in different languages form a group of parallel data, and the texts are enabled to have the same category in the text classification processing, so that the semantic knowledge learned by the model in different languages is shared. The functionality of migrating knowledge in one language, such as the source language, to another language is enhanced, i.e., the ability to migrate knowledge across languages is enhanced.
And obtaining a plurality of groups of parallel data by the text information of the training samples and the corresponding target text information of the at least one target language.
And 104, training an initial semantic representation model of multi-language fusion according to multiple groups of parallel data to obtain a semantic representation model, wherein in the training process, the initial semantic representation model is updated according to processing results of the initial semantic representation model for respectively processing the text information in each group of parallel data and the target text information of at least one target language and label information corresponding to the text information in each group of parallel data.
The initial semantic representation model of the multi-language fusion is a model needing network parameter updating. In the step, a specific training process is limited, the model is updated by using the same label information and the processing result of the text information and the processing result of the target text information, so that the model learns the semantic alignment among different languages, and the low-resource semantic representation is optimized by using the knowledge obtained by the high-resource linguistics.
In one case, step 104 includes: respectively inputting text information corresponding to each group of parallel data in the multiple groups of parallel data and target text information of at least one target language into an initial semantic representation model fused with multiple languages for processing, such as semantic processing and classification processing, so as to obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information of at least one target language; and updating the initial semantic representation model according to the source text classification result, the target text classification result of at least one target language and the label information corresponding to the text information in each group of parallel data to obtain a semantic representation model. In the embodiment, the initial semantic representation model is limited to respectively carry out the same processing on the text information and the target text information in each group of parallel data, such as semantic processing and classification processing, and then the initial semantic representation model is updated according to the processing result of the same processing and the same label information, so that the model can learn the semantics among different languages, and the semantic alignment among different languages is realized.
The initial semantic model comprises a semantic processing module and a classification module, wherein the semantic processing module is used for extracting semantic features of text information, namely semantic representation, and the semantic representation can be expressed by semantic vectors. It should be noted that the semantic processing module is a shared module. And the classification module is used for classifying the semantic vectors to obtain the classification result of the semantic vectors.
The text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language can be respectively input into a semantic processing module of an initial semantic representation model of multi-language fusion for semantic processing so as to respectively obtain a source text semantic representation of the text information and a target text semantic representation of the target text information of at least one target language. The semantic representation of the text information of the source language is called source text semantic representation, the semantic representation of the target text information of at least one target language is called target text semantic representation, one target language corresponds to one target text information, and one target text information corresponds to one target text semantic representation.
For example, if the source language is chinese and the target languages are tibetan and wiki, the text information of chinese in each set of parallel data is input into the semantic processing module, the semantic processing module performs semantic processing on the text information to extract semantic features of the text information of chinese, that is, chinese semantic representation, the target text information corresponding to the tibetan in each set of parallel data is input into the semantic processing module, the semantic processing module performs semantic processing on the target text information to extract semantic features of the target text information of tibetan, that is, tibetan semantic representation, and similarly, wiki semantic representation is extracted.
After the source text semantic representation and the target text semantic representation of at least one target language are obtained, the source text semantic representation and the target text semantic representation of at least one target language are respectively input into a classification module for classification processing, so that a source text classification result of text information in each group of parallel data and a target text classification result of target text information of at least one target language are respectively obtained. One target text semantic representation obtains one target text classification result, and the plurality of target text semantic representations respectively obtain a plurality of target text classification results.
For example, the Chinese semantic representation is input into the classification module, the classification module is used for classifying the Chinese semantic representation to obtain a Chinese text classification result, the Tibetan semantic representation is input into the classification module, the classification module is used for classifying the Tibetan semantic representation to obtain a Tibetan text classification result, and similarly, a dimensional text classification result is obtained.
The semantic representation model in this embodiment includes a semantic processing module and a classification module, which may be any module capable of implementing semantic extraction and classification, for example, the semantic processing module may be a Bert module formed by a Bert (Bidirectional Encoder retrieval from transforms) model, and the classification module may be a soft network module, as shown in fig. 5, so that the semantic processing module may be used for performing semantic processing, and the classification module performs classification processing. In this way, when a trained semantic representation model is used in the following, only a semantic representation of a target language can be obtained by using a semantic processing module, in addition, classification processing can be performed to obtain classification results, and then an association relationship is generated between a source language and a target language according to a plurality of classification results of the same semantic, such as a source text classification result, at least one target text classification result of the target language, and label information of the same semantic, so that the target language can learn semantic features/semantic representation of the source language.
In one case, in order to better realize the classification and make the classification result more accurate, the initial semantic representation model further includes a Self-attention mechanism network module/Self-attention module before the classification module, as shown in fig. 5. And learning the feature representation of the sentence text of the text information by using the self-attention mechanism network module, so that the self feature representation of the text information is also fused in the semantic representation result.
In one case, in order to distinguish different languages, explicit language tags are designed, and the different languages correspond to different language tags, for example, a source language tag corresponds to a source language, and a target language tag corresponds to a target language. For example, the chinese label of chinese may be denoted by "[ ZH ]", the dimension label of dimension language may be denoted by "[ UG ]", and the Tibetan label of Tibetan language may be denoted by "[ BO ]". Correspondingly, the initial semantic representation model for multi-language fusion includes a segmentation and language tagging module, as shown in FIG. 5. The word segmentation and language label adding module is specifically configured to perform word segmentation on text information corresponding to each set of parallel data in the multiple sets of parallel data and target text information of at least one target semantic before performing semantic processing, and add corresponding semantic labels to word segmentation results respectively.
Fig. 6 is a sub-flow diagram provided in the present application, which further illustrates the above step of "training the initial semantic representation model for multi-language fusion according to multiple sets of parallel data to obtain a semantic representation model", and can be understood with reference to fig. 5. The sub-process includes the following steps.
And 201, respectively inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language into a word segmentation and language tag adding module of an initial semantic representation model for multi-language fusion to perform word segmentation processing so as to obtain a source text word segmentation result of the text information and a target text word segmentation result of the target text information of at least one target language.
The word segmentation processing can adopt any one of the existing word segmentation processing modes. The target text information of one target language corresponds to one target text word segmentation result, and the target text information of a plurality of target languages corresponds to a plurality of target text word segmentation results.
As shown in fig. 7, for the same sentence, "i have occasionally eaten" the representation in three different languages, the bottom row shows the text information for each language, and the second row shows the word segmentation result. For example, for the text information "i have occasionally eaten" in the bottom line of chinese as an example, after the word segmentation processing is performed, chinese word segmentation results such as '_ me', 'occasionally', 'having occasionally' in the second line are obtained. The word segmentation results for the Tibetan and the wiki can be seen in the second row at the corresponding languages in fig. 7, respectively.
Although each character can be encoded in the chinese language to obtain the corresponding semantic representation thereof, some minority semantics are not feasible, and some minority languages need to be encoded by words to obtain the corresponding semantic representation. Therefore, in order to perform the unified processing, the text information corresponding to all languages is subjected to word segmentation processing.
202, adding a source language label corresponding to the source language to the source text segmentation result to obtain a source text segmentation result after the source language label is added, and adding a target language label corresponding to the target language to the target text segmentation result of at least one target language to obtain a target text segmentation result after the target language label is added.
For example, a source language tag corresponding to the source language may be added before (e.g., at the beginning of a sentence)/at the end (e.g., at the end of a sentence) of the source text segmentation result to obtain a source text segmentation result after the source language tag is added, and a target language tag corresponding to the corresponding target language may be added before/at the end of a target text segmentation result in at least one target language to obtain a target text segmentation result after the corresponding target language tag is added.
Or the word segmentation result of the source text and the word segmentation result of the source language are spliced to obtain the word segmentation result of the source text after the source language tag is added, and the word segmentation result of the target text of at least one target language and the corresponding target language tag are spliced to obtain the word segmentation result of the target text after the corresponding target language tag is added.
As shown in the first line of each language in fig. 7, i.e., the result after adding the language tags.
And 203, respectively inputting the source text word segmentation result added with the language label and the target text word segmentation result of at least one target language into a semantic processing module of the multi-language fusion initial semantic representation model for semantic processing so as to respectively obtain the source text semantic representation of the text information and the target text semantic representation of the target text information of at least one target language.
The semantic processing module can be a Bert module, and the Bert module comprises a Bert model. The Bert module in the application comprises word lists, wherein words of national languages added after pre-training and subsequent data expansion are included in the word lists, the word lists can also be understood as a dictionary, and one word or one word corresponds to one number/ID. The word list also includes the numbers/ID representations corresponding to different languages.
In the semantic processing module, the source text word segmentation result added with the language tags and the target text word segmentation result of at least one target language are input into the semantic processing module, and the semantic processing module maps each tag and each word/each word in the source text word segmentation result and the target text word segmentation result of at least one target language according to the corresponding word list to obtain the number/ID of each tag and each word/each word. In this manner, to turn the text information into a string of numbers that the computer device can understand. For example, [ "ZH", "i", "e", "country", "home" ] becomes the form of [ "01", "0004", "0064", "5120", "3200" ] also referred to as mapping result. Therefore, since each language performs the mapping and the data processed in the model is the numbers, it is only necessary to add language tags to the word segmentation results of each language to distinguish different languages.
In the embodiment of the application, a maximum matrix is set for a Bert module, the maximum matrix corresponds to a fixed sentence length, when the length of a mapping result of each language is smaller than the fixed sentence length, the length of the mapping result of each language is filled by using pads to obtain the filled mapping result, then semantic extraction processing is performed on the mapping result filled in each language to obtain semantic features of each language, the dimensionality of the semantic features of each language is the same, and finally pooling processing is sequentially performed on the semantic features of each language to obtain semantic representation after pooling processing.
Assuming that the dimension of the semantic feature corresponding to each digit is 1 × 256 and the length of the fixed sentence is N, the dimension of the semantic feature obtained after semantic extraction processing is N × 256, and then pooling processing is performed, for example, after maximization pooling processing or average pooling processing is performed, the dimension of the semantic representation obtained is 1 × 256.
For different languages such as Chinese (source language), uyghur and Tibetan (target language), the Chinese word segmentation result with the Chinese label is used X zh To express, the result of the segmentation of the dimension language with the dimension language label is X ug To show that the Tibetan word segmentation result added with the Tibetan label is used X bo To indicate. The processing procedure through the Bert module can be expressed as shown in the following formula (1), formula (2) and formula (3), respectively.
H zh =Pool(Bert( X zh ) (1)
H ug =Pool(Bert( X ug ) (2)
H bo =Pool(Bert( X bo ) (3)
Wherein, H zhH ugH bo respectively a source text semantic representation and a first target text semantic representation and a second target text semantic representation, namely a Chinese semantic representation, a dimension semantic representation and a Tibetan semantic representation.
And 204, respectively inputting the source text semantic representation and the target text semantic representation of at least one target language into a self-attention mechanism network module of the initial semantic representation model for multi-language fusion to extract the self characteristics of the sentences so as to respectively obtain the source text semantic representation of the fused sentences and the target text semantic representation of at least one target language.
In order to fuse the feature information of the sentence and obtain better semantic representation of the sentence, the semantic representation of the source text is input into a self-attention mechanism network module, the self-attention mechanism network module is used for extracting the feature of the sentence to obtain the semantic representation of the source text (also called as semantic representation of the source sentence) of the self feature of the fused sentence, the semantic representation of the target text of at least one target language is input into the self-attention mechanism network module, and the self-attention mechanism network module is used for extracting the feature of the sentence to obtain the semantic representation of the target text of at least one target language (also called as semantic representation of the target sentence) of the self feature of the fused sentence.
The processing procedure of the sentence self-feature extraction processing by using the self-attention mechanism network model can be expressed as the following formula (4), formula (5), and formula (6).
Figure SMS_1
Figure SMS_2
Figure SMS_3
Wherein, self-attention represents the process of extracting the characteristics of sentences by using a Self-attention mechanism network model,
Figure SMS_4
the semantic representation of the source text and the semantic representation of at least one target text of the self characteristics of the fused sentences are respectively, namely the semantic representation of Chinese, the semantic representation of dimensional language and the semantic representation of Tibetan language of the self characteristics of the fused sentences are respectively. The dimensionality of the Chinese semantic representation, the dimensionality of the dimensional semantic representation and the dimensionality of the Tibetan semantic representation of the fused sentence self characteristics are all 1 × 256.
And 205, inputting the semantic representation of the source text and the semantic representation of the target text in at least one target language into a classification module for classification processing, so as to obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information in at least one target language.
The classification module can be a softmax network module, and the softmax network module comprises a softmax function. Respectively inputting the source text semantic representation fused with the self characteristics of the sentences and the target text semantic representation of at least one target language into a classification softmax network module, performing classification processing by using a softmax function of the softmax network module to respectively obtain the source text classification probability of the text information in each group of parallel data and the target text classification probability of the target text information of at least one target language, and determining the source text classification result and the target text classification result of at least one target language according to the source text classification probability and the target text classification probability of at least one target language.
The number of classifications by utilizing the softmax function is based on the intentions related to the selected data set, for example, when 530 intentions are related, 530 probability data are available in the source text classification probability and the target text classification probability of the target text information of at least one target language, then the data with the highest probability in the 530 probability data are converted into 1, and the other data are converted into 0, so as to obtain the original text classification result and the target text classification result of at least one target language.
The procedure of performing the classification process can be expressed using formula (7), formula (8), and formula (9).
Figure SMS_5
Figure SMS_6
Figure SMS_7
Wherein,
Figure SMS_8
respectively representing Chinese text classification results, dimension text classification results and Tibetan text classification results.
And 206, updating the initial semantic representation model according to the source text classification result, the target text classification result of at least one target language and the label information corresponding to the text information in each group of parallel data to obtain a semantic representation model.
And updating network parameters of the initial text representation model according to the loss value to obtain a semantic representation module.
Further, the step of determining the loss value of the initial text representation model according to the source text classification result, the target text classification result of at least one target language, and the label information corresponding to the text information in each set of parallel data includes: determining a source text classification loss value according to the source text classification result and the label information, and determining a target text classification loss value of at least one target language according to a target text classification result of at least one target language and the label information; determining a loss value of the initial semantic representation model according to the source text classification loss value and a target text classification loss value of at least one target language.
In this embodiment, the loss values defining the initial semantic representation model include a source language classification loss value and at least one target language classification loss value of a target language, where the source language classification loss value is obtained according to a source text classification result and tag information, and the target language classification loss value is obtained according to a target text classification result and the same tag information, so that the target language learns semantic representation of the source language.
And weighting and summing the original text classification loss value and at least one target text classification loss value of a target language to obtain a loss value of the initial semantic representation model.
In consideration of the problems of difficulty of different language samples and unbalance of various intention type samples in an outbound scene, the method adopts the combination of Focal local and GHM local as a Loss function under each language condition, obtains the final Loss function by performing weighted balance on Loss values of three different languages, and optimizes the Loss function by adopting a random gradient descent method. The concept of Focal local is to reduce the weight of simple samples and increase the weight of difficult samples, so that the model focuses more on samples difficult to classify. In the outbound scene, the answer of the user is very random and spoken, and the situations of wrong annotation and confusion exist in a large quantity, so that the number of samples is difficult. Only Focal local is applied to the optimization direction of the model which is difficult to stabilize, therefore, GHM local is introduced, the weight of a difficult sample can be dynamically balanced, so that the attention degree of the model to the difficult sample is adjusted, most of the difficult samples are wrong samples or mixed samples are marked, and the optimization of the model is guided by mistake due to excessive attention of the samples. In summary, the loss expression of a single language can be expressed as shown in formula (10), formula (11), and formula (12).
Figure SMS_9
Figure SMS_10
Figure SMS_11
Wherein γzhγugγbo And the configuration is a hyper-parameter configuration, and is used for respectively balancing the occupation ratio of GHM Loss and carrying out designation according to the distribution condition of the data set. l osszhl ossugl ossbo The text classification loss values of the three languages are respectively a Chinese text classification loss value, a dimensional text classification loss value and a Tibetan text classification loss value. The final loss value of the initial semantic representation model is weighted by the text classification loss values of the three languages, and can be expressed as the following formula (13).
Loss=α*loss zh +β*loss ug +γ*loss bo (13)
Wherein, αβγ the weight ratio is used for balancing the text classification loss value of each language.
And after the final loss value is obtained, updating the initial semantic representation model by using the loss value so as to stop training when the training stop condition is reached, thereby obtaining the semantic representation model.
The training stopping condition may be loss value convergence or the number of training rounds reaching a preset number of rounds, and the like, and is not particularly limited, and the model obtained when the training is stopped is the semantic representation model.
In the outbound scene, a large amount of Chinese outbound data exists currently, and only a small amount of artificial outbound data exists in minority regions, so by adopting the method in the application, a large amount of outbound data of minority languages is obtained through translation of the Chinese outbound data, a plurality of groups of parallel data are constructed, and training is performed on the basis of a pre-training model of Chinese outbound, so that a semantic representation model with multi-language fusion can learn semantic alignment among different languages, and Chinese outbound field knowledge is migrated to the minority languages, and better semantic representation of the minority languages is obtained.
And 105, performing semantic processing on the text information to be processed by using a semantic representation model to obtain semantic representation of the text information to be processed, wherein the language of the text information to be processed is any one of a source language or at least one target language.
Because the semantic representation model is finally used for classification, and the semantic representation of the text information is needed, the text information to be processed can be subjected to semantic processing only by using the semantic processing module and the previous part thereof in the semantic representation model to obtain the semantic representation of the text information to be processed.
The language of the text information to be processed is any one of a source language or at least one target language, and may be any one of chinese, dimension, tibetan, or the like, for example.
In one case, after the semantic representation model is obtained, the part behind the semantic processing module of the semantic representation model is cut to obtain the semantic processing module and the part in front of the semantic processing module, and then the semantic representation model obtained after cutting is utilized to perform semantic processing on the text information to be processed to obtain the semantic representation of the text information to be processed.
The embodiment utilizes a large amount of corpus resources of a source language such as Chinese to perform data expansion on a low-resource target language such as minority language, and explicitly labels different languages, so that the model can realize understanding of multiple languages and help to obtain semantic representation of the target language such as the minority language, equivalent low-resource data is constructed through translation to obtain multiple groups of parallel data, the model is prevented from being biased to high-resource data in the training process, and meanwhile, semantic alignment among languages can be learned, so that the semantic representation of the low-resource target language is optimized by utilizing knowledge obtained by the high-resource linguistics, semantic knowledge can be migrated across languages, the semantic representation capability of the low-resource target language such as the minority language is effectively improved, and the accuracy of the semantic representation of the low-resource target language is improved.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.
In order to better implement the semantic representation method for multi-language fusion in the embodiment of the application, the embodiment of the application also provides a semantic representation device for multi-language fusion. Referring to fig. 8, fig. 8 is a schematic structural diagram of a semantic representation apparatus for multi-language fusion according to an embodiment of the present application. The multi-language-fused semantic representation apparatus 300 may include an acquisition module 301, an expansion module 302, a data determination module 303, a model training module 304, and a processing module 305.
The obtaining module 301 is configured to obtain a data set corresponding to a source language of a high resource, where the data set includes a plurality of data samples, and each data sample includes text information of the source language and tag information corresponding to the text information.
An expansion module 302, configured to expand at least one low-resource target language by using the text information in each data sample to obtain target text information of at least one target language.
In an embodiment, the expansion module 302 is specifically configured to, for the text information in each data sample, translate the text information into text information in at least one low-resource target language, and use the text information as the target text information in at least one target language.
The data determining module 303 is configured to determine multiple sets of parallel data according to the text information in each data sample and the target text information obtained by expanding the text information, where each set of parallel data includes the text information of a training sample and the target text information of at least one target language.
A model training module 304, configured to train an initial semantic representation model for multi-language fusion according to multiple sets of parallel data to obtain a semantic representation model, where in a training process, the initial semantic representation model is updated according to a processing result obtained by processing the text information in each set of parallel data and the target text information in at least one target language respectively by the initial semantic representation model, and tag information corresponding to the text information in each set of parallel data.
In an embodiment, the model training module 304 is specifically configured to input the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information in at least one target language into an initial semantic representation model for multi-language fusion, and perform processing to obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information in at least one target language; and updating the initial semantic representation model according to the source text classification result, the target text classification result of at least one target language and the label information corresponding to the text information in each group of parallel data to obtain a semantic representation model.
In an embodiment, the updating the initial semantic representation model according to the source text classification result, the target text classification result of at least one target language, and the tag information corresponding to the text information in each set of parallel data to obtain a semantic representation model includes: determining a source text classification loss value according to the source text classification result and the label information, and determining a target text classification loss value of at least one target language according to the target text classification result of at least one target language and the label information; determining a loss value of the initial semantic representation model according to a source text classification loss value and a target text classification loss value of at least one target language; and updating the initial semantic representation model according to the loss value to obtain a semantic representation model.
In an embodiment, the initial semantic representation model includes a semantic processing module and a classification module, and the step of inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information in at least one target language into the initial semantic representation model of multi-language fusion respectively for processing to obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information in at least one target language includes: respectively inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language into a semantic processing module of an initial semantic representation model fused with multiple languages for semantic processing so as to respectively obtain a source text semantic representation of the text information and a target text semantic representation of the target text information of at least one target language; and respectively inputting the source text semantic representation and the target text semantic representation of at least one target language into the classification module for classification processing so as to respectively obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information of at least one target language.
In an embodiment, the initial speech representation model further includes a self-attention mechanism network module, and the model training module 304 is further configured to input the source text semantic representation and the target text semantic representation in at least one target language into the self-attention mechanism network module respectively to perform sentence self-feature extraction processing, so as to obtain a source text semantic representation fusing self-features of the sentence and a target text semantic representation in at least one target language respectively.
In an embodiment, the initial semantic representation model further includes a word segmentation and language label adding module, and the model training module 304 is further configured to input the text information corresponding to each set of parallel data in the multiple sets of parallel data and the target text information in at least one target language into the word segmentation and language label adding module for word segmentation processing, so as to obtain a source text word segmentation result of the text information and a target text word segmentation result of the target text information in at least one target language; adding a source language label corresponding to a source language to the source text word segmentation result to obtain a source text word segmentation result after the source language label is added, and adding a target language label corresponding to a corresponding target language to the target text word segmentation result of at least one target language to obtain a target text word segmentation result after the corresponding target language label is added; correspondingly, the step of inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language into a semantic processing module of an initial semantic representation model fused to multiple languages respectively for semantic processing includes: and respectively inputting the source text word segmentation result added with the language label and the target text word segmentation result of at least one target language into a semantic processing module of an initial semantic representation model fused with multiple languages for semantic processing.
The processing module 305 is configured to perform semantic processing on the text information to be processed by using the semantic representation model to obtain a semantic representation of the text information to be processed, where a language of the text information to be processed is any one of a source language or at least one target language.
In one embodiment, as shown in fig. 8, the apparatus further comprises a format conversion module 306. The format conversion module 306 is configured to perform format conversion processing on the text information in each data sample and the expanded target text information in the at least one target language according to a format of a preset data dictionary, so as to obtain the text information after the format conversion processing and the target text information in the at least one target language.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.
Correspondingly, the embodiment of the application also provides a computer device, and the computer device can be a terminal or a server. As shown in fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer apparatus 400 includes a processor 401 having one or more processing cores, a memory 402 having one or more computer-readable storage media, and a computer program stored on the memory 402 and executable on the processor. The processor 401 is electrically connected to the memory 402.
The processor 401 is a control center of the computer apparatus 400, connects various parts of the entire computer apparatus 400 using various interfaces and lines, performs various functions of the computer apparatus 400 and processes data by running or loading software programs (computer programs) and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer apparatus 400 as a whole.
In this embodiment of the present application, the processor 401 in the computer device 400 loads instructions corresponding to processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 runs the application programs stored in the memory 402, so as to implement the functions in any of the above method embodiments, for example:
acquiring a data set corresponding to a source language of high resources, wherein the data set comprises a plurality of data samples, and each data sample comprises text information of the source language and label information corresponding to the text information;
expanding at least one low-resource target language by utilizing the text information in each data sample to obtain target text information of at least one target language;
determining a plurality of groups of parallel data according to the text information in each data sample and the target text information obtained by expanding the text information, wherein each group of parallel data comprises the text information of a training sample and the target text information of at least one target language;
training an initial semantic representation model of multi-language fusion according to a plurality of groups of parallel data to obtain a semantic representation model, wherein in the training process, the initial semantic representation model is updated according to processing results of the initial semantic representation model for respectively processing the text information in each group of parallel data and the target text information of at least one target language and label information corresponding to the text information in each group of parallel data;
and performing semantic processing on the text information to be processed by using the semantic representation model to obtain semantic representation of the text information to be processed, wherein the language of the text information to be processed is any one of a source language or at least one target language.
For specific implementation and beneficial effects of various operations executed by the processor, reference may be made to the foregoing method embodiments, and details are not described herein.
Optionally, as shown in fig. 9, the computer device 400 further includes: touch-sensitive display screen 403, radio frequency circuit 404, audio circuit 405, input unit 406 and power 407. The processor 401 is electrically connected to the touch display screen 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power source 407. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 9 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.
The touch display screen 403 may be used for displaying a graphical user interface and receiving operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. The display panel may be used, among other things, to display information entered by or provided to a user and various graphical user interfaces of the computer device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. The touch panel may overlay the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel may transmit the touch operation to the processor 401 to determine the type of the touch event, and then the processor 401 may provide a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 403 to realize input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display screen 403 may also be used as a part of the input unit 406 to implement an input function.
In the embodiment of the present application, the touch display screen 403 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.
The rf circuit 404 may be used for transceiving rf signals to establish wireless communication with a network device or other computer device through wireless communication, and to transmit and receive signals to and from the network device or other computer device.
The audio circuit 405 may be used to provide an audio interface between a user and a computer device through speakers, microphones. The audio circuit 405 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 405 and converted into audio data, which is then processed by the audio data output processor 401, and then sent to, for example, another computer device via the radio frequency circuit 404, or output to the memory 402 for further processing. The audio circuit 405 may also include an earbud jack to provide communication of a peripheral headset with the computer device.
The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
The power supply 407 is used to power the various components of the computer device 400. Optionally, the power source 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, power consumption management, and the like through the power management system. The power supply 407 may also include one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, or any other component.
Although not shown in fig. 9, the computer device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, the present application provides a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any one of the multi-language fusion semantic representation methods provided in the present application. For example, the computer program may perform the steps of:
acquiring a data set corresponding to a source language of high resources, wherein the data set comprises a plurality of data samples, and each data sample comprises text information of the source language and label information corresponding to the text information;
expanding at least one low-resource target language by utilizing the text information in each data sample to obtain target text information of at least one target language;
determining a plurality of groups of parallel data according to the text information in each data sample and the target text information obtained by expanding the text information, wherein each group of parallel data comprises the text information of a training sample and the target text information of at least one target language;
training an initial semantic representation model of multi-language fusion according to a plurality of groups of parallel data to obtain a semantic representation model, wherein in the training process, the initial semantic representation model is updated according to processing results of the initial semantic representation model for respectively processing the text information in each group of parallel data and the target text information of at least one target language and label information corresponding to the text information in each group of parallel data;
and performing semantic processing on the text information to be processed by using the semantic representation model to obtain semantic representation of the text information to be processed, wherein the language of the text information to be processed is any one of a source language or at least one target language.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.
Since the computer program stored in the storage medium can execute the steps in any of the multi-language fusion semantic representation methods provided in the embodiments of the present application, beneficial effects that can be achieved by any of the multi-language fusion semantic representation methods provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.
The method, the device, the storage medium and the computer device for semantic representation of multi-language fusion provided by the embodiment of the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (12)

1. A semantic representation method for multi-language fusion is characterized by comprising the following steps:
acquiring a data set corresponding to a source language of high resources, wherein the data set comprises a plurality of data samples, and each data sample comprises text information of the source language and label information corresponding to the text information;
expanding at least one low-resource target language by utilizing the text information in each data sample to obtain target text information of at least one target language;
determining a plurality of groups of parallel data according to the text information in each data sample and the target text information obtained by expanding the text information, wherein each group of parallel data comprises the text information of a training sample and the target text information of at least one target language;
training an initial semantic representation model of multi-language fusion according to a plurality of groups of parallel data to obtain a semantic representation model, wherein in the training process, the initial semantic representation model is updated according to processing results of the initial semantic representation model for respectively processing the text information in each group of parallel data and the target text information of at least one target language and label information corresponding to the text information in each group of parallel data;
and performing semantic processing on the text information to be processed by using the semantic representation model to obtain semantic representation of the text information to be processed, wherein the language of the text information to be processed is any one of a source language or at least one target language.
2. The method according to claim 1, wherein the step of training the initial semantic representation model for multi-language fusion according to the plurality of sets of parallel data to obtain the semantic representation model comprises:
respectively inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language into an initial semantic representation model fused with multiple languages for processing to obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information of at least one target language;
and updating the initial semantic representation model according to the source text classification result, the target text classification result of at least one target language and the label information corresponding to the text information in each group of parallel data to obtain a semantic representation model.
3. The method according to claim 2, wherein the step of updating the initial semantic representation model according to the source text classification result, the target text classification result in at least one target language, and the label information corresponding to the text information in each set of parallel data to obtain a semantic representation model comprises:
determining a source text classification loss value according to the source text classification result and the label information, and determining a target text classification loss value of at least one target language according to the target text classification result of at least one target language and the label information;
determining a loss value of the initial semantic representation model according to a source text classification loss value and a target text classification loss value of at least one target language;
and updating the initial semantic representation model according to the loss value to obtain a semantic representation model.
4. The method according to claim 2, wherein the initial semantic representation model includes a semantic processing module and a classification module, and the step of inputting the text information corresponding to each group of parallel data in the plurality of groups of parallel data and the target text information in at least one target language into an initial semantic representation model of multi-language fusion to be processed respectively so as to obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information in at least one target language comprises:
respectively inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language into a semantic processing module of an initial semantic representation model fused with multiple languages for semantic processing so as to respectively obtain a source text semantic representation of the text information and a target text semantic representation of the target text information of at least one target language;
and respectively inputting the source text semantic representation and the target text semantic representation of at least one target language into the classification module for classification processing so as to respectively obtain a source text classification result of the text information in each group of parallel data and a target text classification result of the target text information of at least one target language.
5. The method of claim 4, further comprising a self-attention mechanism network module in the initial speech representation model, the method further comprising:
and respectively inputting the source text semantic representation and the target text semantic representation of at least one target language into a self-attention mechanism network module to extract the self characteristics of the sentences so as to respectively obtain the source text semantic representation fusing the self characteristics of the sentences and the target text semantic representation of at least one target language.
6. The method according to claim 4, wherein the initial semantic representation model further comprises a module for segmenting words and adding language tags, and before the step of inputting the text information corresponding to each set of parallel data in the plurality of sets of parallel data and the target text information of at least one target language into the semantic processing module of the multi-language fused initial semantic representation model for semantic processing, the method further comprises:
inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language into a word segmentation and language tag adding module for word segmentation processing so as to obtain a source text word segmentation result of the text information and a target text word segmentation result of the target text information of at least one target language;
adding source language labels corresponding to the source language to the source text segmentation result to obtain a source text segmentation result after the source language labels are added, and adding target language labels corresponding to the corresponding target languages to the target text segmentation result of at least one target language to obtain a target text segmentation result after the corresponding target language labels are added;
the step of inputting the text information corresponding to each group of parallel data in the multiple groups of parallel data and the target text information of at least one target language into a semantic processing module of an initial semantic representation model fused with multiple languages respectively for semantic processing includes: and respectively inputting the source text word segmentation result added with the language label and the target text word segmentation result of at least one target language into a semantic processing module of the initial semantic representation model fused to the multi-language for semantic processing.
7. The method according to claim 1, wherein the step of utilizing the text information in each data sample to expand at least one low-resource target language to obtain target text information of at least one target language comprises:
for the text information in each data sample, translating the text information into text information of at least one low-resource target language, and taking the text information as the target text information of at least one target language.
8. The method according to claim 1, wherein the step of performing semantic processing on the text information to be processed by using the semantic representation model to obtain the semantic representation of the text information to be processed comprises:
performing semantic processing on the text information to be processed by utilizing a semantic processing module of the semantic representation model to obtain semantic representation of the text information to be processed; or,
and cutting the part behind the semantic processing module of the semantic representation model, and performing semantic processing on the text information to be processed by using the cut semantic representation model to obtain the semantic representation of the text information to be processed.
9. The method of claim 1, wherein the source language comprises chinese, and the at least one target language comprises at least one ethnic minority language; and/or the text information of the source language included in each data sample comprises text information answered by Chinese for each outbound question user in an outbound scene, and the label information corresponding to the text information comprises an intention label corresponding to the text information.
10. A multi-language converged semantic representation apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data set corresponding to a source language of high resources, the data set comprises a plurality of data samples, and each data sample comprises text information of the source language and label information corresponding to the text information;
the expansion module is used for expanding at least one low-resource target language by utilizing the text information in each data sample to obtain target text information of at least one target language;
the data determination module is used for determining a plurality of groups of parallel data according to the text information in each data sample and the target text information obtained by expanding the text information, wherein each group of parallel data comprises the text information of a training sample and the target text information of at least one target language;
the model training module is used for training an initial semantic representation model of multi-language fusion according to a plurality of groups of parallel data to obtain a semantic representation model, wherein in the training process, the initial semantic representation model is updated according to processing results of the initial semantic representation model for respectively processing the text information in each group of parallel data and the target text information of at least one target language and label information corresponding to the text information in each group of parallel data;
and the processing module is used for performing semantic processing on the text information to be processed by using the semantic representation model to obtain semantic representation of the text information to be processed, wherein the language of the text information to be processed is any one of a source language or at least one target language.
11. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor for performing the steps in the multilingual fusion semantic representation method according to any one of claims 1 to 9.
12. A computer arrangement, characterized in that the computer arrangement comprises a memory in which a computer program is stored and a processor which performs the steps in the multilingual fusion semantic representation method according to any one of claims 1-9 by calling the computer program stored in the memory.
CN202211539197.7A 2022-12-01 2022-12-01 Semantic representation method, device, storage medium and equipment for multi-language fusion Pending CN115858752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211539197.7A CN115858752A (en) 2022-12-01 2022-12-01 Semantic representation method, device, storage medium and equipment for multi-language fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211539197.7A CN115858752A (en) 2022-12-01 2022-12-01 Semantic representation method, device, storage medium and equipment for multi-language fusion

Publications (1)

Publication Number Publication Date
CN115858752A true CN115858752A (en) 2023-03-28

Family

ID=85669372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211539197.7A Pending CN115858752A (en) 2022-12-01 2022-12-01 Semantic representation method, device, storage medium and equipment for multi-language fusion

Country Status (1)

Country Link
CN (1) CN115858752A (en)

Similar Documents

Publication Publication Date Title
US12033621B2 (en) Method for speech recognition based on language adaptivity and related apparatus
KR102401942B1 (en) Method and apparatus for evaluating translation quality
CN110462730B (en) Facilitating end-to-end communication with automated assistants in multiple languages
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
CN113205817A (en) Speech semantic recognition method, system, device and medium
CN112507706B (en) Training method and device for knowledge pre-training model and electronic equipment
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
CN110619051A (en) Question and sentence classification method and device, electronic equipment and storage medium
CN113821589B (en) Text label determining method and device, computer equipment and storage medium
US20230342667A1 (en) Classification model training method, semantic classification method, device and medium
CN115083434A (en) Emotion recognition method and device, computer equipment and storage medium
CN116756564A (en) Training method and using method of task solution-oriented generation type large language model
CN113722436A (en) Text information extraction method and device, computer equipment and storage medium
CN110807097A (en) Method and device for analyzing data
CN109002498B (en) Man-machine conversation method, device, equipment and storage medium
CN113918710A (en) Text data processing method and device, electronic equipment and readable storage medium
CN116913278A (en) Voice processing method, device, equipment and storage medium
CN114861639B (en) Question information generation method and device, electronic equipment and storage medium
CN115577106A (en) Text classification method, device, equipment and medium based on artificial intelligence
CN113111664B (en) Text generation method and device, storage medium and computer equipment
CN115312028A (en) Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment
CN115858752A (en) Semantic representation method, device, storage medium and equipment for multi-language fusion
CN115017914A (en) Language processing method, language processing device, electronic equipment and storage medium
CN112836523B (en) Word translation method, device and equipment and readable storage medium
CN113821609A (en) Answer text acquisition method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 230088 floor 23-24, building A5, No. 666, Wangjiang West Road, high tech Zone, Hefei, Anhui Province

Applicant after: IFLYTEK Medical Technology Co.,Ltd.

Address before: 230088 floor 23-24, building A5, No. 666, Wangjiang West Road, high tech Zone, Hefei, Anhui Province

Applicant before: Anhui Xunfei Medical Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination