CN115587589B - Statement confusion degree acquisition method and system for multiple languages and related equipment - Google Patents

Statement confusion degree acquisition method and system for multiple languages and related equipment Download PDF

Info

Publication number
CN115587589B
CN115587589B CN202211131283.4A CN202211131283A CN115587589B CN 115587589 B CN115587589 B CN 115587589B CN 202211131283 A CN202211131283 A CN 202211131283A CN 115587589 B CN115587589 B CN 115587589B
Authority
CN
China
Prior art keywords
training
multilingual
confusion
sentence
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211131283.4A
Other languages
Chinese (zh)
Other versions
CN115587589A (en
Inventor
黄嘉鑫
谢育涛
尹曦
谢凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Digital Economy Academy IDEA
Original Assignee
International Digital Economy Academy IDEA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Digital Economy Academy IDEA filed Critical International Digital Economy Academy IDEA
Priority to CN202211131283.4A priority Critical patent/CN115587589B/en
Publication of CN115587589A publication Critical patent/CN115587589A/en
Application granted granted Critical
Publication of CN115587589B publication Critical patent/CN115587589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a system and related equipment for acquiring statement confusion degree aiming at multilingual, wherein the method comprises the following steps: acquiring a sentence to be calculated, wherein the language corresponding to the sentence to be calculated is at least one of preset multiple languages; obtaining a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary; adding a target language token at the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of a sentence to be calculated in a multilingual dictionary; according to the target sequence, the statement confusion degree corresponding to the statement to be calculated is obtained through a trained multilingual confusion degree calculation model, wherein the trained multilingual confusion degree calculation model is obtained through training according to a preset multilingual corpus corresponding to multiple languages. The method and the device are beneficial to improving the accuracy of the confusion degree of the acquired sentences under a multilingual scene.

Description

Statement confusion degree acquisition method and system for multiple languages and related equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method, a system and related equipment for acquiring statement confusion degree aiming at multilingual languages.
Background
With the development of science and technology, especially the development of deep learning technology, the related application of natural language processing is becoming wider and wider. In the course of natural language processing, sentence washing by various means to determine a valuable corpus is an extremely important step, wherein the value of the corpus can be determined in combination with sentence confusion.
Statement confusion is an indicator that measures whether a statement is smooth or clear in terms of semantics. In the prior art, statement confusion is usually calculated through an n_gram confusion calculation model, and the problem in the prior art is that the n_gram confusion calculation model can only acquire statement confusion for a statement in a specific language and cannot be suitable for other languages (accuracy for other languages is not high). Therefore, in a multilingual scene, the same n_gram confusion degree calculation model is used for sentences of different languages, and the corpus of the languages is required to be used for retraining for calculating the confusion degree of the sentences, so that the accuracy of acquiring the confusion degree of the multilingual sentences is not improved.
Accordingly, there is a need for improvement and development in the art.
Disclosure of Invention
The invention mainly aims to provide a method, a system and related equipment for acquiring statement confusion degree aiming at multiple languages, and aims to solve the problems that in the prior art, in a scenario of multiple languages, when the statement confusion degree is calculated by using the same n_gram confusion degree calculation model for the statement confusion degree of different languages, the language is required to be retrained by using the language, and the accuracy of acquiring the statement confusion degree is not beneficial to improvement.
In order to achieve the above object, a first aspect of the present invention provides a method for obtaining a sentence confusion degree for multiple languages, where the method for obtaining a sentence confusion degree for multiple languages includes:
acquiring a sentence to be calculated, wherein the language corresponding to the sentence to be calculated is at least one of preset multiple languages;
obtaining a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary, wherein each element in the primitive sequence is the same as a value indicated by a position index in the multilingual dictionary;
adding a target language token at the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary;
according to the target sequence, the statement confusion corresponding to the statement to be calculated is obtained through a trained multilingual confusion calculation model, wherein the trained multilingual confusion calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages.
Optionally, the acquiring the statement to be calculated includes:
The method comprises the steps of obtaining a text to be processed, and preprocessing the text to be processed according to preset preprocessing operation to obtain the preprocessed text, wherein the preprocessing operation comprises full-half-angle conversion, unified case and multi-blank character combination, and the text to be processed is composed of sentences corresponding to any one of the preset languages;
and performing single sentence segmentation on the preprocessing text according to sentence segmenters in the preprocessing text, and sequentially taking each sentence obtained after the single sentence segmentation as the sentence to be calculated.
Optionally, the trained word segmentation model and the multilingual dictionary are obtained by training in advance according to the following steps:
acquiring the multilingual corpus, wherein the multilingual corpus comprises normal semantic data sets corresponding to each of the preset multiple languages;
preprocessing each normal semantic data set according to the preprocessing operation to obtain a preprocessing training text corresponding to each normal semantic data set;
training the preprocessing training text by a preset Sentence piece tool to obtain the trained word segmentation model and a dictionary to be processed;
And adding language marks corresponding to each language in the preset multiple languages at the tail part of the dictionary to be processed to obtain the multilingual dictionary, wherein the multilingual dictionary comprises a plurality of affix and a plurality of language marks, and the positions of the affix and the language marks are respectively indicated by corresponding position indexes.
Optionally, the above-mentioned position index is a subscript value for representing the position.
Optionally, the obtaining, according to the target sequence, the statement confusion degree corresponding to the statement to be calculated through a trained multilingual confusion degree calculation model includes:
inputting the target sequence into the trained multilingual confusion degree calculation model, and obtaining a target scalar value output by the trained multilingual confusion degree calculation model;
and taking the numerical value obtained by subtracting the target scalar value from 1 as the statement confusion degree corresponding to the statement to be calculated.
Optionally, the trained multilingual confusion computation model includes a multi-layered stacked encoder, a fully connected layer, and a sigmoid function.
Optionally, the multilingual confusion calculating model is trained in advance according to the following steps:
according to sentence segmenters in each pre-processing training text, performing single sentence segmentation on each pre-processing training text to obtain each training sentence corresponding to each pre-processing training text;
Acquiring a training original primitive sequence corresponding to each training sentence according to the trained word segmentation model and the multilingual dictionary;
obtaining training negative sample primitive sequences corresponding to the training primitive sequences according to the training primitive sequences and preset negative sample construction operation construction, wherein the negative sample construction operation comprises at least one of random disordered recombination, random replacement, random deletion, random insertion, fragment position exchange and position reverse sequence aiming at elements in the training primitive sequences;
respectively adding corresponding training target language tokens at the first positions of the training original primitive sequences to obtain training target original sequences, and respectively adding corresponding training target language tokens at the first positions of the training negative sample sequences to obtain training target negative sample sequences;
performing mask language pre-training on the coder according to the training target original sequence to obtain a pre-trained multilingual confusion degree calculation model;
and training the pre-trained multilingual confusion computing model according to the training target original sequence and the training target negative sample sequence to fine tune model parameters of the pre-trained multilingual confusion computing model and obtain a trained multilingual confusion computing model.
Optionally, the training the pre-trained multilingual confusion computing model according to the training target original sequence and the training target negative sample sequence to fine tune model parameters of the pre-trained multilingual confusion computing model and obtain a trained multilingual confusion computing model, including:
inputting each element of the training sequence in the training data into the pretrained multilingual confusion degree calculation model in turn, aiming at each element of the training sequence, acquiring a training first bit vector output by a target encoder in the pretrained multilingual confusion degree calculation model and splicing to obtain a training splicing vector, inputting the training splicing vector into a full-connection layer in the pretrained multilingual confusion degree calculation model to obtain a training full-connection vector output by the full-connection layer, mapping the training full-connection vector according to the sigmoid function to obtain a training target scalar value, amplifying the training target scalar value based on a preset amplification factor to obtain a training amplification value, forming the training amplification label group corresponding to the training sequence by the training amplification values corresponding to all elements in the training sequence, converting the training amplified tag group according to a preset softmax function to obtain a training prediction tag group, wherein the training data comprises a plurality of groups of training text groups, each group of training text groups comprises a training sequence and a corresponding training real tag group, the training sequence consists of one training target original sequence and all training target negative sample sequences corresponding to the training target original sequence, the training real tag group corresponding to the training sequence consists of a plurality of real tags, the real tag corresponding to the training target original sequence is 1, the real tag corresponding to the training target negative sample sequence is 0, the target encoder is a preset layer number encoder after the training text groups are in the pretrained multilingual confusion degree calculation model, and the training initial vector is the initial position of the vector output by each target encoder;
And according to the training real label set and the training prediction label set, fine tuning model parameters of the pre-training multilingual confusion degree calculation model, and continuously executing the step of sequentially inputting each element of a training sequence in training data into the pre-training multilingual confusion degree calculation model until a preset training condition is met, so as to obtain a trained multilingual confusion degree calculation model.
The second aspect of the present invention provides a system for obtaining a confusion of a sentence for multiple languages, wherein the system for obtaining a confusion of a sentence for multiple languages includes:
the sentence acquisition module is used for acquiring a sentence to be calculated, wherein the language corresponding to the sentence to be calculated is at least one of preset multiple languages;
the primitive sequence acquisition module is used for acquiring primitive sequences corresponding to the sentences to be calculated according to the trained word segmentation model and the multilingual dictionary, wherein each element in the primitive sequences is the same as a value indicated by a position index in the multilingual dictionary;
the primitive sequence processing module is used for adding a target language token at the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary;
The sentence confusion degree obtaining module is used for obtaining the sentence confusion degree corresponding to the sentence to be calculated through a trained multilingual confusion degree calculation model according to the target sequence, wherein the trained multilingual confusion degree calculation model is obtained through training according to the preset multilingual corpus corresponding to multiple languages.
The third aspect of the present invention provides an intelligent terminal, where the intelligent terminal includes a memory, a processor, and a multilingual statement confusion acquiring program stored in the memory and executable on the processor, and the multilingual statement confusion acquiring program implements any one of the steps of the multilingual statement confusion acquiring method when executed by the processor.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a multi-language sentence confusion obtaining program, wherein the multi-language sentence confusion obtaining program, when executed by a processor, implements any one of the steps of the multi-language sentence confusion obtaining method.
From the above, in the scheme of the present invention, the sentence to be calculated is obtained, wherein the language corresponding to the sentence to be calculated is at least one of preset multiple languages; obtaining a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary, wherein each element in the primitive sequence is the same as a value indicated by a position index in the multilingual dictionary; adding a target language token at the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary; according to the target sequence, the statement confusion corresponding to the statement to be calculated is obtained through a trained multilingual confusion calculation model, wherein the trained multilingual confusion calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages.
Compared with the prior art, the method and the device for calculating the statement confusion degree in the scheme of the invention do not use the n_gram confusion degree calculation model to calculate the statement confusion degree, but provide a trained multilingual confusion degree calculation model. The sentence confusion degree can be obtained by using the same trained multilingual confusion degree calculation model and sharing the parameters of the same multilingual confusion degree calculation model for the obtained sentence to be calculated in any one of the preset multilingual languages. Specifically, after the corresponding primitive sequence is obtained according to the trained word segmentation model and the multilingual dictionary, the corresponding target language token is added to the first position of the primitive sequence to obtain the corresponding target sequence, so that the actual languages corresponding to the target sequences can be distinguished according to the target language tokens. And then acquiring statement confusion corresponding to the statement to be calculated through a trained multilingual confusion calculation model according to the target sequence. Because the trained multilingual confusion degree calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages, the method can be suitable for obtaining the confusion degree of sentences in different languages, thereby realizing confusion degree calculation for sentences to be calculated in any language in a multilingual scene, and being beneficial to improving the accuracy of the obtained confusion degree of the sentences.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a method for obtaining statement confusion degree for multiple languages according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of mask language pre-training for an encoder according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a training process of a multilingual confusion degree calculation model according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a composition module of a multi-language sentence confusion obtaining system according to an embodiment of the present invention;
fig. 5 is a schematic block diagram of an internal structure of an intelligent terminal according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the appended claims, the term "if" may be interpreted in the context of "when …" or "once" or "in response to a determination" or "in response to a classification. Similarly, the phrase "if determined" or "if classified to [ described condition or event ]" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon classification to [ described condition or event ]" or "in response to classification to [ described condition or event ]".
The following description of the embodiments of the present invention will be made more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown, it being evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
With the development of science and technology, especially the development of deep learning technology, the related application of Natural Language Processing (NLP) is becoming more and more widespread. Currently, bert and its derivative technology have become a mainstream solution for various natural language processing tasks, such as classification, dialogue, entity recognition, translation, generation, and so on. The bert and its related solutions require a large amount of training corpus. In order to make the model more accurate, valuable corpora need to be identified from the massive data.
Therefore, in the course of natural language processing, sentence washing by various means to determine a valuable corpus is an extremely important step, wherein the value of the corpus can be determined in combination with sentence confusion. In general, the mass data available comes from internet channels, but the quality of network data is uneven, and cleaning is needed by combining multiple means, wherein statement confusion is an important evaluation index. Statement confusion is an indicator that measures whether a statement is smooth or clear in terms of semantics. The smaller the confusion, the more authentic the statement is represented; conversely, the greater the confusion, the more the sentence is represented as unrealistic (i.e., not a sentence that is actually produced by a human or a sentence whose expression does not conform to human habit), should be filtered out of the corpus.
In the prior art, statement confusion is usually calculated through an n_gram confusion calculation model, and the problem in the prior art is that the n_gram confusion calculation model can only acquire statement confusion for a statement in a specific language and cannot be suitable for other languages (accuracy for other languages is not high). Therefore, in a multilingual scenario, using the same n_gram confusion degree calculation model to calculate the sentence confusion degree for sentences of different languages is not beneficial to improving the accuracy of the obtained sentence confusion degree. And the n_gram confusion degree calculation model needs training data with large enough scale during training, statistical rules are extracted from the training data, and if the rules do not reach standards, the model is easy to be unavailable. On the other hand, the confusion score calculated by the model has no obvious limit between the smoothness of the semantics, and a specific judgment threshold cannot be determined, so that the subsequent cleaning step and the like are difficult to carry out.
If a separate n_gram confusion computation model is trained for each language, then for a sentence, the language to which it belongs needs to be determined first, then the n_gram confusion computation model to be used needs to be determined, and a plurality of n_gram confusion computation models need to be trained in advance, so that more processing time needs to be consumed, the efficiency of sentence confusion acquisition is not improved, and a greater amount of hardware support is needed in a multi-language scene, and difficulty is introduced to model management and updating. And most languages also lack available confusion computing models, and sentence confusion acquisition is difficult to achieve.
In order to solve at least one of the above problems, in the solution of the present invention, a sentence to be calculated is obtained, where a language corresponding to the sentence to be calculated is at least one of preset multiple languages; obtaining a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary, wherein each element in the primitive sequence is the same as a value indicated by a position index in the multilingual dictionary; adding a target language token at the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary; according to the target sequence, the statement confusion corresponding to the statement to be calculated is obtained through a trained multilingual confusion calculation model, wherein the trained multilingual confusion calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages.
Compared with the prior art, the method and the device for calculating the statement confusion degree in the scheme of the invention do not use the n_gram confusion degree calculation model to calculate the statement confusion degree, but provide a trained multilingual confusion degree calculation model. And the same trained multilingual confusion degree calculation model can be used for obtaining the sentence confusion degree for the obtained sentence to be calculated in any one of the preset multilingual languages. Specifically, after the corresponding primitive sequence is obtained according to the trained word segmentation model and the multilingual dictionary, the corresponding target language token is added to the first position of the primitive sequence to obtain the corresponding target sequence, so that the actual languages corresponding to the target sequences can be distinguished according to the target language tokens. And then acquiring statement confusion corresponding to the statement to be calculated through a trained multilingual confusion calculation model according to the target sequence. Because the trained multilingual confusion degree calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages, the method can be suitable for obtaining the confusion degree of sentences in different languages, thereby realizing confusion degree calculation for sentences to be calculated in any language in a multilingual scene, and being beneficial to improving the accuracy of the obtained confusion degree of the sentences.
Therefore, the multilingual corpus is fused in the scheme of the invention and is used for training the same multilingual confusion degree calculation model based on the neural network, and all languages share one model parameter, namely, the multilingual confusion degree calculation model can be used by the multilingual languages to obtain a better calculation effect. Furthermore, the calculation accuracy of the multi-language confusion degree calculation model obtained through training can be further improved by setting a negative sample mode and combining the pre-training and contrast learning ideas.
Exemplary method
As shown in fig. 1, an embodiment of the present invention provides a method for obtaining a statement confusion degree for multiple languages, and specifically, the method includes the following steps:
step S100, obtaining a sentence to be calculated, wherein the language corresponding to the sentence to be calculated is at least one of preset languages.
Specifically, the statement to be calculated is a statement for which statement confusion degree acquisition is required. In this embodiment, the method for obtaining the statement confusion degree for multiple languages has high applicability, and can be applied to a scenario of multiple languages, that is, the statement in the scenario may be any one of the multiple languages, and the statement to be calculated for any one of the languages is processed in the same manner and finally the statement confusion degree is obtained. Note that, the sentence to be calculated may also correspond to a plurality of languages, and the processing manner is similar, and in this embodiment, specific description is given by taking the corresponding language as an example, but the sentence to be calculated is not limited to the specific description.
In this embodiment, a preset plurality of languages are predetermined, so that training is performed according to the corpus of each corresponding language when training the multilingual confusion calculation model. For example, the preset multiple languages may include chinese, english, japanese, and the like, and may further include other languages, and the preset multiple languages specifically include which of the multiple languages may be set and adjusted according to actual requirements, which is not specifically limited herein.
In this embodiment, the sentence to be calculated may be a sentence obtained by performing processing such as segmentation on a text to be processed. That is, the user may need to calculate the statement confusion degree for all the statements in the entire text to be processed during the use process, and in this embodiment, each statement in the text to be processed is acquired to calculate the statement confusion degree in turn.
Specifically, obtaining a statement to be calculated includes: the method comprises the steps of obtaining a text to be processed, and preprocessing the text to be processed according to preset preprocessing operation to obtain the preprocessed text, wherein the preprocessing operation comprises full-half-angle conversion, unified case and multi-blank character combination, and the text to be processed is composed of sentences corresponding to any one of the preset languages; and performing single sentence segmentation on the preprocessing text according to sentence segmenters in the preprocessing text, and sequentially taking each sentence obtained after the single sentence segmentation as the sentence to be calculated.
The text to be processed is text which needs to be processed and the statement confusion degree of the statement therein is calculated. It should be noted that, a text to be processed includes a plurality of sentences, where languages corresponding to the sentences may be the same or different, and only a single sentence to be calculated obtained after segmentation and processing is required to be guaranteed to correspond to only one language. In this embodiment, the same language corresponding to all sentences in a text to be processed is taken as an example for explanation, but not as a specific limitation.
Specifically, the preprocessing operation is used for preprocessing and cleaning the text, so that the processed text accords with the specification, for example, full-half-angle unification, case unification and redundant blank characters are combined, and the recognition processing efficiency and the calculation accuracy of the subsequent model are improved. The sentence segmenter includes a text linefeed, a period, a question mark, and an exclamation mark, and may further include other symbols for indicating the end of a sentence, which are not particularly limited herein.
Step S200, obtaining a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary, wherein each element in the primitive sequence is the same as a value indicated by a position index in the multilingual dictionary.
The trained word segmentation model and the multilingual dictionary are obtained after being trained in advance, the word segmentation model is used for dividing an input sentence (such as a sentence to be calculated) into words or word affix units, carrying out numerical processing on the words and words to form a corresponding primitive sequence, each word (or word affix) is stored in the multilingual dictionary in advance, and the words (or word affix) in the multilingual dictionary are in a numerical form (namely token value), namely, the words (or word affix) in the multilingual dictionary are represented by a numerical value instead of vectors, so that the efficiency of the subsequent calculation and processing processes is improved. In this embodiment, each position is indicated by a position index in the multilingual dictionary, and the value of each position represents a affix or a language identifier. The word segmentation model divides an input sentence into single affix corresponding to the affix in the multilingual dictionary, and obtains a primitive sequence corresponding to the sentence.
It should be noted that the trained word segmentation model and the multilingual dictionary are obtained by training according to the preset multilingual corpus corresponding to multiple languages. In this embodiment, the trained word segmentation model and the multilingual dictionary are obtained by training in advance according to the following steps:
Acquiring the multilingual corpus, wherein the multilingual corpus comprises normal semantic data sets corresponding to each of the preset multiple languages;
preprocessing each normal semantic data set according to the preprocessing operation to obtain a preprocessing training text corresponding to each normal semantic data set;
training the pre-processed training text by a preset Sentence piece open source tool to obtain the trained word segmentation model and a dictionary to be processed;
and adding language marks corresponding to each language in the preset multiple languages at the tail part of the dictionary to be processed to obtain the multilingual dictionary, wherein the multilingual dictionary comprises a plurality of affix and a plurality of language marks, and the positions of the affix and the language marks are respectively indicated by corresponding position indexes.
The multilingual corpus consists of texts for training corresponding to all preset languages. The normal semantic data set is used for training, and is composed of sentences with normal semantics, namely, the sentences in the normal semantic data set are all smooth sentences with clear semantics meeting the requirements, and each sentence in the normal semantic data set can be input in advance by a user or obtained by screening by the user, and the normal semantic data set is not particularly limited.
It should be noted that, one language may correspond to one or more normal semantic data sets, one normal semantic data set is formed by a plurality of texts of normal semantics corresponding to the corresponding language, and one text includes a plurality of sentences.
In one application scenario, after a multilingual corpus is acquired, a unified text token instantiation algorithm (namely token) is trained based on the multilingual corpus, and the unified text token instantiation algorithm is used for segmenting sentences and digitizing the words (namely token processing) so as to acquire a corresponding primitive sequence (namely token sequence).
In this embodiment, the word segmentation model is a model for implementing the text token instantiation algorithm. Specifically, preprocessing and cleaning are performed on normal semantic data sets corresponding to each language in the multilingual corpus, wherein the preprocessing and cleaning comprise full-half-angle conversion, unification of cases and cases, merging of a plurality of blank characters and the like, and preprocessing training texts corresponding to the normal semantic data sets are obtained. The dictionary total size (vocab_size) is then set, and all pre-processed training text after pre-processing is trained using the sentence piece open source tool, resulting in a trained word segmentation model (SPM, sentence Piece Model) and a dictionary to be processed. The total size of the dictionary is used to define the total number of words (or word affix) in the dictionary to be processed, and the total size of the dictionary may be preset, or may be set and adjusted in real time by a user, which is not specifically defined herein.
Furthermore, in this embodiment, language identifiers corresponding to various languages are further added at the tail of the dictionary to be processed, so as to obtain a multi-language dictionary that is finally shared by multiple languages, where the language identifiers are used for indicating and distinguishing different languages, for example, language identifiers corresponding to chinese languages and english languages are different. In this embodiment, the language identifier is also stored in the form of a numerical value (i.e., token value) rather than a vector. In the multilingual dictionary described above, different positions are indicated by position indexes. In this embodiment, the position index is a subscript value for indicating a position. For example, a subscript of 1 in the multilingual dictionary may represent the 1 st position, a subscript of 2 may represent the 2 nd position, and so on, and will not be described again.
In one application scenario, the multilingual dictionary may be stored in a table format, where each affix or each language identifier corresponds to a subscript of the dictionary and a position in the table. In another application scenario, the multilingual dictionary is stored in txt format, each line stores a affix (or a language identifier), and the line number corresponding to one affix (or a language identifier) is used as the subscript (i.e. the position index) of the affix (or the language identifier) by reading sequentially.
And step S300, adding a target language token to the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary.
Specifically, after obtaining the primitive sequence corresponding to the sentence to be calculated, adding a target language token to the primitive sequence to identify the language corresponding to the primitive sequence. It should be noted that, the target language token is not directly added into the primitive sequence in the form of name or token value, but the position index of the language of the sentence to be calculated in the multilingual dictionary is added into the primitive sequence, so that the identification of the language and the real content of the sentence to be calculated are better distinguished, and the accuracy of the confusion degree calculation is improved.
In one application scenario, the target language token is a position index corresponding to a target language identifier in the multilingual dictionary, and the target language identifier is a language identifier corresponding to the same language as the sentence to be calculated stored in the multilingual dictionary. In this embodiment, [ LID ] represents the target language token.
Step S400, according to the target sequence, obtaining statement confusion corresponding to the statement to be calculated through a trained multilingual confusion calculation model, wherein the trained multilingual confusion calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages.
In this embodiment, the multilingual confusion computing model is a neural network model based on a Transformer model structure, and can perform confusion computing for sentences to be computed of various languages in a multilingual scene, so as to solve the problems that the corpus and confusion computing model of certain languages are lacking and the confusion computing models of various languages are mutually independent.
Specifically, the calculating, according to the target sequence, the statement confusion degree corresponding to the statement to be calculated through a trained multilingual confusion degree calculation model includes: inputting the target sequence into the trained multilingual confusion degree calculation model, and obtaining a target scalar value output by the trained multilingual confusion degree calculation model; and taking the numerical value obtained by subtracting the target scalar value from 1 as the statement confusion degree corresponding to the statement to be calculated.
Thus, when it is required to process a text to be processed including multiple sentences by using the sentence confusion obtaining method in this embodiment and calculate the sentence confusion, preprocessing, splitting a single sentence, and serializing a sentence by sentence through a trained word segmentation model token according to the above steps are performed on the text to be processed, and adding a target language token corresponding to the language of the text to be processed (or the sentence to be processed obtained after splitting the single sentence) at the beginning of the sentence to obtain a target sequence (i.e., a token sequence) corresponding to the sentence to be processed. The target sequence is input into a trained multilingual confusion degree calculation model, a target scalar value ranging from 0 to 1 can be output through the model, and the statement confusion degree is equal to 1 minus the target scalar value.
It should be noted that, the greater the statement confusion degree is, the greater the possibility that the statement to be processed is unclear or not in compliance with the specification is, if the user needs to use the statement to be processed to perform other operations (for example, training the bert model) later, and the validity of the corpus needs to be ensured in the subsequent process, the user can perform operations such as statement screening, deleting, secondary editing, correcting and the like according to the statement confusion degree of the statement to be processed so as to improve the accuracy of the subsequent operations. For example, the user may preset a confusion threshold value, and when the statement confusion of the statement to be processed is greater than the confusion threshold value, it is indicated that the statement to be processed needs to be deleted or corrected.
In this embodiment, the trained multilingual confusion computation model includes multiple layers of stacked encoders, a full connection layer, and a sigmoid function.
The multilingual confusion degree calculation model is trained in advance according to the following steps:
according to sentence segmenters in each pre-processing training text, performing single sentence segmentation on each pre-processing training text to obtain each training sentence corresponding to each pre-processing training text;
acquiring a training original primitive sequence corresponding to each training sentence according to the trained word segmentation model and the multilingual dictionary;
Obtaining training negative sample primitive sequences corresponding to the training primitive sequences according to the training primitive sequences and preset negative sample construction operation construction, wherein the negative sample construction operation comprises at least one of random disordered recombination, random replacement, random deletion, random insertion, fragment position exchange and position reverse sequence aiming at elements in the training primitive sequences;
respectively adding corresponding training target language tokens at the first positions of the training original primitive sequences to obtain training target original sequences, and respectively adding corresponding training target language tokens at the first positions of the training negative sample sequences to obtain training target negative sample sequences;
performing mask language pre-training on the coder according to the training target original sequence to obtain a pre-trained multilingual confusion degree calculation model;
and training the pre-trained multilingual confusion computing model according to the training target original sequence and the training target negative sample sequence to fine tune model parameters of the pre-trained multilingual confusion computing model and obtain a trained multilingual confusion computing model.
In this embodiment, in the model training process, a corresponding negative sample (i.e., training a negative sample primitive sequence) is constructed according to a training primitive sequence corresponding to a real training sentence, and the number of data used for training is increased by constructing the negative sample, so that the model training effect can be improved. The training sentences are obtained by preprocessing single sentences of the training text, the training original primitive sequences (namely training original token sequences) are token sequences obtained by token conversion of the training sentences, and the training negative sample primitive sequences are token sequences obtained by negative sample construction of the training original primitive sequences.
It should be noted that, the above-mentioned training primitive sequence, the above-mentioned training negative sample primitive sequence, and the primitive sequence of the sentence to be calculated obtained during the use of the model are similar in terms of expression form and processing mode, and the main difference is that the training primitive sequence and the training negative sample primitive sequence are data during the process of calculating the sentence confusion degree using the model during the use of the model. Similarly, other data (e.g., training target language tokens) used in the training process also have corresponding data corresponding to the corresponding data (e.g., target language tokens) in the model use process, and the description and the processing of the two may be referred to each other at this time, which is not repeated herein.
Specifically, one training primitive sequence may obtain one or more training negative sample primitive sequences through a negative sample construction operation, and in this embodiment, the construction to obtain a plurality of corresponding training negative sample primitive sequences is described as an example, but not as a specific limitation. Further, the language of any training negative-sample primitive sequence is the same as the language of the corresponding training primitive sequence, so that one training primitive sequence has the same training target language token as all the training negative-sample primitive sequences corresponding to the training primitive sequence. A training target language token corresponding to the training original primitive sequence is a position index (i.e. a subscript value representing a position) of a language identification of the language of the training original primitive sequence in the multilingual dictionary.
In this embodiment, the negative sample construction operation includes random out-of-order recombination, random substitution, random puncturing, random insertion, fragment position exchange, and position reversal for elements in the training primitive sequence.
The random disordered recombination is specifically to keep the sequence length unchanged, and randomly disordered recombine each element (namely each element) in the training original element sequence. The random substitution is specifically to randomly replace part of elements in the training original primitive sequence, and randomly replace other elements in the multilingual dictionary (specifically to replace the affix in the multilingual dictionary). The random puncturing is to randomly prune elements of a first preset length (for example, not less than a quarter of the length of the sequence) part in the training original primitive sequence, and the deleted part does not need to be complemented by other characters. The random insertion is specifically to randomly insert elements in a multilingual dictionary of a second preset length (for example, not more than 1 time the sequence length). The above-mentioned segment position exchange refers to the exchange of the positions of two element segments in the training original primitive sequence, specifically, a non-first-last position x is selected in the training original primitive sequence, segments before and after the position x are exchanged, for example, [ A, B, C, x, D, E ] are converted into [ D, E, x, A, B, C ], and the lengths of the two exchanged segments may be unequal. The above positional reversal is the direct reverse arrangement of the sequences, for example converting [ A, B, C, D, E ] to [ E, D, C, B, A ]. Note that x, A, B, C, D, E is merely an element for illustrating an example, and is not particularly limited.
After the training negative sample primitive sequence is constructed, adding a corresponding target language token (LID) identifier for each training original primitive sequence and training negative sample primitive sequence at the first place based on language identifiers in a multilingual dictionary and corresponding position indexes thereof, and respectively obtaining a training target original sequence and a training target negative sample sequence. Furthermore, the multilingual confusion degree calculation model is pre-trained according to the training target original sequence, so that training effect and efficiency are improved.
In this embodiment, the basic integration unit of the multilingual confusion computing model is formed by stacking the Encoder encoders of the converters, and the number of layers of the encoders can be set according to actual requirements, for example, the number of layers is set to 12 in this embodiment, and for the training target original sequence, vector embedding processing (embedding) is performed on the training target original sequence and the corresponding position vector, and then the training target original sequence and the corresponding position vector are input into the Encoder layers for mask language (MLM, masked Language Model) pre-training.
Fig. 2 is a schematic flow chart of performing mask language pre-training on an encoder according to an embodiment of the present invention, as shown in fig. 2, in this embodiment, a training target original sequence is used as an input in a mask language pre-training process, and a corresponding text tokenization vector and a position vector are obtained according to the input, where the text tokenization vector is obtained after vectorizing the training target original sequence, and the first bit of the text tokenization vector is a target language token used as a language tag. Specifically, the use of the ebedding can encode the object with the low-dimensional vector and can also preserve the meaning of the object, so that the text tokenization vector is formed by using the ebedding method for the training target original sequence in the embodiment, which is beneficial to improving the efficiency and effect of obtaining the statement confusion. It should be noted that, the position vector adopts the position embedded vector formed by initializing the embedding at random, and then learns from the training process. In this embodiment, task training (e.g., training of classification tasks) is performed by using the mask language model, so as to pretrain the encoder, and after pretraining is completed, pretraining model parameters corresponding to the encoder are obtained. In the process of mask language pre-training, some necessary network layers, such as a vector embedding layer (i.e. an embedding layer) in fig. 2, can be added to the multilingual confusion degree calculation model according to actual requirements, but after the MLM task is completed, the temporarily added network layers can be discarded, and only the pre-trained encoder layer, the full-connection layer (i.e. a function) and a logistic regression activation layer for storing the sigmoid function are reserved. In this embodiment, the sigmoid function is stored in the logistic regression activation layer, but the logistic regression activation layer may not be set in the actual use process, and the sigmoid function may be set in other layers or stored separately, which is not limited herein. In the MLM pre-training process, a training original target sequence is input to carry out corresponding masking operation, the output is a complete token sequence, and the MLM pre-training task is considered to be completed when the loss between the output token sequence and the input training original target sequence is converged.
Further, as shown in fig. 2, for the data output by each encoder layer, the first bits (i.e., training first bit vector, i.e., the result corresponding to the target language token) in the result output by some target encoders (the last layer encoder, for example, the last 4 layers encoder is used as the target encoder, the preset layer number is 4) are spliced to obtain a training splicing vector, the spliced training splicing vector is input into the full-connection layer, a 128-dimensional vector (i.e., training full-connection vector) is output through the full-connection layer, and then a training target vector value is output through a sigmoid function. Wherein the sigmoid function is an activation function for mapping variables between 0 and 1. The first vectors of the target encoders with the preset layers are only adopted for splicing and subsequent calculation, so that the processing efficiency is improved, and the model training efficiency and the calculation efficiency when the sentence confusion is acquired by using the model can be improved.
It should be noted that the above description is a data processing process in the training process of the multilingual confusion degree calculation model, and the data processing process is similar when the trained multilingual confusion degree calculation model is used. Specifically, the input data is a target sequence obtained by processing a sentence to be calculated, a first vector output by an encoder (namely a target encoder) of the last four layers in the trained multilingual confusion degree calculation model is obtained, the first vector is spliced to obtain a spliced vector, the spliced vector is input into a full-connection layer to obtain a full-connection vector, then a target scalar value is obtained according to sigmoid function mapping, and a value obtained by subtracting the target scalar value from 1 is used as the sentence confusion degree corresponding to the sentence to be calculated.
After MLM pre-training the encoder described above, the multilingual confusion computation model needs to be further trained and its model parameters fine-tuned in combination with positive and negative samples.
Further, the training the pre-trained multilingual confusion computing model according to the training target original sequence and the training target negative sample sequence to fine tune model parameters of the pre-trained multilingual confusion computing model and obtain a trained multilingual confusion computing model, including:
inputting each element of the training sequence in the training data into the pretrained multilingual confusion degree calculation model in turn, aiming at each element of the training sequence, acquiring a training first bit vector output by a target encoder in the pretrained multilingual confusion degree calculation model and splicing to obtain a training splicing vector, inputting the training splicing vector into a full-connection layer in the pretrained multilingual confusion degree calculation model to obtain a training full-connection vector output by the full-connection layer, mapping the training full-connection vector according to the sigmoid function to obtain a training target scalar value, amplifying the training target scalar value based on a preset amplification factor to obtain a training amplification value, forming the training amplification label group corresponding to the training sequence by the training amplification values corresponding to all elements in the training sequence, converting the training amplified tag group according to a preset softmax function to obtain a training prediction tag group, wherein the training data comprises a plurality of groups of training text groups, each group of training text groups comprises a training sequence and a corresponding training real tag group, the training sequence consists of one training target original sequence and all training target negative sample sequences corresponding to the training target original sequence, the training real tag group corresponding to the training sequence consists of a plurality of real tags, the real tag corresponding to the training target original sequence is 1, the real tag corresponding to the training target negative sample sequence is 0, the target encoder is a preset layer number encoder after the training text groups are in the pretrained multilingual confusion degree calculation model, and the training initial vector is the initial position of the vector output by each target encoder;
And according to the training real label set and the training prediction label set, fine tuning model parameters of the pre-training multilingual confusion degree calculation model, and continuously executing the step of sequentially inputting each element of a training sequence in training data into the pre-training multilingual confusion degree calculation model until a preset training condition is met, so as to obtain a trained multilingual confusion degree calculation model.
The training predictive label group consists of a plurality of predictive labels, one predictive label represents a value representing confusion degree obtained after a multilingual confusion degree calculation model predicts a corresponding element in a training sequence, the element in the training sequence corresponds to the predictive label in the training predictive label group one by one, and similarly, the element in the training sequence corresponds to the real label in the training real label group one by one. The preset training condition is that the iteration times reach a preset threshold value or the acquired loss converges.
In this embodiment, the loss value is calculated according to the real tags and the predicted tags corresponding to each of the training real tag group and the training predicted tag group, so as to perform fine adjustment of the model parameters.
Specifically, in this embodiment, a training sequence is composed of the training target original sequence and all the training target negative sample sequences corresponding to the training target original sequence, so as to form a corresponding token sequence array, for example, a training sequence may be The corresponding training real tag group is +.>Note that the arrangement order of the elements in the training sequence is not particularly limited.
Wherein the training target scalar value is typically a floating point number between 0 and 1. And for each element in the training sequence, acquiring a corresponding training target scalar value, and amplifying based on a preset amplification factor to acquire a training amplification value. The preset amplification factor can be preset or adjusted according to actual conditions, in this embodiment, the values of all training target scalar values are amplified by k times (k is 10 to 100), k can be preset temperature super parameters (temperature parameter), and the temperature super parameters are set so that the model can learn negative samples with high difficulty better. Constructing training amplified tag groups corresponding to the training sequences according to the amplified training amplified valuesStructure of the deviceAnd->The training objective of the difference between, in this embodiment, the corresponding loss is calculated using a cross entropy loss function. And performing iterative training until the loss converges, and obtaining a trained multilingual confusion degree calculation model after fine adjustment is finished.
Fig. 3 is a schematic training flow diagram of a multilingual confusion computing model provided by the embodiment of the present invention, as shown in fig. 3, for corpora of different languages, unified preprocessing (including full-half-angle conversion, blank character merging and unified case) is performed first to obtain corresponding text (i.e., preprocessing training text), and the preprocessing training text can be used for training to obtain a trained word segmentation model and obtain a corresponding multilingual dictionary. And carrying out token conversion on each training sentence in the preprocessing training text based on the trained word segmentation model and the multilingual dictionary to obtain a training original primitive sequence. For each training original primitive sequence, the MLM pre-training can be performed on the encoder by using the training original primitive sequence to obtain a pre-trained multilingual confusion computing model. Meanwhile, respectively carrying out negative sampling (namely generating a corresponding negative sample) on each training original primitive sequence to obtain a training negative sample primitive sequence, and carrying out model parameter fine adjustment on the pre-trained multilingual confusion computing model according to the training original primitive sequence and the training negative sample primitive sequence (after adding a corresponding target language token), so as to finally obtain the trained multilingual confusion computing model.
From the above, in the method for obtaining the confusion degree of the statement for multiple languages provided by the embodiment of the invention, a trained multiple language confusion degree calculation model is provided. And the same trained multilingual confusion degree calculation model can be used for obtaining the sentence confusion degree for the obtained sentence to be calculated in any one of the preset multilingual languages. Specifically, after the corresponding primitive sequence is obtained according to the trained word segmentation model and the multilingual dictionary, the corresponding target language token is added to the first position of the primitive sequence to obtain the corresponding target sequence, so that the actual languages corresponding to the target sequences can be distinguished according to the target language tokens. And then acquiring statement confusion corresponding to the statement to be calculated through a trained multilingual confusion calculation model according to the target sequence. Because the trained multilingual confusion degree calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages, the method can be suitable for obtaining the confusion degree of sentences in different languages, thereby realizing confusion degree calculation for sentences to be calculated in any language in a multilingual scene, and being beneficial to improving the accuracy of the obtained confusion degree of the sentences.
In this way, aiming at the problem of evaluating statement confusion in natural language processing, especially the problem that the existing n_gram confusion calculation model cannot give a high confidence result under the condition of lack of language resources, the neural network model (namely the trained multi-language confusion calculation model) obtained by combining the model pre-training thought and the contrast learning thought training is provided in the embodiment, which is beneficial to improving the accuracy of finally judging the statement confusion.
Meanwhile, the multilingual corpus is fused for training, and a model suitable for multilingual confusion calculation is finally obtained, and can be effectively transferred from knowledge of large languages to small languages, so that the confusion calculation effect of all languages is improved; and finally, under a multilingual scene, only one model is needed to replace the traditional multi-model scheme.
Furthermore, the negative example in contrast learning can be flexibly and effectively constructed by the negative example constructing method provided in the embodiment; the MLM pre-training learning is realized, and the steps of merging the characteristics of the last few layers of encoders and the like are realized, so that the information representation and extraction capacity is improved; using the comparative idea, positive and negative sample spacing is explicitly pulled apart during fine-tuning training. The steps are beneficial to increasing the robustness of the obtained multi-language confusion computing model and improving the accuracy of the obtained statement confusion.
Exemplary apparatus
As shown in fig. 4, corresponding to the above method for obtaining the confusion of sentences for multiple languages, the embodiment of the present invention further provides a system for obtaining the confusion of sentences for multiple languages, where the system for obtaining the confusion of sentences for multiple languages includes:
the sentence obtaining module 510 is configured to obtain a sentence to be calculated, where the language corresponding to the sentence to be calculated is at least one of preset multiple languages;
the primitive sequence obtaining module 520 is configured to obtain a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary, where each element in the primitive sequence is the same as a value indicated by a position index in the multilingual dictionary;
the primitive sequence processing module 530 is configured to add a target language token to the first position of the primitive sequence to obtain a target sequence, where the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary;
the sentence confusion obtaining module 540 is configured to obtain, according to the target sequence, the sentence confusion corresponding to the sentence to be calculated through a trained multilingual confusion computation model, where the trained multilingual confusion computation model is obtained through training according to the preset multilingual corpus corresponding to multiple languages.
It should be noted that, the specific structure and implementation manner of the multilingual sentence confusion obtaining system and each module or unit thereof may refer to the corresponding description in the method embodiment, and are not repeated herein.
The division method of each module of the multilingual sentence confusion obtaining system is not limited to a specific one.
Based on the above embodiment, the present invention further provides an intelligent terminal, and a functional block diagram thereof may be shown in fig. 5. The intelligent terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. The processor of the intelligent terminal is used for providing computing and control capabilities. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a statement confusion acquiring program for multilingual languages. The internal memory provides an environment for the operation of the operating system and the statement confusion retrieval program for multiple languages in the nonvolatile storage medium. The network interface of the intelligent terminal is used for communicating with an external terminal through network connection. The multi-language sentence confusion obtaining program realizes any one of the multi-language sentence confusion obtaining methods when executed by a processor. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen.
It will be appreciated by those skilled in the art that the schematic block diagram shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the smart terminal to which the present inventive arrangements are applied, and that a particular smart terminal may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, an intelligent terminal is provided, where the intelligent terminal includes a memory, a processor, and a multilingual statement confusion obtaining program stored in the memory and capable of running on the processor, where the multilingual statement confusion obtaining program implements any one of the steps of the multilingual statement confusion obtaining method provided by the embodiment of the present invention when executed by the processor.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a statement confusion degree acquisition program for multiple languages, and the statement confusion degree acquisition program for multiple languages realizes any one of the steps of the statement confusion degree acquisition method for multiple languages provided by the embodiment of the invention when being executed by a processor.
It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiment of the present invention.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed system/terminal device and method may be implemented in other manners. For example, the system/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or elements described above is merely a logical functional division, and may be implemented in other manners, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed.
The integrated modules/units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer-readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of each method embodiment may be implemented. The computer program comprises computer program code, and the computer program code can be in a source code form, an object code form, an executable file or some intermediate form and the like. The computer readable medium may include: any entity or device capable of carrying the computer program code described above, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. The content of the computer readable storage medium can be appropriately increased or decreased according to the requirements of the legislation and the patent practice in the jurisdiction.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that; the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions are not intended to depart from the spirit and scope of the various embodiments of the invention, which are also within the spirit and scope of the invention.

Claims (11)

1. The statement confusion degree acquisition method for multiple languages is characterized by comprising the following steps of:
acquiring a sentence to be calculated, wherein the language corresponding to the sentence to be calculated is at least one of preset multiple languages;
obtaining a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary, wherein each element in the primitive sequence is the same as a value indicated by a position index in the multilingual dictionary;
adding a target language token at the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary;
According to the target sequence, the statement confusion corresponding to the statement to be calculated is obtained through a trained multilingual confusion calculation model, wherein the trained multilingual confusion calculation model is obtained through training according to the preset multilingual corpus corresponding to the multilingual languages.
2. The method for obtaining the confusion of multi-language sentences according to claim 1, wherein the obtaining the sentences to be calculated comprises:
the method comprises the steps of obtaining a text to be processed, and preprocessing the text to be processed according to a preset preprocessing operation to obtain the preprocessed text, wherein the preprocessing operation comprises full-half-angle conversion, unified case and multi-blank character combination, and the text to be processed is composed of sentences corresponding to any one of the preset languages;
and carrying out single sentence segmentation on the preprocessing text according to sentence segmenters in the preprocessing text, and sequentially taking each sentence obtained after the single sentence segmentation as the sentence to be calculated.
3. The sentence confusion obtaining method for multiple languages according to claim 2, wherein the trained word segmentation model and the multilingual dictionary are obtained by training in advance according to the steps of:
Acquiring the multilingual corpus, wherein the multilingual corpus comprises normal semantic data sets corresponding to each of the preset multiple languages;
preprocessing each normal semantic data set according to the preprocessing operation to obtain a preprocessing training text corresponding to each normal semantic data set;
training the pre-processed training text by a preset Sentence piece tool to obtain the trained word segmentation model and a dictionary to be processed;
and adding language identifiers corresponding to various languages in the preset multiple languages at the tail part of the dictionary to be processed to obtain the multilingual dictionary, wherein the multilingual dictionary comprises a plurality of affix and a plurality of language identifiers, and the positions of the affix and the language identifiers are respectively indicated by corresponding position indexes.
4. The sentence confusion retrieval method for multiple languages according to claim 3, wherein the position index is a lower index value for representing a position.
5. The method for obtaining the confusion of multi-language sentences according to claim 3, wherein the obtaining the confusion of the sentences corresponding to the sentences to be calculated according to the target sequence through a trained multi-language confusion calculation model comprises the following steps:
Inputting the target sequence into the trained multilingual confusion degree calculation model, and obtaining a target scalar value output by the trained multilingual confusion degree calculation model;
and taking the numerical value obtained by subtracting the target scalar value from 1 as the statement confusion degree corresponding to the statement to be calculated.
6. The method of claim 5, wherein the trained multilingual statement confusion computation model comprises a multi-layered stacked encoder, a fully connected layer, and a sigmoid function.
7. The method for obtaining the confusion of multi-language sentences according to claim 6, wherein the multi-language confusion calculation model is trained in advance according to the following steps:
according to sentence segmenters in each preprocessing training text, performing single sentence segmentation on each preprocessing training text to obtain each training sentence corresponding to each preprocessing training text;
acquiring a training original primitive sequence corresponding to each training sentence according to the trained word segmentation model and the multilingual dictionary;
obtaining training negative sample primitive sequences corresponding to the training primitive sequences according to the training primitive sequences and preset negative sample construction operation construction, wherein the negative sample construction operation comprises at least one of random disordered recombination, random replacement, random deletion, random insertion, fragment position exchange and position reverse sequence aiming at elements in the training primitive sequences;
Respectively adding corresponding training target language tokens at the first positions of the training original primitive sequences to obtain training target original sequences, and respectively adding corresponding training target language tokens at the first positions of the training negative sample sequences to obtain training target negative sample sequences;
performing mask language pre-training on the encoder according to the training target original sequence to obtain a pre-trained multilingual confusion degree calculation model;
training the pre-trained multilingual confusion degree calculation model according to the training target original sequence and the training target negative sample sequence to fine tune model parameters of the pre-trained multilingual confusion degree calculation model and obtain a trained multilingual confusion degree calculation model.
8. The method for multilingual sentence confusion obtaining according to claim 7, wherein the training the pre-trained multilingual confusion calculation model according to the training target original sequence and the training target negative sample sequence to fine tune model parameters of the pre-trained multilingual confusion calculation model and obtain a trained multilingual confusion calculation model includes:
inputting each element of a training sequence in training data into the pretrained multilingual confusion degree calculation model in turn, aiming at each element in the training sequence, acquiring a training first bit vector output by a target encoder in the pretrained multilingual confusion degree calculation model and splicing to obtain a training splicing vector, inputting the training splicing vector into a full-connection layer in the pretrained multilingual confusion degree calculation model to obtain a training full-connection vector output by the full-connection layer, mapping the training full-connection vector according to the sigmoid function to obtain a training target scalar value, amplifying the training target scalar value based on a preset amplification factor to obtain a training amplification value, forming the training amplification label group corresponding to the training sequence by the training amplification value corresponding to all elements in the training sequence, converting the training amplified tag group according to a preset softmax function to obtain a training predicted tag group, wherein the training data comprises a plurality of groups of training text groups, each group of training text groups comprises a training sequence and a corresponding training real tag group, the training sequence consists of a training target original sequence and all training target negative sample sequences corresponding to the training target original sequence, the training real tag group corresponding to the training sequence consists of a plurality of real tags, the real tag corresponding to the training target original sequence is 1, the real tag corresponding to the training target negative sample sequence is 0, a plurality of encoders are preset after the training text groups are in the pre-training multilingual confusion degree calculation model, and the training first vector is the first position of the vector output by each target encoder;
And fine tuning model parameters of the pre-trained multilingual confusion degree calculation model according to the training real label set and the training prediction label set, and continuously executing the step of sequentially inputting each element of a training sequence in training data into the pre-trained multilingual confusion degree calculation model until a preset training condition is met, so as to obtain the trained multilingual confusion degree calculation model.
9. The statement confusion degree acquisition system for multiple languages is characterized by comprising:
the sentence acquisition module is used for acquiring a sentence to be calculated, wherein the language corresponding to the sentence to be calculated is at least one of preset multiple languages;
the primitive sequence acquisition module is used for acquiring a primitive sequence corresponding to the sentence to be calculated according to the trained word segmentation model and the multilingual dictionary, wherein each element in the primitive sequence is the same as a value indicated by a position index in the multilingual dictionary;
the primitive sequence processing module is used for adding a target language token at the first position of the primitive sequence to obtain a target sequence, wherein the target language token is a position index of a language identifier corresponding to the language of the sentence to be calculated in the multilingual dictionary;
The sentence confusion degree acquisition module is used for acquiring the sentence confusion degree corresponding to the sentence to be calculated through a trained multilingual confusion degree calculation model according to the target sequence, wherein the trained multilingual confusion degree calculation model is obtained through training according to the preset multilingual corpus corresponding to multiple languages.
10. An intelligent terminal, characterized in that the intelligent terminal comprises a memory, a processor and a multi-language sentence confusion obtaining program stored on the memory and capable of running on the processor, wherein the multi-language sentence confusion obtaining program realizes the steps of the multi-language sentence confusion obtaining method according to any one of claims 1-8 when being executed by the processor.
11. A computer-readable storage medium, wherein a multilingual sentence confusion obtaining program is stored on the computer-readable storage medium, and the multilingual sentence confusion obtaining program, when executed by a processor, implements the steps of the multilingual sentence confusion obtaining method according to any one of claims 1 to 8.
CN202211131283.4A 2022-09-16 2022-09-16 Statement confusion degree acquisition method and system for multiple languages and related equipment Active CN115587589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211131283.4A CN115587589B (en) 2022-09-16 2022-09-16 Statement confusion degree acquisition method and system for multiple languages and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211131283.4A CN115587589B (en) 2022-09-16 2022-09-16 Statement confusion degree acquisition method and system for multiple languages and related equipment

Publications (2)

Publication Number Publication Date
CN115587589A CN115587589A (en) 2023-01-10
CN115587589B true CN115587589B (en) 2023-07-18

Family

ID=84778860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211131283.4A Active CN115587589B (en) 2022-09-16 2022-09-16 Statement confusion degree acquisition method and system for multiple languages and related equipment

Country Status (1)

Country Link
CN (1) CN115587589B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104951469A (en) * 2014-03-28 2015-09-30 株式会社东芝 Method and device for optimizing corpus
WO2020113918A1 (en) * 2018-12-06 2020-06-11 平安科技(深圳)有限公司 Statement rationality determination method and apparatus based on semantic parsing, and computer device
WO2021169288A1 (en) * 2020-02-26 2021-09-02 平安科技(深圳)有限公司 Semantic understanding model training method and apparatus, computer device, and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11288452B2 (en) * 2019-07-26 2022-03-29 Beijing Didi Infinity Technology And Development Co., Ltd. Dual monolingual cross-entropy-delta filtering of noisy parallel data and use thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104951469A (en) * 2014-03-28 2015-09-30 株式会社东芝 Method and device for optimizing corpus
WO2020113918A1 (en) * 2018-12-06 2020-06-11 平安科技(深圳)有限公司 Statement rationality determination method and apparatus based on semantic parsing, and computer device
WO2021169288A1 (en) * 2020-02-26 2021-09-02 平安科技(深圳)有限公司 Semantic understanding model training method and apparatus, computer device, and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Language Model Evaluation Beyond Perplexity;Clara Meister 等;arXiv:2106.00085v1;全文 *
基于隐含狄利克雷分布的多语种文本的自动检测研究;张巍;李雯;陈丹;李增杰;;中国海洋大学学报(自然科学版)(第12期);全文 *
汉语连续语音识别系统与知识导引的搜索策略研究;宋战江,郑方,徐明星,武健,吴文虎;自动化学报(第04期);全文 *

Also Published As

Publication number Publication date
CN115587589A (en) 2023-01-10

Similar Documents

Publication Publication Date Title
CN110489555B (en) Language model pre-training method combined with similar word information
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN111708882B (en) Transformer-based Chinese text information missing completion method
CN110866401A (en) Chinese electronic medical record named entity identification method and system based on attention mechanism
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN111091004B (en) Training method and training device for sentence entity annotation model and electronic equipment
CN112016300B (en) Pre-training model processing method, pre-training model processing device, downstream task processing device and storage medium
CN115545041B (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN110991185A (en) Method and device for extracting attributes of entities in article
CN113705196A (en) Chinese open information extraction method and device based on graph neural network
CN111222329B (en) Sentence vector training method, sentence vector model, sentence vector prediction method and sentence vector prediction system
CN114692568A (en) Sequence labeling method based on deep learning and application
CN116663578A (en) Neural machine translation method based on strategy gradient method improvement
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN115587589B (en) Statement confusion degree acquisition method and system for multiple languages and related equipment
CN116484851A (en) Pre-training model training method and device based on variant character detection
CN112131879A (en) Relationship extraction system, method and device
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN112434133B (en) Intention classification method and device, intelligent terminal and storage medium
Xu Research on neural network machine translation model based on entity tagging improvement
CN113836892A (en) Sample size data extraction method and device, electronic equipment and storage medium
CN114462418A (en) Event detection method, system, intelligent terminal and computer readable storage medium
CN115617959A (en) Question answering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant