CN114756659A - Language model training method, device, equipment and storage medium - Google Patents

Language model training method, device, equipment and storage medium Download PDF

Info

Publication number
CN114756659A
CN114756659A CN202210505156.XA CN202210505156A CN114756659A CN 114756659 A CN114756659 A CN 114756659A CN 202210505156 A CN202210505156 A CN 202210505156A CN 114756659 A CN114756659 A CN 114756659A
Authority
CN
China
Prior art keywords
training
text data
language model
vocabulary
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210505156.XA
Other languages
Chinese (zh)
Inventor
司世景
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210505156.XA priority Critical patent/CN114756659A/en
Publication of CN114756659A publication Critical patent/CN114756659A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of artificial intelligence, and discloses a language model training method, a device, equipment and a storage medium, which are used for performing data enhancement processing on acquired training text data and determining near-meaning text data and antisense text data corresponding to the corresponding training text data so as to improve the accuracy and stability of a trained language model; the method comprises the steps of extracting features of the training text data, the near-meaning text data and the antisense text data, determining original text features, near-meaning text features and antisense text features, and performing language model training according to the original text features, the near-meaning text features and the antisense text features, so that a target language model with higher accuracy is obtained, and the accuracy and the stability of subsequent downstream language processing are ensured.

Description

Language model training method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for language model training, a computer device, and a storage medium.
Background
The effect of contrast learning in unsupervised learning is very outstanding, in the optimization of the present more contrast learning, the optimization comprises a plurality of aspects such as a replacement loss function, a replacement data enhancement method and the like, but the research on the construction of negative examples is relatively less, generally, when the positive and negative sample pairs are constructed by word embedding vectors, most models simply take one word embedding vector and an enhanced copy thereof as positive pairs, and the rest samples are regarded as negative pairs.
Unlike the field of computer vision, the data of text is discrete, with few word substitutions causing significant semantic changes. Some recent studies have shown that resistance training is useless or even detrimental to the model's detection of these semantic changes. However, in most of the current comparative learning models, the emphasis is focused on the construction of positive example pairs in the process of data enhancement, and in the simcl model, data other than the primary data enhancement is taken as negative example samples. The manner of constructing the negative example pairs may divide samples far apart, and negative example pairs near the negative example pairs may be difficult to distinguish, so that the text features formed by the existing text data enhancement are not significant enough, and the accuracy of the target language model trained based on the enhanced text features is reduced.
Disclosure of Invention
The application provides a language model training method, a language model training device, computer equipment and a storage medium, and solves the problem that text features are not obvious enough in the existing text data enhancement process.
The embodiment of the application provides a language model training method, which comprises the following steps:
acquiring training text data;
Performing data enhancement processing on the training text data to acquire near-meaning text data and antisense text data corresponding to the training text data;
performing feature extraction on the training text data, the near-meaning text data and the antisense text data to obtain an original text feature, a near-meaning text feature and an antisense text feature;
and carrying out language model training according to the original text features, the near-meaning text features and the antisense text features to obtain a target language model.
The embodiment of the present application further provides a language model training device, including:
the training text data acquisition module is used for acquiring training text data;
the enhancement processing module is used for carrying out data enhancement processing on the training text data to obtain the near-meaning text data and the antisense text data corresponding to the training text data;
the feature extraction module is used for extracting features of the training text data, the near-meaning text data and the antisense text data to obtain original text features, near-meaning text features and antisense text features;
and the target language model acquisition module is used for carrying out language model training according to the original text features, the near-meaning text features and the antisense text features to acquire a target language model.
The embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of implementing the language model training method when executing the computer program.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of implementing the language model training method.
According to the language model training method, the language model training device, the computer equipment and the storage medium, data enhancement processing is carried out on the training text data, and the near meaning text data and the anti-meaning text data corresponding to the corresponding training text data are determined, so that the accuracy and the stability of the trained language model are improved; and performing feature extraction on the training text data, the near-meaning text data and the antisense text data, determining original text features, near-meaning text features and antisense text features, and performing language model training according to the original text features, the near-meaning text features and the antisense text features, so as to obtain a target language model with higher accuracy and ensure the accuracy and stability of subsequent downstream language processing.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a language model training method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a language model training method according to an embodiment of the present invention;
FIG. 3 is another flow chart of a method for training a language model according to an embodiment of the present invention;
FIG. 4 is another flow diagram of a method for training a language model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the language model training device in accordance with an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computing device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making an invasive task, are within the scope of the present application.
The language model training method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. As shown in fig. 1, a client communicates with a server over a network. The client is also called a client, and refers to a program corresponding to the server and providing local services to the client, and the client includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, cameras, and portable wearable devices. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
The language model training method provided by the embodiment of the invention can be applied to an application environment shown in fig. 1. Specifically, the language model training method is applied to a language model training system, the language model training system comprises a client and a server shown in fig. 1, the client and the server are communicated through a network and are used for realizing the training of a language model, so that after data enhancement processing is carried out on training text data, the language model is trained, and the language model training method is beneficial to improving the applicability of language model training.
The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In an embodiment, as shown in fig. 2, a language model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s201: acquiring training text data;
s202: performing data enhancement processing on the training text data to acquire near-meaning text data and antisense text data corresponding to the training text data;
s203: extracting features of the training text data, the near-meaning text data and the antisense text data to obtain original text features, near-meaning text features and antisense text features;
s204: and performing language model training according to the original text features, the near-meaning text features and the antisense text features to obtain a target language model.
As an example, in step S201, the server obtains training text data corresponding to a training target language model. In this example, since the language model can only ensure the accuracy of natural language processing in the application scenario after being trained, the server needs to obtain the training text data of the corresponding application scenario according to the actual requirements of the service.
As an example, in step S202, the server performs data enhancement processing on the acquired training text data, so as to increase the number of training text data for training the target language model, ensure the accuracy of subsequent training of the target language model, and avoid overfitting or other problems in model training when the number is small. In this example, the server obtains near-meaning text data having a similar semantic meaning to the training text data and obtains antisense text data having a semantic meaning opposite to the training text data, and combines the training text data with the two text data having opposite semantic meanings to train the target language model.
In this example, the si mcse model may be used to process the training text data during the data enhancement process performed on the training text data by the server, that is, the si mcse model processes the training text data through the Dropout layer to obtain the near-meaning text data with a semantic similar to that of the training text data, and further obtain the antisense text data with a semantic opposite to that of the training text data. The simCSE model (Simple contrast Learning of Sendences Embeddings) is a model for comparative Learning without supervision data. The simCSE model constructs similar samples by randomly sampling dropout masks. The method comprises the specific operation that dropout mask operation is carried out on a full connection layer and attention summation operation, one sample is copied into two parts during model training, and because a different dropout mask is randomly generated in the BERT every time a dropout is generated, the result of two different dropout masks can be obtained by feeding the sample to the model twice without changing the original BERT model, so that a similar sample pair is obtained, and two different expression vectors, namely near-meaning text data with similar semantics to training text data and antisense text data with opposite semantics to the training text data can be obtained by putting the similar sample pair into the same coder.
As an example, in step S203, the server performs feature extraction on the training text data, the near-meaning text data, and the anti-meaning text data, respectively, to obtain corresponding original text features, near-meaning text features, and anti-sense text features. In this example, the server may perform feature extraction through the BERT model, and use the extracted text features for subsequent target language model training.
As an example, in step S204, the server performs language model training according to the original text feature, the near-sense text feature and the anti-sense text feature, and performs similarity calculation according to the two distances by capturing the distances between the original text feature and the near-sense text feature and capturing the distances between the original text feature and the anti-sense text feature, thereby training the target language model. In the example, the original text features, the similar text features with similar semantics and the antisense text features with opposite semantics are adopted for modeling training, so that the target language model can be more sensitive to semantic changes, and the target language model can effectively sense semantic changes caused by small disturbance so as to ensure the accuracy and stability of language processing.
In the example, the training amount of the model is enhanced by performing data enhancement processing on the training text data to obtain the near-meaning text data and the antisense text data; and then extracting the features of the training text data, the near meaning text data and the antisense text data, and performing language model training according to the original text features, the near meaning text features and the antisense text features obtained by feature extraction to obtain a target language model, thereby obtaining the target language model with higher accuracy.
In one embodiment, as shown in fig. 3, step S202: performing data enhancement processing on the training text data to acquire the near-meaning text data and the antisense text data corresponding to the training text data, and including:
s301: performing word segmentation processing on the training text data to obtain at least two original words;
s302: performing part-of-speech matching processing on each original vocabulary and a preset dictionary to obtain replaceable vocabularies corresponding to the original vocabularies;
s303: and performing vocabulary replacement processing on the replaceable vocabulary to obtain near meaning text data and antisense text data corresponding to the training text data.
As an example, in step S301, the server may adopt a word segmentation tool to perform word segmentation on the acquired training text data to obtain at least two original words corresponding to the training text data, so as to process the original words in the following step, thereby achieving the purpose of performing data enhancement on the training text data. The original vocabulary is the vocabulary after word segmentation by adopting training text data.
As an example, in step S302, the server performs part-of-speech matching on each original vocabulary with a preset dictionary to obtain alternative vocabularies corresponding to the original vocabularies, and when a certain original vocabulary cannot obtain a corresponding alternative vocabulary, it indicates that the original vocabulary cannot be replaced by a vocabulary with similar or opposite semantics. Wherein the alternative words include antisense words and synonyms. In this example, the preset dictionary is a dictionary preset by the system, and may be, but is not limited to, a WordNet dictionary, and is a broad-coverage vocabulary semantic web, in which nouns, verbs, adjectives and adverbs are each organized into a network of synonyms, each set of synonyms represents a basic semantic concept, and the sets are connected by various relationships.
As an example, in step S303, the server performs vocabulary replacement on the replaceable vocabulary corresponding to the original vocabulary, and obtains near-meaning text data and anti-sense text data corresponding to the training text data. For example, the server may replace the replaceable vocabulary by the near-meaning vocabulary, retain the original vocabulary that cannot be replaced as the corresponding near-meaning text data, replace the replaceable vocabulary by the antisense vocabulary, and retain the original vocabulary that cannot be replaced as the corresponding antisense text data.
In this example, the original vocabulary after the training text data is segmented is matched with the replaceable vocabulary through the preset dictionary, so that the replaceable vocabulary in the original vocabulary is determined, and the corresponding near-meaning text data and antisense text data after data enhancement are obtained according to the matched replaceable vocabulary, so that the near-meaning text data and the antisense text data are used for subsequent training of the target language model, and the accuracy of the target language model is improved.
In an embodiment, in step S302, performing part-of-speech matching processing on each original vocabulary and a preset dictionary to obtain a replaceable vocabulary corresponding to the original vocabulary, including:
s3021: performing part-of-speech matching processing on each original vocabulary and a preset dictionary, and determining the vocabulary attribute of each original vocabulary;
S3022: and if the vocabulary attribute is the replaceable attribute, acquiring a replaceable vocabulary corresponding to the original vocabulary.
As an example, in step S3021, the server performs part-of-speech matching processing on each original vocabulary according to a preset dictionary, and determines a vocabulary attribute of each original vocabulary to determine whether the original vocabulary needs to be replaced. Common lexical attributes include verbs, ranks, adjectives, adverbs, and pronouns, where the parts of speech that can be an alternative vocabulary include, but are not limited to, verbs, nouns, adjectives, adverbs, and pronouns.
As an example, in step S3022, the server performs a replacement process on the original vocabulary with the replaceable attribute according to the acquired vocabulary attribute, so as to ensure that the training text data after the original vocabulary is replaced has a certain similarity.
In this example, the original vocabulary is performed with the lexical matching using the comprehensive vocabulary in the current preset dictionary, which can reduce the development cost and improve the accuracy of the lexical matching.
In an embodiment, in step S303, performing vocabulary replacement processing on the replaceable vocabulary to obtain near-meaning text data and anti-sense text data corresponding to the training text data, including:
S3031: similarity calculation is carried out on the replaceable vocabulary and the original vocabulary, and vocabulary similarity corresponding to the replaceable vocabulary is obtained;
s3032: screening according to the vocabulary similarity to obtain a near vocabulary and an antisense vocabulary corresponding to the original vocabulary;
s3033: replacing the original vocabulary by adopting the near-meaning vocabulary to obtain near-meaning text data corresponding to the training text data;
s3034: and replacing the original vocabulary by adopting the antisense vocabulary to obtain the antisense text data corresponding to the training text data.
As an example, in step S3021, the server may perform similarity calculation on the replaceable vocabulary and the original vocabulary by using a similarity calculation method, and obtain a vocabulary similarity corresponding to the two. The calculated similarity includes, but is not limited to, cosine similarity, euclidean distance, mahalanobis distance, and the like.
As an example, in step S3022, the server performs filtering by the vocabulary similarity, and uses the original vocabulary corresponding to the similarity greater than the similarity threshold as the antisense vocabulary corresponding to the original vocabulary, and uses the original vocabulary corresponding to the similarity not greater than the similarity threshold as the near-meaning vocabulary corresponding to the original vocabulary. Wherein, the similarity threshold value in the example is obtained by sampling a plurality of words, calculating the similarity according to the words, counting to obtain the similarity threshold value and comparing,
As an example, in step S3023, the server uses the near-meaning vocabulary to replace the original vocabulary, and retains the original vocabulary which cannot be replaced, and obtains the near-meaning text data corresponding to the training text data. The original vocabulary includes some specific nouns, etc., which cannot be completely matched with the corresponding near-meaning vocabulary, and if the vocabulary is randomly generated, the uncertainty of the final result is increased.
As an example, in step S3024, the server uses the antisense vocabulary to replace the original vocabulary, and retains the irreplaceable original vocabulary, and obtains the antisense text data corresponding to the training text data. The original vocabulary includes some specific nouns, etc., which cannot be completely matched with the corresponding antisense vocabulary, and if the original vocabulary is randomly generated, the uncertainty of the final result is increased.
In this example, the replaceable vocabulary in the training text data is replaced by the preset dictionary, so that the training text data still has a certain correlation with the data-enhanced near-meaning text data and the data-enhanced anti-sense text data, and the training of the data-enhanced data on the target language model is facilitated.
In one embodiment, as shown in fig. 4, step S204: performing language model training according to the original text features, the near-meaning text features and the antisense text features to obtain a target language model, wherein the method comprises the following steps:
S401: acquiring a first mapping characteristic according to the original text characteristic and the near-meaning text characteristic;
s402: acquiring a second mapping characteristic according to the original text characteristic and the antisense text characteristic;
s403: and training the universal language model by adopting the first mapping characteristic and the second mapping characteristic to obtain a target language model.
As an example, in step S401, the server maps the original text feature and the near-meaning text feature by using a first mapping function, and obtains a corresponding first mapping feature, so that the first mapping feature reflects a mapping relationship between the original text feature and the near-meaning text feature. The first mapping function is a preset function for mapping the original text features and the near-meaning text features.
In this example, the first mapping function may be
Figure BDA0003637193060000101
Wherein, f (x)ori,xsyn) Is a first mapping function, xoriFor training text data, xsynFor near-meaning text data, exp is an exponential operation,
Figure BDA0003637193060000102
as a feature of the original text, hsynIs a near-meaning text feature.
As an example, in step S402, the server uses a second mapping function to map the original text feature and the antisense text feature, and obtains a corresponding second mapping feature, so that the second mapping feature reflects a mapping relationship between the original text feature and the antisense text feature. Wherein the second mapping function is a preset function for mapping the original text feature and the meaning text feature.
In this example, the second mapping function may be
Figure BDA0003637193060000111
Wherein, f (x)ori,xsyn) Is a second mapping function, xoriFor training text data, xantFor antisense text data, exp is an exponential operation,
Figure BDA0003637193060000112
as a feature of the original text, hantIs an antisense text feature.
As an example, in step S403, the server trains the general language model by using the first mapping feature and the second mapping feature, and updates the model parameters of the general language model by using the first mapping feature and the second mapping feature to obtain the target language model. The target language model is a model trained on the general language model based on the first mapping characteristic and the second mapping characteristic, can reflect the distance measurement of the original text characteristic and the near-meaning text characteristic and also reflect the distance measurement between the original text characteristic and the antisense text characteristic, so that the target language model can be more sensitive to semantic change, and can effectively sense the semantic change caused by small disturbance, thereby ensuring the stability of the target language model and the accuracy of language processing and improving the compatibility of the target language model in different application scenes.
In an embodiment, in step S403, training a generic language model by using the first mapping feature and the second mapping feature to obtain a target language model, including:
S4031A: acquiring a mapping feature ratio according to the first mapping feature and the second mapping feature;
S4032A: if the mapping characteristic ratio is larger than a preset ratio threshold, updating model parameters of the universal language model;
S4033A: and if the mapping characteristic ratio is not greater than the preset ratio threshold, not updating the model parameters of the universal language model.
As an example, in step S4031A, the server calculates the first mapping feature and the second mapping feature to obtain a corresponding mapping feature ratio, and calculates the ratio between the first mapping feature and the second mapping feature to ensure that a certain distance can be maintained between the near-sense text feature and the anti-sense text feature, so that the target language model can be more sensitive to semantic changes and enhance the semantic perception capability of the model.
As an example, in step S4032A, if the mapping feature ratio is greater than the preset ratio threshold, it is stated that the distance between the near-meaning text feature and the anti-sense text feature reaches the standard required by the model, and the model parameters of the generic language model may be updated by retaining the first mapping feature and the second mapping feature.
In this example, before the similarity between the first mapping feature and the second mapping feature is calculated by the loss function, the mapping feature ratio between the first mapping feature and the second mapping feature is calculated, and the preset ratio threshold is obtained by statistics. For example, when the preset ratio threshold is 0.8, if the ratio of the mapping features between the first mapping feature and the second mapping feature is greater than the preset ratio threshold, the distance between the original text feature and the near-meaning text feature is not close enough, and the distance between the original text feature and the antisense text feature is not far enough, so that the model parameters of the universal language model need to be updated to ensure the accuracy of the updated target language model.
As an example, in step S4033A, if the mapping feature ratio is smaller than the preset ratio threshold, it indicates that the distance between the near-meaning text feature and the anti-sense text feature does not meet the standard required by the model, and the model parameters of the generic language model may not be updated by omitting the first mapping feature and the second mapping feature.
In this example, whether the enhancement data reaches the standard is determined according to the mapping feature ratio between the corresponding first mapping feature and the corresponding second mapping feature between the near-meaning text feature and the antisense text feature, so that more accurate enhancement data is reserved, the model parameters of the general language model are updated, and the model accuracy is improved.
In another embodiment, in step S403, training the generic language model by using the first mapping feature and the second mapping feature to obtain the target language model, including:
S4031B: acquiring an original loss function according to the first mapping characteristic and the second mapping characteristic;
S4032B: acquiring a target truncation function according to the first mapping characteristic and the second mapping characteristic;
S4033B: determining a target loss function according to the original loss function and the target truncation function;
S4034B: and (5) performing general language model training by adopting a target loss function to obtain a target language model.
As an example, in step S4031B, the server performs an original loss function calculation according to the first mapping feature and the second mapping feature, where the original loss function is a common loss function. The loss function is used for calculating whether the trained data is in a certain convergence state in the model training, so that the quality of the trained model is ensured.
In this example, the raw loss function may be
Figure BDA0003637193060000131
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003637193060000132
as a function of the original loss, f (x)ori,xsyn) Is a first mapping function, f (x)ori,xant) Is a second mapping function.
As an example, in step S4032B, the server calculates an objective truncation function according to the first mapping feature and the second mapping feature, where the objective truncation function is a function for determining a ratio of mapping features corresponding to the first mapping feature and the second mapping feature, and determines whether the near-meaning text feature and the antisense text feature enhanced by the current data meet the standard of model training.
In this example, the target truncation function may be
Figure BDA0003637193060000133
Wherein m is a predetermined ratio threshold, x is a first mapping function, y is a second mapping function,
Figure BDA0003637193060000134
Is composed of
Figure BDA0003637193060000135
As an example, in step S4033B, the server identifies a target loss function that can be used for the present application scenario all the time, based on the original loss function and the target truncation function, and the function can perform judgment and similarity calculation, thereby reducing the calculation flow.
As an example, in step S4034B, the server performs a general language model training through a target loss function, and when it is ensured that the distance between the near-meaning text feature and the antisense text feature corresponding to the data enhancement is within a reasonable range, the stability of the model to be trained is ensured, so as to obtain a more accurate target language model.
In this example, the target loss function may be
Figure BDA0003637193060000141
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003637193060000142
in order to be a function of the target loss,
Figure BDA0003637193060000143
as a function of the original loss, g [ f (x)ori,xant),f(xori,xsyn)]Is the target cutoff function.
In this example, only the target loss function is set, and then the distance between the near-meaning text feature and the antisense text feature corresponding to the data enhancement at this time is ensured to be within a reasonable range, so that the stability of the model to be trained is ensured, and a more accurate target language model is obtained. Moreover, the target language model is a model trained on the general language model based on the first mapping characteristic and the second mapping characteristic, so that the distance measurement of the original text characteristic and the near-sense text characteristic can be reflected, and the distance measurement between the original text characteristic and the antisense text characteristic can also be reflected, so that the target language model can be more sensitive to semantic change, the target language model can effectively sense semantic change caused by small disturbance, the stability of the target language model and the accuracy of language processing can be guaranteed, and the compatibility of the target language model in different application scenes can be improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a user data analysis apparatus is provided, where the user data analysis apparatus corresponds to the user data analysis method in the foregoing embodiment one to one. As shown in fig. 5, the user data analysis apparatus includes a training text data acquisition module 801, a data enhancement processing module 802, a data feature extraction module 803, and a target language model acquisition module 804, and the detailed descriptions of the functional modules are as follows:
a training text data acquisition module 801 for acquiring training text data;
the data enhancement processing module 802 performs data enhancement processing on the training text data to obtain near-meaning text data and antisense text data corresponding to the training text data;
the data feature extraction module 803 is used for extracting features of the training text data, the near-meaning text data and the antisense text data to obtain original text features, near-meaning text features and antisense text features;
and the target language model acquisition module 804 performs language model training according to the original text features, the near-meaning text features and the anti-sense text features to acquire a target language model.
In one embodiment, the data enhancement processing module 802 includes:
the original vocabulary acquisition unit is used for performing word segmentation on the training text data to acquire at least two original vocabularies;
the replaceable vocabulary acquisition unit is used for performing part-of-speech matching processing on each original vocabulary and a preset dictionary to acquire replaceable vocabularies corresponding to the original vocabularies;
and the vocabulary replacement processing unit is used for performing vocabulary replacement processing on the replaceable vocabulary and acquiring the near-meaning text data and the antisense text data corresponding to the training text data.
In one embodiment, the replaceable vocabulary acquisition unit includes:
the vocabulary attribute acquiring subunit is used for performing part-of-speech matching processing on each original vocabulary and a preset dictionary and determining the vocabulary attribute of each original vocabulary;
and the replaceable vocabulary acquiring subunit acquires a replaceable vocabulary corresponding to the original vocabulary if the vocabulary attribute is the replaceable attribute.
In one embodiment, a vocabulary replacement processing unit includes:
the vocabulary similarity obtaining subunit is used for calculating the similarity between the replaceable vocabulary and the original vocabulary to obtain the vocabulary similarity corresponding to the replaceable vocabulary;
the screening processing subunit performs screening processing according to the vocabulary similarity to obtain near-meaning vocabularies and antisense vocabularies corresponding to the original vocabularies;
The near-meaning text data acquisition subunit adopts near-meaning vocabularies to replace the original vocabularies to acquire near-meaning text data corresponding to the training text data;
and the antisense text data acquisition subunit adopts antisense words to replace the original words to acquire the antisense text data corresponding to the training text data.
In one embodiment, the target language model obtaining module 804 includes:
the first mapping characteristic acquisition unit acquires a first mapping characteristic according to the original text characteristic and the near-meaning text characteristic;
the second mapping characteristic acquisition unit acquires a second mapping characteristic according to the original text characteristic and the antisense text characteristic;
and the target language model acquisition unit is used for training the universal language model by adopting the first mapping characteristic and the second mapping characteristic to acquire the target language model.
In one embodiment, the target language model obtaining unit includes:
the mapping feature ratio obtaining subunit obtains a mapping feature ratio according to the first mapping feature and the second mapping feature;
the model parameter updating detection subunit is used for updating the model parameters of the universal language model if the mapping characteristic ratio is greater than a preset ratio threshold; and if the mapping characteristic ratio is larger than the preset ratio threshold, not updating the model parameters of the general language model.
In another embodiment, the target language model obtaining unit includes:
the original loss function obtaining subunit obtains an original loss function according to the first mapping characteristic and the second mapping characteristic;
the target truncation function acquisition subunit acquires a target truncation function according to the first mapping characteristic and the second mapping characteristic;
the target loss function acquisition subunit determines a target loss function according to the original loss function and the target truncation function;
and the target language model acquisition subunit performs general language model training by adopting a target loss function to acquire a target language model.
For the specific definition of the language model training device, reference may be made to the above definition of the language model training method, which is not described herein again. The modules in the language model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, or can be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for executing data adopted or generated in the language model training method process. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a language model training method.
In an embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the language model training method in the foregoing embodiments is implemented, for example, S201 to S205 shown in fig. 2, or shown in fig. 3 to fig. 4, which is not described again to avoid repetition. Alternatively, when the processor executes the computer program, the functions of the modules/units in the embodiment of the language model training apparatus, such as the functions of the training text data obtaining module 801, the data enhancement processing module 802, the data feature extraction module 803, and the target language model obtaining module 804 shown in fig. 5, are not described herein again to avoid repetition.
In an embodiment, a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the language model training method in the foregoing embodiments, for example, S201 to S205 shown in fig. 2, or S201 to S205 shown in fig. 3 to fig. 4, which are not described herein again to avoid repetition. Alternatively, when being executed by a processor, the computer program implements the functions of the modules/units in the embodiment of the language model training apparatus, for example, the functions of the training text data obtaining module 801, the data enhancement processing module 802, the data feature extraction module 803, and the target language model obtaining module 804 shown in fig. 5, and are not repeated here for avoiding repetition.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by the relevant hardware instructed by a computer program stored in a non-volatile computer readable storage medium, and the computer program can include the processes of the embodiments of the methods described above when executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above.
The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims (10)

1. A method for training a language model, comprising:
acquiring training text data;
performing data enhancement processing on the training text data to acquire near-meaning text data and antisense text data corresponding to the training text data;
Extracting features of the training text data, the near-meaning text data and the antisense text data to obtain original text features, near-meaning text features and antisense text features;
and performing language model training according to the original text features, the near-meaning text features and the antisense text features to obtain a target language model.
2. The method for training a language model according to claim 1, wherein the performing data enhancement processing on the training text data to obtain the near-meaning text data and the anti-sense text data corresponding to the training text data comprises:
performing word segmentation processing on the training text data to obtain at least two original words;
performing part-of-speech matching processing on each original vocabulary and a preset dictionary to obtain replaceable vocabularies corresponding to the original vocabularies;
and performing vocabulary replacement processing on the replaceable vocabulary to obtain near meaning text data and antisense text data corresponding to the training text data.
3. The method for training a language model according to claim 2, wherein said performing part-of-speech matching on each of the original words with a predetermined dictionary to obtain a replaceable word corresponding to the original word comprises:
Performing part-of-speech matching processing on each original vocabulary and the preset dictionary to determine the vocabulary attribute of each original vocabulary;
and if the vocabulary attribute is a replaceable attribute, acquiring a replaceable vocabulary corresponding to the original vocabulary.
4. The method for training a language model according to claim 2, wherein said performing vocabulary replacement processing on said replaceable vocabulary to obtain near-meaning text data and anti-sense text data corresponding to said training text data comprises:
similarity calculation is carried out on the replaceable vocabulary and the original vocabulary, and vocabulary similarity corresponding to the replaceable vocabulary is obtained;
screening according to the vocabulary similarity to obtain a near vocabulary and an antisense vocabulary corresponding to the original vocabulary;
replacing the original vocabulary by adopting the near-meaning vocabulary to obtain the near-meaning text data corresponding to the training text data;
and replacing the original vocabulary by adopting the antisense vocabulary to obtain the antisense text data corresponding to the training text data.
5. The method for training a language model according to claim 1, wherein the training a language model according to the original text features, the near-meaning text features and the antisense text features to obtain a target language model comprises:
Acquiring a first mapping characteristic according to the original text characteristic and the near-meaning text characteristic;
acquiring a second mapping characteristic according to the original text characteristic and the antisense text characteristic;
and training a general language model by adopting the first mapping characteristic and the second mapping characteristic to obtain a target language model.
6. The method for training a language model according to claim 5, wherein the training a generic language model using the first mapping feature and the second mapping feature to obtain a target language model comprises:
acquiring a mapping feature ratio according to the first mapping feature and the second mapping feature;
if the mapping characteristic ratio is larger than a preset ratio threshold, updating the model parameters of the universal language model;
and if the mapping characteristic ratio is not greater than a preset ratio threshold, not updating the model parameters of the universal language model.
7. The method for training a language model according to claim 6, wherein the training a generic language model using the first mapping feature and the second mapping feature to obtain a target language model comprises:
acquiring an original loss function according to the first mapping characteristic and the second mapping characteristic;
Acquiring a target truncation function according to the first mapping characteristic and the second mapping characteristic;
determining a target loss function according to the original loss function and the target truncation function;
and performing general language model training by adopting the target loss function to obtain a target language model.
8. A language model training device, comprising:
the training text data acquisition module acquires training text data;
the enhancement processing module is used for carrying out data enhancement processing on the training text data to obtain near-meaning text data and antisense text data corresponding to the training text data;
the feature extraction module is used for extracting features of the training text data, the near-meaning text data and the antisense text data to obtain original text features, near-meaning text features and antisense text features;
and the target language model acquisition module is used for carrying out language model training according to the original text features, the near-sense text features and the antisense text features to acquire a target language model.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the language model training method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method for language model training as claimed in any one of claims 1 to 7.
CN202210505156.XA 2022-05-10 2022-05-10 Language model training method, device, equipment and storage medium Pending CN114756659A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210505156.XA CN114756659A (en) 2022-05-10 2022-05-10 Language model training method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210505156.XA CN114756659A (en) 2022-05-10 2022-05-10 Language model training method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114756659A true CN114756659A (en) 2022-07-15

Family

ID=82334373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210505156.XA Pending CN114756659A (en) 2022-05-10 2022-05-10 Language model training method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114756659A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909354A (en) * 2022-11-11 2023-04-04 北京百度网讯科技有限公司 Training method of text generation model, and text acquisition method and device
CN116150380A (en) * 2023-04-18 2023-05-23 之江实验室 Text matching method, device, storage medium and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909354A (en) * 2022-11-11 2023-04-04 北京百度网讯科技有限公司 Training method of text generation model, and text acquisition method and device
CN115909354B (en) * 2022-11-11 2023-11-10 北京百度网讯科技有限公司 Training method of text generation model, text acquisition method and device
CN116150380A (en) * 2023-04-18 2023-05-23 之江实验室 Text matching method, device, storage medium and equipment

Similar Documents

Publication Publication Date Title
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
WO2021047286A1 (en) Text processing model training method, and text processing method and apparatus
CN109815333B (en) Information acquisition method and device, computer equipment and storage medium
CN114756659A (en) Language model training method, device, equipment and storage medium
US10242670B2 (en) Syntactic re-ranking of potential transcriptions during automatic speech recognition
CN110377900A (en) Checking method, device, computer equipment and the storage medium of Web content publication
CN112084789B (en) Text processing method, device, equipment and storage medium
CN113297366B (en) Emotion recognition model training method, device, equipment and medium for multi-round dialogue
CN111666775A (en) Text processing method, device, equipment and storage medium
CN111611383A (en) User intention recognition method and device, computer equipment and storage medium
CN112652295A (en) Language model training method, device, equipment and medium, and video subtitle checking method, device and medium
CN110598210A (en) Entity recognition model training method, entity recognition device, entity recognition equipment and medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN113254620B (en) Response method, device and equipment based on graph neural network and storage medium
CN113449081A (en) Text feature extraction method and device, computer equipment and storage medium
CN113420203A (en) Object recommendation method and device, electronic equipment and storage medium
CN112307754A (en) Statement acquisition method and device
CN111626039A (en) Training method and device for text similarity recognition model and related equipment
CN112087473A (en) Document downloading method and device, computer readable storage medium and computer equipment
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium
CN115525749A (en) Voice question-answering method, device, electronic equipment and storage medium
CN114254634A (en) Multimedia data mining method, device, storage medium and equipment
CN111552785A (en) Method and device for updating database of human-computer interaction system, computer equipment and medium
CN113095073A (en) Corpus tag generation method and device, computer equipment and storage medium
CN111930884A (en) Method and equipment for determining reply sentence and man-machine conversation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination