CN117313892A

CN117313892A - Training device and method for text processing model

Info

Publication number: CN117313892A
Application number: CN202311246155.9A
Authority: CN
Inventors: 杨其凡
Original assignee: Shanghai Yuepu Network Technology Co ltd
Current assignee: Shanghai Yuepu Network Technology Co ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-12-29

Abstract

The invention discloses a training device and a training method for a text processing model, which belong to the technical field of model training and comprise the following steps: the system comprises a text acquisition module, a text processing module, a model training module, an evaluation optimization module and a data storage module. The text processing model training method solves the problems that when the text data is trained by the existing text processing model, the text data cannot be preprocessed, so that the text data training load is high, and the text processing model training effect is poor because real-time optimization adjustment cannot be performed according to the text data training condition.

Description

Training device and method for text processing model

Technical Field

The invention relates to the technical field of model training, in particular to a training device and a training method for a text processing model.

Background

With the rapid development of machine learning, machine learning models are widely used in a variety of business scenarios. In many fields related to text processing, such as intelligent customer service questions and answers, machine translation, text analysis and classification, etc., it can be appreciated that, for a text processing model, the prediction performance depends on the richness of training texts, and the more the training texts are fit to actual application scenes, the larger the data volume, the more excellent the trained model performance is.

Chinese patent publication No. CN114861887a discloses a method, apparatus, device, medium and program product for generating a text processing model, the method comprising: obtaining a plurality of candidate text processing models, wherein the plurality of candidate text processing models respectively comprise at least two types of attention layers; obtaining performance information of the candidate text processing model; a target text processing model is determined from the plurality of candidate text processing models based on performance information of the plurality of candidate text processing models. However, the above patent has the following drawbacks in practical use:

when the existing text processing model trains text data, the text data cannot be preprocessed, so that the training load of the text data is large, and the text processing model cannot be optimized and adjusted in real time according to the training condition of the text data, so that the training effect of the text processing model is poor.

Disclosure of Invention

The invention aims to provide a training device and a training method for a text processing model, which can preprocess text data, reduce the training load of the text data, optimize and adjust in real time according to the training condition of the text data, improve the training effect of the text processing model and solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a training device for a text processing model, comprising:

the text acquisition module is used for acquiring a text to be trained and determining a text data set to be trained based on the acquired text to be trained;

the text processing module is used for comprehensively processing the text data set to be trained, obtaining the text data set to be trained, searching, grouping and extracting the text data set, and determining text characteristic data based on the text data set;

the model training module is used for carrying out model training on the text characteristic data based on the text data set, obtaining the text characteristic data based on the text data set, inputting the text characteristic data into the text processing model, carrying out model training on the text characteristic data based on the text processing model, and determining the text training data based on the text processing model;

The evaluation optimization module is used for evaluating the text training data based on the text processing model, optimizing the text processing model, acquiring the text training data based on the text processing model, retrieving stored text expected data based on the text training data index, evaluating the text training data based on the text processing model based on the text expected data, determining a data evaluation result, determining a text processing model optimization method based on the data evaluation result, and correspondingly optimizing the text processing model based on the text processing model optimization method;

and the data storage module is used for storing the text expected data and the text training data and providing a reference guide basis for the evaluation of the text training data and the optimization of the text processing model.

Preferably, the text acquisition module includes:

the direct input unit is used for directly inputting the text to be trained, and directly inputting the text to be trained in the modes of keyboard input, voice input and pen type writing input;

the text acquisition unit is used for acquiring the text to be trained, and acquiring the text to be trained in a database acquisition mode, a search engine acquisition mode, a file acquisition mode, a social media acquisition mode, an API acquisition mode and a website acquisition mode;

The text downloading unit is used for downloading the text to be trained, and downloading the text to be trained in a mode of online downloading and library downloading.

Preferably, the text acquisition unit comprises:

the information extraction module is used for extracting the whole data volume of the database;

the regional division number acquisition module is used for setting the regional division number of the database according to the whole data volume, wherein the regional division number of the database is acquired through the following formula:

wherein M represents the regional division number of the database; n represents the reference number; c (C) ₀₁ Representing the amount of data stored in the database; c (C) ₀₂ Representing the total amount of data storage space of the database; n represents the number of the unit time, and the value range of the unit time is 1min-3min; c (C) _i Representing the variation of the database stored data quantity of the ith unit time compared with the last unit time; c (C) _min And C _max Respectively representing the minimum variation and the maximum variation of the database stored data quantity in unit time compared with the last unit time;

the proportional relation acquisition module is used for extracting the regional division number of the database and comparing the regional division number with the preset reference number, and judging the proportional relation between the regional division number of the database and the preset reference number;

The regional data quantity determining module is used for determining regional data quantities corresponding to the first region and the second region according to the proportional relation between the regional division quantity of the database and the preset reference quantity;

the area number setting module is used for setting constraint conditions according to the data quantity of the first area and the second area and the quantity between the first area and the second area, and setting the number of the first area and the number of the second area, wherein the quantity setting constraint conditions of the first area and the second area are as follows: m is M ₂ ≤0.65M ₁ The method comprises the steps of carrying out a first treatment on the surface of the Wherein M is ₁ And M ₂ The number of the first region and the second region is represented by the score;

the dividing module is used for dividing the database according to the number of the first areas and the second areas and the area data quantity of the first areas and the second areas to obtain a plurality of database area blocks; the residual data volume of the database after the region division is independently used as a database region block;

the encoding module is used for encoding the database region blocks corresponding to the first region and the second region in a unique sequence according to the first region and the second region interleaving mode;

and the scanning module is used for sequentially scanning the database area blocks according to the unique sequence codes and extracting the texts to be trained, which are in accordance with the training data requirements, of the database area blocks until the number requirements of the texts to be trained are met.

Preferably, the area data amount determining module includes:

the proportional relation extraction module is used for extracting the proportional relation between the regional division number of the database and the preset reference number;

the judging module is used for judging whether the ratio relation is N/M which is less than or equal to 0.46; wherein M represents the regional division number of the database; n represents the reference number;

the first setting module is used for setting the area data amounts corresponding to the first area and the second area as follows when the ratio relation satisfies that N/M is less than or equal to 0.46:

wherein M represents the regional division number of the database; n represents the reference number; c (C) _m0 Representing a preset reference area data amount; c (C) _m1 And C _m2 The data amount of the region corresponding to the first region and the second region is respectively represented;

the second setting module is used for setting the area data amounts corresponding to the first area and the second area as follows when the ratio relation does not meet that N/M is less than or equal to 0.46:

wherein M represents the regional division number of the database; n represents the reference number; cm0 represents a preset reference area data amount; c (C) _m1 And C _m2 The data amounts of the areas corresponding to the first area and the second area are respectively represented.

Preferably, the text processing module includes:

The text retrieval unit is used for retrieving the text data set to be trained;

acquiring a text data set to be trained, searching the text data set to be trained based on a sequential searching method, filtering text data which is useless for training a text processing model in the text data set, and determining text data which is useful for training the text processing model in the text data set;

a text grouping unit for grouping the retrieved text data;

acquiring text data which are useful for training a text processing model in a text data set, grouping the text data which are useful for training the text processing model in the text data set based on a mutual exclusion principle, and determining text data groups with different attributes, wherein each text data group stores text data with the same attribute;

the feature extraction unit is used for extracting features of the grouped text data;

and acquiring text data groups with different attributes, and extracting characteristics of text data stored in each text data group to determine text characteristic data based on the text data groups.

Preferably, the model training module includes:

a data extraction unit for extracting text feature data based on the text data set;

Extracting text feature data based on the text data set, and inputting the extracted text feature data into a text processing model;

the model training unit is used for carrying out model training on the text characteristic data based on the text data set;

and carrying out model training on the text characteristic data based on the text processing model, and determining text training data based on the text processing model.

Preferably, the evaluation optimization module includes:

the index calling unit is used for calling out the stored text expected data by indexes;

acquiring text training data based on a text processing model, retrieving stored text expected data based on the text training data, and calling out the retrieved text expected data;

the comparison evaluation unit is used for evaluating the text training data based on the text processing model;

acquiring text training data and text expected data, evaluating the text training data based on a text processing model based on the text expected data, and determining a data evaluation result;

the analysis optimizing unit is used for optimizing the text processing model;

and acquiring a data evaluation result, determining a text processing model optimization method based on the data evaluation result, and correspondingly optimizing the text processing model based on the text processing model optimization method.

Preferably, the data storage module includes:

the expected storage unit is used for storing text expected data and providing a reference basis for evaluating text training data based on a text processing model;

and the training storage unit is used for storing text training data and providing a guiding basis for optimizing the text processing model.

According to another aspect of the present invention, there is provided a training method of a text processing model, implemented based on a training apparatus according to the above text processing model, comprising the steps of:

s1: acquiring a text to be trained through a text acquisition module, determining a text data set to be trained based on the acquired text to be trained, comprehensively processing the text data set to be trained through a text processing module, and determining text characteristic data based on the text data set;

s2: model training is carried out on text feature data based on a text data set through a model training module, text feature data based on the text data set is obtained, the text feature data is input into a text processing model, model training is carried out on the text feature data based on the text processing model, and text training data based on the text processing model is determined;

S3: and evaluating the text training data based on the text processing model through an evaluation optimization module, optimizing the text processing model, retrieving stored text expected data based on the text training data index, evaluating the text training data based on the text expected data, determining a data evaluation result, determining a text processing model optimization method based on the data evaluation result, and correspondingly optimizing the text processing model based on the text processing model optimization method.

Preferably, in the step S3, the evaluation optimization module evaluates the text training data based on the text processing model, and optimizes the text processing model, and performs the following operations:

acquiring text training data based on a text processing model;

retrieving stored text expected data based on the text training data and retrieving the retrieved text expected data;

aiming at the condition that the text training data is in the text expected data range, the data evaluation result is that the data training based on the text processing model is qualified;

Aiming at the condition that the text training data is not in the text expected data range, the data evaluation result is that the data training based on the text processing model is unqualified;

acquiring a data evaluation result, determining a text processing model optimization method based on the data evaluation result, and correspondingly optimizing a text processing model based on the text processing model optimization method;

aiming at the condition that the data training based on the text processing model is qualified, the text processing model does not need to be optimized;

aiming at the condition that the data training based on the text processing model is unqualified, the text processing model is required to be optimized, the text processing model is obtained, and the text processing model and the text expected data are iteratively upgraded.

Compared with the prior art, the invention has the beneficial effects that:

according to the training device and method for the text processing model, the text to be trained is obtained, the text data set to be trained is determined based on the obtained text to be trained, comprehensive processing is conducted on the text data set to be trained, text characteristic data based on the text data set is determined, preprocessing can be conducted on the text data, text data training load is reduced, the text characteristic data is input into the text processing model, model training is conducted on the text characteristic data based on the text processing model, text training data based on the text processing model is determined, stored text expected data is extracted based on the text expected data, text training data based on the text processing model is evaluated, data evaluation results are determined, a text processing model optimizing method is determined based on the text processing model optimizing method, corresponding optimizing processing is conducted on the text processing model based on the text processing model optimizing method, real-time optimizing adjustment can be conducted according to the text data training conditions, and the text processing model training effect is improved.

Drawings

FIG. 1 is a block diagram of a training device for a text processing model of the present invention;

FIG. 2 is a flow chart of a training method of the text processing model of the present invention;

FIG. 3 is an algorithm flow chart of a training method of the text processing model of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the problem that when the existing text processing model trains text data, the text data cannot be preprocessed, so that the text data training load is large, and the text processing model cannot be optimized and adjusted in real time according to the text data training condition, so that the text processing model training effect is poor, please refer to fig. 1-3, the embodiment provides the following technical scheme:

the training device of the text processing model comprises a text acquisition module, a text processing module, a model training module, an evaluation optimization module and a data storage module;

The text processing module is used for comprehensively processing the text data set to be trained to determine text characteristic data based on the text data set, so that the text data can be preprocessed, and the text data training load is reduced.

The text feature data based on the text data set is subjected to model training through a model training module, text feature data based on the text data set is obtained, the text feature data is input into a text processing model, the text feature data is subjected to model training based on the text processing model, and text training data based on the text processing model is determined; the text processing model training data based on the text processing model is evaluated through an evaluation optimization module, the text processing model is optimized, stored text expected data is extracted based on the text training data index, the text training data based on the text expected data is evaluated, a data evaluation result is determined, a text processing model optimizing method is determined based on the data evaluation result, the text processing model is correspondingly optimized based on the text processing model optimizing method, real-time optimization adjustment can be performed according to the text data training condition, and the text processing model training effect is improved;

the text acquisition module comprises a direct input unit, a text acquisition unit and a text downloading unit;

the direct input unit can directly input the text to be trained, and the text to be trained is directly input in the modes of keyboard input, voice input and pen writing input;

the text acquisition unit can acquire the text to be trained, and acquires the text to be trained in a database acquisition mode, a search engine acquisition mode, a file acquisition mode, a social media acquisition mode, an API acquisition mode and a website acquisition mode;

the text downloading unit can download the text to be trained in a mode of online downloading and library downloading.

Specifically, the text acquisition unit includes:

wherein M represents the regional division number of the database; n represents the reference number; c (C) ₀₁ Representing the amount of data stored in the database; c (C) ₀₂ Representing the total amount of data storage space of the database; n represents the number of the unit time, and the value range of the unit time is 1min-3min; c (C) _i Representing the ith unit timeThe change amount of the stored data amount of the database in comparison with the previous unit time; c (C) _min And C _max Respectively representing the minimum variation and the maximum variation of the database stored data quantity in unit time compared with the last unit time;

The technical effects of the technical scheme are as follows: data extraction and partition division: the scheme obtains the whole data volume through an information extraction module, and calculates the regional division number of the database according to a group of parameters and formulas. This helps effectively divide the database into different regions for better management and processing of the data.

Dynamic adjustment: formulas and parameters in the scheme allow for dynamic adjustment of the number of region divisions according to the actual situation and changes of the database. This may help to automatically adapt to the growth or shrinkage of the database to ensure efficient data collection.

And (3) proportional relation analysis: through the proportional relation acquisition module, the scheme can compare the proportional relation between the regional division number of the database and the preset reference number. This helps determine which areas require more data acquisition and which require less data acquisition.

Data volume determination and region setting: according to the proportional relation, the scheme determines the area data quantity corresponding to the first area and the second area, and sets the number of the first area and the second area according to the quantity setting constraint condition. This helps to evenly divide the database to meet data acquisition requirements.

Data encoding and scanning: the data encoding module encodes the database region blocks in a unique order for subsequent data scanning. The scanning module scans the database area blocks according to the coding sequence and extracts text data meeting the requirement of training data.

Automated data acquisition: the whole process is automatic, and text data meeting the requirements can be automatically acquired from the database according to preset parameters and constraint conditions without manual intervention.

In general, the technical effect of the technical scheme is that an automatic data acquisition system is provided, the data area can be dynamically divided and managed according to the actual situation and the requirement of a database, the text data meeting the training data requirement is ensured to be acquired, flexible adjustment can be carried out under the conditions of different time periods and data volume, and the data acquisition efficiency and adaptability are improved. This is of great significance for application scenarios requiring large-scale collection of text data, such as machine learning and natural language processing research.

Specifically, the area data amount determining module includes:

The technical effects of the technical scheme are as follows: and (3) extracting the proportion relation: the scheme can acquire the proportional relation between the regional division number of the database and the preset reference number through the proportional relation extraction module. This helps to understand whether the data partitioning of the database is expected and provides the underlying information for use in subsequent decisions.

And (3) judging the proportion relation: the judging module judges whether a specific condition (N/M is less than or equal to 0.46) is met according to the numerical value of the proportion relation. This condition may be set according to traffic demand or performance optimization considerations. The technical scheme can set different data volumes according to the condition.

Dynamic data volume setting: the first setting module and the second setting module respectively set the area data quantity corresponding to the first area and the second area according to whether the proportional relation meets the condition. This dynamic setting allows the system to automatically adjust the data partitioning to meet the best balance of performance and resource utilization, case by case.

Setting a reference area data amount: the technical scheme also provides the preset reference area data quantity (Cm 0) and the corresponding area data quantity (Cm 1 and Cm 2) of the first area and the second area. This helps to ensure that the data volume of the different regions is within a reasonable range and meets the expected requirements.

In general, the technical effect of the technical scheme is to provide a mechanism for setting dynamic data volume, and the data volume distribution of the database area can be automatically adjusted according to the proportion relation and the conditions so as to meet the optimal performance and resource utilization under different requirements. The method is beneficial to improving the flexibility and efficiency of data management, and is particularly suitable for application scenes of large-scale data acquisition and processing.

The text processing module is used for comprehensively processing the text data set to be trained;

it should be noted that the text processing module includes a text retrieving unit, a text grouping unit and a feature extracting unit;

the text retrieval unit can retrieve a text data set to be trained;

specifically, a text data set to be trained is obtained, the text data set to be trained is searched based on a sequential search method, text data which is useless for training a text processing model in the text data set is filtered, and text data which is useful for training the text processing model in the text data set is determined;

the text grouping unit can group the retrieved text data;

specifically, obtaining text data in a text data set, which is useful for training a text processing model, grouping the text data in the text data set, which is useful for training the text processing model, based on a mutual exclusion principle, and determining text data groups with different attributes, wherein each text data group stores text data with the same attribute;

the feature extraction unit can perform feature extraction on the grouped text data;

Specifically, text data sets with different attributes are obtained, feature extraction is carried out on text data stored in each text data set, and text feature data based on the text data set is determined.

The model training module is used for carrying out model training on text characteristic data based on the text data set;

it should be noted that the model training module includes a data extraction unit and a model training unit;

wherein the data extraction unit may extract text feature data based on the text data set;

specifically, text feature data based on a text data set is extracted, and the extracted text feature data is input into a text processing model;

the model training unit can perform model training on text characteristic data based on the text data set;

specifically, model training is performed on the text feature data based on the text processing model, and text training data based on the text processing model is determined.

The evaluation optimization module is used for evaluating the text training data based on the text processing model and optimizing the text processing model;

it should be noted that the evaluation optimization module includes an index retrieving unit, a comparison evaluation unit and an analysis optimization unit;

The index calling unit can call out the stored text expected data in an index way;

specifically, text training data based on a text processing model is obtained, stored text expected data is searched and extracted based on the text training data, and the indexed text expected data is called out;

wherein, the comparison evaluation unit can evaluate the text training data based on the text processing model;

specifically, text training data and text expected data are obtained, the text training data based on a text processing model is evaluated based on the text expected data, and a data evaluation result is determined;

the analysis optimizing unit can optimize the text processing model;

specifically, a data evaluation result is obtained, a text processing model optimization method is determined based on the data evaluation result, and a text processing model is correspondingly optimized based on the text processing model optimization method.

It should be noted that, evaluating text training data based on the text processing model and optimizing the text processing model includes:

acquiring text training data based on a text processing model;

The data storage module is used for storing text expected data and text training data and providing a reference guide basis for evaluating the text training data and optimizing a text processing model.

It should be noted that the data storage module includes an expected storage unit and a training storage unit;

the training storage unit can be used for storing text training data and providing a guiding basis for optimizing a text processing model.

It should be noted that, the training device using the text processing model trains the text processing model, and the training situation of the text processing model is shown in table 1:

table 1: training situation of text processing model

In order to better demonstrate the training flow of the text processing model, the embodiment now provides a training method of the text processing model, which is realized based on the training device of the text processing model, and comprises the following steps:

In summary, the training device and the method for the text processing model acquire the text to be trained, determine the text data set to be trained based on the acquired text to be trained, comprehensively process the text data set to be trained, determine text feature data based on the text data set, preprocess the text data, reduce text data training load, input the text feature data into the text processing model, model train the text feature data based on the text processing model, determine text training data based on the text processing model, retrieve stored text expected data based on the text training data index, evaluate the text training data based on the text processing model, determine a data evaluation result, determine a text processing model optimization method based on the data evaluation result, correspondingly optimize the text processing model based on the text processing model optimization method, and perform real-time optimization adjustment according to the text data training condition, thereby improving the training effect of the text processing model.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A training device for a text processing model, comprising:

2. The training device of a text processing model of claim 1, wherein the text acquisition module comprises:

3. The training device of a text processing model according to claim 1, characterized by a text acquisition unit comprising:

4. A training device for a text processing model as claimed in claim 3, characterized in that the region data amount determining module comprises:

5. The training device of a text processing model of claim 2, wherein the text processing module comprises:

the text retrieval unit is used for retrieving the text data set to be trained;

A text grouping unit for grouping the retrieved text data;

6. The training device of a text processing model of claim 5, wherein the model training module comprises:

7. The training device of a text processing model of claim 6, wherein the evaluation optimization module comprises:

the analysis optimizing unit is used for optimizing the text processing model;

8. The training device of a text processing model of claim 7, wherein the data storage module comprises:

9. A method of training a text processing model, based on a training device implementation of a text processing model according to claim 6, characterized by the steps of:

10. The method for training a text processing model according to claim 9, wherein in S3, the text training data based on the text processing model is evaluated by an evaluation optimization module, and the text processing model is optimized, the following operations are performed:

acquiring text training data based on a text processing model;