CN117709355A - Method, device and medium for improving training effect of large language model - Google Patents

Method, device and medium for improving training effect of large language model Download PDF

Info

Publication number
CN117709355A
CN117709355A CN202410164274.8A CN202410164274A CN117709355A CN 117709355 A CN117709355 A CN 117709355A CN 202410164274 A CN202410164274 A CN 202410164274A CN 117709355 A CN117709355 A CN 117709355A
Authority
CN
China
Prior art keywords
corpus
corpus text
target
text
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410164274.8A
Other languages
Chinese (zh)
Inventor
王帅
周舒婷
雷成铭
陈玉梅
张光谱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Shutian Information Technology Co ltd
Original Assignee
Sichuan Shutian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Shutian Information Technology Co ltd filed Critical Sichuan Shutian Information Technology Co ltd
Priority to CN202410164274.8A priority Critical patent/CN117709355A/en
Publication of CN117709355A publication Critical patent/CN117709355A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a method, a device and a medium for improving training effect of a large language model, wherein the method comprises the following steps: acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set; carrying out semantic segmentation on a target corpus text in a target corpus text set to obtain a plurality of segmented words; updating a preset vocabulary library based on the segmented vocabulary to obtain an updated vocabulary library; according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from the multi-space target corpus text; training the pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain the trained large language model. The method and the device can improve the effect of the large language model obtained through training.

Description

Method, device and medium for improving training effect of large language model
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method, a device and a medium for improving training effect of a large language model.
Background
The large language model is a natural adaptive natural language processing task obtained by pretraining massive text data. However, the pre-trained large language model is directly used for the question-answering task in the specific field, and the effect is often unsatisfactory. In order to better adapt to a question-answering task in a specific field, high-quality data are required to be extracted by processing corpus texts in a professional field to make marks, and the marked data are used for fine adjustment of a model so as to adapt to the specific task or field, so that the question-answering capability is further improved.
A large language which meets human expectations and has excellent performance in all aspects, and the model not only needs to ensure that the semantic understanding and common sense reasoning capacity of the model in the general field are excellent, but also better adapts to the question-answering task in the specific field. However, if the data is incorrectly processed in the process of fine-tuning the large language model, the phenomenon of under-fitting or over-fitting of the trained large language model is easy to occur, so that the large language model does not learn knowledge in a specific professional domain or forget the semantic understanding and common sense reasoning capacity of the general domain and is excessively suitable for a target domain and a specific task, and further, the capacity of the large language model to complete a question-answering task lifted by a user is difficult to achieve an expected effect.
Disclosure of Invention
The main purpose of the application is to provide a method, a device and a medium for improving the training effect of a large language model, and aims to solve the technical problem that the large language model obtained through training is difficult to achieve the expected effect.
To achieve the above object, the present application provides a method for improving training effect of a large language model, the method comprising:
acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from the multi-space target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
Training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
Optionally, the preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set includes:
based on each corpus text, respectively performing separator duty ratio calculation to determine a separator Fu Zhanbi of each corpus text;
deleting the corpus texts to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed;
and performing de-duplication operation on the corpus texts in the corpus text set to be processed to obtain a target corpus text set.
Optionally, the calculating the separator duty ratio of the current corpus text in the initial corpus text set, and determining the separator duty ratio of the current corpus text includes:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
Determining a total number of characters of the target character and a number of separators of the separators;
the ratio between the number of separators and the total number of characters is calculated, and the separator Fu Zhanbi of the current corpus text is determined.
Optionally, the performing a deduplication operation on the corpus text in the corpus text set to be processed to obtain a target corpus text set includes:
processing each corpus text in the corpus text set to be processed and converting the corpus text into binary strings;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
Optionally, the method for determining the repeated corpus text comprises the following steps:
determining a first corpus text to be compared and a second corpus text to be compared from the corpus text set to be processed;
acquiring a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
comparing the first binary string with the second binary string bit by bit, judging whether numbers corresponding to the same position are consistent or not, and performing count value adding 1 operation when the numbers are inconsistent, wherein the initial value of the count value is zero;
taking the count value as the target difference degree, and judging whether the target difference degree is smaller than the difference threshold value or not;
if yes, determining the first corpus text to be compared and the second corpus text to be compared as repeated corpus texts.
Optionally, the performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented vocabularies includes:
semantic segmentation is carried out on each piece of target language material text in the target language material text set, so as to obtain a word segmentation set of each piece of target language material text; wherein a text of a target material corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
Combining any two adjacent word segments in the word segment set of each target language text to obtain a plurality of to-be-processed segmented words corresponding to each target language text;
determining a segmented word to be deleted from the plurality of segmented words to be processed;
and deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
Optionally, updating the preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library, including:
determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set;
sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
And adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
Optionally, the determining, according to the target label information set of the target corpus text, the professional corpus text and the general corpus text from the multi-space target corpus text includes:
inputting each target material text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each target material text;
deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
and determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
In addition, in order to achieve the above object, the present application further provides an apparatus for improving training effect of a large language model, the apparatus comprising:
The acquisition unit is used for acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
the preprocessing unit is used for preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
the segmentation unit is used for carrying out semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
the updating unit is used for updating the preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
the determining unit is used for determining a professional corpus text and a general corpus text from the multi-space target corpus text according to the target label information set of the target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
and the training unit is used for training the pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
In addition, the present application also provides a computing device, including: at least one processor, memory, and input output unit; wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the method of any of the first aspects.
Furthermore, the application provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first aspects.
According to the method, the device and the medium for improving the training effect of the large language model, the obtained initial corpus text set is preprocessed to obtain the target corpus text set; the target corpus text in the target corpus text set can be subjected to semantic segmentation to obtain a plurality of segmented words, and a preset vocabulary library can be updated according to the obtained plurality of segmented words; in addition, the professional corpus text and the general corpus text can be determined from a plurality of target corpus texts in the target corpus text set, and the pre-constructed large language model is trained based on the updated vocabulary library, the professional corpus text and the general corpus text. Therefore, through high-quality training data, the semantic understanding and reasoning capacity of the trained large language model for knowledge in the professional field can be improved, and the training effect of the large language model is improved.
Drawings
FIG. 1 is a flowchart of a method for improving training effects of a large language model according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a refinement flow chart of step S102 in FIG. 1;
FIG. 3 is a schematic diagram of a functional module of a device for improving training effect of a large language model according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a medium according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Reference numerals illustrate: 50. a computing device; 501. a processing unit; 502. a system memory; 5021. RAM (random access memory); 5022. a cache memory; 5023. ROM (read only memory); 5024. a program module; 5025. program/utility of program modules; 503. a bus connecting the different system components; 504. an external device; 505. an I/O (input/output) interface; 506. a network adapter.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the present application may be implemented as a system, apparatus, device, method, or computer program product. Thus, the present application may be embodied in the form of: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
In the prior art, through the pre-training of a large-scale corpus, a large language model can obtain basic language understanding and generating skills. In this process, the size and quality of the pre-training corpus is critical to the ability of large language models to obtain strong power. However, because the data quality of the pre-training corpus is poor, it is difficult to achieve the desired effect using a large language model trained using a low-quality pre-training corpus.
The main solutions of the embodiments of the present application are:
acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
Preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from the multi-space target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
The application provides a solution, wherein the obtained initial corpus text set is preprocessed to obtain a target corpus text set; the target corpus text in the target corpus text set can be subjected to semantic segmentation to obtain a plurality of segmented words, and a preset vocabulary library can be updated according to the obtained plurality of segmented words; in addition, the professional corpus text and the general corpus text can be determined from a plurality of target corpus texts in the target corpus text set, and the pre-constructed large language model is trained based on the updated vocabulary library, the professional corpus text and the general corpus text. Therefore, through high-quality training data, semantic understanding and reasoning capacity of the large language model aiming at knowledge in the patent field can be improved, and training effect of the large language model is improved.
It should be noted that any number of elements in the figures are for illustration and not limitation, and that any naming is used for distinction only and not for limitation.
The principles and spirit of the present application are explained in detail below with reference to several representative embodiments thereof.
Example 1
Referring now to fig. 1, fig. 1 is a flowchart illustrating a method for improving training effects of a large language model according to an embodiment of the present application. It should be noted that embodiments of the present application may be applied to any scenario where applicable.
The process of the method for improving training effect of large language model according to the embodiment of the present application shown in fig. 1 includes:
step S101, an initial corpus text set is obtained.
In this embodiment of the present application, the initial corpus text set includes a plurality of corpus texts, and each corpus text corresponds to one tag information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier.
In this embodiment of the present application, the download address identifier may be a web address or a web address identifier of a downloaded corpus text (e.g., web addresses such as a web, forum, encyclopedia, etc.); the domain identifier may be professional domain information (such as physics, chemistry, mathematics, new energy, etc.) corresponding to the corpus text.
Step S102, preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set.
In the embodiment of the application, the target corpus text set includes a plurality of target corpus texts and a label information set corresponding to each target corpus text; the label information set at least comprises a download address identifier and a domain identifier of the corresponding target corpus text.
In this embodiment of the present application, the method for preprocessing the initial corpus text set may be: any one or more combinations of data cleaning, de-duplication, language screening, separator duty ratio screening and the like are performed on the corpus text in the initial corpus text set, and the embodiment of the application is not limited to this. By preprocessing the initial corpus text set, noisy, redundant, irrelevant and potentially harmful corpus text can be removed to improve the quality of corpus text contained in the target corpus text set.
In another embodiment of the present application, in order to improve the quality of the target corpus text in the target corpus text set, the corpus text with a higher separator occupation may be deleted from the initial corpus text set, so as to obtain a corpus text set to be processed; and performing a de-duplication operation on the corpus text set to be processed to obtain a target corpus text set, as shown in fig. 2, the step S102 is replaced by the following steps S201 to S203:
In step S201, a separator duty ratio calculation is performed on each corpus text, so as to determine a separator Fu Zhanbi of each corpus text.
As an optional implementation manner, step S201 performs a separator duty ratio calculation on a current corpus text in the initial corpus text set, and the manner of determining the separator duty ratio of the current corpus text may specifically include the following steps:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
determining a total number of characters of the target character and a number of separators of the separators;
the ratio between the number of separators and the total number of characters is calculated, and the separator Fu Zhanbi of the current corpus text is determined.
According to the embodiment, the total number of characters and the number of separators in each corpus text can be identified, the proportion of the separators in the current corpus text is determined based on the total number of characters and the number of separators, and the accuracy of the determined proportion of the separators is improved.
Step S202, deleting the corpus text to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed.
Step S203, performing de-duplication operation on the corpus text in the corpus text set to be processed to obtain a target corpus text set.
As an optional implementation manner, the step S203 of performing a deduplication operation on the corpus text in the corpus text set to be processed to obtain the target corpus text set may specifically include the following steps:
respectively calculating each corpus text in the corpus text set to be processed to obtain a corresponding binary string;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
Deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
According to the embodiment, the corpus texts can be converted into the binary strings, the similarity of the two different corpus texts is determined based on the binary strings, and then the repeated corpus texts with higher similarity are deleted, so that only one repeated corpus text is finally reserved for training the large language model, the diversity of the corpus texts is ensured, the stability of the training process of the large language model is improved, and meanwhile, the performance of the model is improved.
As an alternative embodiment, the method for determining the repeated corpus text may include the steps of:
determining a first corpus text to be compared and a second corpus text to be compared from the corpus text set to be processed;
acquiring a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
comparing the first binary string with the second binary string bit by bit to obtain a target difference degree; wherein the target degree of difference is the number of positions with different numbers on the corresponding positions of the first binary string and the second binary string;
And if the target difference degree is smaller than the difference threshold value, determining the first corpus text to be compared and the second corpus text to be compared as repeated corpus texts.
In the embodiment of the application, the corpus text can be converted into the hash value through the SimHash algorithm, and then the hash value can be converted into the binary string, namely the binary string corresponding to the corpus text.
In another embodiment of the present application, any two corpus texts may be selected as a combination (including a first corpus text to be compared and a second corpus text to be compared), and the calculation mode of the difference degree of the binary strings corresponding to the two corpus texts may be:
determining a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
comparing the first binary string with the second binary string bit by bit, determining whether numbers at corresponding positions in the first binary string and the second binary string are the same, if not, adding 1 in a cumulative way, wherein the initial value is zero, and the final count value is the number of positions with different numbers at the corresponding positions;
and determining the number of the positions as the difference degree between the first corpus text to be compared and the second corpus text to be compared, and when the difference degree is smaller than a difference threshold value, explaining that the two corpus texts in the combination are highly similar in content, deleting one corpus text from the two corpus texts, and then comparing the two corpus texts with other residual corpus texts to form a new combination to perform binary string comparison until the difference degree of binary strings corresponding to all the residual corpus texts is larger than or equal to the difference threshold value, namely, the residual corpus texts have repeated corpus with high dissimilarity.
Optionally, in another embodiment, the method for deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain the target corpus text set further includes:
and (3) comparing the same position numerical values based on the binary strings corresponding to all the corpus texts in the corpus text set to be processed, dividing the corpus texts with higher binary string similarity into a combination, comparing the difference degree inside each combination, deleting repeated texts, and comparing the binary strings between groups bit by bit based on the rest corpus texts in all the combinations until the difference degree of the binary strings corresponding to all the rest corpus texts is larger than or equal to a difference threshold value. In other embodiments, there are other comparison methods, and the specific ones are not limiting.
By implementing the steps S201 to S203, corpus texts with higher separator occupation can be deleted from the initial corpus text set to obtain a corpus text set to be processed; and performing de-duplication operation on the corpus text set to be processed to obtain a target corpus text set, so that the quality of the target corpus text in the target corpus text set is improved.
Step S103, carrying out semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words.
In this embodiment of the present application, the preset word segmentation tools, such as the THULAC, jieba word segmentation, ltp word segmentation, and the like, may be used, which is not limited to this embodiment of the present application.
As an optional implementation manner, the step S103 performs semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented vocabulary, which may specifically include the following steps:
semantic segmentation is carried out on each piece of target language material text in the target language material text set, so as to obtain a word segmentation set of each piece of target language material text; wherein a text of a target material corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
combining any two adjacent word segments in the word segment set of each target language text to obtain a plurality of to-be-processed segmented words corresponding to each target language text;
determining a segmented word to be deleted from the plurality of segmented words to be processed; the segmented vocabulary to be deleted at least comprises vocabulary related to the stop vocabulary;
And deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
According to the implementation mode, semantic segmentation can be carried out on each target material text to obtain a plurality of segmented words, and adjacent segmented words can be combined to obtain a plurality of segmented words, so that the segmented words obtained in the mode are richer in meaning, and the quality of the segmented words is improved.
In the embodiment of the invention, the segmented vocabulary to be deleted at least comprises vocabulary related to the stop word list, such as vocabularies, conjunctions, exclamation words and other words with smaller information quantity. And identifying the plurality of to-be-processed segmented words through a pre-trained nonsensical word identification model, and determining nonsensical to-be-deleted segmented words.
For example, the target corpus text may be: atherosclerosis can cause unsmooth blood circulation, and cause ischemia and necrosis of tissues and organs needing blood supply, thereby causing complications of multiple tissues and organs.
Artificial intelligence is an important component of the intellectual discipline that attempts to understand the essence of intelligence.
Semantic segmentation is carried out on the target corpus, and the obtained segmented word set can be { artery, atherosclerosis, caused, blood, circulation, unsmooth, letting, need, blood supply, tissues, organs, appearance, ischemia, necrosis, induction, multiple tissues, organs, appearance and complications };
Any two adjacent segmented words in the segmented word set are combined, and the obtained multiple segmented words can be: { atherosclerosis, which causes blood, causes blood circulation, circulation disorder, needs blood supply, tissue and organ appearance, ischemic necrosis, necrosis causes multiple tissues, tissue and organ appearance, complications and the like }.
Step S104, updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library.
As an optional implementation manner, the step S104 of updating the preset vocabulary library based on the divided vocabulary, and the manner of obtaining the updated vocabulary library may specifically include the following steps:
determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set; the calculation formula of the correlation value can be as follows
Wherein: y is an association value; i is the order of the target corpus text; n is the total number of target corpus texts; The frequency of occurrence of the jth divided vocabulary in the ith text of the target language material is determined; />A text quality assessment value for the target material at the ith space, the quality assessment value being a score given based on the professionality of the download address identification; k is a domain evaluation value, which is an evaluation value given based on the proximity of the belonging domain identifier to the user's desired domain. In other embodiments, the weighted average of the parameters in each target text may be provided, and the specific manner is not limited.
Sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
and adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
According to the method and the device, the association value of the segmented vocabulary can be calculated according to the occurrence frequency of the segmented vocabulary in the target corpus text and the target label information corresponding to the target corpus text, the segmented vocabulary can be further ordered based on the association value, and the segmented vocabulary with larger association value can be added into a preset vocabulary library to expand an original vocabulary of the pre-trained large language model, so that the specific vocabulary in certain professional fields can be recognized more accurately when the semantic understanding or reasoning of the large language model is performed in the later stage, the intention of a user can be better understood or more professional content can be returned, and the situation that abundant semantics cannot be carried due to the fact that the vocabulary is small is avoided, so that the quality of the vocabulary stored in the vocabulary library is improved.
Step S105, according to the target label information set of the target corpus text, determining the professional corpus text and the general corpus text from the multi-space target corpus text.
In this embodiment of the present application, the ratio of the number of specialized corpus texts to the number of generic corpus texts is a preset ratio. The preset ratio is optimally 1:1, but aiming at the scale of an unused corpus, fine adjustment can be performed on the scale, and if the total number of the micro-invoked training corpuses is 500, the number ratio of the professional corpus texts to the general corpus texts is 6:4; if the total number of the micro-invoked training corpus is 5000, the quantity ratio of the professional corpus text to the general corpus text is 5.5:4.5, etc. The mixed corpus is formed by mixing the general domain corpus and the target corpus based on the preset proportion, and the mixed corpus is matched with the expanded word list to be trained together so as to ensure the balance and diversity of data, ensure that the general answering capacity and the intra-domain knowledge answering capacity of a final large language model can be simultaneously reserved, avoid the large language model from being excessively sensitive to texts in certain specific fields or types, keep the balance of positive and negative samples, and improve the generalization capacity of the large language model.
As an optional implementation manner, the determining, in step S105, the professional corpus text and the general corpus text from the multi-space target corpus text according to the target label information set of the target corpus text may specifically include the following steps:
Inputting each target material text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each target material text;
deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
and determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
According to the embodiment, the confusion degree of each piece of target language material text can be evaluated through a pre-constructed quality scoring model, and target language material texts with the confusion degree larger than or equal to a preset confusion degree threshold value can be deleted, so that the language material texts of a high-quality language material text set are high-quality language material texts, and further, professional language material texts and general language material texts can be determined from the high-quality language material text set according to the corresponding domain identification of the language material texts, and the quality of the professional language material texts and the general language material texts is improved.
In the embodiment of the present application, the calculation manner of the confusion degree of the target corpus text may be:
dividing the target corpus text into independent sentences, continuing word segmentation processing of the sentences, and then counting the probability of occurrence of two adjacent words in the target corpus text according to the following formula, wherein a smooth value V is needed to be added in order to avoid the problem that the probability is 0.
Wherein,for the current word->For the previous adjacent word of the current word, c () represents the combination mode of the words in brackets, the numerator is the number of times of occurrence of the adjacent word, the denominator is the total number of times of occurrence of the current word in the target corpus text, and V is the total number-1 of the divided words contained in the sentence where the current word is located. And then multiplying the probabilities of all adjacent words of the current sentence to obtain the probability of the current sentence.
And, according to the following formula, find the confusion degree of the current sentence, wherein N is the word segmentation number of the current sentence, P (w) 1 w 2 ...w n ) Representing the sentence probability calculated by the quality scoring model, which is the confusion PP (W) of the current sentence.
And then executing the same operation on other sentences in the whole target corpus text to obtain the confusion degree of all sentences, and finally obtaining the average confusion degree of the whole target corpus text, wherein the average confusion degree is the confusion degree of the whole target corpus text.
And step S106, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
By implementing the steps S101 to S106, the semantic understanding and reasoning capacity of the large language model aiming at the professional knowledge can be improved through high-quality training data, and the training effect of the large language model can be improved. In addition, the quality of the target corpus text in the target corpus text set can be improved. In addition, the accuracy of the determined separator duty ratio can be improved. In addition, the accuracy of repeated corpus text determination can be improved. In addition, the quality of the segmented vocabulary can be improved. In addition, the quality of the vocabulary stored in the vocabulary library can be improved. In addition, the method and the device can improve the specificity of the vocabulary stored in the vocabulary library. In addition, the method and the device can improve the quality of the professional corpus text and the general corpus text.
As an optional implementation manner, step S106 trains a pre-constructed large language model based on the updated vocabulary library, the specialized corpus text and the generic corpus text, and the obtaining the trained large language model includes the following steps:
And (3) verification: the method comprises the steps that a large model evaluation kit is used for respectively evaluating the response accuracy of general knowledge and specific domain knowledge of a first model and a second model, a first evaluation value set and a second evaluation value set are correspondingly obtained, the first evaluation value set at least comprises a first general domain evaluation value and a first specific domain evaluation value, and the second evaluation value set at least comprises a second general domain evaluation value and a second specific domain evaluation value; the first model is a large language model constructed in advance, and the second model is a large language model after preliminary training;
judging whether the first difference value is smaller than a first threshold value or not and whether the second difference value is larger than or equal to a second threshold value or not based on the first difference value; the first difference value is an absolute value of a difference between the second general field evaluation value and the first general field evaluation value, the second difference value is a difference between the second specific field evaluation value and the first specific field evaluation value, and the first threshold value is a maximum value allowing the second model general knowledge response accuracy to be reduced; the second threshold value provides a target value of domain-specific knowledge response accuracy for the second model;
If yes, obtaining the trained large language model; if not, judging whether the first difference value is larger than or equal to the first threshold value, and judging whether the second difference value is larger than the second threshold value;
if yes, reducing the preset ratio according to a first preset ratio threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, judging whether the first difference value is smaller than the first threshold value and whether the second difference value is larger than or equal to a third threshold value, wherein the third threshold value is a product value of a second preset proportion and the second threshold value;
if yes, increasing the preset ratio according to the first preset ratio threshold, increasing the original training time according to a preset time threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, the original training time is prolonged according to the preset time threshold, training is carried out on a large language model built in advance based on the updated vocabulary library, the professional corpus text and the general corpus text, and the verification step is returned.
It can be understood that the general knowledge of the large language model after each training and the inference question-answering capability in the specific target field are evaluated, and the training mode of the large language model is selectively adjusted from two angles of training duration, the proportion of the professional corpus text and the general corpus text according to the evaluation result, so that the aims of remarkably shortening the training time and the calculation cost of the model while improving the performance of the large language model are fulfilled.
Example two
Having described the method of the exemplary embodiment of the present application, an apparatus for improving training effects of a large language model according to the exemplary embodiment of the present application will be described with reference to fig. 3, where the apparatus includes an obtaining unit 301, a preprocessing unit 302, a segmentation unit 303, an updating unit 304, a determining unit 305, and a training unit 306, specifically:
the acquiring unit 301 may be configured to acquire an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
The preprocessing unit 302 may be configured to perform preprocessing on all the corpus texts in the initial corpus text set to obtain a target corpus text set;
the segmentation unit 303 may be configured to perform semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented vocabularies;
the updating unit 304 may be configured to update a preset vocabulary library based on the divided vocabulary, to obtain an updated vocabulary library;
the determining unit 305 may be configured to determine, according to the target label information set of the target corpus text, a professional corpus text and a general corpus text from the multi-space target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
the training unit 306 may be configured to train the pre-built large language model based on the updated vocabulary library, the specialized corpus text, and the generic corpus text, to obtain a trained large language model.
As an optional implementation manner, the preprocessing unit 302 may perform preprocessing on all the corpus texts in the initial corpus text set to obtain the target corpus text set, which may specifically be:
Based on each corpus text, respectively performing separator duty ratio calculation to determine a separator Fu Zhanbi of each corpus text;
deleting the corpus texts to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed;
and performing de-duplication operation on the corpus texts in the corpus text set to be processed to obtain a target corpus text set.
By implementing the implementation mode, the corpus text with the higher separator occupation ratio can be deleted from the initial corpus text set, so that a corpus text set to be processed is obtained; and performing de-duplication operation on the corpus text set to be processed to obtain a target corpus text set, so that the quality of the target corpus text in the target corpus text set is improved.
As an optional implementation manner, the preprocessing unit 302 performs the separator duty ratio calculation on a current corpus text in the initial corpus text set, and the manner of determining the separator duty ratio of the current corpus text may specifically be:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
Determining a total number of characters of the target character and a number of separators of the separators;
the ratio between the number of separators and the total number of characters is calculated, and the separator Fu Zhanbi of the current corpus text is determined.
According to the embodiment, the total number of characters and the number of separators in each corpus text can be identified, the proportion of the separators in the current corpus text is determined based on the total number of characters and the number of separators, and the accuracy of the determined proportion of the separators is improved.
As an optional implementation manner, the preprocessing unit 302 performs a deduplication operation on the corpus text in the corpus text set to be processed, and a manner of obtaining the target corpus text set may specifically be:
processing each corpus text in the corpus text set to be processed and converting the corpus text into binary strings;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
Determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
According to the embodiment, the corpus texts can be converted into the binary strings, and the difference values of the two different corpus texts are determined based on the binary strings, so that repeated corpus texts are deleted, and the accuracy of determining the repeated corpus texts can be improved.
As an alternative embodiment, the manner in which the preprocessing unit 302 determines the repeated corpus text may specifically be:
determining a first corpus text to be compared and a second corpus text to be compared from the corpus text set to be processed;
acquiring a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
comparing the first binary string with the second binary string bit by bit to obtain a target difference degree; wherein the target degree of difference is the number of positions with different numbers on the corresponding positions of the first binary string and the second binary string;
And if the target difference degree is smaller than the difference threshold value, determining the first corpus text to be compared and the second corpus text to be compared as repeated corpus texts.
As an optional implementation manner, the segmentation unit 303 performs semantic segmentation on the target corpus text in the target corpus text set, and the manner of obtaining the plurality of segmented words may specifically be:
semantic segmentation is carried out on each piece of target language material text in the target language material text set, so as to obtain a word segmentation set of each piece of target language material text; wherein a text of a target material corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
combining any two adjacent word segments in the word segment set of each target language text to obtain a plurality of to-be-processed segmented words corresponding to each target language text;
determining a segmented word to be deleted from the plurality of segmented words to be processed;
and deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
According to the implementation mode, semantic segmentation can be carried out on each target language text to obtain a plurality of segmented words, and adjacent analysis can be combined to obtain a plurality of segmented words.
As an optional implementation manner, the updating unit 304 updates the preset vocabulary library based on the divided vocabulary, and the manner of obtaining the updated vocabulary library may specifically be:
determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set;
sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
And adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
According to the implementation mode, the association value of the segmented vocabulary can be calculated according to the occurrence frequency of the segmented vocabulary in the target corpus text and the target label information corresponding to the target corpus text, the segmented vocabulary can be further ordered based on the association value, and the segmented vocabulary with the larger association value can be added into a preset vocabulary library, so that the quality of the vocabulary stored in the vocabulary library is improved.
As an optional implementation manner, the determining unit 305 may specifically determine, according to the target tag information set of the target corpus text, a specific corpus text and a general corpus text from the multi-space target corpus text:
inputting each target material text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each target material text;
deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
And determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
According to the embodiment, the confusion degree of each piece of target language material text can be evaluated through a pre-constructed quality scoring model, and target language material texts with the confusion degree larger than or equal to a preset confusion degree threshold value can be deleted, so that the language material texts of a high-quality language material text set are high-quality language material texts, and further, professional language material texts and general language material texts can be determined from the high-quality language material text set according to the corresponding domain identification of the language material texts, and the quality of the professional language material texts and the general language material texts is improved.
By implementing the implementation mode, the semantic understanding and reasoning capacity of the large language model aiming at professional knowledge can be improved through high-quality training data, and the training effect of the large language model is improved. In addition, the quality of the target corpus text in the target corpus text set can be improved. In addition, the accuracy of the determined separator duty ratio can be improved. In addition, the accuracy of repeated corpus text determination can be improved. In addition, the quality of the segmented vocabulary can be improved. In addition, the quality of the vocabulary stored in the vocabulary library can be improved. In addition, the method and the device can improve the quality of the professional corpus text and the general corpus text.
Example III
Having described the method and apparatus of the exemplary embodiments of the present application, reference will now be made to fig. 4 for a description of a computer-readable storage medium of the exemplary embodiments of the present application, and reference will be made to fig. 4 for an illustration of a computer-readable storage medium, an optical disc 40, having a computer program (i.e., a program product) stored thereon that, when executed by a processor, implements the steps described in the embodiments of the method, e.g., obtaining an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier; preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set; performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words; updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library; according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from the multi-space target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio; training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model; the specific implementation of each step is not repeated here.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
Example IV
Having described the methods, apparatus, and media of exemplary embodiments of the present application, next, a computing device for model processing of exemplary embodiments of the present application is described with reference to fig. 5.
Fig. 5 illustrates a block diagram of an exemplary computing device 50 suitable for use in implementing embodiments of the present application, the computing device 50 may be a computer system or a server. The computing device 50 shown in fig. 5 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 5, components of computing device 50 may include, but are not limited to: one or more processors or processing units 501, a system memory 502, and a bus 503 that connects the various system components (including the system memory 502 and processing units 501).
Computing device 50 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computing device 50 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 502 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 5021 and/or cache memory 5022. Computing device 50 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM5023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5 and commonly referred to as a "hard drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media), may be provided. In such cases, each drive may be coupled to a bus 503 that connects the various system components through one or more data medium interfaces. The system memory 502 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present application.
A program/utility 5025 having a set (at least one) of program modules 5024 may be stored in, for example, system memory 502, and such program modules 5024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 5024 generally perform the functions and/or methods in the embodiments described herein.
Computing device 50 may also communicate with one or more external devices 504 (e.g., keyboard, pointing device, display, etc.). Such communication may occur through an input/output (I/O) interface 505. Moreover, computing device 50 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 506. As shown in fig. 5, network adapter 506 communicates with other modules of computing device 50 (e.g., processing unit 501, etc.) over bus 503 that connects the various system components. It should be appreciated that although not shown in fig. 5, other hardware and/or software modules may be used in connection with computing device 50.
The processing unit 501 executes various functional applications and data processing by running a program stored in the system memory 502, for example, acquires an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier; preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set; performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words; updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library; according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from the multi-space target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio; training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model. The specific implementation of each step is not repeated here. It should be noted that while several units/modules or sub-units/sub-modules of the apparatus for enhancing the training effect of a large language model are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module according to embodiments of the present application. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
In the description of the present application, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

Claims (10)

1. A method for improving training results of a large language model, the method comprising:
acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from the multi-space target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
2. The method for improving training effects of large language models according to claim 1, wherein the preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set includes:
based on each corpus text, respectively performing separator duty ratio calculation to determine a separator Fu Zhanbi of each corpus text;
deleting the corpus texts to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed;
and performing de-duplication operation on the corpus texts in the corpus text set to be processed to obtain a target corpus text set.
3. The method for improving training effects of large language models of claim 2, wherein the method for determining the separator ratio of each of the corpus texts comprises:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
determining a total number of characters of the target character and a number of separators of the separators;
The ratio between the number of separators and the total number of characters is calculated, and the separator Fu Zhanbi of the current corpus text is determined.
4. The method for improving training effects of large language models according to claim 2, wherein the performing a de-duplication operation on the corpus text in the corpus text set to be processed to obtain a target corpus text set includes:
respectively calculating each corpus text in the corpus text set to be processed to obtain a corresponding binary string;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
Deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
5. The method for improving training effects of a large language model according to claim 1, wherein the performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words comprises:
semantic segmentation is carried out on each piece of target language material text in the target language material text set, so as to obtain a word segmentation set of each piece of target language material text; wherein a text of a target material corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
combining any two adjacent word segments in the word segment set of each target language text to obtain a plurality of to-be-processed segmented words corresponding to each target language text;
determining a to-be-deleted segmented word from the plurality of to-be-processed segmented words, wherein the to-be-deleted segmented word at least comprises words related to a stop word list;
and deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
6. The method for improving training effects of large language models according to claim 5, wherein updating the preset vocabulary library based on the segmented vocabulary to obtain an updated vocabulary library comprises:
determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set;
sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
and adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
7. The method of claim 1, wherein training a pre-built large language model based on the updated vocabulary library, the specialized corpus text, and the generic corpus text to obtain a trained large language model comprises:
And (3) verification: the method comprises the steps that a large model evaluation kit is used for respectively evaluating the response accuracy of general knowledge and specific domain knowledge of a first model and a second model, a first evaluation value set and a second evaluation value set are correspondingly obtained, the first evaluation value set at least comprises a first general domain evaluation value and a first specific domain evaluation value, and the second evaluation value set at least comprises a second general domain evaluation value and a second specific domain evaluation value; the first model is a large language model constructed in advance, and the second model is a large language model obtained after training each time;
judging whether the first difference value is smaller than a first threshold value or not and whether the second difference value is larger than or equal to a second threshold value or not based on the first difference value; the first difference value is an absolute value of a difference between the second general field evaluation value and the first general field evaluation value, the second difference value is a difference between the second specific field evaluation value and the first specific field evaluation value, and the first threshold value is a maximum value allowing the second model general knowledge response accuracy to be reduced; the second threshold value provides a target value of domain-specific knowledge response accuracy for the second model;
If yes, outputting and obtaining the trained large language model; if not, judging whether the first difference value is larger than or equal to the first threshold value, and judging whether the second difference value is larger than the second threshold value;
if yes, reducing the preset ratio according to a first preset ratio threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, judging whether the first difference value is smaller than the first threshold value and whether the second difference value is larger than or equal to a third threshold value, wherein the third threshold value is a product value of a second preset proportion and the second threshold value;
if yes, increasing the preset ratio according to the first preset ratio threshold, increasing the original training time according to a preset time threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, the original training time is prolonged according to the preset time threshold, training is carried out on a large language model built in advance based on the updated vocabulary library, the professional corpus text and the general corpus text, and the verification step is returned.
8. The method for improving training effects of a large language model according to any one of claims 1 to 7, wherein determining a professional corpus text and a general corpus text from a multi-space target corpus text according to the target tag information set of the target corpus text comprises:
inputting each target material text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each target material text;
deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
and determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
9. An apparatus for enhancing training effects of a large language model, the apparatus comprising:
The acquisition unit is used for acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
the preprocessing unit is used for preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
the segmentation unit is used for carrying out semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
the updating unit is used for updating the preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
the determining unit is used for determining a professional corpus text and a general corpus text from the multi-space target corpus text according to the target label information set of the target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
and the training unit is used for training the pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of enhancing the training effect of a large language model as claimed in any one of claims 1 to 8.
CN202410164274.8A 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model Pending CN117709355A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410164274.8A CN117709355A (en) 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410164274.8A CN117709355A (en) 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model

Publications (1)

Publication Number Publication Date
CN117709355A true CN117709355A (en) 2024-03-15

Family

ID=90144707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410164274.8A Pending CN117709355A (en) 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model

Country Status (1)

Country Link
CN (1) CN117709355A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220147715A1 (en) * 2019-05-16 2022-05-12 Huawei Technologies Co., Ltd. Text processing method, model training method, and apparatus
CN115545010A (en) * 2022-10-12 2022-12-30 阿里巴巴(中国)有限公司 Training method, device and equipment for generating network by navigation broadcast statement
US20230334263A1 (en) * 2022-04-13 2023-10-19 Abridge AI, Inc. Automating follow-up actions from conversations
CN117290500A (en) * 2022-06-16 2023-12-26 马上消费金融股份有限公司 Professional word stock construction method, device, medium and program product

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220147715A1 (en) * 2019-05-16 2022-05-12 Huawei Technologies Co., Ltd. Text processing method, model training method, and apparatus
US20230334263A1 (en) * 2022-04-13 2023-10-19 Abridge AI, Inc. Automating follow-up actions from conversations
CN117290500A (en) * 2022-06-16 2023-12-26 马上消费金融股份有限公司 Professional word stock construction method, device, medium and program product
CN115545010A (en) * 2022-10-12 2022-12-30 阿里巴巴(中国)有限公司 Training method, device and equipment for generating network by navigation broadcast statement

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
DANNI YU等: "Assessing the potential of LLM-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apologies", 《RESEARCHGATE》, 31 May 2023 (2023-05-31), pages 1 - 30 *
KHUSHI BHARDWAJ等: "Pre-training LLMs using human-like development data corpus", 《ARXIV》, 10 January 2024 (2024-01-10), pages 1 - 7 *
唐一之;: "基于知网的领域概念抽取与关系分析研究", 湘潭大学自然科学学报, no. 01, 15 March 2009 (2009-03-15), pages 135 - 140 *
王儒;王嘉梅;王伟全;符飞;: "深度学习框架下微博文本情感细粒度研究", 计算机系统应用, no. 05, 15 May 2020 (2020-05-15), pages 19 - 28 *
贺姣姣: "基于深度学习的教育技术学术论文文本自动分类研究", 《中国优秀硕士学位论文全文数据库 (社会科学Ⅱ辑)》, no. 1, 15 January 2019 (2019-01-15), pages 127 - 89 *
赵立君等: "医学语言模型研究", 《长江信息通信》, vol. 36, no. 11, 15 November 2023 (2023-11-15), pages 1 - 7 *

Similar Documents

Publication Publication Date Title
US11816438B2 (en) Context saliency-based deictic parser for natural language processing
CN108091328B (en) Speech recognition error correction method and device based on artificial intelligence and readable medium
Rastogi et al. Weighting finite-state transductions with neural context
CN109829162B (en) Text word segmentation method and device
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
KR101813683B1 (en) Method for automatic correction of errors in annotated corpus using kernel Ripple-Down Rules
US20120262461A1 (en) System and Method for the Normalization of Text
US20040210434A1 (en) System and iterative method for lexicon, segmentation and language model joint optimization
CN112287670A (en) Text error correction method, system, computer device and readable storage medium
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN111382260A (en) Method, device and storage medium for correcting retrieved text
CN111104803B (en) Semantic understanding processing method, device, equipment and readable storage medium
US20220414332A1 (en) Method and system for automatically generating blank-space inference questions for foreign language sentence
CN112016303B (en) Text error correction method, device, equipment and storage medium based on graphic neural network
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN111462751A (en) Method, apparatus, computer device and storage medium for decoding voice data
CN115048944A (en) Open domain dialogue reply method and system based on theme enhancement
CN113407709A (en) Generative text summarization system and method
CN115293138A (en) Text error correction method and computer equipment
CN113806489A (en) Method, electronic device and computer program product for dataset creation
CN112232057B (en) Method, device, medium and equipment for generating countermeasure sample based on text expansion
CN111898337B (en) Automatic generation method of single sentence abstract defect report title based on deep learning
CN111737417B (en) Method and device for correcting natural language generated result
Göker et al. Neural text normalization for turkish social media
CN116562240A (en) Text generation method, computer device and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination