CN117709355B - Method, device and medium for improving training effect of large language model - Google Patents

Method, device and medium for improving training effect of large language model Download PDF

Info

Publication number
CN117709355B
CN117709355B CN202410164274.8A CN202410164274A CN117709355B CN 117709355 B CN117709355 B CN 117709355B CN 202410164274 A CN202410164274 A CN 202410164274A CN 117709355 B CN117709355 B CN 117709355B
Authority
CN
China
Prior art keywords
corpus
corpus text
text
target
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410164274.8A
Other languages
Chinese (zh)
Other versions
CN117709355A (en
Inventor
王帅
周舒婷
雷成铭
陈玉梅
张光谱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Shutian Information Technology Co ltd
Original Assignee
Sichuan Shutian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Shutian Information Technology Co ltd filed Critical Sichuan Shutian Information Technology Co ltd
Priority to CN202410164274.8A priority Critical patent/CN117709355B/en
Publication of CN117709355A publication Critical patent/CN117709355A/en
Application granted granted Critical
Publication of CN117709355B publication Critical patent/CN117709355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device and a medium for improving training effect of a large language model, wherein the method comprises the following steps: acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set; carrying out semantic segmentation on a target corpus text in a target corpus text set to obtain a plurality of segmented words; updating a preset vocabulary library based on the segmented vocabulary to obtain an updated vocabulary library; according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from a plurality table of contents of markup material texts; training the pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain the trained large language model. The application can improve the effect of the large language model obtained by training.

Description

Method, device and medium for improving training effect of large language model
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method, a device and a medium for improving training effect of a large language model.
Background
The large language model is a natural adaptive natural language processing task obtained by pretraining massive text data. However, the pre-trained large language model is directly used for the question-answering task in the specific field, and the effect is often unsatisfactory. In order to better adapt to a question-answering task in a specific field, high-quality data are required to be extracted by processing corpus texts in a professional field to make marks, and the marked data are used for fine adjustment of a model so as to adapt to the specific task or field, so that the question-answering capability is further improved.
A large language which meets human expectations and has excellent performance in all aspects, and the model not only needs to ensure that the semantic understanding and common sense reasoning capacity of the model in the general field are excellent, but also better adapts to the question-answering task in the specific field. However, if the data is incorrectly processed in the process of fine-tuning the large language model, the phenomenon of under-fitting or over-fitting of the trained large language model is easy to occur, so that the large language model does not learn knowledge in a specific professional domain or forget semantic understanding and common sense reasoning capability in a general domain and is excessively suitable for a target domain and a specific task, and further, the capability of the large language model to complete a question-answer task lifted by a user is difficult to achieve an expected effect.
Disclosure of Invention
The application mainly aims to provide a method, a device and a medium for improving the training effect of a large language model, and aims to solve the technical problem that the large language model obtained through training is difficult to achieve the expected effect.
In order to achieve the above object, the present application provides a method for improving training effect of a large language model, the method comprising:
Acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
Preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
Performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
According to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from a plurality table of contents of markup material texts; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
Optionally, the preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set includes:
Based on each corpus text, respectively performing separator duty ratio calculation to determine the separation Fu Zhanbi of each corpus text;
Deleting the corpus texts to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed;
and performing de-duplication operation on the corpus texts in the corpus text set to be processed to obtain a target corpus text set.
Optionally, the calculating the separator duty ratio of the current corpus text in the initial corpus text set, and determining the separator duty ratio of the current corpus text includes:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
determining a total number of characters of the target character and a number of separators of the separators;
and calculating the ratio between the number of separators and the total number of characters, and determining the separation Fu Zhanbi of the current corpus text.
Optionally, the performing a deduplication operation on the corpus text in the corpus text set to be processed to obtain a target corpus text set includes:
Processing each corpus text in the corpus text set to be processed and converting the corpus text into binary strings;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
Deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
Optionally, the method for determining the repeated corpus text comprises the following steps:
Determining a first corpus text to be compared and a second corpus text to be compared from the corpus text set to be processed;
acquiring a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
comparing the first binary string with the second binary string bit by bit, judging whether numbers corresponding to the same position are consistent or not, and performing count value adding 1 operation when the numbers are inconsistent, wherein the initial value of the count value is zero;
taking the count value as the target difference degree, and judging whether the target difference degree is smaller than the difference threshold value or not;
if yes, determining the first corpus text to be compared and the second corpus text to be compared as repeated corpus texts.
Optionally, the performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented vocabularies includes:
Semantic segmentation is carried out on each table of contents tagline text in the target corpus text set to obtain a word segmentation set of each table of contents tagline text; wherein, a table of contents tagline text corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
Combining any two adjacent word segments in the word segment set of each table of contents tagline text to obtain a plurality of to-be-processed segmented words corresponding to each table of contents tagline text;
determining a segmented word to be deleted from the plurality of segmented words to be processed;
And deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
Optionally, updating the preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library, including:
Determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
Determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set;
sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
and adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
Optionally, the determining, according to the target tag information set of the target corpus text, a professional corpus text and a general corpus text from the multiple table of contents taglines text includes:
Inputting each table of contents tagline text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each table of contents tagline text;
Deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
And determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
In addition, in order to achieve the above object, the present application further provides a device for improving training effect of a large language model, the device comprising:
the acquisition unit is used for acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
the preprocessing unit is used for preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
The segmentation unit is used for carrying out semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
The updating unit is used for updating the preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
the determining unit is used for determining a professional corpus text and a general corpus text from a plurality table of contents of mark-word material texts according to the target label information set of the target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
And the training unit is used for training the pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
Furthermore, the present application provides a computing device, the computing device comprising: at least one processor, memory, and input output unit; wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the method of any of the first aspects.
Furthermore, the application provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first aspects.
The embodiment of the application provides a method, a device and a medium for improving training effect of a large language model, which are used for preprocessing an obtained initial corpus text set to obtain a target corpus text set; the target corpus text in the target corpus text set can be subjected to semantic segmentation to obtain a plurality of segmented words, and a preset vocabulary library can be updated according to the obtained plurality of segmented words; in addition, the professional corpus text and the general corpus text can be determined from a plurality of target corpus texts in the target corpus text set, and the pre-constructed large language model is trained based on the updated vocabulary library, the professional corpus text and the general corpus text. Therefore, through high-quality training data, the semantic understanding and reasoning capacity of the trained large language model for knowledge in the professional field can be improved, and the training effect of the large language model is improved.
Drawings
FIG. 1 is a flow chart of a method for improving training effect of a large language model according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a refinement flow chart of step S102 in FIG. 1;
FIG. 3 is a schematic diagram of a functional module of an apparatus for improving training effects of a large language model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a medium according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Reference numerals illustrate: 50. a computing device; 501. a processing unit; 502. a system memory; 5021. RAM (random access memory); 5022. a cache memory; 5023. ROM (read only memory); 5024. a program module; 5025. program/utility of program modules; 503. a bus connecting the different system components; 504. an external device; 505. an I/O (input/output) interface; 506. a network adapter.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the application may be implemented as a system, apparatus, device, method, or computer program product. Thus, the application may be embodied in the form of: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
In the prior art, through the pre-training of a large-scale corpus, a large language model can obtain basic language understanding and generating skills. In this process, the size and quality of the pre-training corpus is critical to the ability of large language models to obtain strong power. However, because the data quality of the pre-training corpus is poor, it is difficult to achieve the desired effect using a large language model trained using a low-quality pre-training corpus.
The main solutions of the embodiments of the present application are:
Acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
Preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
Performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
According to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from a plurality table of contents of markup material texts; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
The application provides a solution, which is to obtain a target corpus text set by preprocessing an obtained initial corpus text set; the target corpus text in the target corpus text set can be subjected to semantic segmentation to obtain a plurality of segmented words, and a preset vocabulary library can be updated according to the obtained plurality of segmented words; in addition, the professional corpus text and the general corpus text can be determined from a plurality of target corpus texts in the target corpus text set, and the pre-constructed large language model is trained based on the updated vocabulary library, the professional corpus text and the general corpus text. Therefore, through high-quality training data, semantic understanding and reasoning capacity of the large language model aiming at knowledge in the patent field can be improved, and training effect of the large language model is improved.
It should be noted that any number of elements in the figures are for illustration and not limitation, and that any naming is used for distinction only and not for limitation.
The principles and spirit of the present application are explained in detail below with reference to several representative embodiments thereof.
Example 1
Referring now to fig. 1, fig. 1 is a flowchart illustrating a method for improving training effects of a large language model according to an embodiment of the present application. It should be noted that embodiments of the present application may be applied to any scenario where applicable.
The process of the method for improving training effect of large language model according to one embodiment of the present application shown in fig. 1 includes:
step S101, an initial corpus text set is obtained.
In the embodiment of the application, the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier.
In the embodiment of the application, the download address identifier may be a web address or a web address identifier of a downloaded corpus text (e.g., web addresses such as a web, forum, encyclopedia, etc.); the domain identifier may be professional domain information (such as physics, chemistry, mathematics, new energy, etc.) corresponding to the corpus text.
Step S102, preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set.
In the embodiment of the application, the target corpus text set comprises a plurality of table of contents tagline texts and a label information set corresponding to each table of contents tagline texts; the label information set at least comprises a download address identifier and a domain identifier of the corresponding target corpus text.
In the embodiment of the present application, the method for preprocessing the initial corpus text set may be: any one or more combinations of data cleaning, de-duplication, language screening, separator duty ratio screening and the like are performed on the corpus text in the initial corpus text set, and the embodiment of the application is not limited in this regard. By preprocessing the initial corpus text set, noisy, redundant, irrelevant and potentially harmful corpus text can be removed to improve the quality of corpus text contained in the target corpus text set.
In another embodiment of the present application, in order to improve the quality of the target corpus text in the target corpus text set, the corpus text with a higher separator occupation may be deleted from the initial corpus text set, so as to obtain a corpus text set to be processed; and performing a de-duplication operation on the corpus text set to be processed to obtain a target corpus text set, as shown in fig. 2, the step S102 is replaced by the following steps S201 to S203:
Step S201, performing separator duty ratio calculation based on each corpus text, and determining a separator Fu Zhanbi of each corpus text.
As an optional implementation manner, step S201 performs a separator duty ratio calculation on a current corpus text in the initial corpus text set, and the manner of determining the separator duty ratio of the current corpus text may specifically include the following steps:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
determining a total number of characters of the target character and a number of separators of the separators;
and calculating the ratio between the number of separators and the total number of characters, and determining the separation Fu Zhanbi of the current corpus text.
According to the embodiment, the total number of characters and the number of separators in each corpus text can be identified, the proportion of the separators in the current corpus text is determined based on the total number of characters and the number of separators, and the accuracy of the determined proportion of the separators is improved.
Step S202, deleting the corpus text to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed.
Step S203, performing de-duplication operation on the corpus text in the corpus text set to be processed to obtain a target corpus text set.
As an optional implementation manner, the step S203 of performing a deduplication operation on the corpus text in the corpus text set to be processed to obtain the target corpus text set may specifically include the following steps:
respectively calculating each corpus text in the corpus text set to be processed to obtain a corresponding binary string;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
Deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
According to the embodiment, the corpus texts can be converted into the binary strings, the similarity of the two different corpus texts is determined based on the binary strings, and then the repeated corpus texts with higher similarity are deleted, so that only one repeated corpus text is finally reserved for training the large language model, the diversity of the corpus texts is ensured, the stability of the training process of the large language model is improved, and meanwhile, the performance of the model is improved.
As an alternative embodiment, the method for determining the repeated corpus text may include the steps of:
Determining a first corpus text to be compared and a second corpus text to be compared from the corpus text set to be processed;
acquiring a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
Comparing the first binary string with the second binary string bit by bit to obtain a target difference degree; wherein the target degree of difference is the number of positions with different numbers on the corresponding positions of the first binary string and the second binary string;
and if the target difference degree is smaller than the difference threshold value, determining the first corpus text to be compared and the second corpus text to be compared as repeated corpus texts.
In the embodiment of the application, the corpus text can be converted into the hash value through SimHash algorithm, and the hash value can be further converted into the binary string, namely the binary string corresponding to the corpus text.
In another embodiment of the present application, any two corpus texts may be selected as a combination (including a first corpus text to be compared and a second corpus text to be compared), and the calculation mode of the difference degree of the binary strings corresponding to the two corpus texts may be:
Determining a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
Comparing the first binary string with the second binary string bit by bit, determining whether numbers at corresponding positions in the first binary string and the second binary string are the same, if not, adding 1 in a cumulative way, wherein the initial value is zero, and the final count value is the number of positions with different numbers at the corresponding positions;
And determining the number of the positions as the difference degree between the first corpus text to be compared and the second corpus text to be compared, and when the difference degree is smaller than a difference threshold value, explaining that the two corpus texts in the combination are highly similar in content, deleting one corpus text from the two corpus texts, and then comparing the two corpus texts with other residual corpus texts to form a new combination to perform binary string comparison until the difference degree of binary strings corresponding to all the residual corpus texts is larger than or equal to the difference threshold value, namely, the residual corpus texts have repeated corpus with high dissimilarity.
Optionally, in another embodiment, the method for deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain the target corpus text set further includes:
And (3) comparing the same position numerical values based on the binary strings corresponding to all the corpus texts in the corpus text set to be processed, dividing the corpus texts with higher binary string similarity into a combination, comparing the difference degree inside each combination, deleting repeated texts, and comparing the binary strings between groups bit by bit based on the rest corpus texts in all the combinations until the difference degree of the binary strings corresponding to all the rest corpus texts is larger than or equal to a difference threshold value. In other embodiments, there are other comparison methods, and the specific ones are not limiting.
By implementing the steps S201 to S203, corpus texts with higher separator occupation can be deleted from the initial corpus text set to obtain a corpus text set to be processed; and performing de-duplication operation on the corpus text set to be processed to obtain a target corpus text set, so that the quality of the target corpus text in the target corpus text set is improved.
Step S103, carrying out semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words.
In the embodiment of the present application, the preset word segmentation tools, such as the THULAC, jieba word segmentation, ltp word segmentation, etc., may be used, which is not limited to the embodiment of the present application.
As an optional implementation manner, the step S103 performs semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented vocabulary, which may specifically include the following steps:
Semantic segmentation is carried out on each table of contents tagline text in the target corpus text set to obtain a word segmentation set of each table of contents tagline text; wherein, a table of contents tagline text corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
Combining any two adjacent word segments in the word segment set of each table of contents tagline text to obtain a plurality of to-be-processed segmented words corresponding to each table of contents tagline text;
Determining a segmented word to be deleted from the plurality of segmented words to be processed; the segmented vocabulary to be deleted at least comprises vocabulary related to the stop vocabulary;
And deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
According to the embodiment, each table of contents tagline text can be subjected to semantic segmentation to obtain a plurality of segmented words, and adjacent segmented words can be combined to obtain a plurality of segmented words, so that the segmented words obtained in the mode are richer in meaning, and the quality of the segmented words is improved.
In the embodiment of the invention, the segmented vocabulary to be deleted at least comprises vocabulary related to the stop word list, such as vocabularies, conjunctions, exclamation words and other words with smaller information quantity. And identifying the plurality of to-be-processed segmented words through a pre-trained nonsensical word identification model, and determining nonsensical to-be-deleted segmented words.
For example, the target corpus text may be: atherosclerosis can cause unsmooth blood circulation, and cause ischemia and necrosis of tissues and organs needing blood supply, thereby causing complications of multiple tissues and organs.
Artificial intelligence is an important component of the intellectual discipline that attempts to understand the essence of intelligence.
Semantic segmentation is carried out on the target corpus, and the obtained segmented word set can be { artery, atherosclerosis, caused, blood, circulation, unsmooth, letting, need, blood supply, tissues, organs, appearance, ischemia, necrosis, induction, multiple tissues, organs, appearance and complications };
Any two adjacent segmented words in the segmented word set are combined, and the obtained multiple segmented words can be: { atherosclerosis, which causes blood, causes blood circulation, circulation disorder, needs blood supply, tissue and organ appearance, ischemic necrosis, necrosis causes multiple tissues, tissue and organ appearance, complications and the like }.
Step S104, updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library.
As an optional implementation manner, the step S104 of updating the preset vocabulary library based on the divided vocabulary, and the manner of obtaining the updated vocabulary library may specifically include the following steps:
Determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
Determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set; the calculation formula of the correlation value can be as follows
Wherein: y is an association value; i is the order of the target corpus text; n is the total number of target corpus texts; The frequency of occurrence of the jth divided word in the ith table of contents mark-up text; /(I) A quality assessment value for the tagline text at i table of contents, the quality assessment value being a score given based on the professionality of the download address identification; k is a domain evaluation value, which is an evaluation value given based on the proximity of the belonging domain identifier to the user's desired domain. In other embodiments, the weighted average of the above parameters in each table of contents tagline text, etc. may be provided, and the specific manner is not limited.
Sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
and adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
According to the method and the device, the association value of the segmented vocabulary can be calculated according to the occurrence frequency of the segmented vocabulary in the target corpus text and the target label information corresponding to the target corpus text, the segmented vocabulary can be further ordered based on the association value, and the segmented vocabulary with larger association value can be added into a preset vocabulary library to expand an original vocabulary of the pre-trained large language model, so that the specific vocabulary in certain professional fields can be recognized more accurately when the semantic understanding or reasoning of the large language model is performed in the later stage, the intention of a user can be better understood or more professional content can be returned, and the situation that abundant semantics cannot be carried due to the fact that the vocabulary is small is avoided, so that the quality of the vocabulary stored in the vocabulary library is improved.
Step S105, according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from the multiple table of contents markup language texts.
In the embodiment of the application, the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio. The preset ratio is optimally 1:1, but aiming at the scale of an unused corpus, fine adjustment can be performed on the scale, and if the total number of the micro-invoked training corpuses is 500, the number ratio of the professional corpus texts to the general corpus texts is 6:4; if the total number of the micro-invoked training corpus is 5000, the quantity ratio of the professional corpus text to the general corpus text is 5.5:4.5, etc. The mixed corpus is formed by mixing the general domain corpus and the target corpus based on the preset proportion, and the mixed corpus is matched with the expanded word list to be trained together so as to ensure the balance and diversity of data, ensure that the general answering capacity and the intra-domain knowledge answering capacity of a final large language model can be simultaneously reserved, avoid the large language model from being excessively sensitive to texts in certain specific fields or types, keep the balance of positive and negative samples, and improve the generalization capacity of the large language model.
As an optional implementation manner, the determining, in step S105, from the multiple table of contents taglines, the professional corpus text and the general corpus text according to the target tag information set of the target corpus text may specifically include the following steps:
Inputting each table of contents tagline text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each table of contents tagline text;
Deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
And determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
According to the implementation mode, the confusion degree of each table of contents tagline text can be evaluated through a pre-constructed quality scoring model, and target corpus texts with the confusion degree larger than or equal to a preset confusion degree threshold value can be deleted, so that corpus texts of a high-quality corpus text set are all high-quality corpus texts, and further professional corpus texts and general corpus texts can be determined from the high-quality corpus text set according to the corresponding domain identification of the corpus texts, and the quality of the professional corpus texts and the general corpus texts is improved.
In the embodiment of the application, the calculation mode of the confusion degree of the target corpus text can be as follows:
Dividing the target corpus text into independent sentences, continuing word segmentation processing of the sentences, and then counting the probability of occurrence of two adjacent words in the target corpus text according to the following formula, wherein a smooth value V is needed to be added in order to avoid the problem that the probability is 0.
Wherein,For the current word,/>For the previous adjacent word of the current word, c () represents the combination mode of the words in brackets, the numerator is the number of times of occurrence of the adjacent word, the denominator is the total number of times of occurrence of the current word in the target corpus text, and V is the total number-1 of the divided words contained in the sentence where the current word is located. And then multiplying the probabilities of all adjacent words of the current sentence to obtain the probability of the current sentence.
And solving the confusion of the current sentence according to the following formula, wherein N is the word segmentation number of the current sentence, and P (W 1w2...wn) represents the sentence probability calculated by the quality scoring model, and the sentence probability is the confusion PP (W) of the current sentence.
And then executing the same operation on other sentences in the whole target corpus text to obtain the confusion degree of all sentences, and finally obtaining the average confusion degree of the whole table of contents tagline text, wherein the average confusion degree is the confusion degree of the whole table of contents tagline text.
And step S106, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model.
By implementing the steps S101 to S106, the semantic understanding and reasoning capacity of the large language model aiming at the professional knowledge can be improved through high-quality training data, and the training effect of the large language model can be improved. In addition, the method and the device can improve the quality of the target corpus text in the target corpus text set. In addition, the application can also improve the accuracy of the determined separator ratio. In addition, the method and the device can also improve the accuracy of repeated corpus text determination. In addition, the application can also improve the quality of the segmented vocabulary. In addition, the application can also improve the quality of the vocabulary stored in the vocabulary library. In addition, the application can also improve the specificity of the vocabulary stored in the vocabulary library. In addition, the method and the device can also improve the quality of the professional corpus text and the general corpus text.
As an optional implementation manner, step S106 trains a pre-constructed large language model based on the updated vocabulary library, the specialized corpus text and the generic corpus text, and the obtaining the trained large language model includes the following steps:
And (3) verification: the method comprises the steps that a large model evaluation kit is used for respectively evaluating the response accuracy of general knowledge and specific domain knowledge of a first model and a second model, a first evaluation value set and a second evaluation value set are correspondingly obtained, the first evaluation value set at least comprises a first general domain evaluation value and a first specific domain evaluation value, and the second evaluation value set at least comprises a second general domain evaluation value and a second specific domain evaluation value; the first model is a large language model constructed in advance, and the second model is a large language model after preliminary training;
Judging whether the first difference value is smaller than a first threshold value or not and whether the second difference value is larger than or equal to a second threshold value or not based on the first difference value; the first difference value is an absolute value of a difference between the second general field evaluation value and the first general field evaluation value, the second difference value is a difference between the second specific field evaluation value and the first specific field evaluation value, and the first threshold value is a maximum value allowing the second model general knowledge response accuracy to be reduced; the second threshold value provides a target value of domain-specific knowledge response accuracy for the second model;
If yes, obtaining the trained large language model; if not, judging whether the first difference value is larger than or equal to the first threshold value, and judging whether the second difference value is larger than the second threshold value;
If yes, reducing the preset ratio according to a first preset ratio threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, judging whether the first difference value is smaller than the first threshold value and whether the second difference value is larger than or equal to a third threshold value, wherein the third threshold value is a product value of a second preset proportion and the second threshold value;
If yes, increasing the preset ratio according to the first preset ratio threshold, increasing the original training time according to a preset time threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, the original training time is prolonged according to the preset time threshold, training is carried out on a large language model built in advance based on the updated vocabulary library, the professional corpus text and the general corpus text, and the verification step is returned.
It can be understood that the application evaluates the general knowledge of the large language model after each training and the inference question-answering capability of the specific target field, and selectively adjusts the training mode of the large language model from two angles of training time length, professional corpus text and general corpus text proportion according to the evaluation result, so as to achieve the aim of remarkably shortening the training time and the calculation cost of the model while improving the performance of the large language model.
Example two
Having described the method of the exemplary embodiment of the present application, an apparatus for improving training effects of a large language model according to the exemplary embodiment of the present application will be described with reference to fig. 3, where the apparatus includes an obtaining unit 301, a preprocessing unit 302, a segmentation unit 303, an updating unit 304, a determining unit 305, and a training unit 306, specifically:
The acquiring unit 301 may be configured to acquire an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
The preprocessing unit 302 may be configured to perform preprocessing on all the corpus texts in the initial corpus text set to obtain a target corpus text set;
the segmentation unit 303 may be configured to perform semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented vocabularies;
The updating unit 304 may be configured to update a preset vocabulary library based on the divided vocabulary, to obtain an updated vocabulary library;
The determining unit 305 may be configured to determine, according to the target tag information set of the target corpus text, a professional corpus text and a general corpus text from the multiple table of contents taglines text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
The training unit 306 may be configured to train the pre-built large language model based on the updated vocabulary library, the specialized corpus text, and the generic corpus text, to obtain a trained large language model.
As an optional implementation manner, the preprocessing unit 302 may perform preprocessing on all the corpus texts in the initial corpus text set to obtain the target corpus text set, which may specifically be:
Based on each corpus text, respectively performing separator duty ratio calculation to determine the separation Fu Zhanbi of each corpus text;
Deleting the corpus texts to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed;
and performing de-duplication operation on the corpus texts in the corpus text set to be processed to obtain a target corpus text set.
By implementing the implementation mode, the corpus text with the higher separator occupation ratio can be deleted from the initial corpus text set, so that a corpus text set to be processed is obtained; and performing de-duplication operation on the corpus text set to be processed to obtain a target corpus text set, so that the quality of the target corpus text in the target corpus text set is improved.
As an optional implementation manner, the preprocessing unit 302 performs the separator duty ratio calculation on a current corpus text in the initial corpus text set, and the manner of determining the separator duty ratio of the current corpus text may specifically be:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
determining a total number of characters of the target character and a number of separators of the separators;
and calculating the ratio between the number of separators and the total number of characters, and determining the separation Fu Zhanbi of the current corpus text.
According to the embodiment, the total number of characters and the number of separators in each corpus text can be identified, the proportion of the separators in the current corpus text is determined based on the total number of characters and the number of separators, and the accuracy of the determined proportion of the separators is improved.
As an optional implementation manner, the preprocessing unit 302 performs a deduplication operation on the corpus text in the corpus text set to be processed, and a manner of obtaining the target corpus text set may specifically be:
Processing each corpus text in the corpus text set to be processed and converting the corpus text into binary strings;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
Deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
According to the embodiment, the corpus texts can be converted into the binary strings, and the difference values of the two different corpus texts are determined based on the binary strings, so that repeated corpus texts are deleted, and the accuracy of determining the repeated corpus texts can be improved.
As an alternative embodiment, the manner in which the preprocessing unit 302 determines the repeated corpus text may specifically be:
Determining a first corpus text to be compared and a second corpus text to be compared from the corpus text set to be processed;
acquiring a first binary string of the first corpus text to be compared and a second binary string of the second corpus text to be compared;
Comparing the first binary string with the second binary string bit by bit to obtain a target difference degree; wherein the target degree of difference is the number of positions with different numbers on the corresponding positions of the first binary string and the second binary string;
and if the target difference degree is smaller than the difference threshold value, determining the first corpus text to be compared and the second corpus text to be compared as repeated corpus texts.
As an optional implementation manner, the segmentation unit 303 performs semantic segmentation on the target corpus text in the target corpus text set, and the manner of obtaining the plurality of segmented words may specifically be:
Semantic segmentation is carried out on each table of contents tagline text in the target corpus text set to obtain a word segmentation set of each table of contents tagline text; wherein, a table of contents tagline text corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
Combining any two adjacent word segments in the word segment set of each table of contents tagline text to obtain a plurality of to-be-processed segmented words corresponding to each table of contents tagline text;
determining a segmented word to be deleted from the plurality of segmented words to be processed;
And deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
According to the implementation mode, semantic segmentation can be carried out on each table of contents tagline text to obtain a plurality of segmented words, and adjacent analysis can be combined to obtain a plurality of segmented words.
As an optional implementation manner, the updating unit 304 updates the preset vocabulary library based on the divided vocabulary, and the manner of obtaining the updated vocabulary library may specifically be:
Determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
Determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set;
sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
and adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
According to the implementation mode, the association value of the segmented vocabulary can be calculated according to the occurrence frequency of the segmented vocabulary in the target corpus text and the target label information corresponding to the target corpus text, the segmented vocabulary can be further ordered based on the association value, and the segmented vocabulary with the larger association value can be added into a preset vocabulary library, so that the quality of the vocabulary stored in the vocabulary library is improved.
As an optional implementation manner, the determining unit 305 may specifically determine, according to the target tag information set of the target corpus text, a specific corpus text and a general corpus text from the multiple table of contents markup language texts:
Inputting each table of contents tagline text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each table of contents tagline text;
Deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
And determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
According to the implementation mode, the confusion degree of each table of contents tagline text can be evaluated through a pre-constructed quality scoring model, and target corpus texts with the confusion degree larger than or equal to a preset confusion degree threshold value can be deleted, so that corpus texts of a high-quality corpus text set are all high-quality corpus texts, and further professional corpus texts and general corpus texts can be determined from the high-quality corpus text set according to the corresponding domain identification of the corpus texts, and the quality of the professional corpus texts and the general corpus texts is improved.
By implementing the implementation mode, the semantic understanding and reasoning capacity of the large language model aiming at professional knowledge can be improved through high-quality training data, and the training effect of the large language model is improved. In addition, the method and the device can improve the quality of the target corpus text in the target corpus text set. In addition, the application can also improve the accuracy of the determined separator ratio. In addition, the method and the device can also improve the accuracy of repeated corpus text determination. In addition, the application can also improve the quality of the segmented vocabulary. In addition, the application can also improve the quality of the vocabulary stored in the vocabulary library. In addition, the method and the device can also improve the quality of the professional corpus text and the general corpus text.
Example III
Having described the method and apparatus of the exemplary embodiments of the present application, reference is next made to fig. 4 for describing a computer-readable storage medium of the exemplary embodiments of the present application, and reference is made to fig. 4 for showing a computer-readable storage medium as an optical disc 40 having a computer program (i.e., a program product) stored thereon that, when executed by a processor, implements the steps described in the above-described method embodiments, for example, obtaining an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier; preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set; performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words; updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library; according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from a plurality table of contents of markup material texts; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio; training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model; the specific implementation of each step is not repeated here.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
Example IV
Having described the methods, apparatus and media of exemplary embodiments of the present application, next, a computing device for model processing of exemplary embodiments of the present application is described with reference to FIG. 5.
FIG. 5 illustrates a block diagram of an exemplary computing device 50 suitable for use in implementing embodiments of the application, the computing device 50 may be a computer system or a server. The computing device 50 shown in fig. 5 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 5, components of computing device 50 may include, but are not limited to: one or more processors or processing units 501, a system memory 502, and a bus 503 that connects the various system components (including the system memory 502 and processing units 501).
Computing device 50 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computing device 50 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 502 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 5021 and/or cache memory 5022. Computing device 50 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM5023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5 and commonly referred to as a "hard drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media), may be provided. In such cases, each drive may be coupled to a bus 503 that connects the various system components through one or more data medium interfaces. The system memory 502 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the application.
A program/utility 5025 having a set (at least one) of program modules 5024 may be stored in, for example, system memory 502, and such program modules 5024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 5024 generally perform the functions and/or methods of the described embodiments of the present application.
Computing device 50 may also communicate with one or more external devices 504 (e.g., keyboard, pointing device, display, etc.). Such communication may occur through an input/output (I/O) interface 505. Moreover, computing device 50 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 506. As shown in fig. 5, network adapter 506 communicates with other modules of computing device 50 (e.g., processing unit 501, etc.) over bus 503 that connects the various system components. It should be appreciated that although not shown in fig. 5, other hardware and/or software modules may be used in connection with computing device 50.
The processing unit 501 executes various functional applications and data processing by running a program stored in the system memory 502, for example, acquires an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier; preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set; performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words; updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library; according to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from a plurality table of contents of markup material texts; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio; training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text to obtain a trained large language model. The specific implementation of each step is not repeated here. It should be noted that while several units/modules or sub-units/sub-modules of the apparatus for enhancing the training effect of a large language model are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present application. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
In the description of the present application, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

Claims (10)

1. A method for improving training results of a large language model, the method comprising:
Acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
Preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
Performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
updating a preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
According to the target label information set of the target corpus text, determining a professional corpus text and a general corpus text from a plurality table of contents of markup material texts; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, wherein the training process comprises the following steps:
And (3) verification: the method comprises the steps that a large model evaluation kit is used for respectively evaluating the response accuracy of general knowledge and specific domain knowledge of a first model and a second model, a first evaluation value set and a second evaluation value set are correspondingly obtained, the first evaluation value set at least comprises a first general domain evaluation value and a first specific domain evaluation value, and the second evaluation value set at least comprises a second general domain evaluation value and a second specific domain evaluation value; the first model is a large language model constructed in advance, and the second model is a large language model obtained after training each time;
The preset ratio or training time is adjusted based on a comparison result of a first difference value and a first threshold value and a comparison result of a second difference value and a second threshold value, so as to obtain a trained large language model, wherein the first difference value is an absolute value of a difference between the second general domain evaluation value and the first general domain evaluation value, the second difference value is a difference between the second specific domain evaluation value and the first specific domain evaluation value, and the first threshold value is a maximum value allowing the response accuracy of the second model to be reduced; the second threshold provides a target value of domain-specific knowledge response accuracy for the second model.
2. The method for improving training effects of large language models according to claim 1, wherein the preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set includes:
Based on each corpus text, respectively performing separator duty ratio calculation to determine the separation Fu Zhanbi of each corpus text;
Deleting the corpus texts to be deleted, the separator duty ratio of which is greater than or equal to a preset duty ratio threshold value, from the initial corpus text set to obtain a corpus text set to be processed;
and performing de-duplication operation on the corpus texts in the corpus text set to be processed to obtain a target corpus text set.
3. The method for improving training effects of large language models of claim 2, wherein the method for determining the separator ratio of each of the corpus texts comprises:
performing character recognition on one current corpus text in the initial corpus text set, and determining all target characters contained in the current corpus text; wherein the target character comprises a literal character and a separator;
determining a total number of characters of the target character and a number of separators of the separators;
and calculating the ratio between the number of separators and the total number of characters, and determining the separation Fu Zhanbi of the current corpus text.
4. The method for improving training effects of large language models according to claim 2, wherein the performing a de-duplication operation on the corpus text in the corpus text set to be processed to obtain a target corpus text set includes:
Respectively calculating each corpus text in the corpus text set to be processed to obtain a corresponding binary string;
determining a repeated corpus Wen Benzi set in the corpus text set to be processed according to the binary string of each corpus text; the number of the repeated corpus texts in the repeated corpus text subsets is larger than 1, and the difference degree of binary strings of any two repeated corpus texts in one repeated corpus text subset is smaller than a difference threshold;
determining repeated corpus texts to be deleted from the repeated corpus text subset; the absolute value of the difference between the number of the repeated corpus texts to be deleted and the number of the repeated corpus texts in the repeated corpus text subset where the repeated corpus texts to be deleted are located is 1;
Deleting the repeated corpus text to be deleted from the corpus text set to be processed to obtain a target corpus text set.
5. The method for improving training effects of a large language model according to claim 1, wherein the performing semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words comprises:
Semantic segmentation is carried out on each table of contents tagline text in the target corpus text set to obtain a word segmentation set of each table of contents tagline text; wherein, a table of contents tagline text corresponds to a word segmentation set; the word segmentation set comprises a plurality of words, and the sequence of the words contained in the word segmentation set is the same as the sequence of the words in the target corpus text corresponding to the word segmentation set;
Combining any two adjacent word segments in the word segment set of each table of contents tagline text to obtain a plurality of to-be-processed segmented words corresponding to each table of contents tagline text;
Determining a to-be-deleted segmented word from the plurality of to-be-processed segmented words, wherein the to-be-deleted segmented word at least comprises words related to a stop word list;
And deleting the segmented vocabulary to be deleted from the plurality of segmented vocabularies to be processed, so as to obtain the deleted segmented vocabulary.
6. The method for improving training effects of large language models according to claim 5, wherein updating the preset vocabulary library based on the segmented vocabulary to obtain an updated vocabulary library comprises:
Determining the occurrence frequency of each divided word in the target corpus text corresponding to the divided word, and a target label information set corresponding to the target corpus text in which each divided word is positioned;
Determining the association value of each divided word according to the occurrence frequency corresponding to each divided word and the target tag information set;
sorting the divided words based on the association values to obtain an association value sequence; the segmented vocabulary in the association value sequence is ordered from large to small according to the association value corresponding to the segmented vocabulary;
sequentially selecting a preset number of target related words from the related value sequence;
and adding the target related words to a preset vocabulary library to obtain an updated vocabulary library.
7. The method for improving training effect of large language model according to claim 1, wherein the preset ratio or training time is adjusted based on a comparison result of a first difference value and a first threshold value and a comparison result of a second difference value and a second threshold value, so as to obtain a trained large language model, wherein the first difference value is an absolute value of a difference between the second general domain evaluation value and the first general domain evaluation value, the second difference value is a difference between the second specific domain evaluation value and the first specific domain evaluation value, and the first threshold value is a maximum value allowing a decrease of a second model general knowledge response accuracy; the second threshold providing the second model with a target value of domain-specific knowledge response accuracy includes:
Judging whether the first difference value is smaller than a first threshold value or not and whether the second difference value is larger than or equal to a second threshold value or not based on the first difference value; the first difference value is an absolute value of a difference between the second general field evaluation value and the first general field evaluation value, the second difference value is a difference between the second specific field evaluation value and the first specific field evaluation value, and the first threshold value is a maximum value allowing the second model general knowledge response accuracy to be reduced; the second threshold value provides a target value of domain-specific knowledge response accuracy for the second model;
if yes, outputting and obtaining the trained large language model; if not, judging whether the first difference value is larger than or equal to the first threshold value, and judging whether the second difference value is larger than the second threshold value;
If yes, reducing the preset ratio according to a first preset ratio threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, judging whether the first difference value is smaller than the first threshold value and whether the second difference value is larger than or equal to a third threshold value, wherein the third threshold value is a product value of a second preset proportion and the second threshold value;
If yes, increasing the preset ratio according to the first preset ratio threshold, increasing the original training time according to a preset time threshold, training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and returning to the verification step; if not, the original training time is prolonged according to the preset time threshold, training is carried out on a large language model built in advance based on the updated vocabulary library, the professional corpus text and the general corpus text, and the verification step is returned.
8. The method for improving training effects of a large language model according to any one of claims 1 to 7, wherein the determining, according to the target tag information set of the target corpus text, a professional corpus text and a general corpus text from a plurality table of contents of markup language text includes:
Inputting each table of contents tagline text of the target corpus text into a pre-constructed quality scoring model to obtain the confusion degree of each table of contents tagline text;
Deleting the target corpus text with the confusion degree larger than or equal to a preset confusion degree threshold value from the target corpus text set to obtain a high-quality corpus text set; the high-quality corpus text set comprises a plurality of high-quality corpus texts and a high-quality label information set corresponding to each high-quality corpus text; the high-quality tag information set comprises a download address identifier and a domain identifier of a high-quality corpus text corresponding to the high-quality tag information set;
And determining professional corpus texts and general corpus texts from the high-quality corpus text set according to the belonging field identifiers in the high-quality label information set.
9. An apparatus for enhancing training effects of a large language model, the apparatus comprising:
the acquisition unit is used for acquiring an initial corpus text set; the initial corpus text set comprises a plurality of corpus texts, and each corpus text corresponds to one label information set; the label information set at least comprises a downloading address identifier of the corpus text and a domain identifier;
the preprocessing unit is used for preprocessing all the corpus texts in the initial corpus text set to obtain a target corpus text set;
The segmentation unit is used for carrying out semantic segmentation on the target corpus text in the target corpus text set to obtain a plurality of segmented words;
The updating unit is used for updating the preset vocabulary library based on the divided vocabulary to obtain an updated vocabulary library;
the determining unit is used for determining a professional corpus text and a general corpus text from a plurality table of contents of mark-word material texts according to the target label information set of the target corpus text; the ratio of the number of the professional corpus texts to the number of the general corpus texts is a preset ratio;
The training unit is used for training a pre-constructed large language model based on the updated vocabulary library, the professional corpus text and the general corpus text, and the training process comprises the following steps:
And (3) verification: the method comprises the steps that a large model evaluation kit is used for respectively evaluating the response accuracy of general knowledge and specific domain knowledge of a first model and a second model, a first evaluation value set and a second evaluation value set are correspondingly obtained, the first evaluation value set at least comprises a first general domain evaluation value and a first specific domain evaluation value, and the second evaluation value set at least comprises a second general domain evaluation value and a second specific domain evaluation value; the first model is a large language model constructed in advance, and the second model is a large language model obtained after training each time;
The preset ratio or training time is adjusted based on a comparison result of a first difference value and a first threshold value and a comparison result of a second difference value and a second threshold value, so as to obtain a trained large language model, wherein the first difference value is an absolute value of a difference between the second general domain evaluation value and the first general domain evaluation value, the second difference value is a difference between the second specific domain evaluation value and the first specific domain evaluation value, and the first threshold value is a maximum value allowing the response accuracy of the second model to be reduced; the second threshold provides a target value of domain-specific knowledge response accuracy for the second model.
10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of enhancing the training effect of a large language model as claimed in any one of claims 1 to 8.
CN202410164274.8A 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model Active CN117709355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410164274.8A CN117709355B (en) 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410164274.8A CN117709355B (en) 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model

Publications (2)

Publication Number Publication Date
CN117709355A CN117709355A (en) 2024-03-15
CN117709355B true CN117709355B (en) 2024-05-17

Family

ID=90144707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410164274.8A Active CN117709355B (en) 2024-02-05 2024-02-05 Method, device and medium for improving training effect of large language model

Country Status (1)

Country Link
CN (1) CN117709355B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118279704B (en) * 2024-06-04 2024-08-13 四川蜀天信息技术有限公司 Digital human interaction evaluation method, device, storage medium and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148877A (en) * 2020-09-23 2020-12-29 网易(杭州)网络有限公司 Corpus text processing method and device and electronic equipment
CN113961669A (en) * 2021-10-26 2022-01-21 杭州中软安人网络通信股份有限公司 Training method of pre-training language model, storage medium and server
CN115545010A (en) * 2022-10-12 2022-12-30 阿里巴巴(中国)有限公司 Training method, device and equipment for generating network by navigation broadcast statement
CN117290500A (en) * 2022-06-16 2023-12-26 马上消费金融股份有限公司 Professional word stock construction method, device, medium and program product

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263324B (en) * 2019-05-16 2021-02-12 华为技术有限公司 Text processing method, model training method and device
US20230334263A1 (en) * 2022-04-13 2023-10-19 Abridge AI, Inc. Automating follow-up actions from conversations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148877A (en) * 2020-09-23 2020-12-29 网易(杭州)网络有限公司 Corpus text processing method and device and electronic equipment
CN113961669A (en) * 2021-10-26 2022-01-21 杭州中软安人网络通信股份有限公司 Training method of pre-training language model, storage medium and server
CN117290500A (en) * 2022-06-16 2023-12-26 马上消费金融股份有限公司 Professional word stock construction method, device, medium and program product
CN115545010A (en) * 2022-10-12 2022-12-30 阿里巴巴(中国)有限公司 Training method, device and equipment for generating network by navigation broadcast statement

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Assessing the potential of LLM-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apologies;Danni Yu等;《researchgate》;20230531;1-30 *
Pre-training LLMs using human-like development data corpus;Khushi Bhardwaj等;《arXiv》;20240110;1-7 *
医学语言模型研究;赵立君等;《长江信息通信》;20231115;第36卷(第11期);1-7+11 *
基于深度学习的教育技术学术论文文本自动分类研究;贺姣姣;《中国优秀硕士学位论文全文数据库 (社会科学Ⅱ辑)》;20190115(第1期);H127-89 *
基于知网的领域概念抽取与关系分析研究;唐一之;;湘潭大学自然科学学报;20090315(第01期);135-140 *
深度学习框架下微博文本情感细粒度研究;王儒;王嘉梅;王伟全;符飞;;计算机系统应用;20200515(第05期);19-28 *

Also Published As

Publication number Publication date
CN117709355A (en) 2024-03-15

Similar Documents

Publication Publication Date Title
Rastogi et al. Weighting finite-state transductions with neural context
CN109829162B (en) Text word segmentation method and device
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
KR101813683B1 (en) Method for automatic correction of errors in annotated corpus using kernel Ripple-Down Rules
CN117709355B (en) Method, device and medium for improving training effect of large language model
CN115630640B (en) Intelligent writing method, device, equipment and medium
US20120262461A1 (en) System and Method for the Normalization of Text
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN111104803B (en) Semantic understanding processing method, device, equipment and readable storage medium
CN115048944B (en) Open domain dialogue reply method and system based on theme enhancement
US20220414332A1 (en) Method and system for automatically generating blank-space inference questions for foreign language sentence
CN113407709A (en) Generative text summarization system and method
WO2023045725A1 (en) Method for dataset creation, electronic device, and computer program product
CN109189907A (en) A kind of search method and device based on semantic matches
CN114757203A (en) Chinese sentence simplification method and system based on contrast learning
Göker et al. Neural text normalization for turkish social media
CN116562240A (en) Text generation method, computer device and computer storage medium
CN116579327A (en) Text error correction model training method, text error correction method, device and storage medium
CN111476003B (en) Lyric rewriting method and device
Umare et al. A survey on machine learning techniques to extract chemical names from text documents
CN114707489B (en) Method and device for acquiring annotation data set, electronic equipment and storage medium
CN115374884B (en) Method for training abstract generation model based on contrast learning and abstract generation method
CN118093838B (en) Large language model prompt word generation method, system, terminal equipment and medium
CN118468822B (en) Target field text generation method and system
JP6425732B2 (en) Sentence search system, polarity determination rule correction system, sentence search method and polarity determination rule correction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant