CN117609450A - Model training method, vector database generating method, question answering method and question answering device - Google Patents

Model training method, vector database generating method, question answering method and question answering device Download PDF

Info

Publication number
CN117609450A
CN117609450A CN202311576180.3A CN202311576180A CN117609450A CN 117609450 A CN117609450 A CN 117609450A CN 202311576180 A CN202311576180 A CN 202311576180A CN 117609450 A CN117609450 A CN 117609450A
Authority
CN
China
Prior art keywords
text
loss function
vector
level
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311576180.3A
Other languages
Chinese (zh)
Inventor
郭瑾瑾
李丽勤
赵婉
张钧波
谢泽华
王文超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Big Data Center
Jingdong City Beijing Digital Technology Co Ltd
Original Assignee
Beijing Big Data Center
Jingdong City Beijing Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Big Data Center, Jingdong City Beijing Digital Technology Co Ltd filed Critical Beijing Big Data Center
Priority to CN202311576180.3A priority Critical patent/CN117609450A/en
Publication of CN117609450A publication Critical patent/CN117609450A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a model training method, a vector database generating method, a question-answering method and a device thereof. Relates to the field of artificial intelligence. The model training method comprises the following steps: masking the content to be masked in the government affair text sample to obtain a masking text; inputting the mask text into a machine learning model to obtain text feature vectors; obtaining a prediction result of the content to be masked according to the text feature vector; determining a first loss function according to the prediction result; the machine learning model is trained using a first loss function.

Description

Model training method, vector database generating method, question answering method and question answering device
Technical Field
The disclosure relates to the field of artificial intelligence, and in particular relates to a model training method, a vector database generating method, a question-answering method and a question-answering device.
Background
The large language model (Large Language Model) is a great achievement achieved in the field of artificial intelligence. Large language models, by virtue of their strong context awareness, are superior in multiple tasks such as multi-round question-answering, text authoring, code programming, etc. Therefore, the application of a general large language model to various vertical fields such as government affairs, finance, electronic commerce, education and the like has received a great deal of attention. For example, in the government affair field, a large language model can answer government affair questions posed by users.
Disclosure of Invention
The inventor notes that in the related art, the large language model has only general knowledge, but does not have professional knowledge in the field of government affairs, so that a targeted answer cannot be performed according to specific government affairs and scenes.
Accordingly, the present disclosure provides a model training scheme, through training a machine learning model, the machine learning model can perform government affair text vectorization based on government affair scenes, thereby constructing a government affair vector database, and accurately answering government affair questions based on the government affair vector database.
According to a first aspect of embodiments of the present disclosure, there is provided a model training method, performed by a model training apparatus, comprising: masking the content to be masked in the government affair text sample to obtain a masking text; inputting the mask text into a machine learning model to obtain text feature vectors; obtaining a prediction result of the content to be masked according to the text feature vector; determining a first loss function according to the prediction result; training the machine learning model using the first loss function.
In some embodiments, the first penalty function is determined by a probability of predicting each result of the content to be masked based on content in the mask text that is not masked.
In some embodiments, the obtained feature vectors are put into the same set with the same level attribute to obtain N sets, and the N sets are ordered according to the order of the level attribute from high to low; obtaining a second loss function according to the intra-level association degree of each feature vector in each set; obtaining a third loss function according to the inter-level association degree of the first N-2 sets in the N sets; obtaining a relevance loss function according to the second loss function and the third loss function; and training the machine learning model by utilizing the relevancy loss function.
In some embodiments, the relevancy loss function is a weighted sum of the second loss function and the third loss function.
In some embodiments, deriving the second loss function includes: obtaining the intra-level association degree of the jth feature vector by utilizing the jth feature vector in the ith set and the level attribute of the ith set, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and M is the total number of feature vectors in the ith set; calculating the sum of the intra-level association degrees of each feature vector in the i sets to serve as the intra-level association degree of the i-th set; and calculating the sum of intra-level association degrees of each of the N sets to obtain the second loss function.
In some embodiments, the determining the intra-level relevance of the jth feature vector comprises: calculating the product between the jth feature vector in the ith set and the level attribute of the ith set; calculating the difference between the product and the first super-parameter; and selecting the minimum value between the first preset value and the difference value as the intra-level association degree of the j-th feature vector.
In some embodiments, the first preset value is 0.
In some embodiments, deriving the third loss function includes: obtaining the level association degree of the kth set according to the level attributes from the kth set to the nth set, wherein k is more than or equal to 1 and less than or equal to N-2; and calculating the sum of the inter-level association degree of each of the first N-2 sets to obtain the third loss function.
In some embodiments, obtaining the inter-level association of the kth set comprises: calculating the sum of the level attributes from the (k+1) th set to the (N-1) th set as a first parameter; calculating the difference between the level attribute of the Nth set and the first parameter to be used as a second parameter; calculating the product of the level attribute of the kth set and the second parameter as a third parameter; calculating a difference between the third parameter and the second super parameter as a fourth parameter; selecting a minimum value between a second preset value and the fourth parameter as the inter-level association degree of the kth set;
in some embodiments, the second preset value is 0.
According to a second aspect of embodiments of the present disclosure, there is provided a model training apparatus, comprising: the first training module is configured to mask the content to be masked in the government affair text sample to obtain mask text; a second training module configured to input the masked text into a machine learning model to obtain text feature vectors; the third training module is configured to obtain a prediction result of the content to be masked according to the text feature vector; and a fourth training module configured to determine a first loss function based on the prediction result and train the machine learning model using the first loss function.
According to a third aspect of embodiments of the present disclosure, there is provided a model training apparatus, comprising: a memory configured to store instructions; a processor coupled to the memory, the processor configured to implement a model training method as described in any of the embodiments above based on execution of instructions stored in the memory.
According to a fourth aspect of embodiments of the present disclosure, there is provided a vector database generating method, including: establishing a local government knowledge base, and updating the local government knowledge base with preset frequency; extracting text content from government affair files of the local government affair knowledge base; dividing the text content to obtain a plurality of texts; inputting each text in the plurality of texts into a machine learning model to obtain a feature vector corresponding to each text, wherein the machine learning model is trained according to the training method described in any embodiment; and writing the obtained characteristic vector into a vector database.
According to a fifth aspect of the embodiments of the present disclosure, there is provided a vector database generating apparatus, including: the first generation module is configured to establish a local government knowledge base and update the local government knowledge base at a preset frequency; the second generation module is configured to extract text content from government files of the local government knowledge base and divide the text content to obtain a plurality of texts; a third generating module configured to input each text of the plurality of texts into a machine learning model to obtain feature vectors corresponding to each text, wherein the machine learning model is trained according to the training method described in any of the above embodiments; and a fourth generation module configured to write the obtained feature vector into the vector database.
According to a sixth aspect of the embodiments of the present disclosure, there is provided a vector database generating apparatus, including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to implement a vector database generation method as described in any of the embodiments above based on execution of instructions stored in the memory.
According to a seventh aspect of the embodiments of the present disclosure, there is provided a question-answering method, including: converting the problem text input by the user into a problem vector; matching the problem vector with each feature vector in a vector database to obtain K feature vectors with highest similarity, wherein the vector database is obtained according to the generating method in any embodiment; converting each of the K feature vectors into a text to obtain K reply texts; and inputting the question text and the K answer texts into a large language model so that the large language model outputs answers.
In some embodiments, inputting the question text and the K answer texts into a large language model includes: carrying out templating processing on the question text and the K reply texts to obtain templated information; and inputting the templated information into the large language model.
According to an eighth aspect of the embodiments of the present disclosure, there is provided a question answering apparatus, including: a first processing module configured to convert a question text input by a user into a question vector; the second processing module is configured to match the problem vector with each feature vector in a vector database to obtain K feature vectors with highest similarity, wherein the vector database is obtained according to the generating method in any embodiment; a third processing module configured to convert each of the K feature vectors into text to obtain K reply texts; a fourth processing module configured to input the question text and the K answer texts into a large language model so that the large language model outputs an answer.
According to a ninth aspect of the embodiments of the present disclosure, there is provided a question answering apparatus, including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method according to any of the embodiments described above based on instructions stored in the memory.
According to a tenth aspect of the embodiments of the present disclosure, there is provided a computer readable storage medium, wherein the computer readable storage medium stores computer instructions which, when executed by a processor, implement a method as referred to in any of the embodiments above.
Other features of the present disclosure and its advantages will become apparent from the following detailed description of exemplary embodiments of the disclosure, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow diagram of a model training method according to one embodiment of the present disclosure;
FIG. 2 is a flow chart of a model training method according to another embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a model training apparatus according to one embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a model training apparatus according to another embodiment of the present disclosure;
FIG. 5 is a flow chart of a method of generating a vector database according to one embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a vector database generating apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a vector database generating apparatus according to another embodiment of the present disclosure;
FIG. 8 is a flow chart of a question-answering method according to one embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a question answering device according to one embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of a question answering device according to another embodiment of the present disclosure;
fig. 11 is a schematic diagram of a question-answering flow based on a local government knowledge base according to an embodiment of the disclosure.
Detailed Description
The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
FIG. 1 is a flow chart of a model training method according to one embodiment of the present disclosure. In some embodiments, the following model training method is performed by the model training apparatus.
In step 101, masking processing is performed on the content to be masked in the government affair text sample to obtain a masking text.
In some embodiments, randomly selected content to be masked in the government affair text sample is masked using a masking language model (Masked Language Model).
For example, the mask language model includes a model of BertForMaskedLM, disttilBERT or the like.
In some embodiments, the mask ratio is 10% -20%. For example, the masking ratio is 15%, i.e., 15% of the words in the government text sample are masked.
At step 102, the masked text is input into a machine learning model to obtain a text feature vector.
It should be noted that, the machine learning model predicts the mask content in the mask text, and outputs a corresponding text feature vector according to the prediction result.
In step 103, a prediction result of the content to be masked is obtained according to the text feature vector.
In step 104, a first loss function is determined based on the prediction.
In some embodiments, the first penalty function is determined by a probability of predicting each result of the content to be masked based on the unmasked content in the mask text.
For example, a first loss functionAs shown in formula (1).
Where m (X) is the masked word in text X, X \m(x) As the unmasked words in text x,is to predict the masked word as +.>Is a probability of (2).
In step 105, a machine learning model is trained using a first loss function.
In the model training method provided by the embodiment of the disclosure, the processing capability of the machine learning model on the incomplete government affair text is trained, so that the trained machine learning model can accurately convert the government affair text into the feature vector with government affair characteristics, and the government affair questions can be accurately answered based on the feature vector with government affair characteristics.
Accordingly, in the process of training the machine learning model, the feature vector with government affair features can be more accurately output by the trained machine learning model by utilizing the grading characteristic of the government affair file.
In some embodiments, as shown in FIG. 2, after step 102 described above, the model training method may further include the following steps.
In step 201, the obtained feature vectors are put into the same set having the same level attribute to obtain N sets, and the N sets are ordered in order of the level attribute from high to low.
In step 202, a second loss function is obtained based on the intra-level relevance of each feature vector in each set.
In some embodiments, the intra-level association of the jth feature vector is obtained by using the jth feature vector in the ith set and the level attribute of the ith set, wherein 1.ltoreq.i.ltoreq.N, 1.ltoreq.j.ltoreq.M, and M is the total number of feature vectors in the ith set. Next, the sum of the intra-level association degrees of each feature vector in the i sets is calculated as the intra-level association degree of the i-th set. Then, the sum of intra-level associations of each of the N sets is calculated to obtain a second loss function.
In some embodiments, determining the intra-level relevance of the jth feature vector comprises:
1) The product between the j-th feature vector in the i-th set and the level attribute of the i-th set is calculated.
2) A difference between the product and the first hyper-parameter is calculated.
3) And taking the minimum value between the preset value and the difference value as the intra-level association degree of the j-th feature vector.
In some embodiments, the first preset value is 0.
For example, a second loss function loss intra As shown in equation (2).
Wherein c i E, C, t is the correspondenceFeature vector, t T Transpose of t, m intra Is a predefined hyper-parameter.
That is, for the jth feature vector in the ith set, the association of the jth feature vector with the level attribute of the ith set should be greater than the association of the jth feature vector with the level attributes of the other sets.
In step 203, a third loss function is obtained according to the inter-level association of the first N-2 sets of the N sets.
In some embodiments, the level-to-level association of the kth set is obtained according to the level attributes of the kth set to the nth set, wherein k is equal to or greater than 1 and equal to or less than N-2. And then calculating the sum of the inter-level association degree of each of the first N-2 sets to obtain the third loss function.
In some embodiments, the step of obtaining the inter-level association of the kth set comprises:
1) The sum of the level attributes of the (k+1) -th set to the (N-1) -th set is calculated as a first parameter.
2) And calculating the difference between the level attribute of the Nth set and the first parameter to serve as a second parameter.
3) And calculating the product of the level attribute of the kth set and the second parameter as a third parameter.
4) And calculating the difference between the third parameter and the second super parameter as a fourth parameter.
5) And selecting a minimum value between a second preset value and the fourth parameter as the inter-level association degree of the kth set. For example, the second preset value is 0.
For example, if there are currently 3 sets, the 1 st set has a level attribute of c 0 Level attribute of the 2 nd set is c 1 Level attribute of the 3 rd set is c 2 Then a third loss function loss inter As shown in equation (3).
Wherein m is inter Is a predefined hyper-parameter.
For another example, if there are currently 4 sets, where the 1 st set has a level attribute of c 0 Level attribute of the 2 nd set is c 1 Level attribute of the 3 rd set is c 2 Level attribute of the 4 th set is c 3 Then a third loss function loss inter As shown in equation (4).
That is, for two government documents, the smaller the level difference, the greater the association of the two government documents.
In step 204, a relevance loss function is derived from the second loss function and the third loss function.
In some embodiments, the relevancy loss function is a weighted sum of the second loss function and the third loss function.
For example, the association loss function loss is as shown in equation (5).
loss=loss intra +loss inter (5)
In step 205, a machine learning model is trained using a relevance loss function.
Through the training, the feature vector output by the machine learning model can better reflect the level characteristics of the government affair text.
Fig. 3 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the model training apparatus includes a first training module 31, a second training module 32, a third training module 33, and a fourth training module 34.
The first training module 31 is configured to mask the content to be masked in the government affair text sample to obtain a mask text.
In some embodiments, randomly selected content to be masked in the government affair text sample is masked using a masking language model (Masked Language Model).
For example, the mask language model includes a model of BertForMaskedLM, disttilBERT or the like.
In some embodiments, the mask ratio is 10% -20%. For example, the masking ratio is 15%, i.e., 15% of the words in the government text sample are masked.
The second training module 32 is configured to input the masked text into a machine learning model to obtain feature vectors of the masked text.
The third training module 33 is configured to derive a prediction result of the content to be masked from the feature vector.
The fourth training module 34 is configured to determine a first loss function based on the prediction and train the machine learning model using the first loss function.
In some embodiments, the first penalty function is determined by a probability of predicting each result of the content to be masked based on the unmasked content in the mask text.
For example, a first loss functionAs shown in formula (1).
In the model training device provided by the embodiment of the disclosure, the processing capability of the machine learning model on the incomplete government affair text is trained, so that the trained machine learning model can accurately convert the government affair text into the feature vector with government affair characteristics, and the government affair questions can be accurately answered based on the feature vector with government affair characteristics.
In some embodiments, fourth training module 34 places the resulting feature vectors into the same set with the same level of attributes to obtain N sets and sorts the N sets in order of the level of attributes from high to low.
Next, the fourth training module 34 obtains a second loss function according to the intra-level association of each feature vector in each set.
In some embodiments, the fourth training module 34 uses the jth feature vector in the ith set and the level attribute of the ith set to obtain an intra-level association of the jth feature vector, where 1.ltoreq.i.ltoreq.N, 1.ltoreq.j.ltoreq.M, where M is the total number of feature vectors in the ith set. Next, the fourth training module 34 calculates the sum of the intra-level degrees of association of each feature vector in the i sets as the intra-level degree of association of the i-th set. The fourth training module 34 then calculates a sum of intra-level associations for each of the N sets to arrive at a second loss function.
In some embodiments, fourth training module 34 determines the intra-level relevance of the jth feature vector by.
1) The product between the j-th feature vector in the i-th set and the level attribute of the i-th set is calculated.
2) A difference between the product and the first hyper-parameter is calculated.
3) And taking the minimum value between the preset value and the difference value as the intra-level association degree of the j-th feature vector.
In some embodiments, the first preset value is 0.
For example, a second loss function loss intra As shown in equation (2).
Next, the fourth training module 34 obtains a third loss function based on the inter-level correlations of the first N-2 of the N sets.
In some embodiments, fourth training module 34 obtains a degree of inter-level association of 1.ltoreq.k.ltoreq.N-2 for the kth set based on the level attributes of the kth set through the Nth set. And then calculating the sum of the inter-level association degree of each of the first N-2 sets to obtain the third loss function.
In some embodiments, fourth training module 34 obtains the inter-level association of the kth set by.
1) The sum of the level attributes of the (k+1) -th set to the (N-1) -th set is calculated as a first parameter.
2) And calculating the difference between the level attribute of the Nth set and the first parameter to serve as a second parameter.
3) And calculating the product of the level attribute of the kth set and the second parameter as a third parameter.
4) And calculating the difference between the third parameter and the second super parameter as a fourth parameter.
5) And selecting a minimum value between a second preset value and the fourth parameter as the inter-level association degree of the kth set. For example, the second preset value is 0.
For example, if there are currently 3 sets, the 1 st set has a level attribute of c 0 Level attribute of the 2 nd set is c 1 Level attribute of the 3 rd set is c 2 Then a third loss function loss inter As shown in equation (3).
For another example, if there are currently 4 sets, where the 1 st set has a level attribute of c 0 Level attribute of the 2 nd set is c 1 Level attribute of the 3 rd set is c 2 Level attribute of the 4 th set is c 3 Then a third loss function loss inter As shown in equation (4).
Next, the fourth training module 34 obtains a relevance loss function based on the second loss function and the third loss function.
In some embodiments, the relevancy loss function is a weighted sum of the second loss function and the third loss function.
For example, the association loss function loss is as shown in equation (5).
Next, the fourth training module 34 trains the machine learning model with a relevance loss function.
Through the training, the feature vector output by the machine learning model can better reflect the level characteristics of the government affair text.
Fig. 4 is a schematic structural diagram of a model training apparatus according to another embodiment of the present disclosure. As shown in fig. 4, the model training apparatus includes a memory 41 and a processor 42.
The memory 41 is for storing instructions and the processor 42 is coupled to the memory 41, the processor 42 being configured to perform a method as referred to in any of the embodiments of fig. 1 or 2 based on the instructions stored by the memory.
As shown in fig. 4, the model training apparatus further comprises a communication interface 43 for information interaction with other devices. Meanwhile, the model training device further comprises a bus 44, and the processor 42, the communication interface 43 and the memory 41 are in communication with each other through the bus 44.
The memory 41 may comprise a high-speed RAM memory or may further comprise a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 41 may also be a memory array. The memory 41 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules.
Further, the processor 42 may be a central processing unit CPU, or may be an application specific integrated circuit ASIC, or one or more integrated circuits configured to implement embodiments of the present disclosure.
The present disclosure also relates to a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method as referred to in any of the embodiments of fig. 1 or 2.
Fig. 5 is a flowchart illustrating a vector database generating method according to an embodiment of the present disclosure. In some embodiments, the following vector database generation method is performed by the vector database generation apparatus.
In step 501, a local government knowledge base is established, and the local government knowledge base is updated at a preset frequency.
It should be noted that, because the training cost of the large language model is higher, the hysteresis of knowledge updating exists, so that some problems at present cannot be accurately answered, but the present disclosure builds the local government knowledge base and updates the local government knowledge base with preset frequency, so that the local knowledge base can be periodically updated to the latest date with smaller cost, and thus, when the latest knowledge of the large language model is asked, the present disclosure can answer by means of the local knowledge base, and the defect caused by the knowledge hysteresis of the large language model is effectively overcome.
In step 502, text content is extracted from government files in a local government knowledge base.
For example, for government files with different formats in a government knowledge base, such as formats of PDF, docx, txt, csv, etc., loading is performed through library files of PDFReader, docx, etc., in Python, and file contents are extracted.
At step 503, the text content is segmented to obtain a plurality of texts.
For example, the text content is segmented to obtain a plurality of text paragraphs. In the process, the text data can be cleaned, and the contents of special characters, special letters and the like in the text can be filtered.
At step 504, each text of the plurality of texts is input into a machine learning model to obtain feature vectors corresponding to each text.
It should be noted that the machine learning model is trained according to the training method of any one of the embodiments in fig. 1 or fig. 2.
In step 505, the resulting feature vector is written into a vector database.
Fig. 6 is a schematic structural diagram of a vector database generating apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the vector database generating apparatus includes a first generating module 61, a second generating module 62, a third generating module 63, and a fourth generating module 64.
The first generation module 61 is configured to establish a local government knowledge base and update the local government knowledge base at a preset frequency.
The second generation module 62 is configured to extract text content from government documents in the local government knowledge base and segment the text content to obtain a plurality of texts.
For example, for government files with different formats in a government knowledge base, such as formats of PDF, docx, txt, csv, etc., loading is performed through library files of PDFReader, docx, etc., in Python, and file contents are extracted.
For example, the text content is segmented to obtain a plurality of text paragraphs. In the process, the text data can be cleaned, and the contents of special characters, special letters and the like in the text can be filtered.
The third generation module 63 is configured to input each text of the plurality of texts into the machine learning model to obtain a feature vector corresponding to each text.
It should be noted that the machine learning model is trained according to the training method of any one of the embodiments in fig. 1 or fig. 2.
The fourth generation module 64 is configured to write the resulting feature vectors into a vector database.
Fig. 7 is a schematic structural diagram of a vector database generating apparatus according to another embodiment of the present disclosure. As shown in fig. 7, the vector database generating apparatus includes a memory 71, a processor 72, a communication interface 73, and a pass-through bus 74. Fig. 7 differs from fig. 4 in that in the embodiment shown in fig. 7, the processor 72 is configured to perform a method as referred to in any of the embodiments of fig. 5 based on instructions stored in the memory 71.
Fig. 8 is a flow chart of a question-answering method according to an embodiment of the present disclosure. In some embodiments, the following question answering method is performed by a question answering apparatus.
In step 801, question text entered by a user is converted into a question vector.
In step 802, the problem vector is matched with each feature vector in the vector database to obtain K feature vectors with highest similarity.
Here, the vector database is obtained according to the generating method of any one of the embodiments in fig. 5.
In step 803, each of the K feature vectors is converted to text to obtain K reply texts.
At step 804, the question text and K answer texts are input into the large language model so that the large language model outputs an answer.
The large language model includes chatGLM, BLOOM, MOSS or other suitable model.
In some embodiments, the question text and the K reply texts are subjected to templating processing to obtain templated information, and the templated information is input into a large language model.
For example, a question template is "a compact and specialized answer to a user question based on the following known information. Problems: { query }, known: { context }).
Fig. 9 is a schematic structural diagram of a question answering device according to one embodiment of the present disclosure. As shown in fig. 9, the question answering apparatus includes a first processing module 91, a second processing module 92, a third processing module 93, and a fourth processing module 94.
The first processing module 91 is configured to convert the question text entered by the user into a question vector.
The second processing module 92 is configured to match the problem vector with each feature vector in the vector database to obtain K feature vectors with highest similarity.
Here, the vector database is obtained according to the generating method of any one of the embodiments in fig. 5.
The third processing module 93 is configured to convert each of the K feature vectors into text to obtain K reply texts.
The fourth processing module 94 is configured to input the question text and the K answer texts into the large language model so that the large language model outputs an answer.
The large language model includes chatGLM, BLOOM, MOSS or other suitable model.
In some embodiments, the fourth processing module 94 templates the question text and the K reply texts to obtain templated information, which is then input into the large language model.
Fig. 10 is a schematic structural diagram of a question answering device according to another embodiment of the present disclosure. As shown in fig. 10, the question-answering apparatus generating apparatus includes a memory 1001, a processor 1002, a communication interface 1003, and a through bus 1004. Fig. 10 differs from fig. 7 in that in the embodiment shown in fig. 10, the processor 1002 is configured to perform a method as referred to in any of the embodiments of fig. 8 based on instructions stored by the memory 1001.
Fig. 11 is a schematic diagram of a question-answering flow based on a local government knowledge base according to an embodiment of the disclosure.
As shown in fig. 11, first, a local government knowledge base is established, and the local government knowledge base is updated at a preset frequency. Next, text content is extracted from government documents in a local government knowledge base. Next, the text content is divided to obtain a plurality of texts. Further, each text of the plurality of texts is input into a trained machine learning model to obtain a feature vector corresponding to each text, and the obtained feature vector is written into a vector database.
Then, the problem text input by the user is converted into a problem vector, and feature similarity calculation is carried out on the problem vector and each feature vector in the vector database to obtain K feature vectors with highest similarity. Then, each of the K feature vectors is converted into a text to obtain K answer texts, the question texts and the K answer texts are subjected to templating, and the templated result is input into the large language model so that the large language model outputs an answer.
In some embodiments, the functional unit blocks described above may be implemented as general-purpose processors, programmable logic controllers (Programmable Logic Controller, abbreviated as PLCs), digital signal processors (Digital Signal Processor, abbreviated as DSPs), application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASICs), field programmable gate arrays (Field-Programmable Gate Array, abbreviated as FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or any suitable combination thereof for performing the functions described in the present disclosure.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (20)

1. A model training method performed by a model training apparatus, comprising:
masking the content to be masked in the government affair text sample to obtain a masking text;
inputting the mask text into a machine learning model to obtain text feature vectors;
obtaining a prediction result of the content to be masked according to the text feature vector;
determining a first loss function according to the prediction result;
training the machine learning model using the first loss function.
2. The method of claim 1, wherein,
the first penalty function is determined by a probability of predicting each result of the content to be masked based on the content in the mask text that is not masked.
3. The method of any of claims 1-2, further comprising:
putting the obtained feature vectors into the same set with the same level attribute to obtain N sets, and sequencing the N sets according to the order of the level attribute from high to low;
obtaining a second loss function according to the intra-level association degree of each feature vector in each set;
obtaining a third loss function according to the inter-level association degree of the first N-2 sets in the N sets;
obtaining a relevance loss function according to the second loss function and the third loss function;
and training the machine learning model by utilizing the relevancy loss function.
4. The method of claim 3, wherein,
the relevancy loss function is a weighted sum of the second loss function and the third loss function.
5. A method according to claim 3, wherein deriving a second loss function comprises:
obtaining the intra-level association degree of the jth feature vector by utilizing the jth feature vector in the ith set and the level attribute of the ith set, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and M is the total number of feature vectors in the ith set;
calculating the sum of the intra-level association degrees of each feature vector in the i sets to serve as the intra-level association degree of the i-th set;
and calculating the sum of intra-level association degrees of each of the N sets to obtain the second loss function.
6. The method of claim 5, wherein the determining the intra-level relevance of the jth feature vector comprises:
calculating the product between the jth feature vector in the ith set and the level attribute of the ith set;
calculating the difference between the product and the first super-parameter;
and selecting the minimum value between the first preset value and the difference value as the intra-level association degree of the j-th feature vector.
7. The method of claim 6, wherein,
the first preset value is 0.
8. A method according to claim 3, wherein deriving a third loss function comprises:
obtaining the level association degree of the kth set according to the level attributes from the kth set to the nth set, wherein k is more than or equal to 1 and less than or equal to N-2;
and calculating the sum of the inter-level association degree of each of the first N-2 sets to obtain the third loss function.
9. The method of claim 8, wherein deriving the inter-level association of the kth set comprises:
calculating the sum of the level attributes from the (k+1) th set to the (N-1) th set as a first parameter;
calculating the difference between the level attribute of the Nth set and the first parameter to be used as a second parameter;
calculating the product of the level attribute of the kth set and the second parameter as a third parameter;
calculating a difference between the third parameter and the second super parameter as a fourth parameter;
and selecting a minimum value between a second preset value and the fourth parameter as the inter-level association degree of the kth set.
10. The method of claim 9, wherein,
the second preset value is 0.
11. A model training apparatus comprising:
the first training module is configured to mask the content to be masked in the government affair text sample to obtain mask text;
a second training module configured to input the masked text into a machine learning model to obtain text feature vectors;
the third training module is configured to obtain a prediction result of the content to be masked according to the text feature vector;
and a fourth training module configured to determine a first loss function based on the prediction result and train the machine learning model using the first loss function.
12. A model training apparatus comprising:
a memory configured to store instructions;
a processor coupled to the memory, the processor configured to perform the method of any of claims 1-10 based on instructions stored by the memory.
13. A vector database generation method, comprising:
establishing a local government knowledge base, and updating the local government knowledge base with preset frequency;
extracting text content from government affair files of the local government affair knowledge base;
dividing the text content to obtain a plurality of texts;
inputting each text of the plurality of texts into a machine learning model to obtain a feature vector corresponding to each text, wherein the machine learning model is trained according to the training method of any one of claims 1-10;
and writing the obtained characteristic vector into a vector database.
14. A vector database generating apparatus comprising:
the first generation module is configured to establish a local government knowledge base and update the local government knowledge base at a preset frequency;
the second generation module is configured to extract text content from government files of the local government knowledge base and divide the text content to obtain a plurality of texts;
a third generation module configured to input each text of the plurality of texts into a machine learning model to obtain feature vectors corresponding to each text, wherein the machine learning model is trained according to the training method of any one of claims 1-10;
and a fourth generation module configured to write the obtained feature vector into the vector database.
15. A vector database generating apparatus comprising:
a memory configured to store instructions;
a processor coupled to the memory, the processor configured to perform the method of implementing the method of claim 13 based on instructions stored by the memory.
16. A question-answering method, comprising:
converting the problem text input by the user into a problem vector;
matching the problem vector with each feature vector in a vector database to obtain K feature vectors with highest similarity, wherein the vector database is obtained according to the generating method of claim 13;
converting each of the K feature vectors into a text to obtain K reply texts;
and inputting the question text and the K answer texts into a large language model so that the large language model outputs answers.
17. The method of claim 16, wherein inputting the question text and the K answer texts into a large language model comprises:
carrying out templating processing on the question text and the K reply texts to obtain templated information;
and inputting the templated information into the large language model.
18. A question answering apparatus comprising:
a first processing module configured to convert a question text input by a user into a question vector;
the second processing module is configured to match the problem vector with each feature vector in a vector database to obtain K feature vectors with highest similarity, wherein the vector database is obtained according to the generating method of claim 13;
a third processing module configured to convert each of the K feature vectors into text to obtain K reply texts;
a fourth processing module configured to input the question text and the K answer texts into a large language model so that the large language model outputs an answer.
19. A question answering apparatus comprising:
a memory configured to store instructions;
a processor coupled to the memory, the processor configured to perform the method of any of claims 16-17 based on instructions stored by the memory.
20. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-10, 13, 16-17.
CN202311576180.3A 2023-11-23 2023-11-23 Model training method, vector database generating method, question answering method and question answering device Pending CN117609450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311576180.3A CN117609450A (en) 2023-11-23 2023-11-23 Model training method, vector database generating method, question answering method and question answering device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311576180.3A CN117609450A (en) 2023-11-23 2023-11-23 Model training method, vector database generating method, question answering method and question answering device

Publications (1)

Publication Number Publication Date
CN117609450A true CN117609450A (en) 2024-02-27

Family

ID=89950850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311576180.3A Pending CN117609450A (en) 2023-11-23 2023-11-23 Model training method, vector database generating method, question answering method and question answering device

Country Status (1)

Country Link
CN (1) CN117609450A (en)

Similar Documents

Publication Publication Date Title
CN111159416B (en) Language task model training method and device, electronic equipment and storage medium
CN111475623B (en) Case Information Semantic Retrieval Method and Device Based on Knowledge Graph
CN108647205B (en) Fine-grained emotion analysis model construction method and device and readable storage medium
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN110362723B (en) Topic feature representation method, device and storage medium
CN111898374B (en) Text recognition method, device, storage medium and electronic equipment
CN110929524A (en) Data screening method, device, equipment and computer readable storage medium
CN111046147A (en) Question answering method and device and terminal equipment
CN112417119A (en) Open domain question-answer prediction method based on deep learning
CN111125295A (en) Method and system for obtaining food safety question answers based on LSTM
Niyozmatova et al. Classification Based On Decision Trees And Neural Networks
CN115630632A (en) Method, system, medium and terminal for correcting personal name in specific field based on context semantics
CN114996464A (en) Text grading method and device using ordered information
CN116910185B (en) Model training method, device, electronic equipment and readable storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
Krutilla et al. The origin and primary areas of application of natural language processing
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN117609450A (en) Model training method, vector database generating method, question answering method and question answering device
CN112597208A (en) Enterprise name retrieval method, enterprise name retrieval device and terminal equipment
CN113342924A (en) Answer retrieval method and device, storage medium and electronic equipment
CN105808522A (en) Method and apparatus for semantic association
CN111813941A (en) Text classification method, device, equipment and medium combining RPA and AI
CN111159366A (en) Question-answer optimization method based on orthogonal theme representation
CN112800778B (en) Intent recognition method, system and storage medium based on word string length

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination