CN117235210A - Content complement method, device, equipment and medium based on automatic complement model - Google Patents
Content complement method, device, equipment and medium based on automatic complement model Download PDFInfo
- Publication number
- CN117235210A CN117235210A CN202311329042.5A CN202311329042A CN117235210A CN 117235210 A CN117235210 A CN 117235210A CN 202311329042 A CN202311329042 A CN 202311329042A CN 117235210 A CN117235210 A CN 117235210A
- Authority
- CN
- China
- Prior art keywords
- text
- search
- model
- preset
- log
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000295 complement effect Effects 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 77
- 238000012545 processing Methods 0.000 claims abstract description 45
- 239000012634 fragment Substances 0.000 claims abstract description 40
- 238000004590 computer program Methods 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 5
- 230000009191 jumping Effects 0.000 claims description 4
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000012549 training Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000006399 behavior Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a content complement method, a device, equipment and a medium based on an automatic complement model, which relate to the field of natural language processing and comprise the following steps: acquiring a search text, and processing the search text to generate a search log; counting the number of logs, judging whether the number of the logs meets a preset condition, if so, adding start and stop symbols to the log text and a preset search library text, and intercepting the obtained added text to construct a content complement data set based on a plurality of obtained text fragments; vectorization processing is carried out on a plurality of text fragments, a plurality of obtained text tensors are input into a preset language model, and overall model loss is calculated; updating the preset language model based on the total model loss, providing an automatic complement text by using the obtained updated model, and updating the model of the next round. In this way, the model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a content complement method, device, equipment and medium based on an automatic complement model.
Background
The natural language technology is developed along with the development of deep learning and neural networks, and in the field of search engine application, users can be helped to focus on interesting contents by means of the natural language processing technology, so that semantic analysis of retrieved contents is realized. The automatic complement of the search content means that in the process of using a search engine, a user only inputs individual keywords, and an algorithm intelligently prompts complete sentences queried by the user to help the user to quickly locate the search content from a massive search library.
The first method is a character matching method, the method needs to manually maintain a dictionary library, regularly expand and reduce dictionary library contents, match characters in the dictionary library by using a user query word, and lock a recommendation list; the second method is a recommendation system method, in the method, by collecting the data characteristics of the retrieval behaviors of the user and analyzing and modeling the retrieval behaviors of the user by means of the current mature recommendation algorithm, personalized prompt can be realized, but the method has a cold start process when the recommendation algorithm is on line, and user behavior data can be lost in the process, so that the recommendation effect is poor; the third method is an intelligent prompt method based on log mining, search log information is utilized to mine search information of a user and corresponding recommendation is carried out, but data of the method is only derived from the search log, and data sources have higher limitation.
Disclosure of Invention
In view of the above, the present application aims to provide a content complement method, device, equipment and medium based on an automatic complement model. The model can be updated online through the user search data and the search library text data, so that the accuracy and the user experience of query prompt can be improved, and the specific scheme is as follows:
in a first aspect, the application discloses a content complement method based on an automatic complement model, comprising the following steps:
acquiring a search text input by a user terminal, and processing the search text based on a preset log generation rule to generate a search log;
counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition or not based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
vectorizing the text segments in the content completion data set, and inputting the obtained text tensors into a preset language model to calculate the total model loss of the preset language model based on the text tensors;
Updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
Optionally, the obtaining the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log includes:
acquiring a search text input by a user terminal, and determining the query time and the user identification corresponding to the search text;
determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts;
If yes, determining the search text as the log text, and generating a second search log based on the query time, the user identification and the log text.
Optionally, the counting the number of logs in the search log, and judging whether the number of logs meets a preset condition based on a preset number threshold, if yes, adding start and stop symbols to log texts in the search log and search library texts in a preset search library, and intercepting the obtained added texts to construct a content complement data set based on a plurality of obtained text fragments, including:
counting the number of the local logs of all the search logs, and judging whether the number of the logs is not smaller than a plurality of integer multiples of a preset number threshold;
if yes, determining all the log texts in the search logs and search library texts in a preset search library, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts;
and intercepting the added text based on each text interception length in a preset interception range respectively to construct a content complement data set based on the obtained text fragments.
Optionally, after the counting the log number of all the search logs locally and determining whether the log number is not less than a number of integer multiples of a preset number threshold, the method further includes:
if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log.
Optionally, the vectorizing the plurality of text segments in the content completion dataset, and inputting the plurality of obtained text tensors into a preset language model to calculate an overall model loss of the preset language model based on the plurality of text tensors, including:
vectorizing the text segments in the content completion data set, and obtaining a first text tensor corresponding to the text segments, a second text tensor corresponding to the text segment corresponding labels and a third text tensor corresponding to the text segment prediction directions based on a preset dictionary index corresponding table;
Performing dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor;
and carrying out maximum pooling operation on the processed first text tensor to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
Optionally, the performing a maximum pooling operation on the processed first text tensor to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor, and the third text tensor includes:
inputting the processed first text tensor to a Maxpooling layer to perform maximum pooling operation on the processed first text tensor to obtain the processed first text tensor;
inputting the processed first text tensor to a first linear classification layer and a second linear classification layer respectively to obtain a corresponding first output tensor and a corresponding second output tensor;
calculating cross entropy loss of the preset language model based on the first output tensor and the second text tensor to obtain target cross entropy loss; calculating the mean square error of the preset language model based on the second output tensor and the third text tensor, so as to take the obtained mean square error as the prediction loss of the preset language model;
And taking the sum value of the target cross entropy loss and the prediction loss as the total model loss of the preset language model.
Optionally, the step of updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log, so as to update the model of the next round, includes:
updating the preset language model based on the overall model loss to obtain an updated language model;
judging whether a current search text input by the user terminal is received, if so, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
In a second aspect, the present application discloses a content complement device based on an automatic complement model, comprising:
The log generation module is used for acquiring a search text input by a user terminal and processing the search text based on a preset log generation rule so as to generate a search log;
the data set construction module is used for counting the number of the logs of the search logs, judging whether the number of the logs meets preset conditions or not based on a preset number threshold, if yes, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
the loss calculation module is used for carrying out vectorization processing on the text fragments in the content completion data set, inputting the obtained text tensors into a preset language model, and calculating the overall model loss of the preset language model based on the text tensors;
and the model updating module is used for updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and a processor for executing the computer program to implement the automatic complement model-based content complement method as described above.
In a fourth aspect, the present application discloses a computer readable storage medium storing a computer program which, when executed by a processor, implements a content complementing method based on an auto-complementing model as described above.
In the method, firstly, a search text input by a user terminal is acquired, the search text is processed based on a preset log generation rule to generate a search log, the log quantity of the search log is counted, whether the log quantity meets a preset condition is judged based on a preset quantity threshold, if yes, starting and ending symbols are added to the log text in the search log and a search library text in a preset search library, the obtained added text is intercepted, a content completion data set is constructed based on a plurality of obtained text fragments, vectorization processing is carried out on the plurality of text fragments in the content completion data set, the obtained text tensor is input into a preset language model, the total model loss of the preset language model is calculated based on the plurality of text tensors, finally the preset language model is updated based on the total model loss, an automatic completion text is provided for the user terminal by using the obtained updated model, and the search text input by the user terminal is repeatedly jumped to the obtained search text input by the user terminal, the search text is processed based on the preset log generation rule, and the next step of updating is carried out. Therefore, the method can generate the search log based on the search text of the user, process the search text in the preset search library and the search text in the search log after the search log reaches the preset quantity threshold, construct a content complement data set by using the processed data, and then perform vectorization processing on text fragments in the content complement data set so as to train a preset language model through a plurality of obtained text tensors, calculate model loss of the preset language model, update the language model according to the obtained model loss, and collect the search text at the same time when the updated model is used for providing the complement text so as to perform next round of updating. In this way, the model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a content completion method based on an automatic completion model provided by the application;
FIG. 2 is a flowchart of a specific automatic-complement-model-based content complement method according to the present application;
FIG. 3 is a schematic diagram of an auto-complete model framework provided by the present application;
FIG. 4 is a schematic diagram of a content completion device based on an automatic completion model according to the present application;
fig. 5 is a block diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Three methods for automatic completion are commonly used in the prior art, the first method is a character matching method, the retrieval efficiency of the method is too slow, when the content of a word stock is increased increasingly, the prompt return time is often beyond the user expectation, and the user experience is reduced; the second method is a recommendation system method, and the method has a cold start process when a recommendation algorithm is online, and user behavior data may be lost in the process, so that a recommendation effect is poor; the data of the third method is only derived from the search logs, and the data sources have higher limitations.
In order to overcome the technical problems, the invention discloses a content complement method, device, equipment and medium based on an automatic complement model. The model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.
Referring to fig. 1, the embodiment of the invention discloses a content complement method based on an automatic complement model, which comprises the following steps:
step S11, obtaining a search text input by a user terminal, and processing the search text based on a preset log generation rule to generate a search log.
In this embodiment, in order to update a preset language model, that is, update an automatic complement model, a text in a preset search library and a search text input by a user terminal collected in real time need to be processed to obtain training data for training the model, and first, a search text driven in by the user terminal needs to be acquired to generate a search log, which specifically includes the following steps: acquiring a search text input by a user terminal, and determining the query time and the user identification corresponding to the search text; determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts; if yes, determining the search text as the log text, and generating a second search log based on the query time, the user identification and the log text. That is, the search text input by the user end needs to be acquired, the query time and the user identifier when the search text is input need to be determined, that is, the user ID (Identity document) needs to determine whether the actual query text which is finally searched by the user is the same as the search text, if not, the query word used for searching by the user is based on a preset language model, that is, the text content recommended based on the automatic completion model, then the actual query text and the ranking of the actual query text in the returned result need to be determined, then the actual query text is used as the log text for generating the search log, and the search log is generated according to the query time, the user ID, the search text, the log text and the target ranking; in another case, if the search text coincides with the actual query text, the search text may be determined as a log text for generating a search log, and then the search log is generated based on the log text, the query time, and the user ID.
And step S12, counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content complement data set based on a plurality of obtained text fragments.
In this embodiment, in order to implement automatic update of the model, conditions for updating the model need to be set, and the specific procedure is as follows: counting the number of the local logs of all the search logs, and judging whether the number of the logs is not smaller than a plurality of integer multiples of a preset number threshold; if yes, determining all the log texts in the search logs and search library texts in a preset search library, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts; and intercepting the added text based on each text interception length in a preset interception range respectively to construct a content complement data set based on the obtained text fragments. That is, the automatic updating rule for the model may be set to count all local search logs, if the data size of the search logs is greater than an integer multiple of a preset threshold, the text may be processed, and it should be noted that the preset threshold is not limited in the present application, and may be set by itself according to the user requirement. For example, the preset threshold may be set to 5000, and when the number of local search logs is not less than 5000, the model may be updated, and when the number of local search logs is not less than 10000, the model may be updated in the next round. Further, when it is determined that the model is updated, a search library text of a preset search objective may be extracted, start and stop signs "< start >" and "< end >" may be added to the search library text and the log text, a text segment having a length of L may be intercepted, a content complement data set may be constructed based on the obtained text segment, and the value of L may be {2,3,4,5}. The text of the search library includes title, abstract, detailed description, etc., taking title as example, "purchase housing and extract housing accumulation fund" as title of the text of the search library, after adding start and stop sign, converting into "< start > purchase housing and extract housing accumulation fund < end >", the value of L is 3, the text segment formed is < start > purchase, buy-from, hold, house pick-up, fang Gong product, public deposit, deposit < end >. After all the text fragments are obtained, a content completion data set can be created based on the text fragments, and a sample format in the content completion data set is { text fragments, label, prediction directions }, wherein label is a label corresponding to the text fragments, and the prediction directions are positions of the text fragments in the first half or the second half of the sentence. Therefore, the data in the search library and the data searched by the user in real time can be combined to generate the complement data set for model training, so that the training of the automatic complement model is more comprehensive, and the situation that model recommendation is inaccurate due to training of the model by adopting a single data source is avoided.
It should be noted that after counting the number of logs of all the local search logs and judging whether the number of logs is not less than a plurality of integer multiples of a preset number threshold, the method further includes: if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log. That is, if the number of current search logs does not reach an integer multiple of the preset number threshold, it is necessary to continue to collect the input search text and generate a new search log according to the search text until the number of search logs is not less than the integer multiple of the preset number threshold.
And S13, carrying out vectorization processing on the text fragments in the content completion data set, and inputting the obtained text tensors into a preset language model so as to calculate the total model loss of the preset language model based on the text tensors.
In this embodiment, vectorization processing may be performed on text segments in the complement dataset to increase the number of model training According to the method, the training time of the model is reduced, after the text is vectorized to obtain a text tensor, the text tensor can be input into the model, and the overall model loss of the model is calculated by the following specific processes: vectorizing the text segments in the content completion data set, and obtaining a first text tensor corresponding to the text segments, a second text tensor corresponding to the text segment corresponding labels and a third text tensor corresponding to the text segment prediction directions based on a preset dictionary index corresponding table; performing dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor; and carrying out maximum pooling operation on the processed first text tensor to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor. That is, sample vectorization processing in a data set, wherein the maximum character length in the text is set to L, and a dictionary index mapping table can be utilized to generate a text segment mapping tensor T, and the dimension is R L . The number of characters is v, the tensor corresponding to label is Y, and the dimension is R 1 And v and L are positive integers. The tensor M represents the prediction direction of the text segment in the sample, M is 1 for the prediction direction to be forward, and M is 0 for the prediction direction to be backward. Wherein R represents real space, and v and L are both positive integers. The tensor T can then be input to the embedding layer of the model to obtain a reduced-dimension tensor X e And to X e Feature extraction to obtain tensor X T . Performing maximum pooling operation to obtain X o And utilize X o Model population loss for M and Y calculation models.
And step S14, updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
In this embodiment, updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, and re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log, so as to perform next round of model updating, including: updating the preset language model based on the overall model loss to obtain an updated language model; judging whether a current search text input by the user terminal is received, if so, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round. That is, the model may be updated based on the resulting overall model loss, and after the user inputs the search text, the user's search text may be input to the updated model to recommend search terms for the user using the updated model, and a search log may be generated again based on the user's search text for the next round of model updating. Thus, after the initial model is online, the model can be updated in a lasting manner, so that the accuracy of content completion is ensured.
It can be seen that in this embodiment, firstly, a search text input by a user terminal is obtained, and is processed based on a preset log generation rule, so as to generate a search log, the log number of the search log is counted, and whether the log number meets a preset condition is judged based on a preset number threshold, if yes, a start-stop symbol is added to the log text in the search log and a search library text in a preset search library, the obtained added text is intercepted, so as to construct a content completion dataset based on a plurality of obtained text fragments, then vectorization processing is performed on the plurality of obtained text fragments in the content completion dataset, and the obtained text tensor is input into a preset language model, so as to calculate an overall model loss of the preset language model based on the plurality of text tensors, finally, the preset language model is updated based on the overall model loss, an automatic completion text is provided for the user terminal by using the obtained updated model, and the search text input by the user terminal is skipped again, and is processed based on the preset log generation rule, so as to generate a round of update log. Therefore, the method can generate the search log based on the search text of the user, process the search text in the preset search library and the search text in the search log after the search log reaches the preset quantity threshold, construct a content complement data set by using the processed data, and then perform vectorization processing on text fragments in the content complement data set so as to train a preset language model through a plurality of obtained text tensors, calculate model loss of the preset language model, update the language model according to the obtained model loss, and collect the search text at the same time when the updated model is used for providing the complement text so as to perform next round of updating. In this way, on one hand, the model can be updated online by searching data and retrieving library text data by a user; on the other hand, the model can be trained by using the text in the search library and the text in the search log generated in real time, so that the accuracy of the query prompt is improved.
Based on the foregoing embodiments, it can be seen that, by the method of the present application, updating of the model is required, for which this embodiment describes in detail how to update the model, as shown in fig. 2, the embodiment of the present application discloses a content complement method based on an automatic complement model, including:
step S21, a search text input by a user terminal is obtained, and the search text is processed based on a preset log generation rule to generate a search log.
And S22, counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content complement data set based on a plurality of obtained text fragments.
And S23, carrying out vectorization processing on the text fragments in the content completion data set, and obtaining a first text tensor corresponding to the text fragments, a second text tensor corresponding to the labels corresponding to the text fragments and a third text tensor corresponding to the predicting directions of the text fragments based on a preset dictionary index corresponding table.
In this embodiment, as shown in fig. 3, a plurality of text segments in the content completion dataset need to be vectorized, and then a tensor T corresponding to the text segments, a tensor Y corresponding to the text labels label, and a tensor M corresponding to the prediction direction of the text segments need to be determined by using a dictionary index mapping table, where the tensor T dimension is R L Tensor Y dimension is R 1 The tensor M takes a value of 1 or 0.
And step S24, performing dimension reduction and feature extraction on the first text tensor to obtain the processed first text tensor.
In this embodiment, the tensor T is required to be input to the Embedding layer for dimension reduction to obtain the tensor X e And X is e =embedding (T), further, the resulting tensor X is required e Input to a transducer layer for feature extraction to obtain tensor X T And X is T =Transformer(X e ) And X is T Is of dimension R L×E 。
Step S25, performing a maximum pooling operation on the processed first text tensor, so as to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
In this embodiment, the first processed text tensor, X, is required T Input to Maxpooling layer for X T Performing maximum pooling operation, and acquiring semantic information with more obvious features in sentence length dimension to obtain tensor X O And X is O =Maxpooling(X T ),X O Is of dimension R E . Then need X O Respectively input to two linesIn the sexual classification layer MLP, the output in the first linear classification layer is X M Dimension is R V And X is M =MLP(X O ) Selecting an almost function as a sigmoid function in a second linear classification layer to determine the forward probability and backward probability of the text, and outputting as X S And X is S =sigmoid(MLP 2 (X O ) The probability of each text fragment occurrence can be obtained by a sigmoid function and when X S >The predicted direction is forward when 0.5; when X is S <And the prediction direction is backward at 0.5. For example, there is some text as {<start>,w 1 ,w 2 ,…,w n ,<end>The overall probability of the text is P {<start>,w 1 ,w 2 ,…,w n ,<end>}=P(<start>)*P(w 1 │<start>)*…*P(<end>|w n ,…,w 1 ,<start>) Taking L as 3 as an example, in order to simplify the operation, the probability is set to obey the markov assumption, and the forward probability calculation formula is as follows: p { S.C<start>,w 1 ,w 2 ,…,w n ,<end>}≈P(<start>)*P(w 1 │<start>)*P(w 2 │w 1 <start>)*P(w 3 │w 2 ,w 1 ,<start>)*…*P(<end>|w n ,w n-1 ,w n-2 ) The method comprises the steps of carrying out a first treatment on the surface of the The backward probability calculation formula is as follows: p { S.C<start>,w 1 ,w 2 ,…,w n ,<end>}=P(<end>)*P(w n │<end>)*P(w n-1 │w n ,<end>)*P(w n-2 │w n-1 ,w n ,<end>)*…*P(<start>|w 1 ,w 2 ,w 3 )。
It is further to be noted that it is necessary to make the reference to X based on the obtained M X is as follows S The overall loss of the computational model first requires outputting X based on a linear layer M And calculating cross entropy loss by using tensor Y corresponding to label, and calculating cross entropy loss 1 =CrossEntropyLoss(X M Y), according to X S Predictive direction correspondence of text segments The tensor M of (1) calculates the mean square error loss as a prediction loss to correct the text segment prediction direction, and the prediction loss is calculated 2 =MSE(X S M), model overall loss loss=loss 1 +loss 2 。
And S26, updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
It can be seen that, in this embodiment, in order to calculate the model loss, it is first required to perform vectorization processing on the plurality of text segments in the content completion dataset, obtain, based on a preset dictionary index correspondence table, a first text tensor corresponding to the plurality of text segments, a second text tensor corresponding to the plurality of text segment correspondence labels, and a third text tensor corresponding to the plurality of text segment prediction directions, then perform dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor, and finally perform a maximum pooling operation on the processed first text tensor to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor. In this way, the model can be updated through model loss, so that the automatic completion model is more accurate, and the problem of cold start in a recommendation algorithm is solved by utilizing the Transformer to process semantic relation between contexts.
Referring to fig. 4, the embodiment of the invention discloses a content complement device based on an automatic complement model, which comprises:
the log generating module 11 is configured to obtain a search text input by a user terminal, and process the search text based on a preset log generating rule to generate a search log;
the data set construction module 12 is configured to count the number of logs in the search log, determine whether the number of logs meets a preset condition based on a preset number threshold, if yes, add start-stop symbols to log texts in the search log and search library texts in a preset search library, and intercept the added text to construct a content completion data set based on the obtained text fragments;
a loss calculation module 13, configured to perform vectorization processing on the plurality of text segments in the content completion dataset, and input the plurality of obtained text tensors to a preset language model, so as to calculate an overall model loss of the preset language model based on the plurality of text tensors;
the model updating module 14 is configured to update the preset language model based on the total model loss, provide an automatic complement text for the user terminal by using the obtained updated model, and skip to the search text input by the user terminal again, and process the search text based on a preset log generation rule to generate a search log, so as to update the model of the next round.
Therefore, in the method, firstly, the search text input by the user terminal is acquired, the search text is processed based on a preset log generation rule to generate a search log, the log quantity of the search log is counted, whether the log quantity meets a preset condition is judged based on a preset quantity threshold, if yes, starting and ending symbols are added to the log text in the search log and the search library text in a preset search library, the obtained added text is intercepted, a content completion data set is constructed based on a plurality of obtained text fragments, vectorization processing is carried out on a plurality of obtained text fragments in the content completion data set, the obtained text tensor is input into a preset language model, the total model loss of the preset language model is calculated based on the plurality of text tensors, finally the preset language model is updated based on the total model loss, the obtained updated model is used for providing automatic completion text for the user terminal, and the search text input by the user terminal is skipped again to the obtained search text input by the user terminal, the search text is processed based on the preset log generation rule, and the next step of updating is carried out. Therefore, the method can generate the search log based on the search text of the user, process the search text in the preset search library and the search text in the search log after the search log reaches the preset quantity threshold, construct a content complement data set by using the processed data, and then perform vectorization processing on text fragments in the content complement data set so as to train a preset language model through a plurality of obtained text tensors, calculate model loss of the preset language model, update the language model according to the obtained model loss, and collect the search text at the same time when the updated model is used for providing the complement text so as to perform next round of updating. In this way, the model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.
In some embodiments, the log generating module 11 may specifically include:
the data acquisition unit is used for acquiring a search text input by a user terminal and determining the query time and the user identification corresponding to the search text;
the first log generation unit is used for determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text so as to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts;
and the second log generating unit is used for determining the search text as the log text if the search text is the log text so as to generate a second search log based on the query time, the user identification and the log text.
In some embodiments, the data set construction module 12 may specifically include:
The quantity judging unit is used for counting the log quantity of all the local search logs and judging whether the log quantity is not smaller than a plurality of integer multiples of a preset quantity threshold value;
the data processing unit is used for determining all the log texts in the search logs and search library texts in a preset search library if yes, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts;
the first data set construction unit is used for carrying out interception processing on the added text based on each text interception length in a preset interception range respectively so as to construct a content complement data set based on a plurality of obtained text fragments.
In some embodiments, the automatic complement model-based content complement apparatus may further include:
and if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log.
In some embodiments, the loss calculation module 13 may specifically include:
the tensor determining unit is used for carrying out vectorization processing on the text fragments in the content completion data set, and obtaining a first text tensor corresponding to the text fragments, a second text tensor corresponding to the text fragment corresponding labels and a third text tensor corresponding to the text fragment prediction directions based on a preset dictionary index corresponding table;
the tensor dimension reduction unit is used for dimension reduction and feature extraction of the first text tensor to obtain a processed first text tensor;
and the loss calculation sub-module is used for carrying out maximum pooling operation on the processed first text tensor so as to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
In some embodiments, the loss calculation submodule may specifically include:
the first tensor processing unit is used for inputting the processed first text tensor to a Maxpooling layer to carry out maximum pooling operation on the processed first text tensor so as to obtain the processed first text tensor;
The second tensor processing unit is used for respectively inputting the processed first text tensor to the first linear classification layer and the second linear classification layer to obtain a corresponding first output tensor and a corresponding second output tensor;
a loss calculation unit, configured to calculate a cross entropy loss of the preset language model based on the first output tensor and the second text tensor, so as to obtain a target cross entropy loss; calculating the mean square error of the preset language model based on the second output tensor and the third text tensor, so as to take the obtained mean square error as the prediction loss of the preset language model;
and the loss determining unit is used for taking the sum value of the target cross entropy loss and the prediction loss as the total model loss of the preset language model.
In some embodiments, the model updating module 14 may specifically include:
a first model updating unit, configured to update the preset language model based on the total model loss, so as to obtain an updated language model;
the text insufficiency unit is used for judging whether a current search text input by the user terminal is received or not, and if yes, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model;
And the second model updating unit is used for re-jumping to the search text input by the acquisition user terminal, processing the search text based on a preset log generating rule, and generating a search log so as to update the model of the next round.
Further, the embodiment of the present application further discloses an electronic device, and fig. 5 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the figure is not to be considered as any limitation on the scope of use of the present application.
Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement the relevant steps in the automatic complement model-based content complement method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and computer programs 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the autocompletion model-based content completion method performed by the electronic device 20 as disclosed in any of the previous embodiments.
Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program, when executed by the processor, implements the disclosed automatic complement model-based content complement method. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing has outlined rather broadly the more detailed description of the application in order that the detailed description of the application that follows may be better understood, and in order that the present principles and embodiments may be better understood; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.
Claims (10)
1. A content completion method based on an automatic completion model, comprising:
acquiring a search text input by a user terminal, and processing the search text based on a preset log generation rule to generate a search log;
counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition or not based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
vectorizing the text segments in the content completion data set, and inputting the obtained text tensors into a preset language model to calculate the total model loss of the preset language model based on the text tensors;
Updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
2. The automatic completion model-based content completion method according to claim 1, wherein the obtaining the search text input by the user terminal and processing the search text based on a preset log generation rule to generate a search log comprises:
acquiring a search text input by a user terminal, and determining the query time and the user identification corresponding to the search text;
determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts;
If yes, determining the search text as the log text, and generating a second search log based on the query time, the user identification and the log text.
3. The automatic completion model-based content completion method according to claim 1, wherein the counting the number of logs of the search logs and judging whether the number of logs meets a preset condition based on a preset number threshold, if yes, adding start and stop symbols to log texts in the search logs and search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion dataset based on a plurality of obtained text segments, and the method comprises:
counting the number of the local logs of all the search logs, and judging whether the number of the logs is not smaller than a plurality of integer multiples of a preset number threshold;
if yes, determining all the log texts in the search logs and search library texts in a preset search library, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts;
and intercepting the added text based on each text interception length in a preset interception range respectively to construct a content complement data set based on the obtained text fragments.
4. The automatic complement model-based content complement method as recited in claim 3, further comprising, after counting the number of logs of all the search logs locally and determining whether the number of logs is not less than a number of integer multiples of a preset number threshold:
if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log.
5. The automatic complement model-based content complement method as recited in claim 1, wherein the vectorizing the plurality of text segments in the content complement data set and inputting the plurality of resulting text tensors into a preset language model to calculate an overall model loss of the preset language model based on the plurality of text tensors comprises:
vectorizing the text segments in the content completion data set, and obtaining a first text tensor corresponding to the text segments, a second text tensor corresponding to the text segment corresponding labels and a third text tensor corresponding to the text segment prediction directions based on a preset dictionary index corresponding table;
Performing dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor;
and carrying out maximum pooling operation on the processed first text tensor to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
6. The automatic complement model-based content complement method as recited in claim 5, wherein the performing a max pooling operation on the processed first text tensor to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor, and the third text tensor comprises:
inputting the processed first text tensor to a Maxpooling layer to perform maximum pooling operation on the processed first text tensor to obtain the processed first text tensor;
inputting the processed first text tensor to a first linear classification layer and a second linear classification layer respectively to obtain a corresponding first output tensor and a corresponding second output tensor;
calculating cross entropy loss of the preset language model based on the first output tensor and the second text tensor to obtain target cross entropy loss; calculating the mean square error of the preset language model based on the second output tensor and the third text tensor, so as to take the obtained mean square error as the prediction loss of the preset language model;
And taking the sum value of the target cross entropy loss and the prediction loss as the total model loss of the preset language model.
7. The automatic complement model-based content complement method according to any one of claims 1 to 6, wherein the step of updating the preset language model based on the overall model loss, providing the automatic complement text for the user terminal using the obtained updated model, and re-jumping to the search text input by the acquisition user terminal, and processing the search text based on a preset log generation rule to generate a search log, so as to perform a next round of model update, comprises:
updating the preset language model based on the overall model loss to obtain an updated language model;
judging whether a current search text input by the user terminal is received, if so, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
8. A content completion device based on an automatic completion model, comprising:
the log generation module is used for acquiring a search text input by a user terminal and processing the search text based on a preset log generation rule so as to generate a search log;
the data set construction module is used for counting the number of the logs of the search logs, judging whether the number of the logs meets preset conditions or not based on a preset number threshold, if yes, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
the loss calculation module is used for carrying out vectorization processing on the text fragments in the content completion data set, inputting the obtained text tensors into a preset language model, and calculating the overall model loss of the preset language model based on the text tensors;
and the model updating module is used for updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the automatic complement model-based content complement method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium for storing a computer program which when executed by a processor implements the automatic complement model-based content complement method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311329042.5A CN117235210A (en) | 2023-10-13 | 2023-10-13 | Content complement method, device, equipment and medium based on automatic complement model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311329042.5A CN117235210A (en) | 2023-10-13 | 2023-10-13 | Content complement method, device, equipment and medium based on automatic complement model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117235210A true CN117235210A (en) | 2023-12-15 |
Family
ID=89094771
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311329042.5A Pending CN117235210A (en) | 2023-10-13 | 2023-10-13 | Content complement method, device, equipment and medium based on automatic complement model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117235210A (en) |
-
2023
- 2023-10-13 CN CN202311329042.5A patent/CN117235210A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635273B (en) | Text keyword extraction method, device, equipment and storage medium | |
US11948058B2 (en) | Utilizing recurrent neural networks to recognize and extract open intent from text inputs | |
CN110162749B (en) | Information extraction method, information extraction device, computer equipment and computer readable storage medium | |
CN107679039B (en) | Method and device for determining statement intention | |
CN116775847B (en) | Question answering method and system based on knowledge graph and large language model | |
CN109376222B (en) | Question-answer matching degree calculation method, question-answer automatic matching method and device | |
US20150095017A1 (en) | System and method for learning word embeddings using neural language models | |
CN112800170A (en) | Question matching method and device and question reply method and device | |
CN112860919B (en) | Data labeling method, device, equipment and storage medium based on generation model | |
CN114840671A (en) | Dialogue generation method, model training method, device, equipment and medium | |
CN116719520B (en) | Code generation method and device | |
CN117435716B (en) | Data processing method and system of power grid man-machine interaction terminal | |
KR20240067971A (en) | Voice recognition method, voice recognition device, electronic equipment, storage media and computer program | |
CN116467417A (en) | Method, device, equipment and storage medium for generating answers to questions | |
CN113343692A (en) | Search intention recognition method, model training method, device, medium and equipment | |
CN116049370A (en) | Information query method and training method and device of information generation model | |
CN115310449A (en) | Named entity identification method and device based on small sample and related medium | |
CN110941713A (en) | Self-optimization financial information plate classification method based on topic model | |
CN112416754B (en) | Model evaluation method, terminal, system and storage medium | |
AU2019290658B2 (en) | Systems and methods for identifying and linking events in structured proceedings | |
CN117235210A (en) | Content complement method, device, equipment and medium based on automatic complement model | |
CN113468306A (en) | Voice conversation method, device, electronic equipment and storage medium | |
CN111199170B (en) | Formula file identification method and device, electronic equipment and storage medium | |
JP2021163477A (en) | Method, apparatus, electronic device, computer-readable storage medium, and computer program for image processing | |
CN111783465A (en) | Named entity normalization method, system and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |