CN117235210A - Content complement method, device, equipment and medium based on automatic complement model - Google Patents

Content complement method, device, equipment and medium based on automatic complement model Download PDF

Info

Publication number
CN117235210A
CN117235210A CN202311329042.5A CN202311329042A CN117235210A CN 117235210 A CN117235210 A CN 117235210A CN 202311329042 A CN202311329042 A CN 202311329042A CN 117235210 A CN117235210 A CN 117235210A
Authority
CN
China
Prior art keywords
text
search
model
preset
log
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311329042.5A
Other languages
Chinese (zh)
Inventor
何彬彬
魏金雷
周庆勇
朱利霞
伊文超
李旭东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202311329042.5A priority Critical patent/CN117235210A/en
Publication of CN117235210A publication Critical patent/CN117235210A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a content complement method, a device, equipment and a medium based on an automatic complement model, which relate to the field of natural language processing and comprise the following steps: acquiring a search text, and processing the search text to generate a search log; counting the number of logs, judging whether the number of the logs meets a preset condition, if so, adding start and stop symbols to the log text and a preset search library text, and intercepting the obtained added text to construct a content complement data set based on a plurality of obtained text fragments; vectorization processing is carried out on a plurality of text fragments, a plurality of obtained text tensors are input into a preset language model, and overall model loss is calculated; updating the preset language model based on the total model loss, providing an automatic complement text by using the obtained updated model, and updating the model of the next round. In this way, the model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.

Description

Content complement method, device, equipment and medium based on automatic complement model
Technical Field
The invention relates to the field of natural language processing, in particular to a content complement method, device, equipment and medium based on an automatic complement model.
Background
The natural language technology is developed along with the development of deep learning and neural networks, and in the field of search engine application, users can be helped to focus on interesting contents by means of the natural language processing technology, so that semantic analysis of retrieved contents is realized. The automatic complement of the search content means that in the process of using a search engine, a user only inputs individual keywords, and an algorithm intelligently prompts complete sentences queried by the user to help the user to quickly locate the search content from a massive search library.
The first method is a character matching method, the method needs to manually maintain a dictionary library, regularly expand and reduce dictionary library contents, match characters in the dictionary library by using a user query word, and lock a recommendation list; the second method is a recommendation system method, in the method, by collecting the data characteristics of the retrieval behaviors of the user and analyzing and modeling the retrieval behaviors of the user by means of the current mature recommendation algorithm, personalized prompt can be realized, but the method has a cold start process when the recommendation algorithm is on line, and user behavior data can be lost in the process, so that the recommendation effect is poor; the third method is an intelligent prompt method based on log mining, search log information is utilized to mine search information of a user and corresponding recommendation is carried out, but data of the method is only derived from the search log, and data sources have higher limitation.
Disclosure of Invention
In view of the above, the present application aims to provide a content complement method, device, equipment and medium based on an automatic complement model. The model can be updated online through the user search data and the search library text data, so that the accuracy and the user experience of query prompt can be improved, and the specific scheme is as follows:
in a first aspect, the application discloses a content complement method based on an automatic complement model, comprising the following steps:
acquiring a search text input by a user terminal, and processing the search text based on a preset log generation rule to generate a search log;
counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition or not based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
vectorizing the text segments in the content completion data set, and inputting the obtained text tensors into a preset language model to calculate the total model loss of the preset language model based on the text tensors;
Updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
Optionally, the obtaining the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log includes:
acquiring a search text input by a user terminal, and determining the query time and the user identification corresponding to the search text;
determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts;
If yes, determining the search text as the log text, and generating a second search log based on the query time, the user identification and the log text.
Optionally, the counting the number of logs in the search log, and judging whether the number of logs meets a preset condition based on a preset number threshold, if yes, adding start and stop symbols to log texts in the search log and search library texts in a preset search library, and intercepting the obtained added texts to construct a content complement data set based on a plurality of obtained text fragments, including:
counting the number of the local logs of all the search logs, and judging whether the number of the logs is not smaller than a plurality of integer multiples of a preset number threshold;
if yes, determining all the log texts in the search logs and search library texts in a preset search library, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts;
and intercepting the added text based on each text interception length in a preset interception range respectively to construct a content complement data set based on the obtained text fragments.
Optionally, after the counting the log number of all the search logs locally and determining whether the log number is not less than a number of integer multiples of a preset number threshold, the method further includes:
if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log.
Optionally, the vectorizing the plurality of text segments in the content completion dataset, and inputting the plurality of obtained text tensors into a preset language model to calculate an overall model loss of the preset language model based on the plurality of text tensors, including:
vectorizing the text segments in the content completion data set, and obtaining a first text tensor corresponding to the text segments, a second text tensor corresponding to the text segment corresponding labels and a third text tensor corresponding to the text segment prediction directions based on a preset dictionary index corresponding table;
Performing dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor;
and carrying out maximum pooling operation on the processed first text tensor to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
Optionally, the performing a maximum pooling operation on the processed first text tensor to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor, and the third text tensor includes:
inputting the processed first text tensor to a Maxpooling layer to perform maximum pooling operation on the processed first text tensor to obtain the processed first text tensor;
inputting the processed first text tensor to a first linear classification layer and a second linear classification layer respectively to obtain a corresponding first output tensor and a corresponding second output tensor;
calculating cross entropy loss of the preset language model based on the first output tensor and the second text tensor to obtain target cross entropy loss; calculating the mean square error of the preset language model based on the second output tensor and the third text tensor, so as to take the obtained mean square error as the prediction loss of the preset language model;
And taking the sum value of the target cross entropy loss and the prediction loss as the total model loss of the preset language model.
Optionally, the step of updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log, so as to update the model of the next round, includes:
updating the preset language model based on the overall model loss to obtain an updated language model;
judging whether a current search text input by the user terminal is received, if so, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
In a second aspect, the present application discloses a content complement device based on an automatic complement model, comprising:
The log generation module is used for acquiring a search text input by a user terminal and processing the search text based on a preset log generation rule so as to generate a search log;
the data set construction module is used for counting the number of the logs of the search logs, judging whether the number of the logs meets preset conditions or not based on a preset number threshold, if yes, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
the loss calculation module is used for carrying out vectorization processing on the text fragments in the content completion data set, inputting the obtained text tensors into a preset language model, and calculating the overall model loss of the preset language model based on the text tensors;
and the model updating module is used for updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and a processor for executing the computer program to implement the automatic complement model-based content complement method as described above.
In a fourth aspect, the present application discloses a computer readable storage medium storing a computer program which, when executed by a processor, implements a content complementing method based on an auto-complementing model as described above.
In the method, firstly, a search text input by a user terminal is acquired, the search text is processed based on a preset log generation rule to generate a search log, the log quantity of the search log is counted, whether the log quantity meets a preset condition is judged based on a preset quantity threshold, if yes, starting and ending symbols are added to the log text in the search log and a search library text in a preset search library, the obtained added text is intercepted, a content completion data set is constructed based on a plurality of obtained text fragments, vectorization processing is carried out on the plurality of text fragments in the content completion data set, the obtained text tensor is input into a preset language model, the total model loss of the preset language model is calculated based on the plurality of text tensors, finally the preset language model is updated based on the total model loss, an automatic completion text is provided for the user terminal by using the obtained updated model, and the search text input by the user terminal is repeatedly jumped to the obtained search text input by the user terminal, the search text is processed based on the preset log generation rule, and the next step of updating is carried out. Therefore, the method can generate the search log based on the search text of the user, process the search text in the preset search library and the search text in the search log after the search log reaches the preset quantity threshold, construct a content complement data set by using the processed data, and then perform vectorization processing on text fragments in the content complement data set so as to train a preset language model through a plurality of obtained text tensors, calculate model loss of the preset language model, update the language model according to the obtained model loss, and collect the search text at the same time when the updated model is used for providing the complement text so as to perform next round of updating. In this way, the model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a content completion method based on an automatic completion model provided by the application;
FIG. 2 is a flowchart of a specific automatic-complement-model-based content complement method according to the present application;
FIG. 3 is a schematic diagram of an auto-complete model framework provided by the present application;
FIG. 4 is a schematic diagram of a content completion device based on an automatic completion model according to the present application;
fig. 5 is a block diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Three methods for automatic completion are commonly used in the prior art, the first method is a character matching method, the retrieval efficiency of the method is too slow, when the content of a word stock is increased increasingly, the prompt return time is often beyond the user expectation, and the user experience is reduced; the second method is a recommendation system method, and the method has a cold start process when a recommendation algorithm is online, and user behavior data may be lost in the process, so that a recommendation effect is poor; the data of the third method is only derived from the search logs, and the data sources have higher limitations.
In order to overcome the technical problems, the invention discloses a content complement method, device, equipment and medium based on an automatic complement model. The model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.
Referring to fig. 1, the embodiment of the invention discloses a content complement method based on an automatic complement model, which comprises the following steps:
step S11, obtaining a search text input by a user terminal, and processing the search text based on a preset log generation rule to generate a search log.
In this embodiment, in order to update a preset language model, that is, update an automatic complement model, a text in a preset search library and a search text input by a user terminal collected in real time need to be processed to obtain training data for training the model, and first, a search text driven in by the user terminal needs to be acquired to generate a search log, which specifically includes the following steps: acquiring a search text input by a user terminal, and determining the query time and the user identification corresponding to the search text; determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts; if yes, determining the search text as the log text, and generating a second search log based on the query time, the user identification and the log text. That is, the search text input by the user end needs to be acquired, the query time and the user identifier when the search text is input need to be determined, that is, the user ID (Identity document) needs to determine whether the actual query text which is finally searched by the user is the same as the search text, if not, the query word used for searching by the user is based on a preset language model, that is, the text content recommended based on the automatic completion model, then the actual query text and the ranking of the actual query text in the returned result need to be determined, then the actual query text is used as the log text for generating the search log, and the search log is generated according to the query time, the user ID, the search text, the log text and the target ranking; in another case, if the search text coincides with the actual query text, the search text may be determined as a log text for generating a search log, and then the search log is generated based on the log text, the query time, and the user ID.
And step S12, counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content complement data set based on a plurality of obtained text fragments.
In this embodiment, in order to implement automatic update of the model, conditions for updating the model need to be set, and the specific procedure is as follows: counting the number of the local logs of all the search logs, and judging whether the number of the logs is not smaller than a plurality of integer multiples of a preset number threshold; if yes, determining all the log texts in the search logs and search library texts in a preset search library, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts; and intercepting the added text based on each text interception length in a preset interception range respectively to construct a content complement data set based on the obtained text fragments. That is, the automatic updating rule for the model may be set to count all local search logs, if the data size of the search logs is greater than an integer multiple of a preset threshold, the text may be processed, and it should be noted that the preset threshold is not limited in the present application, and may be set by itself according to the user requirement. For example, the preset threshold may be set to 5000, and when the number of local search logs is not less than 5000, the model may be updated, and when the number of local search logs is not less than 10000, the model may be updated in the next round. Further, when it is determined that the model is updated, a search library text of a preset search objective may be extracted, start and stop signs "< start >" and "< end >" may be added to the search library text and the log text, a text segment having a length of L may be intercepted, a content complement data set may be constructed based on the obtained text segment, and the value of L may be {2,3,4,5}. The text of the search library includes title, abstract, detailed description, etc., taking title as example, "purchase housing and extract housing accumulation fund" as title of the text of the search library, after adding start and stop sign, converting into "< start > purchase housing and extract housing accumulation fund < end >", the value of L is 3, the text segment formed is < start > purchase, buy-from, hold, house pick-up, fang Gong product, public deposit, deposit < end >. After all the text fragments are obtained, a content completion data set can be created based on the text fragments, and a sample format in the content completion data set is { text fragments, label, prediction directions }, wherein label is a label corresponding to the text fragments, and the prediction directions are positions of the text fragments in the first half or the second half of the sentence. Therefore, the data in the search library and the data searched by the user in real time can be combined to generate the complement data set for model training, so that the training of the automatic complement model is more comprehensive, and the situation that model recommendation is inaccurate due to training of the model by adopting a single data source is avoided.
It should be noted that after counting the number of logs of all the local search logs and judging whether the number of logs is not less than a plurality of integer multiples of a preset number threshold, the method further includes: if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log. That is, if the number of current search logs does not reach an integer multiple of the preset number threshold, it is necessary to continue to collect the input search text and generate a new search log according to the search text until the number of search logs is not less than the integer multiple of the preset number threshold.
And S13, carrying out vectorization processing on the text fragments in the content completion data set, and inputting the obtained text tensors into a preset language model so as to calculate the total model loss of the preset language model based on the text tensors.
In this embodiment, vectorization processing may be performed on text segments in the complement dataset to increase the number of model training According to the method, the training time of the model is reduced, after the text is vectorized to obtain a text tensor, the text tensor can be input into the model, and the overall model loss of the model is calculated by the following specific processes: vectorizing the text segments in the content completion data set, and obtaining a first text tensor corresponding to the text segments, a second text tensor corresponding to the text segment corresponding labels and a third text tensor corresponding to the text segment prediction directions based on a preset dictionary index corresponding table; performing dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor; and carrying out maximum pooling operation on the processed first text tensor to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor. That is, sample vectorization processing in a data set, wherein the maximum character length in the text is set to L, and a dictionary index mapping table can be utilized to generate a text segment mapping tensor T, and the dimension is R L . The number of characters is v, the tensor corresponding to label is Y, and the dimension is R 1 And v and L are positive integers. The tensor M represents the prediction direction of the text segment in the sample, M is 1 for the prediction direction to be forward, and M is 0 for the prediction direction to be backward. Wherein R represents real space, and v and L are both positive integers. The tensor T can then be input to the embedding layer of the model to obtain a reduced-dimension tensor X e And to X e Feature extraction to obtain tensor X T . Performing maximum pooling operation to obtain X o And utilize X o Model population loss for M and Y calculation models.
And step S14, updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
In this embodiment, updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, and re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log, so as to perform next round of model updating, including: updating the preset language model based on the overall model loss to obtain an updated language model; judging whether a current search text input by the user terminal is received, if so, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round. That is, the model may be updated based on the resulting overall model loss, and after the user inputs the search text, the user's search text may be input to the updated model to recommend search terms for the user using the updated model, and a search log may be generated again based on the user's search text for the next round of model updating. Thus, after the initial model is online, the model can be updated in a lasting manner, so that the accuracy of content completion is ensured.
It can be seen that in this embodiment, firstly, a search text input by a user terminal is obtained, and is processed based on a preset log generation rule, so as to generate a search log, the log number of the search log is counted, and whether the log number meets a preset condition is judged based on a preset number threshold, if yes, a start-stop symbol is added to the log text in the search log and a search library text in a preset search library, the obtained added text is intercepted, so as to construct a content completion dataset based on a plurality of obtained text fragments, then vectorization processing is performed on the plurality of obtained text fragments in the content completion dataset, and the obtained text tensor is input into a preset language model, so as to calculate an overall model loss of the preset language model based on the plurality of text tensors, finally, the preset language model is updated based on the overall model loss, an automatic completion text is provided for the user terminal by using the obtained updated model, and the search text input by the user terminal is skipped again, and is processed based on the preset log generation rule, so as to generate a round of update log. Therefore, the method can generate the search log based on the search text of the user, process the search text in the preset search library and the search text in the search log after the search log reaches the preset quantity threshold, construct a content complement data set by using the processed data, and then perform vectorization processing on text fragments in the content complement data set so as to train a preset language model through a plurality of obtained text tensors, calculate model loss of the preset language model, update the language model according to the obtained model loss, and collect the search text at the same time when the updated model is used for providing the complement text so as to perform next round of updating. In this way, on one hand, the model can be updated online by searching data and retrieving library text data by a user; on the other hand, the model can be trained by using the text in the search library and the text in the search log generated in real time, so that the accuracy of the query prompt is improved.
Based on the foregoing embodiments, it can be seen that, by the method of the present application, updating of the model is required, for which this embodiment describes in detail how to update the model, as shown in fig. 2, the embodiment of the present application discloses a content complement method based on an automatic complement model, including:
step S21, a search text input by a user terminal is obtained, and the search text is processed based on a preset log generation rule to generate a search log.
And S22, counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content complement data set based on a plurality of obtained text fragments.
And S23, carrying out vectorization processing on the text fragments in the content completion data set, and obtaining a first text tensor corresponding to the text fragments, a second text tensor corresponding to the labels corresponding to the text fragments and a third text tensor corresponding to the predicting directions of the text fragments based on a preset dictionary index corresponding table.
In this embodiment, as shown in fig. 3, a plurality of text segments in the content completion dataset need to be vectorized, and then a tensor T corresponding to the text segments, a tensor Y corresponding to the text labels label, and a tensor M corresponding to the prediction direction of the text segments need to be determined by using a dictionary index mapping table, where the tensor T dimension is R L Tensor Y dimension is R 1 The tensor M takes a value of 1 or 0.
And step S24, performing dimension reduction and feature extraction on the first text tensor to obtain the processed first text tensor.
In this embodiment, the tensor T is required to be input to the Embedding layer for dimension reduction to obtain the tensor X e And X is e =embedding (T), further, the resulting tensor X is required e Input to a transducer layer for feature extraction to obtain tensor X T And X is T =Transformer(X e ) And X is T Is of dimension R L×E
Step S25, performing a maximum pooling operation on the processed first text tensor, so as to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
In this embodiment, the first processed text tensor, X, is required T Input to Maxpooling layer for X T Performing maximum pooling operation, and acquiring semantic information with more obvious features in sentence length dimension to obtain tensor X O And X is O =Maxpooling(X T ),X O Is of dimension R E . Then need X O Respectively input to two linesIn the sexual classification layer MLP, the output in the first linear classification layer is X M Dimension is R V And X is M =MLP(X O ) Selecting an almost function as a sigmoid function in a second linear classification layer to determine the forward probability and backward probability of the text, and outputting as X S And X is S =sigmoid(MLP 2 (X O ) The probability of each text fragment occurrence can be obtained by a sigmoid function and when X S >The predicted direction is forward when 0.5; when X is S <And the prediction direction is backward at 0.5. For example, there is some text as {<start>,w 1 ,w 2 ,…,w n ,<end>The overall probability of the text is P {<start>,w 1 ,w 2 ,…,w n ,<end>}=P(<start>)*P(w 1 │<start>)*…*P(<end>|w n ,…,w 1 ,<start>) Taking L as 3 as an example, in order to simplify the operation, the probability is set to obey the markov assumption, and the forward probability calculation formula is as follows: p { S.C<start>,w 1 ,w 2 ,…,w n ,<end>}≈P(<start>)*P(w 1 │<start>)*P(w 2 │w 1 <start>)*P(w 3 │w 2 ,w 1 ,<start>)*…*P(<end>|w n ,w n-1 ,w n-2 ) The method comprises the steps of carrying out a first treatment on the surface of the The backward probability calculation formula is as follows: p { S.C<start>,w 1 ,w 2 ,…,w n ,<end>}=P(<end>)*P(w n │<end>)*P(w n-1 │w n ,<end>)*P(w n-2 │w n-1 ,w n ,<end>)*…*P(<start>|w 1 ,w 2 ,w 3 )。
It is further to be noted that it is necessary to make the reference to X based on the obtained M X is as follows S The overall loss of the computational model first requires outputting X based on a linear layer M And calculating cross entropy loss by using tensor Y corresponding to label, and calculating cross entropy loss 1 =CrossEntropyLoss(X M Y), according to X S Predictive direction correspondence of text segments The tensor M of (1) calculates the mean square error loss as a prediction loss to correct the text segment prediction direction, and the prediction loss is calculated 2 =MSE(X S M), model overall loss loss=loss 1 +loss 2
And S26, updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
It can be seen that, in this embodiment, in order to calculate the model loss, it is first required to perform vectorization processing on the plurality of text segments in the content completion dataset, obtain, based on a preset dictionary index correspondence table, a first text tensor corresponding to the plurality of text segments, a second text tensor corresponding to the plurality of text segment correspondence labels, and a third text tensor corresponding to the plurality of text segment prediction directions, then perform dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor, and finally perform a maximum pooling operation on the processed first text tensor to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor. In this way, the model can be updated through model loss, so that the automatic completion model is more accurate, and the problem of cold start in a recommendation algorithm is solved by utilizing the Transformer to process semantic relation between contexts.
Referring to fig. 4, the embodiment of the invention discloses a content complement device based on an automatic complement model, which comprises:
the log generating module 11 is configured to obtain a search text input by a user terminal, and process the search text based on a preset log generating rule to generate a search log;
the data set construction module 12 is configured to count the number of logs in the search log, determine whether the number of logs meets a preset condition based on a preset number threshold, if yes, add start-stop symbols to log texts in the search log and search library texts in a preset search library, and intercept the added text to construct a content completion data set based on the obtained text fragments;
a loss calculation module 13, configured to perform vectorization processing on the plurality of text segments in the content completion dataset, and input the plurality of obtained text tensors to a preset language model, so as to calculate an overall model loss of the preset language model based on the plurality of text tensors;
the model updating module 14 is configured to update the preset language model based on the total model loss, provide an automatic complement text for the user terminal by using the obtained updated model, and skip to the search text input by the user terminal again, and process the search text based on a preset log generation rule to generate a search log, so as to update the model of the next round.
Therefore, in the method, firstly, the search text input by the user terminal is acquired, the search text is processed based on a preset log generation rule to generate a search log, the log quantity of the search log is counted, whether the log quantity meets a preset condition is judged based on a preset quantity threshold, if yes, starting and ending symbols are added to the log text in the search log and the search library text in a preset search library, the obtained added text is intercepted, a content completion data set is constructed based on a plurality of obtained text fragments, vectorization processing is carried out on a plurality of obtained text fragments in the content completion data set, the obtained text tensor is input into a preset language model, the total model loss of the preset language model is calculated based on the plurality of text tensors, finally the preset language model is updated based on the total model loss, the obtained updated model is used for providing automatic completion text for the user terminal, and the search text input by the user terminal is skipped again to the obtained search text input by the user terminal, the search text is processed based on the preset log generation rule, and the next step of updating is carried out. Therefore, the method can generate the search log based on the search text of the user, process the search text in the preset search library and the search text in the search log after the search log reaches the preset quantity threshold, construct a content complement data set by using the processed data, and then perform vectorization processing on text fragments in the content complement data set so as to train a preset language model through a plurality of obtained text tensors, calculate model loss of the preset language model, update the language model according to the obtained model loss, and collect the search text at the same time when the updated model is used for providing the complement text so as to perform next round of updating. In this way, the model can be updated online through the user search data and the search library text data, so that the accuracy of query prompt and user experience are improved.
In some embodiments, the log generating module 11 may specifically include:
the data acquisition unit is used for acquiring a search text input by a user terminal and determining the query time and the user identification corresponding to the search text;
the first log generation unit is used for determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text so as to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts;
and the second log generating unit is used for determining the search text as the log text if the search text is the log text so as to generate a second search log based on the query time, the user identification and the log text.
In some embodiments, the data set construction module 12 may specifically include:
The quantity judging unit is used for counting the log quantity of all the local search logs and judging whether the log quantity is not smaller than a plurality of integer multiples of a preset quantity threshold value;
the data processing unit is used for determining all the log texts in the search logs and search library texts in a preset search library if yes, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts;
the first data set construction unit is used for carrying out interception processing on the added text based on each text interception length in a preset interception range respectively so as to construct a content complement data set based on a plurality of obtained text fragments.
In some embodiments, the automatic complement model-based content complement apparatus may further include:
and if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log.
In some embodiments, the loss calculation module 13 may specifically include:
the tensor determining unit is used for carrying out vectorization processing on the text fragments in the content completion data set, and obtaining a first text tensor corresponding to the text fragments, a second text tensor corresponding to the text fragment corresponding labels and a third text tensor corresponding to the text fragment prediction directions based on a preset dictionary index corresponding table;
the tensor dimension reduction unit is used for dimension reduction and feature extraction of the first text tensor to obtain a processed first text tensor;
and the loss calculation sub-module is used for carrying out maximum pooling operation on the processed first text tensor so as to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
In some embodiments, the loss calculation submodule may specifically include:
the first tensor processing unit is used for inputting the processed first text tensor to a Maxpooling layer to carry out maximum pooling operation on the processed first text tensor so as to obtain the processed first text tensor;
The second tensor processing unit is used for respectively inputting the processed first text tensor to the first linear classification layer and the second linear classification layer to obtain a corresponding first output tensor and a corresponding second output tensor;
a loss calculation unit, configured to calculate a cross entropy loss of the preset language model based on the first output tensor and the second text tensor, so as to obtain a target cross entropy loss; calculating the mean square error of the preset language model based on the second output tensor and the third text tensor, so as to take the obtained mean square error as the prediction loss of the preset language model;
and the loss determining unit is used for taking the sum value of the target cross entropy loss and the prediction loss as the total model loss of the preset language model.
In some embodiments, the model updating module 14 may specifically include:
a first model updating unit, configured to update the preset language model based on the total model loss, so as to obtain an updated language model;
the text insufficiency unit is used for judging whether a current search text input by the user terminal is received or not, and if yes, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model;
And the second model updating unit is used for re-jumping to the search text input by the acquisition user terminal, processing the search text based on a preset log generating rule, and generating a search log so as to update the model of the next round.
Further, the embodiment of the present application further discloses an electronic device, and fig. 5 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the figure is not to be considered as any limitation on the scope of use of the present application.
Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement the relevant steps in the automatic complement model-based content complement method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and computer programs 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the autocompletion model-based content completion method performed by the electronic device 20 as disclosed in any of the previous embodiments.
Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program, when executed by the processor, implements the disclosed automatic complement model-based content complement method. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing has outlined rather broadly the more detailed description of the application in order that the detailed description of the application that follows may be better understood, and in order that the present principles and embodiments may be better understood; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (10)

1. A content completion method based on an automatic completion model, comprising:
acquiring a search text input by a user terminal, and processing the search text based on a preset log generation rule to generate a search log;
counting the number of the logs of the search logs, judging whether the number of the logs meets a preset condition or not based on a preset number threshold, if so, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
vectorizing the text segments in the content completion data set, and inputting the obtained text tensors into a preset language model to calculate the total model loss of the preset language model based on the text tensors;
Updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
2. The automatic completion model-based content completion method according to claim 1, wherein the obtaining the search text input by the user terminal and processing the search text based on a preset log generation rule to generate a search log comprises:
acquiring a search text input by a user terminal, and determining the query time and the user identification corresponding to the search text;
determining whether the search text is the same as an actual query text, if not, determining target ranks of the actual query text in a plurality of recommended texts, and determining the actual query text as a log text to generate a first search log based on the query time, the user identification, the search text, the log text and the target ranks; the plurality of recommended texts are recommended texts which are generated through the preset language model and are related to the search text; the actual query text is a search text determined by the user side from the plurality of recommended texts;
If yes, determining the search text as the log text, and generating a second search log based on the query time, the user identification and the log text.
3. The automatic completion model-based content completion method according to claim 1, wherein the counting the number of logs of the search logs and judging whether the number of logs meets a preset condition based on a preset number threshold, if yes, adding start and stop symbols to log texts in the search logs and search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion dataset based on a plurality of obtained text segments, and the method comprises:
counting the number of the local logs of all the search logs, and judging whether the number of the logs is not smaller than a plurality of integer multiples of a preset number threshold;
if yes, determining all the log texts in the search logs and search library texts in a preset search library, and adding a start symbol and a stop symbol to the log texts and the search library texts to obtain added texts;
and intercepting the added text based on each text interception length in a preset interception range respectively to construct a content complement data set based on the obtained text fragments.
4. The automatic complement model-based content complement method as recited in claim 3, further comprising, after counting the number of logs of all the search logs locally and determining whether the number of logs is not less than a number of integer multiples of a preset number threshold:
if not, jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log until the number of local logs of all the search logs is not less than a plurality of integer multiples of the preset number threshold value, so as to construct the content complement data set based on the search log.
5. The automatic complement model-based content complement method as recited in claim 1, wherein the vectorizing the plurality of text segments in the content complement data set and inputting the plurality of resulting text tensors into a preset language model to calculate an overall model loss of the preset language model based on the plurality of text tensors comprises:
vectorizing the text segments in the content completion data set, and obtaining a first text tensor corresponding to the text segments, a second text tensor corresponding to the text segment corresponding labels and a third text tensor corresponding to the text segment prediction directions based on a preset dictionary index corresponding table;
Performing dimension reduction and feature extraction on the first text tensor to obtain a processed first text tensor;
and carrying out maximum pooling operation on the processed first text tensor to calculate the total model loss of the preset language model based on the obtained output tensor, the second text tensor and the third text tensor.
6. The automatic complement model-based content complement method as recited in claim 5, wherein the performing a max pooling operation on the processed first text tensor to calculate an overall model loss of the preset language model based on the obtained output tensor, the second text tensor, and the third text tensor comprises:
inputting the processed first text tensor to a Maxpooling layer to perform maximum pooling operation on the processed first text tensor to obtain the processed first text tensor;
inputting the processed first text tensor to a first linear classification layer and a second linear classification layer respectively to obtain a corresponding first output tensor and a corresponding second output tensor;
calculating cross entropy loss of the preset language model based on the first output tensor and the second text tensor to obtain target cross entropy loss; calculating the mean square error of the preset language model based on the second output tensor and the third text tensor, so as to take the obtained mean square error as the prediction loss of the preset language model;
And taking the sum value of the target cross entropy loss and the prediction loss as the total model loss of the preset language model.
7. The automatic complement model-based content complement method according to any one of claims 1 to 6, wherein the step of updating the preset language model based on the overall model loss, providing the automatic complement text for the user terminal using the obtained updated model, and re-jumping to the search text input by the acquisition user terminal, and processing the search text based on a preset log generation rule to generate a search log, so as to perform a next round of model update, comprises:
updating the preset language model based on the overall model loss to obtain an updated language model;
judging whether a current search text input by the user terminal is received, if so, generating a plurality of automatic complement texts corresponding to the current search text based on the updated language model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
8. A content completion device based on an automatic completion model, comprising:
the log generation module is used for acquiring a search text input by a user terminal and processing the search text based on a preset log generation rule so as to generate a search log;
the data set construction module is used for counting the number of the logs of the search logs, judging whether the number of the logs meets preset conditions or not based on a preset number threshold, if yes, adding start and stop symbols for the log texts in the search logs and the search library texts in a preset search library, and intercepting the obtained added texts to construct a content completion data set based on a plurality of obtained text fragments;
the loss calculation module is used for carrying out vectorization processing on the text fragments in the content completion data set, inputting the obtained text tensors into a preset language model, and calculating the overall model loss of the preset language model based on the text tensors;
and the model updating module is used for updating the preset language model based on the total model loss, providing an automatic complement text for the user terminal by using the obtained updated model, re-jumping to the search text input by the user terminal, and processing the search text based on a preset log generation rule to generate a search log so as to update the model of the next round.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the automatic complement model-based content complement method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium for storing a computer program which when executed by a processor implements the automatic complement model-based content complement method of any one of claims 1 to 7.
CN202311329042.5A 2023-10-13 2023-10-13 Content complement method, device, equipment and medium based on automatic complement model Pending CN117235210A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311329042.5A CN117235210A (en) 2023-10-13 2023-10-13 Content complement method, device, equipment and medium based on automatic complement model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311329042.5A CN117235210A (en) 2023-10-13 2023-10-13 Content complement method, device, equipment and medium based on automatic complement model

Publications (1)

Publication Number Publication Date
CN117235210A true CN117235210A (en) 2023-12-15

Family

ID=89094771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311329042.5A Pending CN117235210A (en) 2023-10-13 2023-10-13 Content complement method, device, equipment and medium based on automatic complement model

Country Status (1)

Country Link
CN (1) CN117235210A (en)

Similar Documents

Publication Publication Date Title
CN109635273B (en) Text keyword extraction method, device, equipment and storage medium
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
CN110162749B (en) Information extraction method, information extraction device, computer equipment and computer readable storage medium
CN107679039B (en) Method and device for determining statement intention
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
US20150095017A1 (en) System and method for learning word embeddings using neural language models
CN112800170A (en) Question matching method and device and question reply method and device
CN112860919B (en) Data labeling method, device, equipment and storage medium based on generation model
CN114840671A (en) Dialogue generation method, model training method, device, equipment and medium
CN116719520B (en) Code generation method and device
CN117435716B (en) Data processing method and system of power grid man-machine interaction terminal
KR20240067971A (en) Voice recognition method, voice recognition device, electronic equipment, storage media and computer program
CN116467417A (en) Method, device, equipment and storage medium for generating answers to questions
CN113343692A (en) Search intention recognition method, model training method, device, medium and equipment
CN116049370A (en) Information query method and training method and device of information generation model
CN115310449A (en) Named entity identification method and device based on small sample and related medium
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
CN112416754B (en) Model evaluation method, terminal, system and storage medium
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
CN117235210A (en) Content complement method, device, equipment and medium based on automatic complement model
CN113468306A (en) Voice conversation method, device, electronic equipment and storage medium
CN111199170B (en) Formula file identification method and device, electronic equipment and storage medium
JP2021163477A (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program for image processing
CN111783465A (en) Named entity normalization method, system and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination