CN113434632A - Text completion method, device, equipment and storage medium based on language model - Google Patents

Text completion method, device, equipment and storage medium based on language model Download PDF

Info

Publication number
CN113434632A
CN113434632A CN202110712451.8A CN202110712451A CN113434632A CN 113434632 A CN113434632 A CN 113434632A CN 202110712451 A CN202110712451 A CN 202110712451A CN 113434632 A CN113434632 A CN 113434632A
Authority
CN
China
Prior art keywords
text
character
sequence
predicted
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110712451.8A
Other languages
Chinese (zh)
Inventor
陈桢博
庄伯金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110712451.8A priority Critical patent/CN113434632A/en
Publication of CN113434632A publication Critical patent/CN113434632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The method comprises the steps of obtaining a plurality of texts to be completed, obtaining text sequences corresponding to the plurality of texts to be completed, filling each text sequence with equal length to obtain a basic text sequence, forming a matrix vector based on the basic text sequence, inputting the matrix vector into the language model for probability calculation to obtain a character probability set, obtaining a character sequence with the maximum probability in each character probability set by combining a beam search mode to obtain a target character sequence, and completing the texts to be completed through the target character sequence. The application also relates to blockchain techniques, where the full text to be supplemented is stored in a blockchain. According to the method and the device, the probability calculation is carried out on the plurality of text sequences, and the target character sequence is obtained in a beam search mode, so that the efficiency of text completion is improved.

Description

Text completion method, device, equipment and storage medium based on language model
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text completion method, apparatus, device, and storage medium based on a language model.
Background
The method can deduce the subsequent content of the document written by the user according to the document written by the user, thereby being capable of completing the automatic short text continuation, which brings convenience to the user.
The existing text completion method predicts the context according to the context of the text provided by the user through a deep learning language model. Although the method can recommend the completion text according to the context of the text and perform text completion, a large amount of calculation is required due to the need of combining a large amount of information of the text, and the efficiency of text completion is low. There is a need for a method that can improve the efficiency of text completion.
Disclosure of Invention
The embodiment of the application aims to provide a text completion method, a text completion device, text completion equipment and a storage medium based on a language model so as to improve the efficiency of text completion.
In order to solve the above technical problem, an embodiment of the present application provides a text completion method based on a language model, including:
acquiring a plurality of texts to be completed, and acquiring text sequences corresponding to the plurality of texts to be completed based on a preset word list to obtain a plurality of initial text sequences;
identifying the character at the last position of each initial text sequence, and taking the character at the last position as a predicted character;
based on the predicted characters of the initial text sequence, performing equal-length filling on the initial text sequence according to a preset text length to obtain a plurality of basic text sequences;
respectively carrying out vector calculation on a plurality of basic text sequences and a preset vector to form a plurality of matrix vectors, and inputting the plurality of matrix vectors into a language model to respectively carry out probability calculation on the predicted characters to obtain a character probability set corresponding to each predicted character;
acquiring a character sequence with the highest probability in each character probability set by adopting a beam search mode to obtain a target character sequence corresponding to each full text to be supplemented;
and completing the text to be completed based on the target character sequence.
In order to solve the foregoing technical problem, an embodiment of the present application provides a text completion apparatus based on a language model, including:
the initial text sequence acquisition module is used for acquiring a plurality of texts to be completed, acquiring text sequences corresponding to the plurality of texts to be completed based on a preset word list, and acquiring a plurality of initial text sequences;
the predicted character determining module is used for identifying the character at the last position of each initial text sequence and taking the character at the last position as a predicted character;
the initial text sequence filling module is used for carrying out equal-length filling on the initial text sequence according to a preset text length based on the predicted characters of the initial text sequence to obtain a plurality of basic text sequences;
the character probability set acquisition module is used for respectively carrying out vector calculation on the plurality of basic text sequences and preset vectors to form a plurality of matrix vectors, inputting the plurality of matrix vectors into a language model to respectively carry out probability calculation on the predicted characters, and obtaining a character probability set corresponding to each predicted character;
the target character sequence acquisition module is used for acquiring a character sequence with the highest probability in each character probability set in a beam search mode to obtain a target character sequence corresponding to each full text to be complemented;
and the to-be-completed text completion module is used for completing the to-be-completed text based on the target character sequence.
In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided that includes, one or more processors; a memory for storing one or more programs that cause the one or more processors to implement any of the language model based text completion methods described above.
In order to solve the technical problems, the invention adopts a technical scheme that: a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a language model-based text completion method as recited in any one of the above.
The embodiment of the invention provides a text completion method, a text completion device, text completion equipment and a storage medium based on a language model. The embodiment of the invention obtains a plurality of texts to be supplemented, obtains text sequences corresponding to the plurality of texts to be supplemented, performs equal-length filling on each text sequence to obtain a basic text sequence, then forms a matrix vector based on the basic text sequence, obtains a character probability set by inputting the matrix vector into a language model to perform probability calculation, obtains a character sequence with the maximum probability in each character probability set by combining a beam search mode, obtains a target character sequence, completes the texts to be supplemented through the target character sequence, realizes equal-length filling on the plurality of text sequences, performs probability calculation on the plurality of texts to be supplemented at the same time, and obtains the target character sequence by combining the beam search mode, thereby being beneficial to improving the efficiency of text supplementation.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is a schematic diagram of an application environment of a text completion method based on a language model according to an embodiment of the present application;
FIG. 2 is a flowchart of an implementation of a method for language model-based text completion according to an embodiment of the present application;
FIG. 3 is a flowchart of an implementation of a sub-process in a method for completing a text based on a language model according to an embodiment of the present application;
FIG. 4 is a flowchart of another implementation of a sub-process in a method for completing a text based on a language model according to an embodiment of the present application;
FIG. 5 is a flowchart of another implementation of a sub-process in a method for completing a text based on a language model according to an embodiment of the present application;
FIG. 6 is a flowchart of another implementation of a sub-process in a method for completing a text based on a language model according to an embodiment of the present application;
FIG. 7 is a flowchart of another implementation of a sub-process in a method for completing a text based on a language model according to an embodiment of the present application;
FIG. 8 is a flowchart of another implementation of a sub-process in a method for completing a text based on a language model according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a text completion apparatus based on a language model according to an embodiment of the present application;
fig. 10 is a schematic diagram of a computer device provided in an embodiment of the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
Referring to fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as a web browser application, a search-type application, an instant messaging tool, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the language model-based text completion method provided in the embodiments of the present application is generally executed by a server, and accordingly, a language model-based text completion apparatus is generally configured in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring to fig. 2, fig. 2 shows an embodiment of a text completion method based on a language model.
It should be noted that, if the result is substantially the same, the method of the present invention is not limited to the flow sequence shown in fig. 2, and the method includes the following steps:
s1: the method comprises the steps of obtaining a plurality of texts to be supplemented, obtaining text sequences corresponding to the plurality of texts to be supplemented based on a preset word list, and obtaining a plurality of initial text sequences.
In the embodiments of the present application, in order to more clearly understand the technical solution, the following detailed description is made on the terminal related to the present application.
The first is that the server can receive a plurality of texts to be supplemented sent from the user side, and can also obtain a plurality of texts to be supplemented from the database. And the server completes the text sequence according to the received text to be completed, performs probability calculation on the predicted characters so as to obtain a target character sequence, completes the text to be completed according to the target character sequence, and returns the completed text to the user side.
And secondly, the user side can select a plurality of texts to be supplemented to send to the server for text supplementation and also can receive supplemented texts returned by the server.
Specifically, the server obtains a plurality of texts to be supplemented, and the plurality of texts to be supplemented can be texts with different lengths. For example, the full text to be supplemented can be ' I want to eat today ', ' We can choose to go tomorrow ', ' how beautiful there is, ' I love ', and the like. And segmenting the obtained multiple texts to be completed, and mapping the texts to be completed in a corresponding preset word list to obtain a corresponding text sequence. The preset word list is that all vocabularies are arranged in advance according to a certain rule, each vocabulary in the word list is provided with a corresponding mark, and the mark can represent the corresponding vocabulary, namely the corresponding vocabulary in the word list can be obtained through the mark.
Referring to fig. 3, fig. 3 shows an embodiment of step S1, which is described in detail as follows:
s11: and acquiring a plurality of texts to be completed.
S12: and segmenting the full texts to be supplemented by adopting a regular matching mode to obtain segmented texts.
The regular matching mode is a logic formula for operating the text to be supplemented, a 'rule character string' is formed by utilizing a plurality of specific characters which are defined in advance and the combination of the specific characters, the 'rule character string' is used for expressing a filtering logic of the text to be supplemented, namely, a plurality of texts to be supplemented are segmented, and therefore the segmented text corresponding to each text to be supplemented is obtained.
S13: and based on the preset word list, converting the characters of the segmented text into corresponding marks in the preset word list to obtain a plurality of initial text sequences.
Specifically, the segmented text is mapped into a preset word list, a mark of the segmented text in the preset word list is obtained, and the mark represents a corresponding character, so that an initial text sequence is obtained.
In the embodiment, a plurality of texts to be completed are obtained, the plurality of texts to be completed are segmented by adopting a regular matching mode to obtain segmented texts, and characters of the segmented texts are converted into corresponding marks in a preset word list based on the preset word list to obtain a plurality of initial text sequences, so that the texts to be completed are converted into a text sequence form, and probability calculation is conveniently performed subsequently, thereby being beneficial to improving the efficiency of text completion.
S2: the last position character of each initial text sequence is identified and the last position character is taken as a predicted character.
Specifically, each initial text sequence needs to be filled with equal length in the following process, and the character at the last position of each initial text sequence is used as the last real character and needs to be marked, so that the subsequent characters can be predicted according to the last real character. It is necessary to identify the last position character of each initial text sequence and use the last position character as the predicted character.
S3: and based on the predicted characters of the initial text sequence, performing equal-length filling on the initial text sequence according to a preset text length to obtain a plurality of basic text sequences.
Specifically, different texts to be completed have different lengths, and the lengths of the corresponding initial text sequences are also different, but matrix calculation needs to be performed on a plurality of text sequences subsequently, and it is necessary to ensure that the lengths of vectors calculated by the matrix calculation are consistent, so that equal-length filling needs to be performed on the plurality of initial text sequences, and a plurality of basic text sequences are obtained.
It should be noted that the preset text length should be no shorter than the longest text sequence in the initial text sequence, and is set according to practical situations, and is not limited herein. In a specific embodiment, the predetermined text length is 10 characters.
Referring to fig. 4, fig. 4 shows an embodiment of step S3, which is described in detail as follows:
s31: the longest text sequence in the initial text sequence is identified as the fixed-length text sequence.
S32: based on the fixed-length text sequence, a preset text length not shorter than the fixed-length text sequence is acquired.
Specifically, since a plurality of initial text sequences are filled with equal length, the longest text sequence in the initial text sequences needs to be obtained, and the preset text length is obtained accordingly, so as to ensure that the preset text length is not shorter than the longest text sequence.
S33: and according to the preset text length, carrying out equal-length filling on the plurality of initial text sequences after the characters are predicted to obtain a plurality of basic text sequences.
In particular, equal length padding of the initial text sequences is performed after each initial text sequence, i.e. from the predicted characters of each initial text sequence. Wherein the padding of the initial text sequence may be padded with PAD symbols. For example, the text to be completed is "i want to eat today", and the corresponding initial text sequence [ x1, x2, x3, x4, x5] is preset with a text length of 10 characters, and after the initial text sequence is filled in, the initial text sequence is obtained as [ x1, x2, x3, x4, x5, "[ PAD ]" "and then converted into a text sequence, that is, the basic text sequence [ x1, x2, x3, x4, x5, x6, x7, x8, x9, x10 ].
In this embodiment, the longest text sequence in the initial text sequences is identified as the fixed-length text sequence, the preset text length not shorter than the fixed-length text sequence is obtained based on the fixed-length text sequence, and then the multiple initial text sequences are subjected to equal-length filling after characters are predicted according to the preset text length to obtain multiple basic text sequences, so that the equal-length filling of the initial text sequences with different lengths is realized, the probability calculation for subsequent execution is facilitated, and the efficiency of text completion is improved.
S4: and respectively carrying out vector calculation on the plurality of basic text sequences and a preset vector to form a plurality of matrix vectors, and inputting the plurality of matrix vectors into a language model to respectively carry out probability calculation on predicted characters to obtain a character probability set corresponding to each predicted character.
Specifically, the language model used in the embodiment of the present application is GPT-2, which is a large-scale unsupervised NLP model, and is a huge model based on transform training on a massive data set, and can complete a plurality of different language modeling tasks such as reading understanding, question answering, machine translation and the like without pre-training.
In this embodiment, a plurality of matrix vectors are formed by vector-calculating a plurality of basic text sequences with a preset vector, respectively, where the preset vector may be a one-dimensional vector of 768 bits, and assuming that the basic text sequences are 10-bit one-dimensional vectors, a plurality of matrix vectors of 10 times 768 bits are formed in the preset vector by each basic text sequence, respectively. And inputting a plurality of matrix vectors of which the number is 10 and the number is 768 into the model GPT-2, and obtaining a character probability set corresponding to each predicted character through the stacking calculation of a transform layer of the model GPT-2.
Referring to fig. 5, fig. 5 shows an embodiment of step S4, which is described in detail as follows:
s41: and respectively carrying out vector calculation on the plurality of basic text sequences and a preset vector to form a plurality of matrix vectors.
Specifically, each basic text sequence and a preset vector are subjected to vector calculation to form a plurality of matrix vectors.
S42: and identifying the padding characters of each matrix vector, and setting the weight of the padding characters to be zero to obtain a plurality of target matrix vectors.
Specifically, because the basic text sequence is obtained by performing equal-length filling on the initial text sequence, the filled characters have no real meaning and cannot be used for predicting the probability of the next character, the filled characters of each matrix vector need to be identified, the weight of the filled characters is set to be infinitesimal, and the purpose of setting the weight of the filled characters to be zero is achieved.
S43: and inputting the target matrix vectors into a language model to perform probability calculation of predicted characters respectively to obtain a character probability set corresponding to each predicted character.
Specifically, a plurality of target vectors are input into a model GPT-2, and prediction of a next token of a last real character of each initial text sequence, namely prediction of a token of a predicted character, is judged through stacking calculation of an embedding layer and a transform layer of the model GPT-2, so that a character probability set corresponding to each predicted character is obtained. Wherein token refers to each position prediction object subsequent to the initial text sequence.
In the embodiment, a plurality of basic text sequences and preset vectors form a plurality of matrix vectors respectively, filler characters of each matrix vector are identified, the weight of the filler characters is set to be zero, a plurality of target matrix vectors are obtained, the plurality of target matrix vectors are input into a language model to perform probability calculation of predicted characters respectively, a character probability set corresponding to each predicted character is obtained, batch calculation of the plurality of text sequences through the language model is achieved, and therefore the efficiency of text completion is effectively improved.
Referring to fig. 6, fig. 6 shows an embodiment of step S43, which is described in detail as follows:
s431: and inputting the target matrix vectors into a model GPT-2 to perform matrix calculation of predicted characters respectively to obtain basic vectors.
S432: and mapping the basic vector to a preset word list to obtain a target vector.
Specifically, a plurality of target matrix vectors are input into a language model to perform matrix calculation of predicted characters respectively, so that a basic vector corresponding to each target matrix vector is obtained, each basic vector is mapped into a preset word list, and a vocabulary corresponding to each vector in the basic vectors is obtained, so that the target vectors are obtained. The target vector is a vocabulary vector which is obtained by performing matrix calculation on the target matrix vector through a model GPT-2 and mapping the target matrix vector to a preset vocabulary table and is possible to appear in the next token of the predicted character.
S433: and performing normalization processing on the target vector by adopting a softmax function mode to obtain a character probability set corresponding to each predicted character.
Specifically, the target vectors are normalized by means of a softmax function, the character probability of each target vector is normalized to [0,1], and the probability sum of the character probability set corresponding to each predicted character is 1.
Therein, the softmax function can "compress" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0,1) and the sum of all elements is 1. In the present embodiment, the target vector is normalized by means of a softmax function, and the character probabilities are obtained.
S5: and acquiring the character sequence with the maximum probability in each character probability set by adopting a beam search mode to obtain a target character sequence corresponding to each full text to be supplemented.
Specifically, the beam search (beamsearch) method is a heuristic method for solving the optimization problem, which is developed on the basis of the branch and bound method, and estimates k best paths by using the heuristic method, and searches downwards only from the k paths, that is, only satisfactory nodes are reserved in each layer, and other nodes are permanently discarded, so that the running time can be greatly saved compared with the branch and bound method.
In this embodiment, based on the character probability set output of the model GPT-2, the scheme obtains a final target character sequence in a bundle search manner. Since the model GPT-2 has already predicted the probabilities of the various tokens, but the greedy search method, where each token takes the highest probability, may result in the overall text being output that is not optimal. The beam search considers the sequence probability (i.e. the integrated probability of all tokens of the sequence), and the first N sequences with the highest probability are reserved when predicting each token and the prediction is recurved.
Referring to fig. 7, fig. 7 shows an embodiment of step S5, which is described in detail as follows:
s51: and selecting a preset beam width character with the maximum probability in each character probability set based on the preset beam width to obtain a candidate character sequence set, wherein the candidate character set comprises a plurality of candidate character sequences.
Specifically, the beam width is a super-parameter beamsize of the beam search, which specifies the number of candidate characters. For example, there are I, L, U predicted characters in the character probability set, the preset beam width is 2, the probability value of I is 0.6, the probability value of L is 0.3, the probability value of U is 0.2, and since the preset beam width is 2, the rows I and L are used as candidate character sequences.
S52: and taking each candidate character sequence as a predicted character sequence, and performing recursion processing on the character probability set through the predicted character sequence to obtain a recursion result corresponding to each character probability set.
Specifically, as shown in step S51, taking I and L as predicted character sequences, assuming that there are II, IL, and IU beginning with I and corresponding probability values of 0.2, 0.7, and 0.1, respectively, assuming that there are LI, LL, and LU beginning with L and corresponding probability values of 0.5, 0.3, and 0.2, respectively, assuming that there is IL and LI as candidate character sequences, and performing recursion processing on the character probability sets to obtain a recursion result corresponding to each character probability set, because the preset beam width is 2.
S53: and selecting the candidate sequence with the maximum probability in each recursion result as a target character sequence corresponding to each full text to be supplemented.
Specifically, as in the example of step S51 and step S52, the recursion results of this example use IL as the beginning of ILI, ILL, and ILU, and use LI as the beginning of LII, LIL, and LIU, and assuming that the probability values of ILU and ILL are the maximum among these recursion results, which are 0.9 and 0.7 respectively, then the ILU is selected as the target character sequence corresponding to the full text to be supplemented.
In the embodiment, the characters with the preset beam width with the maximum probability in each character probability set are selected based on the preset beam width to obtain the candidate character sequence set, each candidate character sequence is used as a predicted character sequence, the character probability sets are subjected to recursion processing through the predicted character sequences to obtain recursion results corresponding to each character probability set, the candidate sequence with the maximum probability in each recursion result is selected to serve as a target character sequence corresponding to each full text to be supplemented, parallel calculation and recursion prediction of the character probability sets are achieved, the target character sequences screened from the character probability sets are quickened, and therefore the text completion efficiency is improved.
S6: and completing the full text to be completed based on the target character sequence.
Specifically, since the target character sequence corresponding to each full text to be supplemented has been obtained in the above steps, only the vocabulary corresponding to each target character sequence needs to be obtained, and the corresponding full text to be supplemented is supplemented with the vocabulary.
In the embodiment, a plurality of texts to be supplemented are obtained, text sequences corresponding to the plurality of texts to be supplemented are obtained, each text sequence is filled with equal length to obtain a basic text sequence, a matrix vector is formed based on the basic text sequences, the matrix vector is input into a language model to be subjected to probability calculation to obtain a character probability set, a beam search mode is combined to obtain a character sequence with the maximum probability in each character probability set to obtain a target character sequence, the texts to be supplemented are supplemented through the target character sequence, equal length filling of the plurality of text sequences is achieved, probability calculation is simultaneously performed on the plurality of texts to be supplemented, and meanwhile, the target character sequence is obtained through the beam search mode, so that the efficiency of text supplementation is improved.
Referring to fig. 8, fig. 8 shows an embodiment of step S6, which is described in detail as follows:
s61: and mapping the target character sequence to a preset word list based on the preset word list to obtain a corresponding target text.
S62: and completing the text to be supplemented according to the target text.
Specifically, the target character sequence is mapped with a preset word list, the position of each character in the target character sequence in the preset word list is obtained, so that corresponding words are obtained, a target text of each text to be supplemented is obtained, and the target text is supplemented to the corresponding text to be supplemented, so that the purpose of supplementing a plurality of supplementary texts is achieved.
In this embodiment, based on the preset word list, the target character sequence is mapped to the preset word list to obtain a corresponding target text, and then the text to be supplemented is supplemented according to the target text, so that the completion of multiple texts to be supplemented is realized, and the efficiency of text completion is improved.
It should be emphasized that, in order to further ensure the privacy and security of the text to be supplemented, the text to be supplemented may also be stored in a node of a block chain.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
Referring to fig. 9, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a text completion apparatus based on a language model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.
As shown in fig. 9, the text completion apparatus based on the language model of the present embodiment includes: an initial text sequence obtaining module 71, a predicted character determining module 72, an initial text sequence filling module 73, a character probability set obtaining module 74, a target character sequence obtaining module 75, and a full text completion to be completed module 76, wherein:
an initial text sequence obtaining module 71, configured to obtain multiple texts to be supplemented, and obtain, based on a preset vocabulary, text sequences corresponding to the multiple texts to be supplemented, so as to obtain multiple initial text sequences;
a predicted character determining module 72 for identifying the character at the last position of each initial text sequence and using the character at the last position as a predicted character;
an initial text sequence filling module 73, configured to perform equal-length filling on the initial text sequence according to a preset text length based on predicted characters of the initial text sequence, so as to obtain multiple basic text sequences;
a character probability set obtaining module 74, configured to perform vector calculation on the multiple basic text sequences and preset vectors respectively to form multiple matrix vectors, and input the multiple matrix vectors into the language model to perform probability calculation on predicted characters respectively, so as to obtain a character probability set corresponding to each predicted character;
a target character sequence obtaining module 75, configured to obtain, in a beam search manner, a character sequence with the highest probability in each character probability set, so as to obtain a target character sequence corresponding to each full text to be complemented;
and a full text completion module 76 for completing the full text to be completed based on the target character sequence.
Further, the initial text sequence obtaining module 71 includes:
a to-be-supplemented text acquisition unit for acquiring a plurality of to-be-supplemented texts;
the segmentation text acquisition unit is used for segmenting a plurality of full texts to be supplemented in a regular matching mode to obtain segmentation texts;
and the segmented text conversion unit is used for converting the characters of the segmented text into corresponding marks in the preset word list based on the preset word list to obtain a plurality of initial text sequences.
Further, the initial text sequence filling module 73 includes:
a fixed-length text sequence determining unit, configured to identify a longest text sequence in the initial text sequence as a fixed-length text sequence;
a preset text length acquisition unit configured to acquire a preset text length not shorter than the fixed-length text sequence based on the fixed-length text sequence;
and the basic text sequence acquisition unit is used for performing equal-length filling on the plurality of initial text sequences after the characters are predicted according to the preset text length to obtain a plurality of basic text sequences.
Further, the character probability set obtaining module 74 includes:
the matrix vector determining unit is used for forming a plurality of matrix vectors by the plurality of basic text sequences and the preset vector respectively;
the target matrix vector generating unit is used for identifying the filling character of each matrix vector and setting the weight of the filling character to be zero to obtain a plurality of target matrix vectors;
and the probability calculation unit is used for inputting the target matrix vectors into the language model to respectively calculate the probability of the predicted characters so as to obtain a character probability set corresponding to each predicted character.
Further, the probability calculation unit includes:
the basic vector generating subunit is used for inputting the target matrix vectors into the language model to perform matrix calculation of predicted characters respectively to obtain basic vectors;
the target vector generating subunit is used for mapping the basic vector to a preset word list to obtain a target vector;
and the target vector processing subunit is used for performing normalization processing on the target vector in a mode of a softmax function to obtain a character probability set corresponding to each predicted character.
Further, the target character sequence acquiring module 75 includes:
the subsequent character processing set selecting unit is used for selecting a preset beam width character with the maximum probability in each character probability set based on a preset beam width to obtain a candidate character sequence set, wherein the candidate character set comprises a plurality of candidate character sequences;
the recursive result acquisition unit is used for taking each candidate character sequence as a predicted character sequence and carrying out recursive processing on the character probability sets through the predicted character sequences to obtain a recursive result corresponding to each character probability set;
and the candidate sequence selecting unit is used for selecting the candidate sequence with the maximum probability in each recursion result as the target character sequence corresponding to each full text to be complemented.
Further, the to-be-supplemented text supplementation module 76 includes:
the target text acquisition unit is used for mapping the target character sequence to a preset word list based on the preset word list to acquire a corresponding target text;
and the target text action unit is used for completing the text to be supplemented according to the target text.
It should be emphasized that, in order to further ensure the privacy and security of the text to be supplemented, the text to be supplemented may also be stored in a node of a block chain.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 10, fig. 10 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 8 includes a memory 81, a processor 82, and a network interface 83 communicatively connected to each other via a system bus. It is noted that only a computer device 8 having three components, a memory 81, a processor 82, and a network interface 83, is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 81 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 81 may be an internal storage unit of the computer device 8, such as a hard disk or a memory of the computer device 8. In other embodiments, the memory 81 may be an external storage device of the computer device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 8. Of course, the memory 81 may also include both internal and external storage devices of the computer device 8. In this embodiment, the memory 81 is generally used for storing an operating system installed in the computer device 8 and various types of application software, such as program codes of a text completion method based on a language model. Further, the memory 81 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 82 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 82 is typically used to control the overall operation of the computer device 8. In this embodiment, the processor 82 is configured to execute the program code stored in the memory 81 or process data, such as the program code of the above-mentioned text completion method based on the language model, so as to implement various embodiments of the text completion method based on the language model.
The network interface 83 may include a wireless network interface or a wired network interface, and the network interface 83 is generally used to establish communication connections between the computer device 8 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of a language model based text completion method as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method of the embodiments of the present application.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A text completion method based on a language model is characterized by comprising the following steps:
acquiring a plurality of texts to be completed, and acquiring text sequences corresponding to the plurality of texts to be completed based on a preset word list to obtain a plurality of initial text sequences;
identifying the character at the last position of each initial text sequence, and taking the character at the last position as a predicted character;
based on the predicted characters of the initial text sequence, performing equal-length filling on the initial text sequence according to a preset text length to obtain a plurality of basic text sequences;
respectively carrying out vector calculation on a plurality of basic text sequences and a preset vector to form a plurality of matrix vectors, and inputting the plurality of matrix vectors into a language model to respectively carry out probability calculation on the predicted characters to obtain a character probability set corresponding to each predicted character;
acquiring a character sequence with the highest probability in each character probability set by adopting a beam search mode to obtain a target character sequence corresponding to each full text to be supplemented;
and completing the text to be completed based on the target character sequence.
2. The method according to claim 1, wherein the obtaining a plurality of texts to be completed and obtaining a plurality of text sequences corresponding to the texts to be completed based on a preset vocabulary to obtain a plurality of initial text sequences comprises:
acquiring a plurality of texts to be completed;
segmenting the plurality of texts to be completed by adopting a regular matching mode to obtain segmented texts;
and converting the characters of the segmented text into corresponding marks in the preset word list based on the preset word list to obtain a plurality of initial text sequences.
3. The method of claim 1, wherein the filling the initial text sequence with equal length according to a preset text length based on the predicted characters of the initial text sequence to obtain a plurality of basic text sequences comprises:
identifying the longest text sequence in the initial text sequence as a fixed-length text sequence;
acquiring a preset text length which is not shorter than the fixed-length text sequence based on the fixed-length text sequence;
and according to the preset text length, performing equal-length filling on the plurality of initial text sequences after the characters are predicted to obtain a plurality of basic text sequences.
4. The method for completing text based on language model according to claim 1, wherein said vector-computing a plurality of said basic text sequences with a predetermined vector to form a plurality of matrix vectors, and inputting a plurality of said matrix vectors into a language model to perform probability-computing of said predicted characters, respectively, to obtain a character probability set corresponding to each said predicted character, comprises:
respectively carrying out vector calculation on the plurality of basic text sequences and a preset vector to form a plurality of matrix vectors;
identifying a filling character of each matrix vector, and setting the weight of the filling character to be zero to obtain a plurality of target matrix vectors;
and inputting a plurality of target matrix vectors into the language model to respectively carry out probability calculation of the predicted characters to obtain a character probability set corresponding to each predicted character.
5. The method for completing text based on language model according to claim 4, wherein said inputting a plurality of said target vectors into said language model for calculating probabilities of said predicted characters to obtain a character probability set corresponding to each of said predicted characters comprises:
inputting a plurality of target matrix vectors into the language model to perform matrix calculation of the predicted characters respectively to obtain basic vectors;
mapping the basic vector to the preset word list to obtain a target vector;
and performing normalization processing on the target vector by adopting a softmax function mode to obtain a character probability set corresponding to each predicted character.
6. The method for completing text based on language model according to claim 1, wherein said obtaining the character sequence with the highest probability in each character probability set by using a beam search to obtain the target character sequence corresponding to each text to be completed comprises:
selecting a preset beam width character with the maximum probability in each character probability set based on a preset beam width to obtain a candidate character sequence set, wherein the candidate character set comprises a plurality of candidate character sequences;
taking each candidate character sequence as a predicted character sequence, and performing recursion processing on the character probability sets through the predicted character sequences to obtain a recursion result corresponding to each character probability set;
and selecting the candidate sequence with the maximum probability in each recursion result as a target character sequence corresponding to each full text to be supplemented.
7. The method according to any one of claims 1 to 6, wherein the completing the text to be completed based on the target character sequence comprises:
mapping the target character sequence to the preset word list based on the preset word list to obtain a corresponding target text;
and completing the text to be supplemented according to the target text.
8. A text completion apparatus based on a language model, comprising:
the initial text sequence acquisition module is used for acquiring a plurality of texts to be completed, acquiring text sequences corresponding to the plurality of texts to be completed based on a preset word list, and acquiring a plurality of initial text sequences;
the predicted character determining module is used for identifying the character at the last position of each initial text sequence and taking the character at the last position as a predicted character;
the initial text sequence filling module is used for carrying out equal-length filling on the initial text sequence according to a preset text length based on the predicted characters of the initial text sequence to obtain a plurality of basic text sequences;
the character probability set acquisition module is used for respectively carrying out vector calculation on the plurality of basic text sequences and preset vectors to form a plurality of matrix vectors, inputting the plurality of matrix vectors into a language model to respectively carry out probability calculation on the predicted characters, and obtaining a character probability set corresponding to each predicted character;
the target character sequence acquisition module is used for acquiring a character sequence with the highest probability in each character probability set in a beam search mode to obtain a target character sequence corresponding to each full text to be complemented;
and the to-be-completed text completion module is used for completing the to-be-completed text based on the target character sequence.
9. A computer device comprising a memory having stored therein a computer program and a processor that, when executed, implements a language model-based text completion method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements a language model-based text completion method according to any one of claims 1 to 7.
CN202110712451.8A 2021-06-25 2021-06-25 Text completion method, device, equipment and storage medium based on language model Pending CN113434632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110712451.8A CN113434632A (en) 2021-06-25 2021-06-25 Text completion method, device, equipment and storage medium based on language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110712451.8A CN113434632A (en) 2021-06-25 2021-06-25 Text completion method, device, equipment and storage medium based on language model

Publications (1)

Publication Number Publication Date
CN113434632A true CN113434632A (en) 2021-09-24

Family

ID=77754571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110712451.8A Pending CN113434632A (en) 2021-06-25 2021-06-25 Text completion method, device, equipment and storage medium based on language model

Country Status (1)

Country Link
CN (1) CN113434632A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150347381A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US20170255278A1 (en) * 2014-10-16 2017-09-07 Touchtype Ltd. Text prediction integration
CN111832292A (en) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 Text recognition processing method and device, electronic equipment and storage medium
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN112348073A (en) * 2020-10-30 2021-02-09 北京达佳互联信息技术有限公司 Polyphone recognition method and device, electronic equipment and storage medium
CN112560476A (en) * 2020-12-09 2021-03-26 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN112580310A (en) * 2020-12-28 2021-03-30 河北省讯飞人工智能研究院 Missing character/word completion method and electronic equipment
CN112749253A (en) * 2020-12-28 2021-05-04 湖南大学 Multi-text abstract generation method based on text relation graph
CN112818663A (en) * 2021-01-15 2021-05-18 北京有竹居网络技术有限公司 Processing method for language model, text generation method, text generation device and medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150347381A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US20170255278A1 (en) * 2014-10-16 2017-09-07 Touchtype Ltd. Text prediction integration
CN111832292A (en) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 Text recognition processing method and device, electronic equipment and storage medium
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN112348073A (en) * 2020-10-30 2021-02-09 北京达佳互联信息技术有限公司 Polyphone recognition method and device, electronic equipment and storage medium
CN112560476A (en) * 2020-12-09 2021-03-26 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN112580310A (en) * 2020-12-28 2021-03-30 河北省讯飞人工智能研究院 Missing character/word completion method and electronic equipment
CN112749253A (en) * 2020-12-28 2021-05-04 湖南大学 Multi-text abstract generation method based on text relation graph
CN112818663A (en) * 2021-01-15 2021-05-18 北京有竹居网络技术有限公司 Processing method for language model, text generation method, text generation device and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DAMI´AN PASCUAL ET AL: "Directed Beam Search: Plug-and-Play Lexically Constrained Language Generation", 《ARXIV:2012.15416V1 [CS.CL]》, pages 1 - 9 *

Similar Documents

Publication Publication Date Title
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN112084752A (en) Statement marking method, device, equipment and storage medium based on natural language
CN112631924A (en) Automatic testing method and device, computer equipment and storage medium
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
CN113947095A (en) Multilingual text translation method and device, computer equipment and storage medium
CN112181835A (en) Automatic testing method and device, computer equipment and storage medium
CN112528029A (en) Text classification model processing method and device, computer equipment and storage medium
CN112699213A (en) Speech intention recognition method and device, computer equipment and storage medium
CN114780701A (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN113283222B (en) Automatic report generation method and device, computer equipment and storage medium
CN116186295B (en) Attention-based knowledge graph link prediction method, attention-based knowledge graph link prediction device, attention-based knowledge graph link prediction equipment and attention-based knowledge graph link prediction medium
CN112232052A (en) Text splicing method and device, computer equipment and storage medium
CN117195886A (en) Text data processing method, device, equipment and medium based on artificial intelligence
CN116774973A (en) Data rendering method, device, computer equipment and storage medium
CN114358023B (en) Intelligent question-answer recall method, intelligent question-answer recall device, computer equipment and storage medium
CN116168403A (en) Medical data classification model training method, classification method, device and related medium
CN113434632A (en) Text completion method, device, equipment and storage medium based on language model
CN112346737B (en) Method, device and equipment for training programming language translation model and storage medium
CN115373634A (en) Random code generation method and device, computer equipment and storage medium
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN112949320A (en) Sequence labeling method, device, equipment and medium based on conditional random field
CN114490969A (en) Question and answer method and device based on table and electronic equipment
CN112396111A (en) Text intention classification method and device, computer equipment and storage medium
CN113761375A (en) Message recommendation method, device, equipment and storage medium based on neural network
CN113065354A (en) Method for identifying geographic position in corpus and related equipment thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination