Detailed Description
The following describes the embodiments of the present application in detail with reference to the drawings.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.
The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.
Referring to fig. 1, fig. 1 is a flow chart illustrating an embodiment of a text completion method of the present application. Specifically, the method may include the steps of:
step S11: and obtaining the text to be complemented.
In the embodiment of the present disclosure, the text to be completed includes at least one missing position, that is, the text to be completed may include 1 missing position, or may include multiple (e.g., 2, 3, etc.) missing positions, which is not limited herein. For example, for a complete text "lancet" in the uk journal of medicine "in-line release of the XX vaccine phase ii clinical test result of the chinese military medical institute", the corresponding to-be-completed text may be "the uk ()" lancet "in the line release of the XX vaccine phase ii clinical test result of the chinese military medical institute", and in the embodiments of the present disclosure and other embodiments described below, if no special description exists, () indicates a missing position, the above-mentioned to-be-completed text may include 1 missing position, or the corresponding to-be-completed text may also be "() state ()" lancet "in the line release of the XX vaccine phase ii clinical test result of the chinese military medical institute", i.e., the to-be-completed text may include 2 missing positions, or the corresponding to-be "()" lancet "in the line release of the XX vaccine phase ii clinical test result of the chinese military medical institute", i.e., the to-be-completed text may include 3 missing positions. Other situations can be similar and are not exemplified here.
It should be noted that, each missing position may correspond to one missing text, or may correspond to a plurality (e.g., 2, 3, etc.) of missing text. Taking the above-mentioned complete text "the lancet of the journal of british medicine" in-line published Chinese military medical institute XX vaccine phase ii clinical test result "as an example, the corresponding text to be complemented may be the result of the lancet of the journal of british medicine" in-line published Chinese military medical institute XX vaccine phase ii clinical test ", that is, the text to be complemented corresponds to 4 missing words at the missing position, or the corresponding text to be complemented may also be the result of the lancet of the journal of british medicine" in-line published Chinese military medical institute XX vaccine phase ii clinical test ", that is, the text to be complemented corresponds to 2 missing words at the missing position, or the corresponding text to be complemented may also be the result of the lancet of the journal of british medicine" in-line published Chinese medical institute XX vaccine phase ii clinical test result ", that is, the text to be complemented corresponds to 1 missing word at the missing position. Other situations can be similar and are not exemplified here.
In addition, in the embodiments of the present disclosure and other embodiments of the disclosure described below, the reason why the text to be completed is missing is not limited. For example, the text to be completed can be the missing text caused by various problems such as network congestion, coding errors and the like in various links such as transmission, storage, display and the like; alternatively, for example, during confidential text transmission, special coding or conversion methods are often adopted, such as place names, person names, organization names, etc., so that a text deletion situation may occur.
Step S12: and determining the source condition of the missing content of the text to be complemented.
In embodiments of the present disclosure, the source conditions include any of the following: the source is unknown, from the first text library, from the second text library relating to the preset knowledge domain. Specifically, in the case that the source situation includes a source from the first text library, it may be known that the missing content of the text to be completed comes from the first text library, but the knowledge domain to which the missing content relates exactly cannot be determined, taking the example that the "world intellectual property organization headquarters of the text to be completed are set at ()" and it may be known that the missing content comes from an encyclopedia (e.g., network encyclopedia such as wikipedia, encyclopedia of hundred degrees, etc.); in the case that the source situation includes a second text library derived from a preset knowledge domain, it is known that the missing content of the text to be completed comes from the second text library, and the second text library refers to the preset knowledge domain, taking the text to be completed "as one of the representative characters of the Vienna classical music group, ()" which was designated as the salsburgh palace musician in 1772 as an example, it is known that the missing content of the text to be completed comes from the second text library (e.g., the network encyclopedia such as the wikipedia, the hundred encyclopedia, etc., or the professional books related to classical music); in the case where the source case includes a source unknown, knowledge fields from which the missing content is exactly derived cannot be known, and a text library from which the missing content is derived cannot be known either. The above examples are only possible situations in the practical application process, and are not limited to the text to be completed, the first text library, the second text library or the preset knowledge field, and may be specifically set according to the practical application situation, which is not illustrated herein.
In one implementation scenario, the complete text of the text to be completed is sent to the receiver by the sender, and the receiver has a text missing condition due to the foregoing various reasons during the receiving process, so that the source condition of the missing content of the text to be completed can be determined based on the prior convention between the sender and the receiver. For example, if the sender and the receiver do not agree in advance, it may be determined that the source of the missing content of the text to be completed is "source unknown"; or, for example, the sender and the receiver agree in advance that the text does not exceed the scope of the encyclopedia, and then it may be determined that the source condition of the missing content of the text to be completed is "from the encyclopedia"; alternatively, for example, if the sender and the receiver agree in advance that the text does not exceed the encyclopedia involving classical music, it may be determined that the source of the missing content of the text to be completed is "from the encyclopedia involving the classical music knowledge field". Other situations can be similar and are not exemplified here.
In another implementation scenario, as described above, the complete text of the text to be completed is sent from the sender to the receiver, and the receiver experiences a text deletion during the receiving process due to the various reasons described above, and then the source condition of the deleted content of the text to be completed may also be determined based on the context of the historical dialogue between the sender and the receiver. For example, if the historical dialog between the sender and the receiver does not involve a particular topic, then it may be determined that the source of the missing content of the text to be completed is "source unknown"; alternatively, for example, the historical dialogue between the sender and the receiver mainly involves people outside of the past and the future, but is not limited to a specific field, and the source of the missing content of the text to be completed can be determined as "from encyclopedia"; alternatively, for example, where the history dialogue between the sender and the receiver mainly involves representative persons of various schools of classical music, it may be determined that the source of the missing content of the text to be completed is "from an encyclopedia involving the domain of classical music knowledge". Other situations can be similar and are not exemplified here.
In yet another implementation scenario, after the text to be completed is obtained, the user may also be prompted to assist in determining the source of the missing content of the text to be completed. Specifically, the user may be prompted to select a text library from which the missing content of the text to be completed originates, such as may include: uncertainty, encyclopedia, and the like, and determining that the source condition of the missing content of the to-be-completed text is "source unknown" in the case that the user selects the "uncertainty" option, and further prompting the user to select a knowledge domain related to the missing content of the to-be-completed text after the user selects a text library from which the missing content of the to-be-completed text is derived, for example, in the case that the user selects the "encyclopedia" option, the user may be further prompted to select the knowledge domain related to the missing content of the to-be-completed text, for example, may include: uncertainty, classical music, popular music, etc., and in the case of user selection of "uncertainty", determining that the source of the missing content of the text to be completed is "from encyclopedia", or in the case of user selection of "classical music", determining that the source of the missing content of the text to be completed is "from encyclopedia involving classical music knowledge domain". Other situations can be similar and are not exemplified here.
Step S13: and carrying out completion prediction on the text to be completed by adopting a text prediction mode matched with the source condition to obtain at least one candidate word at the missing position.
In one implementation scenario, under the condition that the source condition includes unknown sources, a preset number of default characters can be respectively added at each missing position of the text to be completed to obtain the text to be processed, and for each missing position, prediction is performed on the text to be processed for a plurality of times to obtain predicted characters of the default characters at the corresponding sequence position of the number of times of prediction, and candidate words of the missing position are obtained based on the predicted characters of the plurality of times of prediction. According to the method, under the condition that the source condition comprises unknown sources, default symbols with preset numerical values are respectively fed into each missing position of the text to be completed, so that the text to be processed is obtained, prediction is conducted on the text to be processed for a plurality of times according to each missing position, predicted words of the default symbols at the corresponding sequence positions of the prediction times are obtained, and then candidate words of the default positions are obtained based on the predicted words of the plurality of times, so that text completion can be conducted without relying on manpower, the efficiency of text completion can be improved, the cost of text completion can be reduced, and in addition, word-by-word prediction is conducted at each missing position under the condition that the source is unknown, the prediction accuracy can be improved, and therefore the accuracy of text completion is improved.
In a specific implementation scenario, the default symbol may be set according to actual application requirements, for example, the default symbol may be set to [ mask ], which is not limited herein.
In another specific implementation scenario, the preset number may be set according to actual application needs, for example, may be set to 2, 3, 4, 5, etc., which is not limited herein.
In still another specific implementation scenario, in order to improve the prediction efficiency, the predicting the text to be processed for each missing position may be specifically performed by the first prediction network, that is, the text to be processed may be sent to the first prediction network, so as to finally obtain the predicted text of the default symbol at the corresponding sequence position of the number of times of prediction, which may be specifically referred to the related description in the following disclosure embodiments, and will not be described herein in detail.
In yet another specific implementation scenario, taking the example of the text to be completed "() the result of the clinical test of the XX vaccine phase ii of the medical journal" lancet "in the online published chinese military medical institute," 4 default symbols are added to obtain the text to be processed "[ mask ] [ mask ] [ mask ] [ mask ]) and the result of the clinical test of the XX vaccine phase ii of the medical journal" lancet "in the online published chinese military medical institute," for the default positions, the 1 st prediction can obtain the predicted word (e.g., english, american, law) of the default position (i.e., the 1 st) at the number of predictions (i.e., the 1 st) and the 2 nd prediction can obtain the predicted word (e.g., the country) of the default position (i.e., the 2 nd) at the number of predictions) and the candidate word "united kingdom" of the default position "," united kingdom "and the united kingdom" can be obtained by such a prediction. Other situations can be similar and are not exemplified here. The specific prediction process may refer to the related description in the following disclosed embodiments, which are not described herein.
In one implementation scenario, where the source instance includes a source from a first text library, the first text library may be utilized to make up predictions of the text to be made up to obtain at least one candidate word for the missing location. According to the method, under the condition that the source comprises the source from the first text library, the first text library is used for carrying out completion prediction on the text to be completed, so that at least one candidate word at the missing position is obtained, the text completion can be carried out without relying on manpower, the text completion efficiency can be improved, and the text completion cost can be reduced. In addition, under the condition that the missing content is derived from the first text library, at least one candidate word of the missing position is obtained by directly predicting the first text library, so that the efficiency of text completion can be further improved. In addition, because candidate words are directly predicted for the missing positions, the missing positions are not limited to missing words, words or entities, and can be favorable for predicting the mixed granularity of words, entities and the like.
In a specific implementation scenario, to expand the application scope, the first text library may specifically include text corpora that may be involved in the practical application process, such as daily chat, or words, terms, entities, and so on that may occur in various professional scenarios such as finance, music, and so on. For example, the first text library may specifically include corpus of network encyclopedias such as encyclopedia of hundred degrees and wikipedia, so that the first text library may be suitable for various business scenes, and the application range is greatly improved.
In another specific implementation scenario, in order to improve the prediction efficiency, the performing of the completion prediction on the text to be completed may be specifically performed by the second prediction network, that is, the text to be completed may be sent to the second prediction network, so as to finally obtain at least one candidate word of the missing position, which may be specifically referred to the related description in the following disclosure embodiments, and will not be described herein in detail.
In yet another specific implementation scenario, taking the to-be-completed text "() the medical journal" lancet "as an example of the on-line release of the results of the phase ii clinical test of XX vaccine by the national institute of military medical science", the to-be-completed text is subjected to completion prediction by using the first text library, so that at least one candidate word "uk", "united states", "france" of the missing position can be obtained. Other situations can be similar and are not exemplified here. The specific prediction process may refer to the related description in the following disclosed embodiments, which are not described herein.
In one implementation scenario, in the case that the source situation includes a second text library that is derived from a preset knowledge domain, the knowledge graph and the second text library that correspond to the preset knowledge domain may be utilized to make up prediction for the text to be completed, so as to obtain at least one candidate word of the missing position. According to the method, under the condition that the source condition comprises a second text library which is derived from the preset knowledge field, the text to be complemented is subjected to complement prediction through the knowledge spectrum which corresponds to the preset knowledge field and the second text library, so that at least one candidate word of the missing position is obtained, text complement can be performed without relying on manpower, the efficiency of text complement can be improved, and the cost of text complement can be reduced. In addition, under the condition that the missing content is derived from a second text library related to the preset knowledge field, on one hand, at least one candidate word of the missing position can be obtained by directly predicting the second text library, so that the text completion efficiency can be improved further, and on the other hand, at least one candidate word of the missing position can be obtained by predicting the corresponding knowledge map of the preset knowledge field, so that the accuracy of the candidate word can be improved.
In a specific implementation scenario, to expand the application scope, the second text library may specifically include text corpora that may be involved in the practical application process, such as daily chat, or words, terms, entities, and so on that may occur in various professional scenarios such as finance, music, and so on. For example, the second text library may specifically include corpus of network encyclopedias such as encyclopedia of hundred degrees and wikipedia, so that the second text library may be suitable for various business scenes, and the application range is greatly improved.
In another specific implementation scenario, in order to improve the prediction efficiency, the performing of the completion prediction on the text to be completed may be specifically performed by the third prediction network, that is, the text to be completed may be sent to the third prediction network, so as to finally obtain at least one candidate word of the missing position, which may be specifically referred to the related description in the following disclosure embodiments, and will not be described herein in detail.
In another specific implementation scenario, taking the instance that the text to be complemented "uk ()" and the lancet "on-line release the XX vaccine phase ii clinical test result of the military medical institute" as examples, the knowledge graph and the second text library corresponding to the medical knowledge field can be utilized to carry out the complement prediction on the text to be complemented, and at least one candidate word "medical journal", "journal" and "newspaper" of the missing position can be obtained. Other situations can be similar and are not exemplified here. The specific prediction process may refer to the related description in the following disclosed embodiments, which are not described herein.
In addition, referring to fig. 2 in combination, fig. 2 is a schematic diagram of an embodiment of a text completion method of the present application. As shown in fig. 2, in order to improve the efficiency of the completion prediction, the completion prediction may be performed using a first prediction network in the case where the source case includes a source unknown, and the completion prediction may be performed using a second prediction network in the case where the source case includes a source from a first text library, or the completion prediction may be performed using a third prediction network in the case where the source case includes a source from a second text library related to the preset knowledge field. Therefore, under the condition of different sources, the completion prediction can be performed by utilizing different prediction networks, so that the application range of text completion can be enlarged.
In one implementation scenario, in order to facilitate text completion within the framework of the method shown in fig. 2, the first text library and the second text library may be the same text library, and as mentioned above, in order to expand the application scope, the text library may specifically include text corpora that may be involved in the actual application process, such as daily chat, or words, terms, entities, and so on that may occur in various professional scenarios such as finance, music, and so on. For example, the text library specifically can contain the corpus of network encyclopedias such as hundred degrees encyclopedia and wikipedia, so that the text library can be suitable for various business scenes, and the application range is greatly improved.
In one implementation scenario, the first prediction network, the second prediction network, and the third prediction network may be respectively obtained by training different preset neural networks by using different sample texts in different training manners. For example, the first predictive network may be trained using a first sample text for a first predetermined neural network using a first training mode, the second predictive network may be trained using a second sample text for a second predetermined neural network using a second training mode, and the third predictive network may be trained using a third sample text for a third predetermined neural network using a third training mode.
In another implementation scenario, in order to reduce the training complexity, the first prediction network, the second prediction network, and the third prediction network may be obtained by training the same preset neural network by using the same sample text in different training manners, that is, the first prediction network, the second prediction network, and the third prediction network may share the sample text and the preset neural network in the training process, so that the training complexity can be reduced advantageously. The specific training manners of the first prediction network, the second prediction network, and the third prediction network may refer to related descriptions in other disclosed embodiments of the present application, which are not described herein.
It should be noted that the above-mentioned preset neural network may be specifically set according to actual application conditions, for example, may include but not limited to: BERT (Bidirectional Encoder Representations from Transformers, transform-based bi-directional coding representation), ELMo, GPT (generating Pre-Training), etc., without limitation herein.
Step S14: and obtaining the complete text of the text to be complemented by utilizing the candidate words of each missing position.
As shown in fig. 2, after obtaining candidate words of each missing position of the text to be completed, the candidate words of each missing position are further combined to perform combined completion prediction on the text to be completed, so as to obtain a complete text of the text to be completed.
In one implementation scenario, a corresponding candidate word may be added at each missing position, so that a plurality of candidate texts of the text to be completed may be obtained, a final score of each candidate text may be obtained, and one candidate text may be selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts.
In a specific implementation scenario, the corresponding candidate word that is added at the missing position is specifically a candidate word predicted at the missing position, so that in the case that the text to be completed includes n missing positions and each missing position is correspondingly predicted to obtain k candidate words, the candidate text of the text to be completed has k in total n And each. Taking the text to be completed "() journal of medicine" lancet "on-line published () results of phase ii clinical trials of XX vaccine by military medical institute" as an example, the candidate words at the first missing position include: the second missing location candidate words include: "China" and "Japan", a corresponding candidate word can be added to each missing position, so that the following can be obtained: the total of 4 candidate texts include "the on-line release of the XX vaccine II clinical test result of the Chinese medical institute by the journal of medicine" the lancet "in the United kingdom", "the on-line release of the XX vaccine II clinical test result of the Chinese military medical institute by the lancet" in the U.S. medical journal "," the on-line release of the XX vaccine II clinical test result of the Japanese military medical institute by the lancet "in the U.S. medical journal" the lancet ", and" the on-line release of the XX vaccine II clinical test result of the Japanese military medical institute "by the U.S. medical journal" the lancet ". Other situations can be similar and are not exemplified here.
In another specific implementation scenario, in order to improve efficiency and accuracy of scoring candidate texts, a plurality of candidate texts may be sent to a preset scoring network, respectively, to obtain final scores of the corresponding candidate texts. The preset scoring network may be an N-gram based statistical language network, and specifically may include, but is not limited to: kenLM, SRILM, IRSTLM, berkeleyLM, etc., without limitation herein. Taking N as an example, 3, the final score can be expressed as:
P(w 1 ,…,w n )=P(w 1 )*…*P(w n |w n-1 ,w n-2 )……(1)
In the above formula (1), w 1 ,…,w n Representing n words in the candidate text, in particular w i Representing the i-th word in the candidate text, right side P (w 1 ),…,P(w n |w n-1 ,w n-2 ) And the like are predicted by using a preset scoring network. Taking the above-mentioned candidate text "lancet of british journal of medicine" as an example, the results of the phase ii clinical test of XX vaccine of the national institute of medicine of military "are published online, the words in the candidate text may be: british, journal of medicine, lancets, online, published, chinese, military, medical, institutes of medicine, XX vaccine, phase ii, clinical, trial, results, then before the candidate text is sent to the preset scoring network, each term in the candidate text may be distinguished using a separator (e.g., a space), e.g., the candidate text may be expressed as "british journal of medicine lancets online published chinese military medical institutes of medicine XX vaccine phase ii clinical trial results"; alternatively, to adapt to various granularities of words, terms, entities, etc., the terms may be further segmented word by word based on the part-of-speech category of each term, for example, the terms in the candidate text may be respectively: english, national, medical journal, lancet, on-line, published, national, military, medical, institutes, XX vaccine, phase ii, clinical, trial, results, then before the candidate text is sent to the preset scoring network, a separator (e.g., space) may be used to distinguish each term in the candidate text, e.g., the candidate text may be expressed as "the british medical journal lancet on-line published chinese military medical institute XX vaccine phase ii clinical trial results". Other cases And so on, and are not illustrated herein. For a specific process of word-by-word segmentation of words based on word category, reference may be made to the related description in the following disclosed embodiments of the present application, which is not repeated herein.
In yet another specific implementation scenario, after obtaining the final scores of the several candidate texts, the candidate text corresponding to the maximum final score may be selected as the complete text of the text to be completed. Taking the 4 candidate texts as an example, the 4 candidate texts can be respectively sent to a preset scoring network to obtain the final scores of the candidate texts, and the candidate text corresponding to the maximum final score is used as the complete text of the text to be complemented, for example, when the final score of the candidate text of 'Lancet of the British medical journal' is the maximum, the candidate text can be used as the complete text of the text to be complemented. Other situations can be similar and are not exemplified here.
In another implementation scenario, in order to improve accuracy of the final score, a corresponding candidate word may be further added to each missing position to obtain a plurality of candidate texts of the text to be completed, and for each candidate text, the words in the candidate texts are reversely ordered to obtain reverse texts of the candidate texts, so that the final score of the candidate texts is obtained based on the first score of the candidate texts and the second score of the reverse texts, and then one candidate text is selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts. According to the method, the corresponding candidate words are supplemented in each missing position to obtain a plurality of candidate texts of the text to be completed, and the words in the candidate texts are reversely sequenced for each candidate text to obtain the reverse text of the candidate text, so that the final score of the candidate text is obtained based on the first score of the candidate text and the second score of the reverse text, and therefore, the forward sequence and the reverse sequence of the candidate text are comprehensively considered to score the candidate text in the scoring process of the candidate text, the accuracy of the final score can be improved, and the accuracy of the complete text can be improved in the subsequent process of obtaining the complete text based on the final score.
In a specific implementation scenario, the corresponding candidate word that is fed in at the missing position is specifically a candidate word predicted at the missing position, and the specific description may refer to the foregoing related description, which is not repeated herein.
In another specific implementation scenario, as previously described, to increase efficiency and accuracy of scoring candidate texts, a first scoring network and a second scoring network may be trained in advance, so that a first score may be obtained by processing candidate texts using the first scoring network and a second score may be obtained by processing candidate texts using the second scoring network. That is, for each candidate text, the candidate text may be sent to a first scoring network to obtain a first score, and the reverse text of the candidate text may be sent to a second scoring network to obtain a second score. Furthermore, as previously described, the first scoring network and the second scoring network may each be N-gram based statistical language networks, which may specifically include, but are not limited to: kenLM, SRILM, IRSTLM, berkeleyLM, etc., without limitation herein. Taking N as 3 for example, the first score may be obtained using the foregoing correlation description, and the second score may be expressed as:
P(w 1 ,…,w n )=P(w n )*…*P(w 1 |w 2 ,w 3 )……(2)
In the above formula (1), w 1 ,…,w n Representing n words in the candidate text, in particular w i Representing the i-th word in the candidate text, right side P (w n ),…,P(w 1 |w 2 ,w 3 ) And the like, predicted using the second scoring network. Still taking the above-mentioned candidate text "lancet of british medical journal" as an example of the on-line published clinical test result of XX vaccine phase ii of the national institute of medical science in China ", the words in the candidate text may be respectively: british, journal of medicine, lancet, on-line, published, china, military, medical, research institute, XX vaccine, stage II, clinical, trial, and outcome, the reverse text of the candidate text may be expressed as "result trial clinical stage II XX vaccine research institute medical military China published on-line lancetThe reverse text may be distinguished from each word in the reverse text by a separator (e.g., a space) prior to sending the reverse text into the second scoring network, such as may be expressed as "result test clinical stage ii XX vaccine institute medical military china published online lancet medical journal, uk". Alternatively, to adapt to various granularities of words, terms, entities, etc., the terms may be further segmented word by word based on the part-of-speech category of each term, for example, the terms in the candidate text may be respectively: the reverse text of the candidate text may be expressed as "the on-line lancet medical journal national English published in the medical military national of the clinical stage II of results test", and the words in the reverse text may be distinguished by a separator (e.g., space) before the reverse text is sent to the second scoring network, such as may be expressed as "the on-line lancet medical journal national English published in the medical military national of the clinical stage II of results test XX vaccine research national". Other situations can be similar and are not exemplified here. For a specific process of word-by-word segmentation of words based on word category, reference may be made to the related description in the following disclosed embodiments of the present application, which is not repeated herein.
In still another specific implementation scenario, the final score may be specifically obtained by weighting the first score and the second score with a first weight and a second weight, respectively, where the first weight is not less than the second weight, for example, the first weight is 0.6 and the second weight is 0.4, or the first weight is 0.7 and the second weight is 0.3, which is not limited herein. For the purposes of the tagline description, the first weight may be denoted as λ and the second weight as 1- λ, then the final score may be expressed as:
score=λg f (x)+(1-λ)g b (x)……(3)
in the above formula (3), score represents the final score, g, of the candidate text x f (x) A first score, g, representing candidate text x b (x) Representing a second score for candidate text x.
In yet another specific implementation scenario, after obtaining the final scores of the several candidate texts, the candidate text corresponding to the maximum final score may be selected as the complete text of the text to be completed. Reference may be made specifically to the foregoing related description, and details are not repeated here.
According to the scheme, the text to be completed is obtained, the text to be completed comprises at least one missing position, the source condition of the missing content of the text to be completed is determined, and the source condition comprises any one of the following: the method comprises the steps of carrying out complement prediction on a text to be complemented by adopting a text prediction mode matched with a source condition to obtain at least one candidate word at a missing position, and further obtaining a complete text of the text to be complemented by utilizing the candidate words at each missing position. Therefore, the missing content of the text to be completed can be completed without relying on manpower, the text completion efficiency can be improved, and the text completion cost can be reduced. In addition, the text to be complemented is subjected to complement prediction by adopting a text prediction mode matched with the source condition, so that the application range of text complement can be enlarged.
As described in the foregoing disclosure embodiments, the first prediction network, the second prediction network, and the third prediction network may share the sample text and the preset neural network in the training process, so the sample text for subsequent training of the first prediction network, the second prediction network, and the third prediction network may be pre-constructed, and then the samples are used to train to obtain the first prediction network, the second prediction network, and the third prediction network, respectively. Referring to fig. 3, fig. 3 is a flow chart illustrating an embodiment of a sample text obtaining process. As shown in fig. 3, the method specifically includes the following steps:
step S31: and performing word segmentation and part-of-speech tagging on the original text to obtain a plurality of words marked with part-of-speech categories.
In one implementation scenario, the original text may be text related to a business scenario. For example, in a financial related business scenario, the original text may include, but is not limited to: financial news data, financial book data and the like; alternatively, in a sports-related business scenario, the original text may include, but is not limited to: sports news data, sports book data, etc., and other scenarios may be similarly referred to herein, which are not examples.
In another implementation scenario, in order to improve the efficiency of word segmentation and part-of-speech tagging, the original text may be processed by using a word segmentation and part-of-speech tagging tool to obtain a plurality of words tagged with part-of-speech categories. Specifically, the segmentation and part-of-speech tagging tools may include, but are not limited to: ICTCLAS, NLTK, stanford NLP, etc., without limitation.
Taking the original text of "XX meeting held for the first time in 2008 in China" as an example, after the original text is subjected to word segmentation and part-of-speech tagging, a plurality of words marked with part-of-speech categories can be obtained:
the above letters are part of speech labeled for words, for example, ns represents place names, nz represents other name entities except common entities such as place names and person names, v represents verbs, m represents numeral words, and q represents graduated words. In addition, the words having the correlation may be combined, and for example, the word "2008" and the term "year" may be combined to "2008". In the case of the original text, the same can be said, and no further examples are given here.
Step S32: word segmentation is carried out on words with part-of-speech categories being preset categories, and words with preset proportions are selected for default in segmented words and non-segmented words.
In one implementation scenario, the preset category may specifically be a place name, in which case, for the original text "XX meeting in china 2008" mentioned above, the word "chinese" marked as a place name may be split word by word to obtain two words, "medium" and "country". Other situations can be similar and are not exemplified here.
It should be noted that, in the process of obtaining the reverse text of the candidate text in the above disclosed embodiment, in order to adapt to various granularities of words, terms, entities, etc., word segmentation and part-of-speech labeling may also be performed on the candidate text to obtain a plurality of terms marked with part-of-speech categories, and the terms with the part-of-speech categories being preset categories may be segmented word by word, which may be specifically referred to the foregoing related description in the embodiments of the present disclosure and will not be repeated herein. On the basis, the segmented words can be reversely sequenced to obtain reverse texts of the candidate texts. Taking the above candidate text 'Lancet of British medical journal' as an example, the on-line publication of the clinical test result of XX vaccine II in China military medical institute's on line', the candidate text can obtain a plurality of words marked with part of speech category as follows after word segmentation and part of speech marking treatment:
The letters above are part of speech tagged to words, e.g., vd represents a secondary verb, n represents a noun, and nt represents an institutional group. On the basis of the preset category of the place name, the words marked as the place name, british, can be divided into English and Chinese word by word, and the words marked as the place name, chinese word, can be divided into Chinese and Chinese word by word. Other situations can be similar and are not exemplified here.
In another implementation scenario, the preset proportion may be set according to the actual application situation, for example, in a service scenario with more missing content, the preset proportion may be set to be slightly larger, for example, may be set to 30%, 35%, 40%, and so on; alternatively, in a business scenario where the missing content is relatively small, the preset ratio may be set to be small, for example, may be set to 10%, 15%, 20%, or the like. In addition, the preset ratio may be set to a fixed value, such as 25%, which is not limited herein. Taking the original text "the first XX meeting in 2008 in china" as an example, the finally segmented words and the non-segmented words can be expressed as:
XX meeting held for the first time in 2008 in China
That is, 6 words are counted in total after the final segmentation and the word without segmentation, and under the condition that the preset proportion is 1/3, 2 words can be selected from the segmented words and the word without segmentation to default, for example, "middle" and "country" can be selected to default, or "2008" and "holding" can also be selected to default, and the method is not limited herein. In other cases, the original text and the preset scale may be similar, and are not exemplified here.
Step S33: taking the default original text as a sample text, and taking the position of the default word as a sample missing position of the sample text.
Taking the original text of "first holding XX in china 2008" as an example, when selecting "middle" and "country" to default, the first holding XX in "2008" may be taken as a sample text, the position of the default word "middle" is taken as a sample missing position of the sample text, and the position of the default word "country" is taken as a sample missing position of the sample text. Other situations can be similar and are not exemplified here.
In one implementation scenario, to facilitate subsequent training with sample text, the default original text and the default words may also be used as sample text. Taking the original text "first hold XX in china 2008" as an example, when selecting "middle" and "country" to default, the default original text "() () first hold XX in 2008" and the default words "middle" and "country" can be taken as sample texts. Other situations can be similar and are not exemplified here.
Different from the embodiment, the original text is subjected to word segmentation and word part labeling to obtain a plurality of words marked with word part categories, and the words with the word part categories being preset are segmented word by word, so that the words with preset proportion are selected from the segmented words and the non-segmented words to default, the original text after default is used as a sample text, and the position of the default word is used as a sample missing position of the sample text, so that the sample text with missing content containing mixed granularity of words, entities and the like can be constructed, the adaptability of a prediction network obtained through subsequent training to the to-be-complemented text with mixed granularity of missing words, entities and the like can be improved, and the accuracy of subsequent completion prediction can be improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating an embodiment of step S13 in fig. 1. Specifically, the embodiment of the disclosure is a schematic flow chart of an embodiment of performing completion prediction on a text to be completed in a case that a source includes a source unknown. Specifically, the method may include the steps of:
step S41: and respectively filling default symbols with preset values at each missing position of the text to be filled in to obtain the text to be processed.
In one implementation scenario, as described in the foregoing disclosed embodiments, where the source conditions include source unknowns, the completion prediction may be performed using a first prediction network, which may specifically be trained on a pre-set neural network (e.g., BERT) using sample text. Based on the method, the number of the missing characters at the missing positions of the samples in each sample text can be counted, and the proportion of the number of the missing characters smaller than the candidate values is counted for a plurality of candidate values respectively, so that the smallest candidate value is selected as the preset value from at least one candidate value with the proportion larger than the preset percentage. Specifically, candidate values may include, but are not limited to: 1. 2, 3, 4, 5, 6, etc., without limitation herein. In addition, the preset percentage may be 90%, 92%, 95%, 97%, 99%, etc., which is not limited herein. According to the method, the number of the missing characters of the sample missing positions in each sample text is counted, and the number of the missing characters is counted for a plurality of candidate values respectively, so that the smallest candidate value is selected as the preset value from at least one candidate value with the duty ratio larger than the preset percentage, the preset value can cover most scenes, the number of the default characters can be reduced as much as possible, and further the efficiency of carrying out subsequent character prediction on each missing position can be improved.
In one specific implementation scenario, for N sample texts, the statistics include 20 sample deletion positions, where: the number of missing characters at 1 sample missing position is 1, the number of missing characters at 3 sample missing position is 2, the number of missing characters at 3 sample missing position is 3, the number of missing characters at 12 sample missing position is 4, the number of missing characters at 1 sample missing position is 5, then for candidate values 1, 2, 3, 4 and 5, the number of missing characters is not greater than 1/20=5% of candidate value 1, the number of missing characters is not greater than 4/20=20% of candidate value 2, the number of missing characters is not greater than 7/20=35% of candidate value 3, the number of missing characters is not greater than 19/20=95% of candidate value 4, the number of missing characters is not greater than 20/20=100% of candidate value 5, and in case of preset percentage 90%, the smallest candidate value 4 can be selected as preset value from candidate value 4 with corresponding ratio of 95% and corresponding ratio of 100%. Other situations can be similar and are not exemplified here.
In another specific implementation scenario, as in the foregoing disclosed embodiment, the default symbol may be set to "mask", and in addition, in order to facilitate the first prediction network processing, the beginning and end of the text to be complemented may be complemented with "CLS" and "SEP" as the start flag and the end flag, respectively. Taking the aforesaid text to be completed, "the world intellectual property organization headquarters is set at ()", it may be treated as "[ CLS ] the world intellectual property organization headquarters is set at [ mask ] [ mask ] [ mask ] [ mask ] [ SEP ]", and other cases may be similarly taken, and no further examples are given here.
In another specific implementation scenario, in order to facilitate the processing of the first prediction network, word segmentation and part-of-speech tagging may also be performed on the text to be completed, so as to obtain a plurality of words marked with part-of-speech categories, and word segmentation is performed on the words with part-of-speech categories being preset categories, so as to obtain the text to be processed. Taking the above-mentioned text to be completed, "the world intellectual property organization headquarters are set at ()", related steps in the above-mentioned disclosed embodiments may be adopted to perform word segmentation, part-of-speech tagging, and word-by-word segmentation, so as to finally obtain the following text to be processed:
[ CLS ] the world intellectual property organization headquarters is set up at [ mask ] [ mask ] [ mask ] [ SEP ]
In the case that the text to be completed is other text, the corresponding text to be processed can be obtained by analogy, and the examples are not given here.
In still another implementation scenario, in order to facilitate the processing of the first prediction network, before the text to be processed is sent to the first prediction network, each word in the text to be processed may be further subjected to position coding, and still taking the above-mentioned text to be completed, "world intellectual property headquarters are set at ()", the text to be processed after position coding may be expressed as:
in the case that the text to be completed is other text, the text to be processed after the position coding can be obtained by analogy, and the text to be completed is not exemplified here.
Step S42: and predicting the text to be processed for a plurality of times aiming at each missing position to obtain predicted words of default symbols at the corresponding sequence position of the prediction times, and obtaining candidate words of the missing position based on the predicted words of the plurality of times.
Specifically, when the i-th prediction is performed on the text to be processed, at least one predicted text of the default symbol at the i-th order and the predicted probability value of each predicted text can be obtained, on the basis, the at least one predicted text of the default symbol at the i-th order can be respectively replaced by the default symbol at the i-th order to obtain at least one new text to be processed, whether the preset ending condition is met or not is further judged, and when the preset ending condition is not met, i can be added with 1, and the step and the subsequent steps of the i-th prediction on the text to be processed can be repeatedly performed, and when the preset ending condition is met, the candidate word of the missing position can be obtained based on the predicted probability value of each newly obtained predicted text. According to the method, the ith prediction is carried out on the text to be processed to obtain at least one predicted word of the default symbol at the ith sequence position and the predicted probability value of each predicted word, and the default symbol at the ith sequence position is replaced by the at least one predicted word of the default symbol at the ith sequence position to obtain at least one new text to be processed, so that the i is added with 1 under the condition that the preset end condition is not met, the step of carrying out the ith prediction on the text to be processed and the subsequent steps are repeatedly carried out, under the condition that the preset end condition is met, the candidate word of the missing position is obtained based on the predicted probability value of each predicted word in each newly obtained text to be processed, the previous prediction can be relied on during each prediction, the relevance among the predicted words obtained by word-by-word prediction can be improved, and the accuracy of the predicted words is improved.
In one implementation scenario, the preset end condition may specifically be set to any one of the following: the predicted character is a preset ending character, and i is not less than a preset numerical value. Specifically, the preset ending character may be set according to the actual application situation, which is not limited herein; in addition, the specific meaning of the preset value may refer to the related description in the foregoing disclosed embodiments, and will not be repeated herein.
In another implementation scenario, when the text to be processed is predicted for the ith time, the text to be processed may be specifically sent to the first prediction network, so as to predict and obtain at least one predicted text of the default symbol at the ith sequence position and a prediction probability value of each predicted text. Specifically, when the text to be processed is predicted for the ith time, after the text to be processed is sent to the first prediction network, the semantic representation v of the default symbol at the ith sequence position can be obtained, and the probability value of each text in the preset word list can be obtained based on the semantic representation W of the preset word list, which can be specifically expressed as:
p=softmax(v·W)……(4)
in the above formula (4), p represents a probability value of each word in the preset vocabulary, v represents a semantic representation of a default symbol at the i-th order, W represents a semantic representation of the preset vocabulary, and softmax represents normalization processing. In addition, the preset vocabulary includes semantic representations of a plurality of (e.g., 30000) common words, which may be specifically obtained during the first predictive network training process. For example, BERT has about 30000 semantic representations of different terms. On this basis, at least one word (e.g., 2 words) may be selected as the predicted word at the i-th order in the order of the probability values from the high probability value to the low probability value, and the corresponding probability value may be used as the predicted probability value of the predicted word.
In yet another implementation scenario, please refer to fig. 5 in combination, fig. 5 is a state diagram of an embodiment of a word-by-word prediction process. As shown in FIG. 5, after the text to be completed, "world intellectual property headquarter is set at ()", is processed into the text to be processed, the text to be processed is sent to a first prediction network, and the 1 st prediction of the text to be processed can obtain the 1 st order default symbol [ mask ]]And representing the semantic representation v of each word of the v preset words W (v 1 ,v 2 ,v 3 ,…,v m ) Performing dot product (dot) operation to obtain probability values of all characters in a preset word list, and sequencing all characters in the preset word list (sort) according to the sequence of the probability values from high to low: day (i.e. W) 1 ) Button (i.e. W) 2 ) East (i.e. W) 3 ) …, north (i.e. W m ) And selecting the words of the previous preset sequence (such as the previous 2 bits), namely 'day' and 'button', as the 1 st sequence default symbol [ mask ] in the 1 st prediction]And predicting the words, taking the probability value of the predicted word 'day' as a predicted probability value, and taking the probability value of the predicted word 'button' as a predicted probability value. On the basis, the predicted word 'day' is replaced by the 1 st order default symbol [ mask ]]The new text to be processed is obtained, and for convenience of description, may be denoted as text to be processed 1:
And the predicted word 'button' is replaced by the 1 st order default symbol [ mask ] to obtain a new text to be processed, which can be recorded as the text to be processed 2:
under the condition that the preset ending condition is not met currently, adding 1 to i, namely, i is 2 at the moment, and the like, when the 2 nd prediction is performed, the two new texts to be processed can be respectively sent into a first prediction network, the text to be processed 1 is subjected to a similar processing process as the 1 st prediction, the 2 nd order default symbol [ mask ] predicted text "in" and "book" when the 2 nd prediction is obtained, and the text to be processed 2 is subjected to a similar processing process as the 1 st prediction, and the 2 nd order default symbol [ mask ] predicted text "about" and "gloss" when the 2 nd prediction is obtained. Further, the predicted text 'inner' and 'book' are respectively replaced by the 2 nd order default symbol [ mask ] of the to-be-processed text 1, 2 new to-be-processed texts can be obtained on the basis of the to-be-processed text 1, similarly, the predicted text 'about' and 'gloss' are respectively replaced by the 2 nd order default symbol [ mask ] of the to-be-processed text 2, and 2 new to-be-processed texts can be obtained on the basis of the to-be-processed text 2. When the preset end condition is not met, adding 1 to i, i.e. when i is 3, the above procedure can be re-executed, and finally, the predicted text and the predicted probability value obtained by each prediction are shown in table 1, where table 1 is a summary table of the predicted text and the predicted probability value of each prediction. As shown in table 1, on the basis of the 1 st prediction result in the predicted word "day", the 2 nd prediction result in the predicted word "in" and "book", on the basis of the 2 nd prediction result in the predicted word "in", the 3 rd prediction result in the predicted word "tile" (on the basis of the 4 th prediction end, not shown in table 1), on the basis of the 2 nd prediction result in the predicted word "book", the 3 rd prediction end (i.e., the predicted word is empty), on the basis of the 1 st prediction word "new", the 2 nd prediction result in the predicted word "about" and "gloss", on the basis of the 2 nd prediction result in the predicted word "about", the 3 rd prediction end (i.e., the predicted word is empty), on the basis of the 2 nd prediction result in the predicted word "blue" and "west" (on the basis, the 4 th prediction end, not shown in table 1).
TABLE 1 predictive text and predictive probability value summary table for each time
It should be noted that, the predicted text and the predicted probability value shown in table 1 are only one possible situation in the practical application process, and are not limited to other possible situations in the practical application process, and may be specifically set according to the practical application situation, which is not limited herein.
In still another implementation scenario, under the condition that a preset end condition is met, specifically, for each newly obtained text to be processed, an average probability value of prediction probability values of each predicted text at a missing position is counted, the text to be processed at a preset sequence position is selected according to the sequence from the high average probability value to the low average probability value, and a combination of the predicted texts at the missing position in the selected text to be processed is used as a candidate word of the missing position. According to the method, the average probability value of the prediction probability value of each prediction word at the missing position is counted for each latest obtained text to be processed, the overall accuracy of the prediction words in the text to be processed can be represented by the average probability value, on the basis, the text to be processed at the preset sequence position is selected according to the sequence from the large average probability value to the small average probability value, and the combination of the prediction words at the missing position in the selected text to be processed is used as the candidate word of the missing position, so that the candidate word with larger overall accuracy can be selected, the accuracy of the candidate word can be improved, and the accuracy of the finally obtained complete text can be improved.
In a specific implementation scenario, the preset sequence bit may be set according to actual application needs. For example, to increase the speed of subsequent joint completion predictions using candidate words for each missing location, the pre-order bit may be set slightly smaller, e.g., may be set to 2, 3, etc.; alternatively, for example, the robustness of joint completion of candidate words using respective missing positions may be improved, and the preset sequence bit may be set slightly larger, for example, may be set to 4, 5, or the like, which is not limited herein.
In another specific implementation scenario, still taking the foregoing "world intellectual property headquarters" of the to-be-completed text as an example, please refer to table 1 in combination, each predicted word at the missing position in one of the to-be-processed texts is "day" and "inside" and "tile", the average probability value of the predicted probability value is counted as 0.9, each predicted word at the missing position in the other to-be-processed text is counted as "day" and "book", the average probability value of the predicted probability value is counted as 0.8, each predicted word at the missing position in the other to-be-processed text is counted as "new" and "about", the average probability value of the predicted probability value is counted as 0.875, each predicted word at the missing position in the other to-be-processed text is counted as "new" and "blue", the average probability value of the predicted probability value at the missing position in the other to-be-processed text is counted as "new" west ", and the average probability value of the predicted probability value is counted as 0.78. Therefore, under the condition that the preset sequence bit is set to 2, the text to be processed with the average probability value being positioned at the first 2 bits can be selected, and the combination of the predictive characters at the missing position in the selected text to be processed, namely 'solar tile' and 'New York', is used as the candidate words of the missing position. Other situations can be similar and are not exemplified here.
It should be noted that, in the case that the text to be completed includes a plurality of missing positions, the completion prediction may be performed on each missing position by using the above manner, so as to finally obtain a candidate word of each missing position, and the specific process may refer to the foregoing description and will not be repeated herein.
Different from the foregoing embodiment, in the case that the source condition includes unknown sources, a default symbol with a preset value is respectively added through each missing position of the text to be completed, so as to obtain the text to be processed, and for each missing position, prediction is performed for several times on the text to be processed, so as to obtain predicted words of the default symbol at the position corresponding to the number of predictions, and further, candidate words of the default position are obtained based on the predicted words of several times, so that text completion can be performed without relying on manpower, the efficiency of text completion can be improved, the cost of text completion can be reduced, and in addition, in the case that the source is unknown, word-by-word prediction is performed at each missing position, so that the prediction precision can be improved, and the accuracy of text completion is improved.
Referring to fig. 6, fig. 6 is a flow chart illustrating an embodiment of a first predictive network training process. Specifically, the method may include the steps of:
Step S61: and respectively supplementing default symbols with preset values at the missing positions of each sample of the sample text to obtain a sample to-be-processed text.
The specific description of step S41 in the foregoing disclosed embodiment may be referred to, and will not be repeated here. In addition, the process of obtaining the sample text may refer specifically to the foregoing disclosure embodiments and the related description in fig. 3 of the specification, which are not repeated herein.
Step S62: and predicting the text to be processed of the sample for a plurality of times by utilizing a first prediction network aiming at each sample missing position to obtain sample prediction characters and sample prediction probability values of default symbols at the corresponding sequence positions of the prediction times.
Specifically, for each sample missing position, the first prediction network may be used to predict the sample to-be-processed text for the ith time to obtain a sample prediction text of a default symbol at the ith sequence position and a sample prediction probability value of the sample prediction text, and replace the default symbol at the ith sequence position with the sample prediction text of the default symbol at the ith sequence position to obtain a new sample to-be-processed text, and add 1 to i and re-execute the step and the subsequent steps of performing the ith prediction on the sample to-be-processed text if the preset ending condition is not met, and may end the prediction on the current sample missing position if the preset ending condition is met. The specific description of step S42 in the foregoing disclosed embodiment may be referred to, and will not be repeated here.
Step S63: and acquiring a first loss value of the first prediction network based on the sample prediction probability value of each sample prediction word in the sample candidate words at each sample missing position.
In order to facilitate training by using the sample text, the default original text and the default words may be used as the sample text, and in addition, in the training process, since default symbols with preset values are respectively added at the missing positions of each sample of the sample text, so as to obtain a sample to be processed text, a plurality of placeholders (e.g., [ PAD ]) may be respectively added for each default word, so that the total number of each default word and the added placeholders is equal to the preset value. For example, taking the foregoing original text "first hold XX in china 2008" as an example, in the case of selecting "middle" and "country" to perform default, the default original text "() () first hold XX in 2008 and default words" middle "and" country "may be taken as sample texts, then in the training process, in the case of setting the preset value to 4, 4 default characters may be respectively added at each default position in the sample text, so as to obtain the sample text" [ CLS ] [ mask ] [ mask ] [ mask ] [ mask ] [ mask ] [ mask ] [ mask ] first hold XX in 2008 ] may [ SEP ], and in the training process, 3 placeholders are added for the default words "middle" to be converted into "middle [ PAD ]", and in the training process, 3 placeholders are similarly converted for the default words "country" to be converted into "PAD ]". Other situations can be similar and are not exemplified here.
In one implementation scenario, the first loss value may be calculated using a cross entropy loss function, which may be expressed in detail as:
in the above formula (5), M represents the number of sample missing positions in the sample text, N represents the preset value, y ij Representing the j-th character, p, in the default word corresponding to the i-th missing position ij And the sample prediction probability value of the sample prediction text obtained by the jth prediction of the ith missing position is represented.
In addition, when the default word is complemented by the above-described placeholder [ PAD ], the preset end character in the preset end condition may be specifically set as the placeholder [ PAD ]. In the case of complementing the default word with another character, the setting manner of the preset ending character may be similar, and will not be illustrated here.
Step S64: and adjusting network parameters of the first prediction network by using the first loss value.
Specifically, the network parameters of the first prediction network may be adjusted by using the first loss value in a random gradient descent (Stochastic Gradient Descent, SGD), a batch gradient descent (Batch Gradient Descent, BGD), a small batch gradient descent (Mini-Batch Gradient Descent, MBGD), etc., where the batch gradient descent refers to parameter updating using all samples at each iteration; random gradient descent refers to the use of one sample for parameter updating at each iteration; the small batch gradient descent refers to that a batch of samples is used for parameter updating at each iteration, and is not described herein.
Different from the foregoing embodiment, a preset number of default symbols are respectively added at each sample missing position of the sample text, so as to obtain a sample to-be-processed text, and for each sample missing position, the sample to-be-processed text is predicted for several times, so as to obtain a sample prediction text and a sample prediction probability value of the default symbol at a sequence position corresponding to the number of predictions, and based on the sample prediction probability value of each sample prediction text in sample candidate words at each sample missing position, a first loss value of the first prediction network is obtained, so that network parameters of the first prediction network are adjusted based on the first loss value. Therefore, word-by-word prediction is performed at each sample missing position, and the network parameters of the first prediction network are adjusted based on the first loss value obtained through statistics, so that the prediction accuracy of the first prediction network can be improved.
Referring to fig. 7, fig. 7 is a flowchart illustrating another embodiment of step S13 in fig. 1. Specifically, the embodiment of the disclosure is a schematic flow diagram of an embodiment of performing completion prediction on a text to be completed in a case that a source case includes a source case from a first text library.
Specifically, the method may include the steps of:
Step S71: and carrying out semantic extraction on the text to be complemented to obtain individual semantic representations of each missing position.
In one implementation scenario, in order to facilitate semantic extraction, a default symbol may be respectively added at each missing position of the text to be completed to obtain the text to be processed, so that semantic extraction is performed on the text to be processed to obtain an individual semantic representation of each default symbol, which is used as the individual semantic representation of the missing position where the default symbol is located. According to the method, the default symbol is respectively added to each missing position of the text to be completed, so that the text to be processed is obtained, semantic extraction is carried out on the text to be processed on the basis of the default symbol, the individual semantic representation of each default symbol is obtained, and the individual semantic representation of the default symbol is used as the individual semantic representation of the missing position where the default symbol is located, so that the subsequent prediction of candidate words of the missing position based on the individual semantic representation of a single default symbol can be facilitated, the missing position is not limited to missing words, words or entities, and the prediction of mixed granularity of words, entities and the like can be facilitated.
In a specific implementation scenario, the default symbol may be specifically set, for example, as "mask" in the foregoing description of the disclosed embodiment, which is not limited herein.
In another specific implementation scenario, [ CLS ] and [ SEP ] may also be respectively added as a start flag and an end flag at the beginning and the end of the text to be completed. The text to be completed, "the world intellectual property organization headquarters is set at ()", may be treated as "[ CLS ] the world intellectual property organization headquarters is set at [ mask ] [ SEP ]", and the other cases may be similar, and are not exemplified here.
In another specific implementation scenario, in order to facilitate the prediction of the mixed granularity of characters, words, entities and the like, word segmentation and part-of-speech tagging can be further performed on the text to be complemented, so as to obtain a plurality of words marked with part-of-speech categories, and the words with the part-of-speech categories being preset categories are segmented word by word, so as to obtain the text to be processed. The specific manner of word segmentation and part-of-speech tagging can refer to the related descriptions in the foregoing disclosed embodiments, and will not be repeated herein. In addition, the preset category may be set as a place name, and specific reference may be made to the related description in the foregoing disclosed embodiments, which is not repeated herein.
Taking the above-mentioned text to be completed "the world intellectual property organization headquarters are set at ()", the related steps in the above-mentioned disclosed embodiments may be adopted to perform the filling of default symbol, start flag and end flag, word segmentation, part-of-speech tagging and word-by-word segmentation, so as to finally obtain the following text to be processed:
[ CLS ] the world intellectual property organization headquarters is arranged at the mask ] [ SEP ]
In the case that the text to be completed is other text, the corresponding text to be processed can be obtained by analogy, and the examples are not given here.
In one implementation scenario, where the source scenario includes a source from a first text library, the completion prediction may be performed using a second prediction network, which may specifically be trained on a pre-set neural network (e.g., BERT) using sample text, as described in the previously disclosed embodiments. On the basis, the text to be completed can be sent to a second prediction network, so that individual semantic representations of each missing position can be obtained. In addition, the training process of the second prediction network may specifically refer to the following disclosure embodiments, which are not described herein.
In a specific implementation scenario, in order to facilitate the second prediction network processing, before the text to be processed is sent to the second prediction network, each word in the text to be processed may be further subjected to position encoding, and still, taking the above text to be completed, "world intellectual property headquarters are set at ()", the text to be processed after position encoding may be expressed as:
in the case that the text to be completed is other text, the text to be processed after the position coding can be obtained by analogy, and the text to be completed is not exemplified here.
Step S72: for each missing position, obtaining at least one candidate word of the missing position by utilizing the individual semantic representation of the missing position and the word semantic representation of each reference word.
In an embodiment of the present disclosure, the first text library includes at least one reference text, and the reference text includes at least one reference word. The setting manner of the first text library may specifically refer to the related description in the foregoing disclosed embodiments, which is not repeated herein.
In one implementation scenario, at least one reference text can be segmented and labeled with parts of speech respectively to obtain a plurality of words marked with part of speech categories, and the words with the part of speech categories being preset categories are segmented word by word, so that a plurality of reference words are obtained by using the segmented words and the non-segmented words, and semantic extraction is performed on the plurality of reference words respectively to obtain word semantic representations of the reference words. According to the method, the words marked with the part-of-speech class are obtained by respectively carrying out word segmentation and part-of-speech tagging on at least one reference text, and the words with the part-of-speech class being the preset class are segmented word by word, so that the segmented words and the non-segmented words are utilized to obtain a plurality of reference words, further, the acquisition of the reference words with the mixed granularity comprising words, entities and the like can be facilitated, and the subsequent prediction of the mixed granularity of the words, the entities and the like can be further facilitated.
In a specific implementation scenario, the specific processes of word segmentation, part-of-speech tagging, and word-by-word segmentation may refer to the related descriptions in the foregoing disclosed embodiments, which are not described herein again. In addition, the preset category may be set according to actual application requirements, for example, may be set as a place name, and specifically, reference may be made to the related description in the foregoing disclosed embodiment, which is not repeated herein.
In another specific implementation scenario, in order to further construct a reference word containing mixed granularity of words, terms, entities and the like, words with part-of-speech categories as preset categories can be specifically segmented word by word, and for the word by word segmented word, the word before segmentation and the word obtained after word by word segmentation can be used as the reference word together. Taking the word Beijing as an example, the part-of-speech category is a place name, and under the condition that the preset category is the place name, the word Beijing can be segmented word by word to obtain the word Beijing and the word Beijing, so that the word Beijing before segmentation and the word Beijing obtained after segmentation can be used as reference words. Other situations can be similar and are not exemplified here.
In yet another specific implementation scenario, a preset word vector training tool (e.g., word2vec, glove, etc.) may be specifically used to perform word vector training on the reference word, so as to extract a word semantic representation of the reference word.
In another specific implementation scenario, the words with the occurrence frequency higher than the preset frequency can be screened from the segmented words and the non-segmented words to obtain a plurality of reference words. Specifically, the frequency of occurrence refers to the frequency of occurrence in the first text library. For example, in the first text library, 10 ten thousand words are counted, wherein the number of times of occurrence of the word "Beijing" is 100, so the frequency of occurrence of the word "Beijing" is 0.1%, and the number of times of occurrence of the word "crevice" is only 1, so the frequency of occurrence of the word "crevice" is 0.001%, and the like, and other cases can be similarly considered, and are not exemplified here. In addition, the preset frequency may be set according to practical application requirements, for example, may be set to 0.01%, 0.05%, etc., which is not limited herein. According to the method, the words with the occurrence frequency higher than the preset frequency are screened from the words after segmentation and the words without segmentation to obtain a plurality of reference words, so that the scale of the reference words can be further reduced.
In another specific implementation scenario, the words with the part-of-speech categories meeting the preset rejection condition can be rejected from the segmented words and the non-segmented words, so as to obtain a plurality of reference words. Specifically, the preset culling condition may be set to include: the part-of-speech category is any one of stop words and special symbols, and the stop words can be functional words in human language, which have no actual meaning, such as ' the ', ' is ' and the like in English, or such as ' one's words ', ' one ' and the like in Chinese; further, special symbols may include, but are not limited to: punctuation marks (e.g., a "no", "etc.), unit symbols (e.g., kg' kg", etc.), number symbols (e.g., (1), etc.), tab symbols, currency symbols, etc., are not limited herein. According to the method, the words with the part-of-speech categories meeting the preset rejection conditions are rejected from the segmented words and the non-segmented words, so that a plurality of reference words are obtained, and the scale of the reference words can be further reduced.
In one implementation scenario, for each missing position, the similarity between the individual semantic representation of the missing position and the word semantic representation of each reference word may be utilized, so that the reference word located in the pre-set sequence position may be selected as the candidate word of the missing position according to the sequence from the high similarity to the low similarity. Specifically, the similarity between the individual semantic representation and the term semantic representation may be a cosine similarity; in addition, the preset sequence bit can be set according to actual application requirements, for example, in order to improve the speed of the subsequent joint completion prediction by using candidate words of each missing position, the preset sequence bit can be set to be slightly smaller, for example, can be set to be 2, 3 and the like; alternatively, for example, the robustness of joint completion of candidate words using respective missing positions may be improved, and the preset sequence bit may be set slightly larger, for example, may be set to 4, 5, or the like, which is not limited herein.
In another implementation scenario, the predicted probability value for each reference term may be derived using the individual semantic representation and the term semantic representation of each reference term. Specifically, the predicted probability value of a reference word may represent the likelihood that the word missing at the missing position is the reference word, and the greater the predicted probability value, the greater the likelihood that the word missing at the missing position is the reference word. On the basis, the reference words positioned in the pre-set sequence position can be selected as candidate words of the missing position according to the sequence from the big to the small of the predicted probability value. The setting manner of the preset sequence bit can be referred to the above related description, and will not be repeated here. According to the method, the predicted probability value of each reference word is obtained by utilizing the individual semantic representation and the word semantic representation of each reference word, so that the reference word positioned at the front preset sequence is selected to serve as the candidate word of the missing position according to the sequence from the big to the small of the predicted probability value, the reference word can be selected to serve as the candidate word of the missing position based on the individual semantic representation and the word semantic representation, and the accuracy of the candidate word can be improved.
In a specific implementation scenario, as described above, the text to be processed is sent to the second prediction network, so that an individual semantic representation of the missing position may be obtained, for convenience of description, the individual semantic representation may be denoted as h, the word semantic representation of each reference word may be denoted as W, and it should be noted that W is a set of word semantic representations of each reference word, and then the probability prediction value p may be calculated by the following formula:
p=softmax(h·W)……(6)
In the above formula (6), p represents a predicted probability value of each reference word, h represents an individual semantic representation of the missing position, W represents a word semantic representation of each reference word, and · represents a dot product operation, softmax represents normalization processing.
In another embodiment, referring to fig. 8 in combination, fig. 8 is a schematic diagram illustrating an embodiment of a prediction process using reference words. As shown in fig. 8, after the text to be completed, "world intellectual property headquarter is set at ()" is processed into the above text to be processed, the text is sent to a second prediction network, so that the individual semantic representation h of the missing position can be obtained, and m reference words (W 1 ,W 2 ,W 3 ,…,W m ) After vectorization, m word semantic representations (v 1 ,v 2 ,v 3 ,…,v m ) Thereby combining the individual semantic representation h with the m term semantic representations (v 1 ,v 2 ,v 3 ,…,v m ) Performing dot product (dot) operation to obtain predicted probability value (p) of each reference word 1 ,p 2 ,p 3 ,…,p m ) And sequencing (sort) the reference words according to the sequence of the predicted probability values from large to small, and finally selecting the reference words with the pre-set sequence (such as the first 2 bits), such as selecting 'solar tile', 'New York' as candidate words of the missing position. It should be noted that, fig. 8 is only one situation that may exist in the practical application process, and is not limited to other situations that may exist in the practical application process, and may be specifically determined according to the practical application The application is not limited herein.
It should be noted that, in the case that the text to be completed includes a plurality of missing positions, the completion prediction may be performed on each missing position by using the above manner, so as to finally obtain a candidate word of each missing position, and the specific process may refer to the foregoing description and will not be repeated herein.
Different from the embodiment, the semantic extraction is performed on the text to be complemented to obtain the individual semantic representation of each missing position, so that for each missing position, the individual semantic representation of the missing position and the word semantic representation of each reference word are utilized to directly obtain at least one candidate word of the missing position, and the accuracy and the efficiency of the complement prediction can be improved. In addition, because candidate words are directly predicted for the missing positions, the missing positions are not limited to missing words, words or entities, and can be favorable for predicting the mixed granularity of words, entities and the like.
Referring to fig. 9, fig. 9 is a flow chart illustrating an embodiment of a second predictive network training process. Specifically, the method may include the steps of:
step S91: and carrying out semantic extraction on the sample text by using a second prediction network to obtain sample individual semantic representations of the missing positions of the samples.
In an embodiment of the present disclosure, the sample text includes at least one sample deletion location. The process of obtaining the sample text may refer specifically to the foregoing disclosed embodiments and the related description in fig. 3 of the specification, which are not repeated herein.
The method for performing semantic extraction on the sample text may specifically refer to the description related to step S71 in the foregoing disclosed embodiment, which is not described herein again.
Step S92: and aiming at each sample missing position, obtaining a sample prediction probability value of each reference word by using the sample individual semantic representation of the sample missing position and the word semantic representation of each reference word.
The specific reference may be made to the description related to step S72 in the foregoing disclosed embodiment, and the description is omitted here.
Step S93: and obtaining a second loss value of the second prediction network based on the sample prediction probability value of each reference word at each sample missing position.
In particular, the second loss value may be calculated using a cross entropy loss function, which may be expressed in particular as:
in the above formula (7), M represents the number of sample missing positions in the sample text, y i For the default word, p, corresponding to the i-th missing position in the sample text i And the sample prediction probability value of each reference word obtained by predicting the ith missing position in the sample text is represented.
Step S94: and adjusting network parameters of the second prediction network by using the second loss value.
Specifically, the network parameters of the second prediction network may be adjusted by using the second loss value in a random gradient descent (Stochastic Gradient Descent, SGD), a batch gradient descent (Batch Gradient Descent, BGD), a small batch gradient descent (Mini-Batch Gradient Descent, MBGD), etc., where the batch gradient descent refers to parameter updating using all samples at each iteration; random gradient descent refers to the use of one sample for parameter updating at each iteration; the small batch gradient descent refers to that a batch of samples is used for parameter updating at each iteration, and is not described herein.
Different from the foregoing embodiment, semantic extraction is performed on the sample text by using the second prediction network to obtain sample individual semantic representations of each sample missing position, so that for each sample missing position, sample prediction probability values of each reference word are obtained by using the sample individual semantic representations of the sample missing position and word semantic representations of each reference word, and further, based on the sample prediction probability values of each reference word at each sample missing position, second loss values of the second prediction network are obtained, and on the basis, network parameters of the second prediction network are adjusted by using the second loss values, so that word prediction is assisted at the sample missing position by using the word semantic representations of the reference words, and network parameters of the second prediction network are adjusted by using the second loss values obtained by statistics, which can be beneficial to improving accuracy of the second prediction network. In addition, because the whole sample missing position is directly predicted, the sample missing position is not limited to missing characters, words or entities, and can be favorable for predicting the mixed granularity of the characters, the words, the entities and the like.
Referring to fig. 10, fig. 10 is a flowchart illustrating a step S13 in fig. 1 according to another embodiment. Specifically, the embodiment of the disclosure is a schematic flow diagram of an embodiment of performing completion prediction on a text to be completed in a case that a source situation includes a second text library related to a preset knowledge domain. As described in the foregoing disclosure embodiments, the completion prediction may be performed on the text to be completed by using a knowledge graph and a text library corresponding to the preset knowledge domain. In the embodiment of the disclosure, the knowledge graph may include a plurality of triples, where the triples may include two entities and an entity relationship between the two entities, and the triples may be specifically represented as < entity 1, entity relationship, entity 2>. Taking classical music as an example of a preset knowledge domain, several triples may include, but are not limited to: the terms "mozate", "salzate", "Lepi", "classical Lepi", "Austria", the longest history, salzate "," souvenir, mozate "and the like are used herein, and are not intended to be limiting in any way in the case of other predetermined knowledge areas. Embodiments of the present disclosure may specifically include the following steps:
step S1010: and searching the entity in the triples to obtain a target triplet containing the target entity.
In the embodiment of the disclosure, the target entity is an entity extracted from the text to be completed. Specifically, natural language processing (Nature Language Process, NLP) tools (e.g., LTP, etc.) may be employed to perform named entity recognition on the text to be completed, such that the target entity may be extracted from the text to be completed. Taking the above-mentioned text to be completed as one of the representative characters of the Vienna classical Lepids, () designated as the salzborg palace musician in 1772 as an example, the target entity "salzborg" can be extracted therefrom, and the other cases can be similarly taken, which is not exemplified here.
In one implementation scenario, entity searching may be performed in a knowledge graph corresponding to a preset knowledge domain, and a triplet including a target entity is directly used as a target triplet. Taking the text to be completed "as one of the representative characters of the Vienna classical Lepids, () designated as the salsburgh palace musician" in 1772 as an example, the above-mentioned triplet containing the target entity "salsburgh" can be directly used: < mozate, birth place, salzburg >, < austria, longest history, salzburg >, < salzburg, commemorative day, mozate week >, as target triplets. In other cases, the text to be completed may be similarly used, and examples are not given here. By adopting the mode, the triples containing the target entity are directly used as the target triples, so that the speed of searching the target triples can be improved.
In another implementation scenario, a triplet including the target entity may be used as a candidate triplet, and another entity other than the target entity in the candidate triplet may be used as a reference entity, on this basis, the word semantic representation of each reference word in the second text library is used to obtain the entity semantic representation of the reference entity, and the whole semantic representation of the text to be completed is obtained, so that at least one candidate triplet is selected as the target triplet based on the similarity between the entity semantic representation of each reference entity and the whole semantic representation. According to the method, the triples containing the target entity are used as candidate triples, the other entity except the target entity in the candidate triples is used as a reference entity, so that the word semantic representation of each reference word in the text library is utilized to obtain the whole semantic representation of the text to be complemented and the entity semantic representation of the reference entity, and at least one candidate triplet is selected as the target triples based on the similarity between the entity semantic representations of each reference entity and the whole semantic representation, so that the candidate triples can be further screened based on the similarity, the interference of triples with lower similarity on subsequent complementation prediction can be reduced, and the complexity of the target triples fused into the text to be complemented can be reduced.
In a specific implementation scenario, the overall semantic representation may be fused using word semantic representations of the words in the text to be completed. The word semantic representation of each word in the text to be completed can be specifically expressed as:
in the above formula (8), V seq Representing the overall semantic representation of the text to be completed, n representing the total number of words in the text to be completed,representing the t-th text in the text to be completed i Word semantic representations of individual words. Taking the text to be completed "as one of the representative characters of the Vienna classical genre, () was designated as the salsburgh palace musician in 1772 as an example, each word in the text to be completed includes: as, vienna, classical, genre, representative, character, one of 1772, year, quilt, nomination, salembodied, palace, musician, the word semantic representation of the word may be substituted into the above formula (8) to obtain the overall semantic representation of the text to be completed. Other situations can be similar and are not exemplified here.
In another specific implementation scenario, the similarity between the entity semantic representation and the whole semantic representation respectively may be specifically a cosine similarity, and the similarity S may be specifically obtained by the following formula:
In the above formula (9), V seq Representing an overall semantic representation of the text to be completed,an entity semantic representation representing another entity (i.e., a reference entity) in the target triplet other than the target entity. Taking the text to be completed as one of the representative characters of the Vienna classical genre, () designated as the salzborg palace musician in 1772 as an example, candidate triples can be obtained respectively while the whole semantic representation of the text to be completed is obtained by using the above formula (8):<mozate, birth place, sal's arrowhead>、<Austrian, the longest history, salsburgh>、<Salsburgh, commemorative day, mozate week>The method comprises the steps of obtaining the similarity between the whole semantic representation and the three entity semantic representations respectively based on the formula (9). Other situations can be similar and are not exemplified here. />
In another specific implementation scenario, the candidate triples may be ordered in order of the similarity from high to low, so as to select the candidate triples with the previous preset sequence as the target triples. Specifically, the preset sequence bit may be set according to actual application requirements, for example, may be set to 2, 3, 4, and so on. In addition, in order to reduce the interference of the candidate triples with lower similarity on the subsequent completion prediction, reduce the complexity of merging the target triples into the text to be completed, and avoid the occurrence of lower completion prediction accuracy caused by too few target triples, the preset sequence bit can be specifically set to 2, namely, the candidate triples with similarity arranged in the first two bits after being ordered from large to small are used as the target triples. Taking the text to be completed, "as one of the representative characters of the Vienna classical genre, ()" was designated as the salzaborg palace musician in 1772 as an example, one could select < mozabor, birth place, salzabor >, < salzabor, commemorative day, mozabor week > as the target triplet. Other situations can be similar and are not exemplified here.
In another specific implementation scenario, the second text library may include at least one reference text, and the reference text includes at least one reference word, so that at least one reference text may be segmented and labeled with part of speech respectively to obtain a plurality of words marked with part of speech class, the words with the part of speech class being the preset class are segmented word by word, and the segmented words and the non-segmented words are utilized to obtain a plurality of reference words, so that semantic extraction is performed on the plurality of reference words respectively to obtain word semantic representations of the reference words, and detailed description in the foregoing disclosed embodiments may be referred to.
Step S1020: and merging the target triples into the target entity of the text to be completed to obtain a merged text.
In one implementation scenario, after the target triplet is obtained, a reference entity, a target entity and an entity relationship between the reference entity and the target entity in the target triplet may be extracted, and the reference entity and the entity relationship in the target triplet may be inserted to the left side and/or the right side of the target entity of the text to be completed, thereby obtaining the fusion text. Taking the "one of the representative characters of the Vienna classical, the" Tazata "is still taken as an example of the" Tazata "in 1772, the reference entity (i.e. Morzate), the target entity (i.e. SALAZfurther) and the entity relationship (i.e. the birth place) in the target triple < Morzata >, the reference entity (i.e. Morzate) in the Monzata, the target entity (i.e. SAZfurther) and the entity relationship (i.e. the Monzata) in 1772, and the reference entity" Morzata ", the entity relationship" birthday "and the reference entity" Morzata ", the entity relationship" Morzata "are inserted to the left side of the target entity of the" SAZzata "to obtain the fusion text" as one of the representative characters of the Vienna classical, the "in 1772 is the Monzata, the reference entity (i.e. Morzata), the target entity (i.e. the Monzata) in the Monzata, the target entity (i.e. the Monzata) and the entity relationship (i.e. the Monzata), the target entity (i.e. the Monzata) and the entity relationship (i.e) are extracted, the" Tazata "in 1772", the reference entity "Monzata" and the target entity "Monzata" in the front "as the target text" in the front "and the front" as the target text, "the one of the three-unit" in the three-element, and the three-element "in the three-element" three element "and the three element" in the three element "respectively" and the three element "in 37". Other situations can be similar and are not exemplified here.
In another implementation scenario, after the target triplet is obtained, a knowledge tree may be constructed by using the target triplet, the knowledge tree is converted into a text sequence, a root node of the knowledge tree is a target entity, leaf nodes of the knowledge tree are reference entities, the reference entities are another entity except the target entity in the target triplet, and an intermediate node between the root node and the leaf nodes is an entity relationship between the target entity and the reference entity, on the basis, the text sequence may be fused into the target entity of the text to be completed, and a fused text is obtained. According to the method, the knowledge tree is constructed by using the target triples, the knowledge tree is converted into the text sequence, and the text sequence is fused to the target entity of the text to be completed, so that the fusion text is obtained, the conversion of the target triples into the text sequence with the structural characteristics by constructing the knowledge tree can be facilitated, the readability of the fusion text can be further improved, and the accuracy of subsequent completion prediction can be improved.
In one embodiment, referring to fig. 11 in combination, fig. 11 is a schematic diagram of an embodiment of a knowledge tree. As shown in fig. 11, taking the text to be completed "as one of the representative characters of the vienna classical genre, () designated as salsburgh palace musician" in 1772 as an example, the corresponding target triples include: since < mozate, birth place, < salzburg, commemorative day, mozate week >, the target entity "salzburg" can be used as the root node of the knowledge tree, the reference entities "mozate" and "mozate week" can be used as leaf nodes, and the entity relations "birth place" and "commemorative day" can be used as the intermediate nodes of the root node and the leaf nodes, respectively. It should be noted that, in the embodiments of the present disclosure and the embodiments of the disclosure described below, the root node represents a node in the knowledge tree where a parent node does not exist, and the leaf node represents a node in the knowledge tree where a child node does not exist, unless otherwise specified.
In another specific implementation scenario, the knowledge tree is a binary tree, and on this basis, a medium-order traversal mode can be adopted to sequentially traverse the knowledge tree, and the combination of the words traversed sequentially is used as a text sequence. In the embodiments of the present disclosure and the embodiments of the disclosure below, the middle-order traversal method is a binary tree traversal method, and may also be referred to as middle-root traversal and middle-order circumferential tour. Referring to fig. 11 in combination, taking the knowledge tree shown in fig. 11 as an example, a middle-order traversal method is adopted to traverse the knowledge tree, namely, traversing the left subtree: "mozart", "birth place", then access the root node: "salsburgh", the last traversal of the right subtree is: the "commemorative day", "mozate week", on which the combination of sequentially traversed words, "mozate birthday salsburgh commemorative day mozate week" is taken as a text sequence. Other situations can be similar and are not exemplified here.
In yet another specific implementation scenario, after obtaining the text sequence, the target entity in the text to be completed may be replaced with the text sequence, to obtain the fused text. Referring to fig. 12 in combination, fig. 12 is a schematic diagram illustrating a state of an embodiment of a fused text acquisition process. As shown in fig. 12, taking the text to be completed "as one of the representative characters of the vienna classical genre" () is designated as a salzaburg imperial in 1772, after the entity search, knowledge tree construction and transformation, the text sequence "mozate is born at the sampsonii commemoration day mozate", on the basis of which the target entity in the text to be completed can be directly replaced with the text sequence, the fusion text "as one of the representative characters of the vienna classical genre" () is designated as a mozamate Zhou Gongting imperial in 1772. Other situations can be similar and are not exemplified here.
Step S1030: and carrying out complement prediction on the fusion text by using a second text library to obtain at least one candidate word at the missing position.
Specifically, the second text library includes at least one reference text, the reference text includes at least one reference word, the semantic extraction is performed on the reference word, so that a word semantic representation of the reference word can be obtained, and a specific process may refer to the related description in the foregoing disclosed embodiment, which is not repeated herein. On the basis, the first digital sequence of the words belonging to the text to be complemented in the fusion text can be sequentially encoded according to the position sequence, the second digital sequence of the words belonging to the target triplet in the fusion text is sequentially encoded, and the largest first digital sequence is smaller than the smallest second digital sequence, so that the encoded fusion text is subjected to semantic extraction to obtain individual semantic representations of all missing positions, and then the individual semantic representations of the missing positions and the word semantic representations of all reference words can be utilized for each missing position to obtain at least one candidate word of the missing position. According to the method, the first digital sequence position is sequentially encoded on the words belonging to the text to be complemented in the fusion text according to the position sequence, the second digital sequence position is sequentially encoded on the words belonging to the target triplet in the fusion text, and the largest first digital sequence position is smaller than the smallest second digital sequence position, so that the field knowledge can be fused under the condition that the original word sequence of the text to be complemented is unchanged in the process of complement prediction, the second semantic extraction is performed on the encoded fusion text on the basis of the field knowledge, individual semantic representations of all missing positions are obtained, and the individual semantic representations of the missing positions and the word semantic representations of all reference words are utilized to obtain at least one candidate word of the missing positions, so that the accuracy of the individual semantic representations can be improved, and the accuracy of complement prediction can be improved.
In one implementation scenario, as described in the foregoing disclosed embodiments, a default symbol may be respectively appended at each missing position of the fused text before encoding in chronological order. The default symbol may be specifically set as "mask" according to the foregoing description of the disclosed embodiments, and is not limited herein.
In another implementation scenario, [ CLS ] and [ SEP ] may also be respectively appended as a start flag and an end flag at the beginning and end of the fused text. Taking the above-mentioned fused text "as one of the representative characters of the Vienna classical genre, () was designated as Morbit's Gershal commemorative day Morbit Zhou Gongting musician" in 1772 as an example, it may be treated as "[ CLS ] as one of the representative characters of the Vienna classical genre, [ mask ] was designated as Morbit's Gershal commemorative day Morbit Zhou Gongting musician [ SEP ] in 1772, and the like, and so on, which is not exemplified herein.
In still another implementation scenario, in order to facilitate the prediction of the mixed granularity of characters, words, entities and the like, the fused text may be further subjected to word segmentation and part-of-speech tagging, so as to obtain a plurality of words marked with part-of-speech categories, and the words with part-of-speech categories as preset categories are segmented word by word. The specific manner of word segmentation and part-of-speech tagging can refer to the related descriptions in the foregoing disclosed embodiments, and will not be repeated herein. In addition, the preset category may be set as a place name, and specific reference may be made to the related description in the foregoing disclosed embodiments, which is not repeated herein. Taking still the fused text "as one of the representative characters of the Vienna classical genre, () designated as Morzate's Gerszaburg commemorative Morzatt Zhou Gongting musician" in 1772 as an example, the relevant steps in the above disclosed embodiment may be adopted to perform the filling of the default symbol, the start flag and the end flag, and the word segmentation, the part of speech tagging and the word-by-word segmentation, and finally the fused text may be processed as:
In the case of fused text as other text, the same can be said, and examples are not given here.
In yet another implementation scenario, to distinguish between words belonging to the text to be completed and words belonging to the target triplet, a sequence start flag may also be appended before the text sequence in the fused text, and a sequence end flag may be appended after the text sequence. The sequence start flag and the sequence end flag may be set according to the actual application requirements, for example, < S > may be used as the sequence start flag, and < T > may be used as the sequence end flag. On this basis, the fused text "as one of the representative characters of the Vienna classical genre, () was designated as Morbit's Gesamsburgh commemorative day Morbit Zhou Gongting musician in 1772" can be treated as:
in yet another implementation scenario, since the target entity exists in both the target triplet and the text to be completed, in order to further maintain the original word order of the text to be completed, the target entity may be regarded as a word belonging to the text to be completed, that is, the target entity is encoded as the first digit order. In addition, the second digital sequence bit may be encoded immediately after the first digital digit, e.g., the largest first digital sequence bit is i, and the smallest second digital sequence bit may be i+1. Still taking the above-mentioned text to be completed "as one of the representative characters of the Vienna classical Lepids, () was designated as the salzborg palace musician in 1772 as an example, the position-coded fused text can be expressed as:
Furthermore, as described in the previously disclosed embodiments, where the source instance includes a second text library derived from a domain related to preset knowledge, the completion prediction may be performed using a third prediction network, which may specifically be trained on a preset neural network (e.g., BERT) using sample text. On the basis, the encoded fusion text can be sent to a third prediction network, so that individual semantic representations of each missing position can be obtained. The process of obtaining the sample text may refer specifically to the related description in the foregoing disclosed embodiments, which is not repeated herein. In addition, the training process of the third prediction network may specifically refer to the following disclosure embodiments, which are not described herein.
In one implementation scenario, the individual semantic representations and the word semantic representations of the respective reference words may derive predicted probability values for the respective reference words. Specifically, the predicted probability value of a reference word may represent the likelihood that the word missing at the missing position is the reference word, and the greater the predicted probability value, the greater the likelihood that the word missing at the missing position is the reference word. On the basis, the reference words positioned in the pre-set sequence position can be selected as candidate words of the missing position according to the sequence from the big to the small of the predicted probability value. The setting manner of the preset sequence bit can be referred to the above related description, and will not be repeated here.
In a specific implementation scenario, as described above, the encoded fused text is sent to the third prediction network, so that an individual semantic representation of the missing position may be obtained, for convenience of description, the individual semantic representation may be denoted as h, the word semantic representation of each reference word may be denoted as W, and it should be noted that W is a set of word semantic representations of each reference word, and then the probability prediction value p may be obtained by calculating the following formula:
p=softmax(h·W)……(10)
in the above formula (10), p represents a predicted probability value of each reference word, h represents an individual semantic representation of a missing position, W represents a word semantic representation of each reference word, and · represents a dot product operation, softmax represents normalization processing. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
It should be noted that, in the case that the target triplet including the target entity is not searched, the fusion text is the text to be completed, in this case, the completion prediction cannot refer to the domain knowledge, that is, the semantic extraction can be directly performed on the text to be completed to obtain the individual semantic representation of each missing position, and for each missing position, the individual semantic representation of the missing position and the word semantic representation of each reference word are utilized to obtain at least one candidate word of the missing position, which can be specifically referred to the related description in the foregoing disclosed embodiments and will not be repeated herein. Therefore, whether the target triples can be searched or not, the completion prediction can be carried out on the text to be completed, so that the knowledge of the field can be used in a pluggable manner, and the flexibility of the completion prediction can be greatly improved.
In addition, in the case of updating the knowledge graph, the searched target triples may also change, and in this case, the step in the embodiment of the disclosure may still be used to predict at least one candidate word of the missing position. Therefore, whether the knowledge graph is updated or not can not be predicted for subsequent complement of the image, so that expansibility of the complement prediction can be greatly improved.
In addition, in the case that the text to be complemented includes a plurality of missing positions, the above manner may be used to complement and predict each missing position, so as to obtain a candidate word of each missing position finally, and the specific process may refer to the foregoing description and will not be repeated herein.
Different from the previous embodiment, the entity search is performed in the triples to obtain a target triplet including a target entity, and the target triplet is fused into the target entity of the text to be complemented to obtain a fused text, so that the second text library is utilized to complement and predict the fused text to obtain at least one candidate word of the missing position. Therefore, the target triples containing the target entity are obtained through searching, and the target triples are fused into the target entity of the text to be completed, so that the domain knowledge closely related to the text to be completed can be fused into the text to be completed, and the accuracy of the follow-up completion prediction can be further improved.
Referring to fig. 13, fig. 13 is a flow chart illustrating an embodiment of a third predictive network training process. In the embodiment of the present disclosure, the sample knowledge graph includes a plurality of sample triples, and the sample triples include two sample entities and a sample entity relationship between the two sample entities, specifically, reference may be made to the related description in the foregoing disclosed embodiment, which is not repeated herein. Embodiments of the present disclosure may specifically include the following steps:
step S1310: and performing entity searching in the plurality of sample triples to obtain a sample target triplet containing the sample target entity.
In the embodiment of the disclosure, the sample target entity is an entity extracted from the sample text. The specific reference may be made to the description of step S1010 in the foregoing disclosed embodiment, which is not repeated herein.
Step S1320: and merging the sample target triples into sample target entities of the sample text to obtain a sample merged text.
The specific reference may be made to the description related to step S1020 in the foregoing disclosed embodiment, which is not repeated herein.
Step S1330: according to the position sequence, the words belonging to the sample text in the sample fusion text are sequentially encoded into first sample digital sequence, and the words belonging to the sample target triples in the sample fusion text are sequentially encoded into second sample digital sequence.
In the embodiment of the present disclosure, the largest first sample digital sequence is smaller than the smallest second sample digital sequence, and the description related to step S1030 in the foregoing embodiment of the present disclosure may be referred to, and will not be repeated here.
Step S1340: and carrying out semantic extraction on the encoded sample fusion text by using a third prediction network to obtain sample individual semantic representations of the missing positions of the samples.
The specific reference may be made to the description related to step S1030 in the foregoing disclosed embodiment, which is not repeated herein.
Step S1350: and aiming at each sample missing position, obtaining a sample prediction probability value of each reference word by using the sample individual semantic representation of the sample missing position and the word semantic representation of each reference word.
The specific reference may be made to the description related to step S1030 in the foregoing disclosed embodiment, which is not repeated herein.
Step S1360: and obtaining a third loss value of a third prediction network based on the sample prediction probability value of each reference word at each sample missing position.
In particular, the third loss value may be calculated using a cross entropy loss function, which may be expressed in particular as:
in the above formula (7), M represents the number of sample missing positions in the sample text, y i For the default word, p, corresponding to the i-th missing position in the sample text i And the sample prediction probability value of each reference word obtained by predicting the ith missing position in the sample text is represented.
Step S1370: and adjusting network parameters of the third prediction network by using the third loss value.
Specifically, the network parameters of the third prediction network may be adjusted by using the third loss value in a random gradient descent (Stochastic Gradient Descent, SGD), a batch gradient descent (Batch Gradient Descent, BGD), a small batch gradient descent (Mini-Batch Gradient Descent, MBGD), etc., where the batch gradient descent refers to parameter updating using all samples at each iteration; random gradient descent refers to the use of one sample for parameter updating at each iteration; the small batch gradient descent refers to that a batch of samples is used for parameter updating at each iteration, and is not described herein.
Different from the embodiment, entity searching is performed on a plurality of sample triples to obtain sample target triples containing sample target entities, the sample target triples are fused into sample target entities of sample texts to obtain sample fusion texts, words belonging to the sample texts in the sample fusion texts are further sequentially encoded into first sample digital sequence positions according to position sequence, words belonging to the sample target triples in the sample fusion texts are sequentially encoded into second sample digital sequence positions, so that the encoded sample fusion texts are subjected to semantic extraction by using a third prediction network to obtain sample individual semantic representations of sample deletion positions, and for each sample deletion position, sample individual semantic representations of sample deletion positions and word semantic representations of reference words are utilized to obtain sample prediction probability values of the reference words, and further, based on the sample prediction probability values of the reference words at each sample deletion position, third loss values of a third prediction network are obtained, network parameters of the third prediction network are adjusted, the sample target triples containing the sample target entities are obtained through searching, and the sample fusion texts of the sample target triples can be further fused into the sample fusion texts in the relevant field of the sample text by utilizing the third loss values, and the sample fusion text can be further predicted in the relevant field of the sample fusion text.
It should be noted that, in the present application, the complement prediction methods matched with different source conditions may be integrated in the same system framework as shown in fig. 2 and the foregoing embodiments, or may be implemented separately and independently.
Referring to fig. 14, fig. 14 is a flowchart illustrating another embodiment of the text completion method of the present application. Specifically, the method may include the steps of:
step S1410: and obtaining the text to be complemented, and determining that the source of the missing content of the text to be complemented is unknown.
In an embodiment of the present disclosure, the text to be completed includes at least one missing location. Reference may be made specifically to the relevant steps in the foregoing disclosed embodiments, which are not described herein.
Step S1420: and carrying out word-by-word prediction on the text to be completed to obtain at least one candidate word of the missing position.
In one implementation scenario, default characters with preset values can be respectively added at each missing position of the text to be completed to obtain the text to be processed, prediction is performed on the text to be processed for a plurality of times aiming at each missing position to obtain predicted characters of the default characters corresponding to the sequence position at the number of times of prediction, and candidate words of the missing position are obtained based on the predicted characters of the plurality of times of prediction. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
In another implementation scenario, unlike the foregoing description, for the i-th missing position, the 1 st to i-1 st missing positions may be respectively supplemented with candidate words predicted at the missing positions, and the i-th missing positions may be respectively supplemented with default symbols of preset values, so that a plurality of texts to be processed may be obtained, for each text to be processed, prediction text of the default symbol at the sequence corresponding to the number of predictions may be obtained, and a combination of the predicted text of the plurality of predictions may be used as candidate words corresponding to the text to be processed, and further candidate words corresponding to each text to be processed may be used as candidate words of the i-th missing position, and so on until candidate words of all N missing positions are obtained. According to the method, the prediction of the candidate words at the later missing position depends on the candidate words at the previous missing position, so that the relevance among the candidate words at each missing position in the completion prediction process can be improved, and the accuracy of the candidate words at each missing position in the completion prediction process can be improved gradually.
In a specific implementation scenario, taking a to-be-completed text "() that includes 2 (i.e., N is 2) missing positions as an example of a result of a phase ii clinical test of XX vaccine of the medical institute of military medical science in online publication () of medical journal" lancet ", for the 1 st missing position, a preset number of default symbols can be respectively added to the 2 missing positions as a to-be-processed text, and the to-be-processed text is utilized to perform a plurality of predictions, and finally, a combination of predicted words of the plurality of predictions is used as candidate words (e.g.," uk "," united states ") corresponding to the to-be-processed text, and a specific process of the predicted words can refer to related descriptions in the above disclosed embodiments and will not be repeated herein. On the basis, for the 2 nd missing position, the candidate words obtained by prediction at the 1 st missing position can be respectively supplemented, namely, the candidate words of British and the United states are respectively supplemented at the 1 st missing position, default characters with preset values are supplemented at the 2 nd missing position, 2 texts to be processed are obtained, prediction can be respectively carried out for the 2 nd texts to be processed, finally, for the text to be processed, which is supplemented at the 1 st missing position, the candidate word of China can be predicted at the 2 nd missing position, and for the text to be processed, which is supplemented at the 1 st missing position, the candidate word of Japan can be predicted at the 2 nd missing position, so that all the 2 missing positions are fully predicted, and finally, the candidate words at the 1 st missing position comprise British and American, and the candidate words at the 2 nd missing position comprise Chinese and Japan. The above example is merely one case that may exist in the practical application process, and is not limited to other cases that may exist. In addition, in the case where the missing positions are other numbers, the same can be said, and no example is given here.
Step S1430: and obtaining the complete text of the text to be complemented by utilizing the candidate words of each missing position.
In one implementation scenario, a corresponding candidate word may be added at each missing position, so that a plurality of candidate texts of the text to be completed may be obtained, a final score of each candidate text may be obtained, and one candidate text may be selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
In another implementation scenario, in order to improve accuracy of the final score, a corresponding candidate word may be further added to each missing position to obtain a plurality of candidate texts of the text to be completed, and for each candidate text, the words in the candidate texts are reversely ordered to obtain reverse texts of the candidate texts, so that the final score of the candidate texts is obtained based on the first score of the candidate texts and the second score of the reverse texts, and then one candidate text is selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
Different from the embodiment, the method and the device for obtaining the text to be completed are different from the embodiment, and the source of the missing content of the text to be completed is unknown, so that the text to be completed is predicted word by word, at least one candidate word of the missing position is obtained, and then the candidate words of each missing position are utilized to obtain the complete text of the text to be completed, so that the missing content of the text to be completed can be completed without relying on manpower, the text completion efficiency can be improved, and the text completion cost can be reduced. In addition, under the condition that the source of the missing content is unknown, the text completion accuracy can be improved by means of word-by-word prediction.
Referring to fig. 15, fig. 15 is a schematic flow chart of a text completion method according to another embodiment of the present application. Specifically, the method may include the steps of:
step S1510: and acquiring the text to be complemented, and determining a text library from which the missing content of the text to be complemented is derived.
In an embodiment of the present disclosure, the text to be completed includes at least one missing location. Reference may be made specifically to the relevant steps in the foregoing disclosed embodiments, which are not described herein.
Step S1520: and carrying out completion prediction on the text to be completed by using the text library to obtain at least one candidate word at the missing position.
In one implementation scenario, the text library includes at least one reference text, and the reference text includes at least one reference word, on the basis of which, semantic extraction can be performed on the text to be completed to obtain individual semantic representations of each missing location, and for each missing location, at least one candidate word of the missing location is obtained by using the individual semantic representations of the missing location and the word semantic representations of each reference word. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
In another implementation scenario, as described above, the text library includes at least one reference text, and the reference text includes at least one reference word, so that semantic extraction can be performed on the reference word to obtain a word semantic representation of the reference word, and a specific process may refer to the related description in the foregoing disclosed embodiment and will not be repeated herein. On the basis, the text to be complemented can be segmented to obtain a plurality of words, the word semantic representations of the reference words which are consistent with the words in the text to be complemented are used as the word semantic representations of the words, and then the word semantic representations of the words of the text to be complemented are fused to obtain the whole semantic representation of the text to be complemented, for example, the word semantic representations are vectors containing preset dimension (e.g. 128-dimensional) elements, and then the same position elements of the word semantic representations of the words can be averaged to obtain the whole semantic representation of the text to be complemented. Further, the similarity (such as cosine similarity) between the word semantic representation and the overall semantic representation of each reference word in the text library can be obtained respectively, so that each reference word can be ordered according to the sequence from high similarity to low similarity, and then the reference word positioned in the front preset sequence (such as the front 5 bits) can be selected to serve as a candidate word for each missing word position in the text to be complemented.
Step S1530: and obtaining the complete text of the text to be complemented by utilizing the candidate words of each missing position.
In one implementation scenario, a corresponding candidate word may be added at each missing position, so that a plurality of candidate texts of the text to be completed may be obtained, a final score of each candidate text may be obtained, and one candidate text may be selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
In another implementation scenario, in order to improve accuracy of the final score, a corresponding candidate word may be further added to each missing position to obtain a plurality of candidate texts of the text to be completed, and for each candidate text, the words in the candidate texts are reversely ordered to obtain reverse texts of the candidate texts, so that the final score of the candidate texts is obtained based on the first score of the candidate texts and the second score of the reverse texts, and then one candidate text is selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
Different from the previous embodiment, the text to be complemented is obtained, and the text library from which the missing content of the text to be complemented is derived is determined, so that the text library is utilized to carry out complement prediction on the text to be complemented, at least one candidate word of the missing position is obtained, and then the candidate words of each missing position are utilized to obtain the complete text of the text to be complemented. Therefore, the missing content of the text to be completed can be completed without relying on manpower, the text completion efficiency can be improved, and the text completion cost can be reduced. In addition, the missing content is determined to be from the text library, so that the text library is utilized to carry out completion prediction on the text to be completed, at least one candidate word of the missing position is directly obtained, namely, missing unknown is not limited to missing words, words or entities, and the prediction on the mixed granularity of the words, the entities and the like can be facilitated.
Referring to fig. 16, fig. 16 is a flowchart illustrating a text completion method according to another embodiment of the present application. Specifically, the method may include the steps of:
step S1610: and acquiring the text to be complemented, and determining a text library from which the missing content of the text to be complemented is derived.
In the embodiment of the disclosure, the text to be completed includes at least one missing position, and the text library relates to the preset knowledge field. Reference may be made specifically to the relevant steps in the foregoing disclosed embodiments, which are not described herein.
Step S1620: and carrying out completion prediction on the text to be completed by utilizing a knowledge graph and a text library corresponding to the preset knowledge field to obtain at least one candidate word at the missing position.
In one implementation scenario, the knowledge graph includes a plurality of triples, the triples include two entities and an entity relationship between the two entities, on the basis, entity searching can be performed in the triples to obtain a target triplet including a target entity, the target entity is an entity extracted from a text to be complemented, and the target triplet is fused into the target entity of the text to be complemented to obtain a fused text, so that the fused text is complemented and predicted by using a text library to obtain at least one candidate word of a missing position. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
In another implementation scenario, similar to the foregoing description, the knowledge graph includes a plurality of triples, the triples include two entities and an entity relationship between the two entities, on the basis of which an entity search may be performed in the plurality of triples to obtain a target triplet including a target entity, and the target entity is an entity extracted from the text to be completed. Different from the foregoing description, when searching the target triplet, the text library may be directly utilized to make up the prediction for the text to be completed, so as to obtain at least one candidate word at the missing position, and the specific process may refer to the related description in the foregoing disclosed embodiment, which is not repeated herein. On this basis, another entity except the target entity in the target triplet may be extracted, which may be referred to as a reference entity as described in the foregoing disclosure embodiments, and at least one candidate word obtained by the complement prediction may be further screened by using the reference entity, for example, a correlation between the entity semantic representation of the reference entity and the word semantic representation of each candidate word may be respectively used, and the candidate word located in the pre-set order (e.g., the first 5 bits) is selected as the final candidate word in the missing position according to the order of the correlation from the high to the low. By the method, the entity search and the completion prediction can be executed in parallel, so that the text completion efficiency can be further improved.
Step S1630: and obtaining the complete text of the text to be complemented by utilizing the candidate words of each missing position.
In one implementation scenario, a corresponding candidate word may be added at each missing position, so that a plurality of candidate texts of the text to be completed may be obtained, a final score of each candidate text may be obtained, and one candidate text may be selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
In another implementation scenario, in order to improve accuracy of the final score, a corresponding candidate word may be further added to each missing position to obtain a plurality of candidate texts of the text to be completed, and for each candidate text, the words in the candidate texts are reversely ordered to obtain reverse texts of the candidate texts, so that the final score of the candidate texts is obtained based on the first score of the candidate texts and the second score of the reverse texts, and then one candidate text is selected as a complete text of the text to be completed based on the final scores of the plurality of candidate texts. Reference may be made specifically to the relevant descriptions in the foregoing disclosed embodiments, and details are not repeated here.
Different from the foregoing embodiment, the to-be-complemented text is obtained, and a text library from which the missing content of the to-be-complemented text is derived is determined, so that the to-be-complemented text is subjected to complement prediction by using a knowledge graph and the text library corresponding to the preset knowledge field, at least one candidate word of the missing position is obtained, and then the complete text of the to-be-complemented text is obtained by using the candidate words of each missing position. Therefore, the missing content of the text to be completed can be completed without relying on manpower, the text completion efficiency can be improved, and the text completion cost can be reduced. In addition, since the missing content determination is derived from the text library related to the preset knowledge field, the completion prediction is performed by using the knowledge graph and the text library corresponding to the preset knowledge field, so that the accuracy of text completion is further improved.
Referring to fig. 17, fig. 17 is a schematic diagram illustrating a frame of an embodiment of an electronic device 1700 of the present application. The electronic device 1700 comprises a memory 1701 and a processor 1702 coupled to each other, the memory 1701 having stored therein program instructions, the processor 1702 being adapted to execute the program instructions to implement the steps of any of the text completion method embodiments described above. In particular, electronic device 1700 may include, but is not limited to: desktop computers, notebook computers, tablet computers, servers, etc., are not limited herein.
In particular, the processor 1702 is configured to control itself and the memory 1701 to implement the steps of any of the text completion method embodiments described above. The processor 1702 may also be referred to as a CPU (Central Processing Unit ). The processor 1702 may be an integrated circuit chip having signal processing capabilities. The processor 1702 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 1702 may be commonly implemented by an integrated circuit chip.
In some disclosed embodiments, the processor 1702 is configured to obtain text to be completed and determine a text library from which missing content of the text to be completed originates; the text to be complemented comprises at least one missing position, and the text library relates to the preset knowledge field; the processor 1702 is configured to perform complement prediction on a text to be complemented by using a knowledge graph and a text base corresponding to a preset knowledge domain, so as to obtain at least one candidate word at a missing position; the processor 1702 is configured to obtain a complete text of the text to be completed using the candidate words at each missing location.
Different from the foregoing embodiment, the to-be-complemented text is obtained, and a text library from which the missing content of the to-be-complemented text is derived is determined, so that the to-be-complemented text is subjected to complement prediction by using a knowledge graph and the text library corresponding to the preset knowledge field, at least one candidate word of the missing position is obtained, and then the complete text of the to-be-complemented text is obtained by using the candidate words of each missing position. Therefore, the missing content of the text to be completed can be completed without relying on manpower, the text completion efficiency can be improved, and the text completion cost can be reduced. In addition, since the missing content determination is derived from the text library related to the preset knowledge field, the completion prediction is performed by using the knowledge graph and the text library corresponding to the preset knowledge field, so that the accuracy of text completion is further improved.
In some disclosed embodiments, the knowledge graph comprises a plurality of triples, the triples comprising two entities and an entity relationship between the two entities, the processor 1702 configured to perform an entity search in the plurality of triples to obtain a target triplet comprising a target entity; the target entity is an entity extracted from the text to be completed; the processor 1702 is configured to fuse the target triplet into a target entity of the text to be completed, to obtain a fused text; the processor 1702 is configured to make up prediction for the fused text using the text library to obtain at least one candidate word for the missing location.
Different from the previous embodiment, the entity search is performed in the triples to obtain a target triplet including a target entity, and the target triplet is fused into the target entity of the text to be complemented to obtain a fused text, so that the second text library is utilized to complement and predict the fused text to obtain at least one candidate word of the missing position. Therefore, the target triples containing the target entity are obtained through searching, and the target triples are fused into the target entity of the text to be completed, so that the domain knowledge closely related to the text to be completed can be fused into the text to be completed, and the accuracy of the follow-up completion prediction can be further improved.
In some disclosed embodiments, the text library contains at least one reference text and the reference text contains at least one reference word, and the processor 1702 is configured to treat a triplet containing the target entity as a candidate triplet and another entity in the candidate triplet other than the target entity as a reference entity; the processor 1702 is configured to obtain an overall semantic representation of a text to be completed and obtain an entity semantic representation of a reference entity by using word semantic representations of each reference word in the text library; the processor 1702 is configured to select at least one candidate triplet as a target triplet based on a similarity between the entity semantic representations of the respective reference entities and the overall semantic representation, respectively.
Different from the foregoing embodiment, by taking the triplet including the target entity as the candidate triplet and taking another entity other than the target entity in the candidate triplet as the reference entity, the whole semantic representation of the text to be complemented and the entity semantic representation of the reference entity are obtained by using the word semantic representation of each reference word in the text library, and then at least one candidate triplet is selected as the target triplet based on the similarity between the entity semantic representation of each reference entity and the whole semantic representation, so that the candidate triplet can be further screened based on the similarity, the interference of the triplet with lower similarity on the subsequent complement prediction can be reduced, and the complexity of the subsequent fusion of the target triplet into the text to be complemented can be reduced.
In some disclosed embodiments, the processor 1702 is configured to perform word segmentation and part-of-speech tagging on at least one reference text, respectively, to obtain a plurality of words tagged with part-of-speech categories; the processor 1702 is configured to word-by-word segment words with a part-of-speech class being a preset class, and obtain a plurality of reference words by using the segmented words and the non-segmented words; the processor 1702 is configured to perform first semantic extraction on a plurality of reference terms, respectively, to obtain term semantic representations of the reference terms.
Different from the foregoing embodiment, by performing word segmentation and part-of-speech tagging on at least one reference text, a plurality of words marked with part-of-speech categories are obtained, and words with part-of-speech categories being preset categories are segmented word by word, so that the segmented words and the non-segmented words are utilized to obtain a plurality of reference words, and further, the method can be beneficial to obtaining the reference words with mixed granularity including words, entities and the like, and can be further beneficial to realizing the prediction of the mixed granularity of words, entities and the like.
In some disclosed embodiments, the overall semantic representation is fused using word semantic representations of each word in the text to be completed; and/or, the target triplet includes: the similarity is sorted from big to small and then the first two candidate triples are ranked.
Different from the foregoing embodiment, by obtaining the overall semantic representation by fusing the word semantic representations of the respective words in the text to be completed, it is possible to facilitate improvement of accuracy of the overall semantic representation, and set the target triplet to include: the candidate triples with lower similarity are arranged in the first two bits after the similarity is sorted from big to small, so that the interference of the candidate triples with lower similarity on the subsequent completion prediction can be reduced, the complexity of merging the target triples into the text to be completed is reduced, and the occurrence of lower completion prediction accuracy caused by too few target triples is avoided.
In some disclosed embodiments, the processor 1702 is configured to construct a knowledge tree using the target triples and convert the knowledge tree into a text sequence; the root node of the knowledge tree is a target entity, the leaf nodes of the knowledge tree are reference entities, the reference entities are other entities except the target entity in the target triplet, and the intermediate node between the root node and the leaf nodes is an entity relationship between the target entity and the reference entity; the processor 1702 is configured to fuse the text sequence into a target entity of the text to be completed, to obtain a fused text.
Different from the previous embodiment, the knowledge tree is constructed by using the target triplet, and the knowledge tree is converted into the text sequence, so that the text sequence is fused to the target entity of the text to be completed to obtain the fused text, thereby being beneficial to converting the target triplet into the text sequence with the structural characteristic by constructing the knowledge tree, further improving the readability of the fused text and being beneficial to improving the accuracy of the subsequent completion prediction.
In some disclosed embodiments, the knowledge tree is a binary tree, and the processor 1702 is configured to sequentially traverse the knowledge tree in a medium-order traversal manner, and use a combination of sequentially traversed words as a text sequence, and/or the processor 1702 is configured to replace a target entity in the text to be completed with the text sequence to obtain a fused text.
Different from the embodiment, the knowledge tree is a binary tree, a middle-order traversing mode is adopted to sequentially traverse the knowledge tree, and the combination of words which are sequentially traversed is used as a text sequence, so that the readability of the text sequence can be improved, and the accuracy of the follow-up complement prediction can be improved; and the target entity in the text to be complemented is replaced by the text sequence to obtain the fusion text, so that the complexity of fusing the text sequence into the text to be complemented can be reduced.
In some disclosed embodiments, the text library comprises at least one reference text and the reference text comprises at least one reference word, and the processor 1702 is configured to sequentially encode a first numerical digit for words in the fused text that belong to the text to be completed and sequentially encode a second numerical digit for words in the fused text that belong to the target triplet according to a position order; wherein the largest first digital sequence is smaller than the smallest second digital sequence; the processor 1702 is configured to perform second semantic extraction on the encoded fusion text to obtain individual semantic representations of each missing location; the processor 1702 is configured to obtain, for each missing location, at least one candidate term for the missing location using the individual semantic representation of the missing location and the term semantic representations of the respective reference terms.
Different from the embodiment, the first digital sequence is sequentially encoded on the words belonging to the text to be complemented in the fusion text according to the position sequence, the second digital sequence is sequentially encoded on the words belonging to the target triplet in the fusion text, and the largest first digital sequence is smaller than the smallest second digital sequence, so that the knowledge of the field can be fused under the condition that the original word sequence of the text to be complemented is unchanged in the process of the complementation prediction, the second semantic extraction is performed on the encoded fusion text on the basis, the individual semantic representation of each missing position is obtained, and the individual semantic representation of the missing position and the word semantic representation of each reference word are utilized to obtain at least one candidate word of the missing position, thereby being beneficial to improving the accuracy of the individual semantic representation and further being beneficial to improving the accuracy of the complementation prediction.
In some disclosed embodiments, the second semantic extraction is performed using a predictive network that is trained on a predetermined neural network using sample text, and the sample text includes at least one sample missing location, and the processor 1702 is configured to segment and label the original text with part of speech to obtain a plurality of words labeled with part of speech categories; the processor 1702 is configured to verbally segment a word with a part-of-speech class being a preset class, and select a word with a preset proportion for default from the segmented word and the word without segmentation; the processor 1702 is configured to take the default original text as a sample text and take the position of the default word as a sample missing position.
Different from the embodiment, the original text is subjected to word segmentation and word part labeling to obtain a plurality of words marked with word part categories, and the words with the word part categories being preset are segmented word by word, so that the words with preset proportion are selected from the segmented words and the non-segmented words to default, the original text after default is used as a sample text, and the position of the default word is used as a sample missing position of the sample text, so that the sample text with missing content containing mixed granularity of words, entities and the like can be constructed, the adaptability of a prediction network obtained through subsequent training to the to-be-complemented text with mixed granularity of missing words, entities and the like can be improved, and the accuracy of subsequent completion prediction can be improved.
In some disclosed embodiments, the processor 1702 is configured to patch a corresponding candidate word at each missing location to obtain a plurality of candidate texts of the text to be patch; the processor 1702 is configured to reversely sequence, for each candidate text, terms in the candidate text to obtain a reverse text of the candidate text, and obtain a final score of the candidate text based on the first score of the candidate text and the second score of the reverse text; the processor 1702 is configured to select one candidate text as the complete text of the text to be completed based on the final scores of the plurality of candidate texts.
Different from the foregoing embodiment, by supplementing a corresponding candidate word at each missing position, a plurality of candidate texts of the text to be complemented are obtained, and for each candidate text, the words in the candidate text are reversely ordered to obtain a reverse text of the candidate text, so that a final score of the candidate text is obtained based on a first score of the candidate text and a second score of the reverse text, and therefore, in the process of scoring the candidate text, the forward sequence and the reverse sequence of the candidate text are comprehensively considered to score, thereby being beneficial to improving the accuracy of the final score, and further being beneficial to improving the accuracy of the complete text in the process of obtaining the complete text based on the final score.
Referring to FIG. 18, FIG. 18 is a schematic diagram illustrating an embodiment of a storage device 1800. The storage device 1800 stores program instructions 1801 that can be executed by the processor, the program instructions 1801 being for implementing the steps in any of the text completion method embodiments described above.
By the aid of the scheme, text completion efficiency can be improved, and text completion cost can be reduced.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.