CN116089597A - Statement recommendation method, device, equipment and storage medium - Google Patents

Statement recommendation method, device, equipment and storage medium Download PDF

Info

Publication number
CN116089597A
CN116089597A CN202310085815.3A CN202310085815A CN116089597A CN 116089597 A CN116089597 A CN 116089597A CN 202310085815 A CN202310085815 A CN 202310085815A CN 116089597 A CN116089597 A CN 116089597A
Authority
CN
China
Prior art keywords
sentence
vector
trained
input content
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310085815.3A
Other languages
Chinese (zh)
Inventor
魏俊杰
李海
许志海
陈开杰
杨帆
王彬
张琳
王凯琳
梁建瑜
徐长飞
贺晓柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Co Ltd
Original Assignee
China Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Southern Power Grid Co Ltd filed Critical China Southern Power Grid Co Ltd
Priority to CN202310085815.3A priority Critical patent/CN116089597A/en
Publication of CN116089597A publication Critical patent/CN116089597A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The embodiment of the invention discloses a sentence recommendation method, a sentence recommendation device, sentence recommendation equipment and a sentence recommendation storage medium, wherein the sentence recommendation method comprises the following steps: acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector; converting the input content into sentence ontology vectors by means of a trained first BERT model; splicing the topic vector and the sentence ontology vector to form a target sentence vector; and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library. The technical scheme provided by the embodiment of the invention can accurately provide the dialect Wen Yugou and reduce the use threshold of the dialect.

Description

Statement recommendation method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a statement recommendation method, device, equipment and storage medium.
Background
At present, when writing documents such as government authorities, institutions, large enterprises, policy documents, regulations, investigation reports, meeting disciplines, working schemes, opinions, and the like, it is often necessary to enrich the expression of articles by using quotation of literary sentences.
However, the user often cannot more vividly complete the manuscript writing by referring to classical works because the storage amount of the user's literary knowledge is insufficient in the writing process.
Disclosure of Invention
The embodiment of the invention provides a sentence recommendation method, a sentence recommendation device, sentence recommendation equipment and a sentence recommendation storage medium, which can accurately provide a culture medium Wen Yugou and reduce a culture medium use threshold.
In a first aspect, an embodiment of the present invention provides a sentence recommendation method, including:
acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector;
converting the input content into sentence ontology vectors by means of a trained first BERT model;
splicing the topic vector and the sentence ontology vector to form a target sentence vector;
and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library.
In a second aspect, an embodiment of the present invention provides a sentence recommendation apparatus, including
The topic vector recognition module is used for acquiring the input content of a user, and performing topic recognition on the input content through a trained document topic generation model to obtain a topic vector;
the sentence ontology vector recognition module is used for converting the input content into sentence ontology vectors through a trained first BERT model;
the splicing module is used for splicing the theme vector and the sentence body vector to form a target sentence vector;
and the matching module is used for inputting the target sentence vector into the trained second BERT model to obtain the characteristic vector of the input content, and determining the sentence of the Chinese language matched with the input content based on the similarity between the characteristic vector of the input content and the characteristic vector of the sentence of the Chinese language in the library of the Chinese language Wen Yugou.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the methods provided by the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing computer instructions for causing a processor to execute a method provided by embodiments of the present invention.
According to the technical scheme provided by the embodiment of the invention, the input content of the user is acquired, the input content is subject-identified through the trained document subject generation model, a subject vector is obtained, and the input content is converted into a sentence ontology vector through the trained first BERT model; splicing the topic vector and the sentence ontology vector to form a target sentence vector; and inputting the target sentence vector into the trained second BERT model to obtain the feature vector of the input content, and determining the dialect sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the dialect sentence in the dialect Wen Yugou library, so that the dialect Wen Yugou can be accurately provided, and the use threshold of the dialect sentence is reduced.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1a is a flowchart of a sentence recommendation method according to an embodiment of the present invention;
FIG. 1b is a search result presentation schematic;
FIG. 2a is a flowchart of a sentence recommendation method according to an embodiment of the present invention;
FIG. 2b is a structural relationship diagram of an LDA model;
FIG. 2c is a schematic diagram of the BERT model structure;
FIG. 2d is a flowchart of the BERT model training process;
FIG. 3a is a block diagram of a sentence recommendation device according to an embodiment of the present invention;
fig. 3b is a structural architecture diagram of a sentence recommendation device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1a is a flowchart of a sentence recommendation method provided by an embodiment of the present invention, where the method may be performed by a sentence recommendation device, the device may be performed by software and/or hardware, the device may be configured in an electronic device such as a computer, and the method may be applied in a scenario of sentence recommendation in a cultural sentence, as shown in fig. 1a, and the technical solution provided by the embodiment of the present invention includes:
S110: and acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector.
In the embodiment of the present invention, before S110, training the document theme generating model (Latent Dirichlet Allocation, LDA) may be further included, specifically: performing theme labeling on the explanation content in the dialect Wen Yugou library; and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model. The interpretation content of the dialect Wen Yugou in the dialect Wen Yugou library can be subject-marked, and the LDA model is trained through the marked interpretation content, so that a trained LDA model is obtained. In the embodiment of the invention, the input content of the user can be sentences or words, and the input content of the user can be other than the text sentences, and can be white text sentences or white text words. In the embodiment of the invention, a junction word segmentation device (jieba-analysis, JEBA) can be used for segmenting the input content to obtain the segmented feature vector, and the segmented feature vector is input into a trained document topic generation model (Latent Dirichlet Allocation, LDA) to obtain the topic vector of the input content.
S120: the input content is converted into sentence ontology vectors by means of a trained first BERT model.
In the embodiment of the invention, the input content can be mapped to two vectors, namely a word vector and a position vector, through the trained first BERT model, and the word vector and the position vector form a sentence ontology vector.
In an embodiment of the present invention, before S120, training the first BERT model (Bidirectional Encoder Representations from Transformers) may further be included, specifically: and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
S130: and splicing the theme vector and the sentence ontology vector to form a target sentence vector.
In the embodiment of the invention, the sentence ontology vector S 'is combined with the topic vector extracted by the LDA model as mu' to form a target sentence vector, and the target sentence vector is defined as F ', and then F' = { S ', mu' }
S140: and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text sentence in the text Wen Yugou library.
In an embodiment of the present invention, training the second BERT model may be further included before S140, specifically: inputting the text sentences in the text sentence library into a trained document topic model to obtain topic vectors in a training set; inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set; splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set; and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model. And inputting new sentence vectors in a training set formed by the topic vectors and the sentence ontology vectors into a second BERT model, and training the second BERT model to obtain a trained second BERT model.
In the embodiment of the invention, the spliced target sentence vector is input into a second BERT model to obtain the feature vector of the input content, the feature vector and the feature vector of the Chinese language sentence in the dialect Wen Yugou library are subjected to similarity calculation, the feature vector of the Chinese language sentence is screened based on the similarity, and the corresponding Chinese language sentence is determined. Wherein, the feature vector of the Chinese language sentence in the Chinese Wen Yugou library can be obtained by the second BERT model. Specifically, vectorization is performed on the dialects Wen Yugou in the dialect Wen Yugou library through the LDA model and the first BERT model, the dialects Wen Yugou are spliced into new sentence vectors, and the new sentence vectors are input into the second BERT model to obtain feature vectors of the dialect sentences.
In the embodiment of the invention, the similarity calculation can be performed by the distance between two vectors, and the specific calculation formula is as follows:
Figure BDA0004068825520000061
wherein F' (p) can be the feature vector of the target sentence, F (t) i ) May be feature vectors of the cultural sentences in the cultural Wen Yugou library. Wherein the distance between the vectors is the similarity between the two feature vectors, denoted as f i The corresponding dialects Wen Yugou in the library of dialects Wen Yugou may be ranked in order of similarity from large to small, the dialects Wen Gouzi of the first 50 of the ranks may be screened, and the screened dialects Wen Yugou may be displayed, for example, as shown in fig. 1b, contents may be input in an input box, and 50 recommended dialect sentences may appear in the search result.
In the related art, the multi-focus on the text of the text uses artificial intelligence technology to explain the text of the text, uses modern white language to explain the text of the text, and carries out machine reading and translation on the difficult words and characters. The method and the system for automatically generating the topic of the cultural relics in the related technology automatically extract key contents to conduct topic training by using a topic generation technology. According to the method for reading and understanding the literary and the text based on the multi-task combined training in the related technology, sentence breaking processing can be accurately carried out on the literary and text, the condition that the ancient text and the modern text exist simultaneously can be considered, a literary and text machine reading and understanding model based on the multi-task combined training is established, the multi-task combined training is carried out on the literary and text, and machine reading and understanding of the literary and the text are realized. The reading understanding of the cultural relics is used for more intuitively understanding the content of the cultural relics and helping the learning and deep understanding of the cultural relics; the problem is that the memory and storage of the literary and literary materials are not solved, and a user can effectively memorize and store the literary and literary materials by reading a large amount of literary and literary works and accumulating and reading for a long time. The second problem is that powerful memory and comprehension are needed, and efficient calling, flexible application and correct quotation of the literary works can be realized through long-time professional training. This places high demands on the user's language reading understanding ability, language application ability, and urgent demands are to have a technology capable of reducing the use threshold of the text.
According to the technical scheme provided by the embodiment of the invention, the input content of the user is acquired, the input content is subject-identified through the trained document subject generation model, a subject vector is obtained, and the input content is converted into a sentence ontology vector through the trained first BERT model; splicing the topic vector and the sentence ontology vector to form a target sentence vector; and inputting the target sentence vector into the trained second BERT model to obtain the feature vector of the input content, and determining the dialect sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the dialect sentence in the dialect Wen Yugou library, so that the dialect Wen Yugou can be accurately provided, and the use threshold of the dialect sentence is reduced.
Fig. 2a is a flowchart of a sentence recommendation method according to an embodiment of the present invention, where in this embodiment, optionally, the method may further include:
collecting the cultural relic data and processing the cultural relic data;
performing sentence breaking processing on the processed text data to form independent text Wen Yugou;
each independent cultural sentence and corresponding cultural octopus, dynasty, author, intention label, translation, admire and annotation content are stored to form a cultural Wen Yugou library.
Optionally, the method may further include:
performing theme labeling on the explanation content in the dialect Wen Yugou library;
and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model.
Optionally, the method may further include:
and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
Optionally, the method may further include:
inputting the cultural sentence in the cultural sentence library into a trained document theme model to obtain a theme vector in a training set;
inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set;
splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set;
and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
As shown in fig. 2a, the technical solution provided by the embodiment of the present invention includes:
S210: and collecting the cultural relic data and processing the cultural relic data.
In the embodiment of the invention, the text data can comprise classical text works, wherein the collected content can comprise article content, authors, dynasties, translation and interpretation content and the like. In particular, classical literary works on a network may be collected by crawlers. 7 attributes are formulated for the classical dialect, and 30 ten thousand data are crawled by adopting a Python web crawler, wherein the crawled data can basically completely represent the detailed information of the classical dialect. Because the classical dialect has more attributes, the crawler data is easily affected by webpage advertisements and data types and is easy to have blank labels, and before the dialect data is written into a dialect database (MongoDB database), the data needs to be subjected to cleaning, screening, filtering, supplementing and other treatments so as to ensure the certainty of the attributes of each dialect in the dialect database, such as the details of the legend names, the dynasties, the authors, the image labels, the translations, the admissions, the annotation contents and the like.
In one implementation manner of the embodiment of the present invention, optionally, the processing the text-to-text data includes: and carrying out data cleaning, complex-simplified conversion, supplementary punctuation and supplementary interpretation content processing on the cultural relics.
The cleaning data is mainly performed on data acquired by a crawler, and the data acquired by the crawler is disordered and is often influenced by information such as web page advertisements, blanks, different tag data formats, additional web page links and the like. The occurrence of the conditions not only can influence the matching speed of the semantic search, but also can directly influence the accuracy of the semantic search, so that the operation difficulty is increased, and all the factors can be removed as noise.
The attribute of the classical dialect with the blank in the data is supplemented to be complete as much as possible, and the complete is needed to be supplemented by a manual marking annotation method especially aiming at the explanation that the content is blank.
Wherein the complex body is converted into a simplified body. Specifically, when the Chinese data is processed, complex characters often appear to influence the normal sentence searching flow, and optionally, a Chinese character simplified and complex conversion system is adopted when the Chinese data is converted in the embodiment, so that the complex characters in the Chinese data can be directly modified and converted.
The cultural relics data can be put in storage. Specifically, a classical dialect database and a user database can be constructed, and in the construction of the dialect database of the application, the characteristics of the database and the text data are considered, and optionally, a non-relational database MongoDB can be selected.
The method comprises the steps of carrying out rechecking and checking on the cultural relics, supplementing the cultural relics, deleting repeated information, correcting error information, supplementing missing cultural relics attribute labels in time, and ensuring consistency of writing classical cultural relics library data by a series of operators. When writing the data of the dialect into the database, each dialect is taken as an object, and the class is a unified abstract description representing that a group of objects have the same attribute and the same operation, and the total number of the attributes of the dialect is 7 in the embodiment.
S220: and performing sentence breaking processing on the processed text data to form independent text Wen Yugou, and storing each independent text sentence and corresponding text octoname, dynasty, author, intention label, translation, appreciation and annotation content to form a text Wen Yugou library.
In the embodiment of the invention, the text data is specifically subjected to sentence breaking processing and divided into independent text sentences, each sentence is a single piece of data, each sentence corresponds to Wen Zhangming, the dynasty, the author, the image label, the translation, the appreciation and the explanation content, and the corresponding processing is carried out according to the sentence breaking. The dialect sentences may be stored in a monglodb database forming a library of dialects Wen Yugou. The sentence data set of 70206 effective articles is determined by four parts of data set definition, data source, data preprocessing, data analysis and the like, and comprises a library Wen Yugou of dialect Wen Yugou with 7 attributes of Wen Zhangming, dynasty, author, image label, translation, pleasant, explanation and the like.
S230: and performing topic annotation on the interpretation content in the dialect Wen Yugou library, inputting the annotated interpretation content into the document topic generation model, and training the document topic generation model to obtain a trained document topic generation model.
In the embodiment of the invention, specifically, the interpreted content is subject (TOPIC) identified by using an LDA model by utilizing JEBA word segmentation, and the identified content is identified as a subject vector. Specifically, the JEBA word bank is adopted to expand word segmentation for the corresponding interpretation content of the dialect Wen Yugou, and the word vector of each word segmentation is set as d i . Setting the interpretation feature vector of the legend Wen Yugou to D, there are:
D={d 1 ,d 2 ,d 3 ,d 4 ,d 5 ,d 6 ,…,d m }
where m is the number of word vectors in the interpreted sentence.
In the embodiment, a sentence interpretation-topic-word segmentation three-layer Bayesian probability LDA model is applied to construct a model structure. The structural relationship of the LDA model may refer to fig. 2b, and the topic vector distribution of the interpretation statement in the LDA model may be determined by a related technology method, where the parameter estimation method in the model needs to perform multiple training adjustment and acquisition according to the sample data of the interpretation statement in the library of the dialect Wen Yugou. The LDA model is pre-trained by annotating the interpreted content in the library of dialects Wen Yugou (content formed by interpreted statements) and by interpreting the content. Setting the characteristic topic vector of the interpretation content as mu, and then obtaining the topic of the interpretation content by calculating the high-frequency word and training of the topic, wherein the topic vector mu corresponding to the interpretation statement i is obtained i
S240: and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
In this embodiment, the first BERT model may be used to perform word segmentation and feature vectorization training on the text and sentence, where the training is labeled as a sentence ontology vector in the training set.
Specifically, the first BERT model is used for training the text and sentence of the cultural relics, and sentence ontology vectors are set as S, word vectors Q and word position vectors L p For a first BERT model input sentence ontology vector S composed of n words, there are:
S={Q 1 ,Q 2 ,Q 3 ,Q 4 ,Q 5 ,Q 6 ,…,Q n }
in the present embodiment, the representation is performed using root embedding and position embedding. Semantic information representing the words is embedded through the root of words, and the position information of the words in the original sentences is reserved through position embedding.
In this embodiment, for root embedding, the input dialect Wen Yugou is specifically segmented and mapped, and the sentence is segmented by using the JEBA embedding with 3 ten thousand word units embedded therein. For example, "eucrypti and solitary canthus Ji Fei" is partitioned into 4 word units of "eucrypti/and/solitary canthus/Ji Fei", where "/" denotes separators. Each word unit is mapped to a vector of a certain dimension as an initial word embedding for that layer.
In this embodiment, for position embedding, in particular, the position of the word unit in the sentence, i.e. the position vector, is marked. The initial position vector calculation method is as follows:
Figure BDA0004068825520000111
Figure BDA0004068825520000112
wherein ,PEt,2k and PEt,2k-1 The position characteristic values of the N-dimensional word vector with the position t in the 2k dimension and the 2k-1 dimension are respectively calculated by using a sine function, 2k is an even number, and 2k-1 is an odd number and is calculated by using a cosine function.
In this embodiment, the BERT model training process is divided into two stages of pre-training and fine-tuning. The model adopts a transducer feature extractor to extract text information bidirectionally. The BERT model structure may refer to fig. 2c.
In this embodiment, the pre-training stage uses unsupervised learning to train from a large-scale dataset to obtain a model with powerful performance, and the context-dependent dynamic word vectors are used to characterize semantic information of the word ambiguities in different contexts. This stage includes 2 tasks, mask language mode (Masked Language Mode, MLM) and next sentence prediction mode (Next Sentence Prediction, NSP). Wherein, the MLM task randomly masks 15% of words in the input corpus, and the model predicts the masked words according to the context. Where the NSP task is to determine if the predicted sentence pairs are consecutive sentences, the training data is to randomly extract the consecutive sentence pairs A, B from the input corpus, where 50% of the sentences B are retained, which have IsNext relations. In addition, 50% of sentence B was randomly replaced, and these sentence pairs had a NotNext relationship. And adding likelihood functions of the two tasks to be used as model pre-training loss functions. The process is shown in fig. 2 d. Network parameters obtained through pre-training are loaded in the fine tuning stage, a network is initialized, networks of different structures are accessed to the BERT model output layer according to task requirements to conduct supervised learning, the network training speed is improved, and the risk of fitting in the training of a small-scale data set is avoided to a certain extent.
In this embodiment, for the first BERT model, normalization operation is performed through a Softmax function, so as to obtain a prediction result of the statement intention. Setting the probability that P represents the intent of the sentence ontology vector S at the Q-th type, wherein the calculation method comprises the following steps:
Figure BDA0004068825520000121
wherein lambda and k respectively correspond to the weight matrix and the bias term; m represents the number of text labels, i.e. the number of machine labels.
In this embodiment, the model is optimally trained using a gradient descent algorithm, model loss is calculated using a cross entropy function, and model parameters are updated. The loss function L is calculated as follows:
Figure BDA0004068825520000122
wherein ,xi A value representing the sentence ontology vector S in the dimension i; phi represents the L regularization parameter. The model training can gradually converge, and parameter values of lambda and k are obtained, so that parameter adjustment is achieved, and a trained first BERT model is obtained.
S250: and inputting the text sentences in the text sentence library into a trained document theme model to obtain theme vectors in a training set, and inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in the training set.
In the embodiment of the invention, a text sentence is input into a trained LDA model to obtain a topic vector mu, the topic vector mu is used as a topic vector in a training set, and a text Wen Yugou is input into a trained first BERT model Mapped into 2 vector representations, denoted as word vector and location vector L, respectively p . It can be seen from the first BERT model that the sentence ontology vector s= { Q is obtained after processing 1 ,Q 2 ,Q 3 ,Q 4 ,Q 5 ,Q 6 ,…,Q n And as sentence ontology vectors in the training set.
S260: and splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors of the training set, inputting the new sentence vectors into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
In the embodiment of the invention, the sentence ontology vector S in the training set is combined with the topic vector mu (as the training set topic vector) extracted by the LDA model to form a new vector, and the combined new sentence vector is defined as F, and then F= { S, mu }.
In an embodiment of the present invention, optionally, inputting the new sentence vector into the second BERT model, and training the second BERT model includes:
inputting the new sentence vector sample into a second BERT model, and performing normalization operation;
adjusting parameters of the second BERT model based on the following loss function L':
Figure BDA0004068825520000131
wherein ,
Figure BDA0004068825520000132
p (f|q) represents the probability that the new sentence vector F is intended at the Q-th type; m represents the number of text labels, alpha and d respectively represent a weight matrix and a bias term, y i Representing the value of the new sentence vector F in the i dimension, η represents the regularization parameter of L'.
Specifically, the new sentence vector input into the second BERT model is F. And (3) using a second model BERT for the new sentence vector F, and carrying out normalization operation through a Softmax function to obtain a prediction result of the sentence intention. The probability calculation formula of the F at the Q-th intention is as follows:
Figure BDA0004068825520000133
the loss function L' is calculated as follows:
Figure BDA0004068825520000134
after processing the new sentence vector F fused with the LDA theme vector, the transducer encoder not only can learn and store the semantic relationship and text structure information of the dialect Wen Yugou, but also can learn the theme information of the dialect, and the second BRET model can gradually converge after training to obtain the parameter values of α and d, so that parameter adjustment of the second BERT model is realized, and a trained second BERT model is obtained.
S270: and acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector.
S280: the input content is converted into sentence ontology vectors by means of a trained first BERT model.
S290: and splicing the theme vector and the sentence ontology vector to form a target sentence vector.
S291: and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library.
The description of S270 to S291 may refer to the above-described embodiments.
According to the technical scheme provided by the embodiment of the invention, through machine reading understanding and intelligent technology recommendation, the memory and storage of the literary and literary composition can be solved, and a user can effectively search the Wen Yugou material of the literary composition without reading a large number of literary composition works and accumulating and reading for a long time; the efficient understanding of the cultural relics can be realized, and the efficient calling, flexible application and correct quotation of the cultural relics can be realized without long-time professional training; the intent of the user can be efficiently identified, and the recommended sentences can be accurately positioned from the massive dialect Wen Yugou library.
Fig. 3a is a block diagram of a sentence recommendation device according to an embodiment of the present invention, and as shown in fig. 3a, the device includes a topic vector identification module 310, a sentence ontology vector identification module 320, a concatenation module 330, and a matching module 340.
The topic vector recognition module 310 is configured to obtain input content of a user, and perform topic recognition on the input content through a trained document topic generation model to obtain a topic vector;
a sentence ontology vector recognition module 320, configured to convert the input content into sentence ontology vectors through a trained first BERT model;
a stitching module 330, configured to stitch the topic vector and the sentence ontology vector to form a target sentence vector;
and the matching module 340 is configured to input the target sentence vector into a trained second BERT model, obtain a feature vector of the input content, and determine a text-to-text sentence matched with the input content based on a similarity between the feature vector of the input content and a feature vector of a text-to-text sentence in the text Wen Yugou library.
Optionally, the first training module of the device is configured to:
performing theme labeling on the explanation content in the dialect Wen Yugou library;
and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model.
Optionally, the second training module is configured to:
And training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
Optionally, the third training module is configured to:
inputting the cultural sentence in the cultural sentence library into a trained document theme model to obtain a theme vector in a training set;
inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set;
splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set;
and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
Optionally, the inputting the new sentence vector into the second BERT model, training the second BERT model includes:
inputting the new sentence vector sample into a second BERT model, and performing normalization operation;
adjusting parameters of the second BERT model based on the following loss function L':
Figure BDA0004068825520000151
wherein ,
Figure BDA0004068825520000152
p (f|q) represents the probability that the new sentence vector F is intended at the Q-th type; m represents the number of text labels, alpha and d respectively represent a weight matrix and a bias term, y i Representing the value of the new sentence vector F in the i dimension, η represents the regularization parameter of L'.
Optionally, the apparatus further comprises a building dialect Wen Yugou library module for:
collecting the cultural relic data and processing the cultural relic data;
performing sentence breaking processing on the processed text data to form independent text Wen Yugou;
each independent cultural sentence and corresponding cultural octopus, dynasty, author, intention label, translation, admire and annotation content are stored to form a cultural Wen Yugou library.
Optionally, the processing the text-to-text data includes:
and carrying out data cleaning, complex-simplified conversion, supplementary punctuation and supplementary interpretation content processing on the cultural relics.
The architecture of the device provided by the embodiment of the present invention may also refer to fig. 3b.
The device provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method.
Fig. 4 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as the sentence recommendation method.
In some embodiments, the sentence recommendation method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the sentence recommendation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the sentence recommendation method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A sentence recommendation method, comprising:
acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector;
converting the input content into sentence ontology vectors by means of a trained first BERT model;
splicing the topic vector and the sentence ontology vector to form a target sentence vector;
And inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library.
2. The method as recited in claim 1, further comprising:
performing theme labeling on the explanation content in the dialect Wen Yugou library;
and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model.
3. The method as recited in claim 2, further comprising:
and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
4. A method according to claim 3, further comprising:
inputting the cultural sentence in the cultural sentence library into a trained document theme model to obtain a theme vector in a training set;
inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set;
Splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set;
and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
5. The method of claim 4, wherein the inputting the new sentence vector into the second BERT model, training the second BERT model, comprises:
inputting the new sentence vector sample into a second BERT model, and performing normalization operation;
adjusting parameters of the second BERT model based on the following loss function L':
Figure FDA0004068825500000021
wherein ,
Figure FDA0004068825500000022
p (F|Q) represents the new sentence directionProbability of quantity F at the Q-th intent; m represents the number of text labels, alpha and d respectively represent a weight matrix and a bias term, y i Representing the value of the new sentence vector F in the i dimension, η represents the regularization parameter of L'.
6. The method as recited in claim 2, further comprising:
collecting the cultural relic data and processing the cultural relic data;
performing sentence breaking processing on the processed text data to form independent text Wen Yugou;
Each independent cultural sentence and corresponding cultural octopus, dynasty, author, intention label, translation, admire and annotation content are stored to form a cultural Wen Yugou library.
7. The method of claim 6, wherein said processing said textual data comprises:
and carrying out data cleaning, complex-simplified conversion, supplementary punctuation and supplementary interpretation content processing on the cultural relics.
8. A sentence recommendation device is characterized by comprising
The topic vector recognition module is used for acquiring the input content of a user, and performing topic recognition on the input content through a trained document topic generation model to obtain a topic vector;
the sentence ontology vector recognition module is used for converting the input content into sentence ontology vectors through a trained first BERT model;
the splicing module is used for splicing the theme vector and the sentence body vector to form a target sentence vector;
and the matching module is used for inputting the target sentence vector into the trained second BERT model to obtain the characteristic vector of the input content, and determining the sentence of the Chinese language matched with the input content based on the similarity between the characteristic vector of the input content and the characteristic vector of the sentence of the Chinese language in the library of the Chinese language Wen Yugou.
9. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of any one of claims 1-7.
CN202310085815.3A 2023-01-17 2023-01-17 Statement recommendation method, device, equipment and storage medium Pending CN116089597A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310085815.3A CN116089597A (en) 2023-01-17 2023-01-17 Statement recommendation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310085815.3A CN116089597A (en) 2023-01-17 2023-01-17 Statement recommendation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116089597A true CN116089597A (en) 2023-05-09

Family

ID=86186746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310085815.3A Pending CN116089597A (en) 2023-01-17 2023-01-17 Statement recommendation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116089597A (en)

Similar Documents

Publication Publication Date Title
EP3923185A2 (en) Image classification method and apparatus, electronic device and storage medium
US20220004714A1 (en) Event extraction method and apparatus, and storage medium
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN112528677B (en) Training method and device of semantic vector extraction model and electronic equipment
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
Rizvi et al. Optical character recognition system for Nastalique Urdu-like script languages using supervised learning
Gao et al. Text classification research based on improved Word2vec and CNN
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN114398943B (en) Sample enhancement method and device thereof
CN116595195A (en) Knowledge graph construction method, device and medium
CN112800244A (en) Method for constructing knowledge graph of traditional Chinese medicine and national medicine
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.
CN115238093A (en) Model training method and device, electronic equipment and storage medium
CN114969371A (en) Heat sorting method and device of combined knowledge graph
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN114491030A (en) Skill label extraction and candidate phrase classification model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination