CN116089597A - Statement recommendation method, device, equipment and storage medium - Google Patents
Statement recommendation method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN116089597A CN116089597A CN202310085815.3A CN202310085815A CN116089597A CN 116089597 A CN116089597 A CN 116089597A CN 202310085815 A CN202310085815 A CN 202310085815A CN 116089597 A CN116089597 A CN 116089597A
- Authority
- CN
- China
- Prior art keywords
- sentence
- vector
- trained
- input content
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 239000013598 vector Substances 0.000 claims abstract description 205
- 238000012549 training Methods 0.000 claims description 78
- 238000012545 processing Methods 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 14
- 238000013519 translation Methods 0.000 claims description 9
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 241000238413 Octopus Species 0.000 claims description 3
- 239000002609 medium Substances 0.000 description 12
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 230000014616 translation Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 230000011218 segmentation Effects 0.000 description 6
- 239000000463 material Substances 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000001502 supplementing effect Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000001963 growth medium Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000000225 bioluminescence resonance energy transfer Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The embodiment of the invention discloses a sentence recommendation method, a sentence recommendation device, sentence recommendation equipment and a sentence recommendation storage medium, wherein the sentence recommendation method comprises the following steps: acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector; converting the input content into sentence ontology vectors by means of a trained first BERT model; splicing the topic vector and the sentence ontology vector to form a target sentence vector; and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library. The technical scheme provided by the embodiment of the invention can accurately provide the dialect Wen Yugou and reduce the use threshold of the dialect.
Description
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a statement recommendation method, device, equipment and storage medium.
Background
At present, when writing documents such as government authorities, institutions, large enterprises, policy documents, regulations, investigation reports, meeting disciplines, working schemes, opinions, and the like, it is often necessary to enrich the expression of articles by using quotation of literary sentences.
However, the user often cannot more vividly complete the manuscript writing by referring to classical works because the storage amount of the user's literary knowledge is insufficient in the writing process.
Disclosure of Invention
The embodiment of the invention provides a sentence recommendation method, a sentence recommendation device, sentence recommendation equipment and a sentence recommendation storage medium, which can accurately provide a culture medium Wen Yugou and reduce a culture medium use threshold.
In a first aspect, an embodiment of the present invention provides a sentence recommendation method, including:
acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector;
converting the input content into sentence ontology vectors by means of a trained first BERT model;
splicing the topic vector and the sentence ontology vector to form a target sentence vector;
and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library.
In a second aspect, an embodiment of the present invention provides a sentence recommendation apparatus, including
The topic vector recognition module is used for acquiring the input content of a user, and performing topic recognition on the input content through a trained document topic generation model to obtain a topic vector;
the sentence ontology vector recognition module is used for converting the input content into sentence ontology vectors through a trained first BERT model;
the splicing module is used for splicing the theme vector and the sentence body vector to form a target sentence vector;
and the matching module is used for inputting the target sentence vector into the trained second BERT model to obtain the characteristic vector of the input content, and determining the sentence of the Chinese language matched with the input content based on the similarity between the characteristic vector of the input content and the characteristic vector of the sentence of the Chinese language in the library of the Chinese language Wen Yugou.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the methods provided by the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing computer instructions for causing a processor to execute a method provided by embodiments of the present invention.
According to the technical scheme provided by the embodiment of the invention, the input content of the user is acquired, the input content is subject-identified through the trained document subject generation model, a subject vector is obtained, and the input content is converted into a sentence ontology vector through the trained first BERT model; splicing the topic vector and the sentence ontology vector to form a target sentence vector; and inputting the target sentence vector into the trained second BERT model to obtain the feature vector of the input content, and determining the dialect sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the dialect sentence in the dialect Wen Yugou library, so that the dialect Wen Yugou can be accurately provided, and the use threshold of the dialect sentence is reduced.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1a is a flowchart of a sentence recommendation method according to an embodiment of the present invention;
FIG. 1b is a search result presentation schematic;
FIG. 2a is a flowchart of a sentence recommendation method according to an embodiment of the present invention;
FIG. 2b is a structural relationship diagram of an LDA model;
FIG. 2c is a schematic diagram of the BERT model structure;
FIG. 2d is a flowchart of the BERT model training process;
FIG. 3a is a block diagram of a sentence recommendation device according to an embodiment of the present invention;
fig. 3b is a structural architecture diagram of a sentence recommendation device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1a is a flowchart of a sentence recommendation method provided by an embodiment of the present invention, where the method may be performed by a sentence recommendation device, the device may be performed by software and/or hardware, the device may be configured in an electronic device such as a computer, and the method may be applied in a scenario of sentence recommendation in a cultural sentence, as shown in fig. 1a, and the technical solution provided by the embodiment of the present invention includes:
S110: and acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector.
In the embodiment of the present invention, before S110, training the document theme generating model (Latent Dirichlet Allocation, LDA) may be further included, specifically: performing theme labeling on the explanation content in the dialect Wen Yugou library; and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model. The interpretation content of the dialect Wen Yugou in the dialect Wen Yugou library can be subject-marked, and the LDA model is trained through the marked interpretation content, so that a trained LDA model is obtained. In the embodiment of the invention, the input content of the user can be sentences or words, and the input content of the user can be other than the text sentences, and can be white text sentences or white text words. In the embodiment of the invention, a junction word segmentation device (jieba-analysis, JEBA) can be used for segmenting the input content to obtain the segmented feature vector, and the segmented feature vector is input into a trained document topic generation model (Latent Dirichlet Allocation, LDA) to obtain the topic vector of the input content.
S120: the input content is converted into sentence ontology vectors by means of a trained first BERT model.
In the embodiment of the invention, the input content can be mapped to two vectors, namely a word vector and a position vector, through the trained first BERT model, and the word vector and the position vector form a sentence ontology vector.
In an embodiment of the present invention, before S120, training the first BERT model (Bidirectional Encoder Representations from Transformers) may further be included, specifically: and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
S130: and splicing the theme vector and the sentence ontology vector to form a target sentence vector.
In the embodiment of the invention, the sentence ontology vector S 'is combined with the topic vector extracted by the LDA model as mu' to form a target sentence vector, and the target sentence vector is defined as F ', and then F' = { S ', mu' }
S140: and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text sentence in the text Wen Yugou library.
In an embodiment of the present invention, training the second BERT model may be further included before S140, specifically: inputting the text sentences in the text sentence library into a trained document topic model to obtain topic vectors in a training set; inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set; splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set; and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model. And inputting new sentence vectors in a training set formed by the topic vectors and the sentence ontology vectors into a second BERT model, and training the second BERT model to obtain a trained second BERT model.
In the embodiment of the invention, the spliced target sentence vector is input into a second BERT model to obtain the feature vector of the input content, the feature vector and the feature vector of the Chinese language sentence in the dialect Wen Yugou library are subjected to similarity calculation, the feature vector of the Chinese language sentence is screened based on the similarity, and the corresponding Chinese language sentence is determined. Wherein, the feature vector of the Chinese language sentence in the Chinese Wen Yugou library can be obtained by the second BERT model. Specifically, vectorization is performed on the dialects Wen Yugou in the dialect Wen Yugou library through the LDA model and the first BERT model, the dialects Wen Yugou are spliced into new sentence vectors, and the new sentence vectors are input into the second BERT model to obtain feature vectors of the dialect sentences.
In the embodiment of the invention, the similarity calculation can be performed by the distance between two vectors, and the specific calculation formula is as follows:
wherein F' (p) can be the feature vector of the target sentence, F (t) i ) May be feature vectors of the cultural sentences in the cultural Wen Yugou library. Wherein the distance between the vectors is the similarity between the two feature vectors, denoted as f i The corresponding dialects Wen Yugou in the library of dialects Wen Yugou may be ranked in order of similarity from large to small, the dialects Wen Gouzi of the first 50 of the ranks may be screened, and the screened dialects Wen Yugou may be displayed, for example, as shown in fig. 1b, contents may be input in an input box, and 50 recommended dialect sentences may appear in the search result.
In the related art, the multi-focus on the text of the text uses artificial intelligence technology to explain the text of the text, uses modern white language to explain the text of the text, and carries out machine reading and translation on the difficult words and characters. The method and the system for automatically generating the topic of the cultural relics in the related technology automatically extract key contents to conduct topic training by using a topic generation technology. According to the method for reading and understanding the literary and the text based on the multi-task combined training in the related technology, sentence breaking processing can be accurately carried out on the literary and text, the condition that the ancient text and the modern text exist simultaneously can be considered, a literary and text machine reading and understanding model based on the multi-task combined training is established, the multi-task combined training is carried out on the literary and text, and machine reading and understanding of the literary and the text are realized. The reading understanding of the cultural relics is used for more intuitively understanding the content of the cultural relics and helping the learning and deep understanding of the cultural relics; the problem is that the memory and storage of the literary and literary materials are not solved, and a user can effectively memorize and store the literary and literary materials by reading a large amount of literary and literary works and accumulating and reading for a long time. The second problem is that powerful memory and comprehension are needed, and efficient calling, flexible application and correct quotation of the literary works can be realized through long-time professional training. This places high demands on the user's language reading understanding ability, language application ability, and urgent demands are to have a technology capable of reducing the use threshold of the text.
According to the technical scheme provided by the embodiment of the invention, the input content of the user is acquired, the input content is subject-identified through the trained document subject generation model, a subject vector is obtained, and the input content is converted into a sentence ontology vector through the trained first BERT model; splicing the topic vector and the sentence ontology vector to form a target sentence vector; and inputting the target sentence vector into the trained second BERT model to obtain the feature vector of the input content, and determining the dialect sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the dialect sentence in the dialect Wen Yugou library, so that the dialect Wen Yugou can be accurately provided, and the use threshold of the dialect sentence is reduced.
Fig. 2a is a flowchart of a sentence recommendation method according to an embodiment of the present invention, where in this embodiment, optionally, the method may further include:
collecting the cultural relic data and processing the cultural relic data;
performing sentence breaking processing on the processed text data to form independent text Wen Yugou;
each independent cultural sentence and corresponding cultural octopus, dynasty, author, intention label, translation, admire and annotation content are stored to form a cultural Wen Yugou library.
Optionally, the method may further include:
performing theme labeling on the explanation content in the dialect Wen Yugou library;
and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model.
Optionally, the method may further include:
and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
Optionally, the method may further include:
inputting the cultural sentence in the cultural sentence library into a trained document theme model to obtain a theme vector in a training set;
inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set;
splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set;
and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
As shown in fig. 2a, the technical solution provided by the embodiment of the present invention includes:
S210: and collecting the cultural relic data and processing the cultural relic data.
In the embodiment of the invention, the text data can comprise classical text works, wherein the collected content can comprise article content, authors, dynasties, translation and interpretation content and the like. In particular, classical literary works on a network may be collected by crawlers. 7 attributes are formulated for the classical dialect, and 30 ten thousand data are crawled by adopting a Python web crawler, wherein the crawled data can basically completely represent the detailed information of the classical dialect. Because the classical dialect has more attributes, the crawler data is easily affected by webpage advertisements and data types and is easy to have blank labels, and before the dialect data is written into a dialect database (MongoDB database), the data needs to be subjected to cleaning, screening, filtering, supplementing and other treatments so as to ensure the certainty of the attributes of each dialect in the dialect database, such as the details of the legend names, the dynasties, the authors, the image labels, the translations, the admissions, the annotation contents and the like.
In one implementation manner of the embodiment of the present invention, optionally, the processing the text-to-text data includes: and carrying out data cleaning, complex-simplified conversion, supplementary punctuation and supplementary interpretation content processing on the cultural relics.
The cleaning data is mainly performed on data acquired by a crawler, and the data acquired by the crawler is disordered and is often influenced by information such as web page advertisements, blanks, different tag data formats, additional web page links and the like. The occurrence of the conditions not only can influence the matching speed of the semantic search, but also can directly influence the accuracy of the semantic search, so that the operation difficulty is increased, and all the factors can be removed as noise.
The attribute of the classical dialect with the blank in the data is supplemented to be complete as much as possible, and the complete is needed to be supplemented by a manual marking annotation method especially aiming at the explanation that the content is blank.
Wherein the complex body is converted into a simplified body. Specifically, when the Chinese data is processed, complex characters often appear to influence the normal sentence searching flow, and optionally, a Chinese character simplified and complex conversion system is adopted when the Chinese data is converted in the embodiment, so that the complex characters in the Chinese data can be directly modified and converted.
The cultural relics data can be put in storage. Specifically, a classical dialect database and a user database can be constructed, and in the construction of the dialect database of the application, the characteristics of the database and the text data are considered, and optionally, a non-relational database MongoDB can be selected.
The method comprises the steps of carrying out rechecking and checking on the cultural relics, supplementing the cultural relics, deleting repeated information, correcting error information, supplementing missing cultural relics attribute labels in time, and ensuring consistency of writing classical cultural relics library data by a series of operators. When writing the data of the dialect into the database, each dialect is taken as an object, and the class is a unified abstract description representing that a group of objects have the same attribute and the same operation, and the total number of the attributes of the dialect is 7 in the embodiment.
S220: and performing sentence breaking processing on the processed text data to form independent text Wen Yugou, and storing each independent text sentence and corresponding text octoname, dynasty, author, intention label, translation, appreciation and annotation content to form a text Wen Yugou library.
In the embodiment of the invention, the text data is specifically subjected to sentence breaking processing and divided into independent text sentences, each sentence is a single piece of data, each sentence corresponds to Wen Zhangming, the dynasty, the author, the image label, the translation, the appreciation and the explanation content, and the corresponding processing is carried out according to the sentence breaking. The dialect sentences may be stored in a monglodb database forming a library of dialects Wen Yugou. The sentence data set of 70206 effective articles is determined by four parts of data set definition, data source, data preprocessing, data analysis and the like, and comprises a library Wen Yugou of dialect Wen Yugou with 7 attributes of Wen Zhangming, dynasty, author, image label, translation, pleasant, explanation and the like.
S230: and performing topic annotation on the interpretation content in the dialect Wen Yugou library, inputting the annotated interpretation content into the document topic generation model, and training the document topic generation model to obtain a trained document topic generation model.
In the embodiment of the invention, specifically, the interpreted content is subject (TOPIC) identified by using an LDA model by utilizing JEBA word segmentation, and the identified content is identified as a subject vector. Specifically, the JEBA word bank is adopted to expand word segmentation for the corresponding interpretation content of the dialect Wen Yugou, and the word vector of each word segmentation is set as d i . Setting the interpretation feature vector of the legend Wen Yugou to D, there are:
D={d 1 ,d 2 ,d 3 ,d 4 ,d 5 ,d 6 ,…,d m }
where m is the number of word vectors in the interpreted sentence.
In the embodiment, a sentence interpretation-topic-word segmentation three-layer Bayesian probability LDA model is applied to construct a model structure. The structural relationship of the LDA model may refer to fig. 2b, and the topic vector distribution of the interpretation statement in the LDA model may be determined by a related technology method, where the parameter estimation method in the model needs to perform multiple training adjustment and acquisition according to the sample data of the interpretation statement in the library of the dialect Wen Yugou. The LDA model is pre-trained by annotating the interpreted content in the library of dialects Wen Yugou (content formed by interpreted statements) and by interpreting the content. Setting the characteristic topic vector of the interpretation content as mu, and then obtaining the topic of the interpretation content by calculating the high-frequency word and training of the topic, wherein the topic vector mu corresponding to the interpretation statement i is obtained i 。
S240: and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
In this embodiment, the first BERT model may be used to perform word segmentation and feature vectorization training on the text and sentence, where the training is labeled as a sentence ontology vector in the training set.
Specifically, the first BERT model is used for training the text and sentence of the cultural relics, and sentence ontology vectors are set as S, word vectors Q and word position vectors L p For a first BERT model input sentence ontology vector S composed of n words, there are:
S={Q 1 ,Q 2 ,Q 3 ,Q 4 ,Q 5 ,Q 6 ,…,Q n }
in the present embodiment, the representation is performed using root embedding and position embedding. Semantic information representing the words is embedded through the root of words, and the position information of the words in the original sentences is reserved through position embedding.
In this embodiment, for root embedding, the input dialect Wen Yugou is specifically segmented and mapped, and the sentence is segmented by using the JEBA embedding with 3 ten thousand word units embedded therein. For example, "eucrypti and solitary canthus Ji Fei" is partitioned into 4 word units of "eucrypti/and/solitary canthus/Ji Fei", where "/" denotes separators. Each word unit is mapped to a vector of a certain dimension as an initial word embedding for that layer.
In this embodiment, for position embedding, in particular, the position of the word unit in the sentence, i.e. the position vector, is marked. The initial position vector calculation method is as follows:
wherein ,PEt,2k and PEt,2k-1 The position characteristic values of the N-dimensional word vector with the position t in the 2k dimension and the 2k-1 dimension are respectively calculated by using a sine function, 2k is an even number, and 2k-1 is an odd number and is calculated by using a cosine function.
In this embodiment, the BERT model training process is divided into two stages of pre-training and fine-tuning. The model adopts a transducer feature extractor to extract text information bidirectionally. The BERT model structure may refer to fig. 2c.
In this embodiment, the pre-training stage uses unsupervised learning to train from a large-scale dataset to obtain a model with powerful performance, and the context-dependent dynamic word vectors are used to characterize semantic information of the word ambiguities in different contexts. This stage includes 2 tasks, mask language mode (Masked Language Mode, MLM) and next sentence prediction mode (Next Sentence Prediction, NSP). Wherein, the MLM task randomly masks 15% of words in the input corpus, and the model predicts the masked words according to the context. Where the NSP task is to determine if the predicted sentence pairs are consecutive sentences, the training data is to randomly extract the consecutive sentence pairs A, B from the input corpus, where 50% of the sentences B are retained, which have IsNext relations. In addition, 50% of sentence B was randomly replaced, and these sentence pairs had a NotNext relationship. And adding likelihood functions of the two tasks to be used as model pre-training loss functions. The process is shown in fig. 2 d. Network parameters obtained through pre-training are loaded in the fine tuning stage, a network is initialized, networks of different structures are accessed to the BERT model output layer according to task requirements to conduct supervised learning, the network training speed is improved, and the risk of fitting in the training of a small-scale data set is avoided to a certain extent.
In this embodiment, for the first BERT model, normalization operation is performed through a Softmax function, so as to obtain a prediction result of the statement intention. Setting the probability that P represents the intent of the sentence ontology vector S at the Q-th type, wherein the calculation method comprises the following steps:
wherein lambda and k respectively correspond to the weight matrix and the bias term; m represents the number of text labels, i.e. the number of machine labels.
In this embodiment, the model is optimally trained using a gradient descent algorithm, model loss is calculated using a cross entropy function, and model parameters are updated. The loss function L is calculated as follows:
wherein ,xi A value representing the sentence ontology vector S in the dimension i; phi represents the L regularization parameter. The model training can gradually converge, and parameter values of lambda and k are obtained, so that parameter adjustment is achieved, and a trained first BERT model is obtained.
S250: and inputting the text sentences in the text sentence library into a trained document theme model to obtain theme vectors in a training set, and inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in the training set.
In the embodiment of the invention, a text sentence is input into a trained LDA model to obtain a topic vector mu, the topic vector mu is used as a topic vector in a training set, and a text Wen Yugou is input into a trained first BERT model Mapped into 2 vector representations, denoted as word vector and location vector L, respectively p . It can be seen from the first BERT model that the sentence ontology vector s= { Q is obtained after processing 1 ,Q 2 ,Q 3 ,Q 4 ,Q 5 ,Q 6 ,…,Q n And as sentence ontology vectors in the training set.
S260: and splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors of the training set, inputting the new sentence vectors into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
In the embodiment of the invention, the sentence ontology vector S in the training set is combined with the topic vector mu (as the training set topic vector) extracted by the LDA model to form a new vector, and the combined new sentence vector is defined as F, and then F= { S, mu }.
In an embodiment of the present invention, optionally, inputting the new sentence vector into the second BERT model, and training the second BERT model includes:
inputting the new sentence vector sample into a second BERT model, and performing normalization operation;
adjusting parameters of the second BERT model based on the following loss function L':
wherein ,p (f|q) represents the probability that the new sentence vector F is intended at the Q-th type; m represents the number of text labels, alpha and d respectively represent a weight matrix and a bias term, y i Representing the value of the new sentence vector F in the i dimension, η represents the regularization parameter of L'.
Specifically, the new sentence vector input into the second BERT model is F. And (3) using a second model BERT for the new sentence vector F, and carrying out normalization operation through a Softmax function to obtain a prediction result of the sentence intention. The probability calculation formula of the F at the Q-th intention is as follows:
the loss function L' is calculated as follows:
after processing the new sentence vector F fused with the LDA theme vector, the transducer encoder not only can learn and store the semantic relationship and text structure information of the dialect Wen Yugou, but also can learn the theme information of the dialect, and the second BRET model can gradually converge after training to obtain the parameter values of α and d, so that parameter adjustment of the second BERT model is realized, and a trained second BERT model is obtained.
S270: and acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector.
S280: the input content is converted into sentence ontology vectors by means of a trained first BERT model.
S290: and splicing the theme vector and the sentence ontology vector to form a target sentence vector.
S291: and inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library.
The description of S270 to S291 may refer to the above-described embodiments.
According to the technical scheme provided by the embodiment of the invention, through machine reading understanding and intelligent technology recommendation, the memory and storage of the literary and literary composition can be solved, and a user can effectively search the Wen Yugou material of the literary composition without reading a large number of literary composition works and accumulating and reading for a long time; the efficient understanding of the cultural relics can be realized, and the efficient calling, flexible application and correct quotation of the cultural relics can be realized without long-time professional training; the intent of the user can be efficiently identified, and the recommended sentences can be accurately positioned from the massive dialect Wen Yugou library.
Fig. 3a is a block diagram of a sentence recommendation device according to an embodiment of the present invention, and as shown in fig. 3a, the device includes a topic vector identification module 310, a sentence ontology vector identification module 320, a concatenation module 330, and a matching module 340.
The topic vector recognition module 310 is configured to obtain input content of a user, and perform topic recognition on the input content through a trained document topic generation model to obtain a topic vector;
a sentence ontology vector recognition module 320, configured to convert the input content into sentence ontology vectors through a trained first BERT model;
a stitching module 330, configured to stitch the topic vector and the sentence ontology vector to form a target sentence vector;
and the matching module 340 is configured to input the target sentence vector into a trained second BERT model, obtain a feature vector of the input content, and determine a text-to-text sentence matched with the input content based on a similarity between the feature vector of the input content and a feature vector of a text-to-text sentence in the text Wen Yugou library.
Optionally, the first training module of the device is configured to:
performing theme labeling on the explanation content in the dialect Wen Yugou library;
and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model.
Optionally, the second training module is configured to:
And training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
Optionally, the third training module is configured to:
inputting the cultural sentence in the cultural sentence library into a trained document theme model to obtain a theme vector in a training set;
inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set;
splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set;
and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
Optionally, the inputting the new sentence vector into the second BERT model, training the second BERT model includes:
inputting the new sentence vector sample into a second BERT model, and performing normalization operation;
adjusting parameters of the second BERT model based on the following loss function L':
wherein ,p (f|q) represents the probability that the new sentence vector F is intended at the Q-th type; m represents the number of text labels, alpha and d respectively represent a weight matrix and a bias term, y i Representing the value of the new sentence vector F in the i dimension, η represents the regularization parameter of L'.
Optionally, the apparatus further comprises a building dialect Wen Yugou library module for:
collecting the cultural relic data and processing the cultural relic data;
performing sentence breaking processing on the processed text data to form independent text Wen Yugou;
each independent cultural sentence and corresponding cultural octopus, dynasty, author, intention label, translation, admire and annotation content are stored to form a cultural Wen Yugou library.
Optionally, the processing the text-to-text data includes:
and carrying out data cleaning, complex-simplified conversion, supplementary punctuation and supplementary interpretation content processing on the cultural relics.
The architecture of the device provided by the embodiment of the present invention may also refer to fig. 3b.
The device provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method.
Fig. 4 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as the sentence recommendation method.
In some embodiments, the sentence recommendation method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the sentence recommendation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the sentence recommendation method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.
Claims (10)
1. A sentence recommendation method, comprising:
acquiring input content of a user, and performing topic identification on the input content through a trained document topic generation model to obtain a topic vector;
converting the input content into sentence ontology vectors by means of a trained first BERT model;
splicing the topic vector and the sentence ontology vector to form a target sentence vector;
And inputting the target sentence vector into a trained second BERT model to obtain a feature vector of the input content, and determining the text-to-speech sentence matched with the input content based on the similarity between the feature vector of the input content and the feature vector of the text-to-speech sentence in the text Wen Yugou library.
2. The method as recited in claim 1, further comprising:
performing theme labeling on the explanation content in the dialect Wen Yugou library;
and inputting the annotated interpretation content into the document theme generation model, and training the document theme generation model to obtain a trained document theme generation model.
3. The method as recited in claim 2, further comprising:
and training the first BERT model through the dialects Wen Yugou in the dialects Wen Yugou library to obtain a trained first BERT model.
4. A method according to claim 3, further comprising:
inputting the cultural sentence in the cultural sentence library into a trained document theme model to obtain a theme vector in a training set;
inputting the text sentences in the text Wen Yugou library into a trained first BERT model to obtain sentence ontology vectors in a training set;
Splicing the topic vectors in the training set and the sentence ontology vectors in the training set to obtain new sentence vectors in the training set;
and inputting the new sentence vector into the second BERT model, and training the second BERT model to obtain a trained second BERT model.
5. The method of claim 4, wherein the inputting the new sentence vector into the second BERT model, training the second BERT model, comprises:
inputting the new sentence vector sample into a second BERT model, and performing normalization operation;
adjusting parameters of the second BERT model based on the following loss function L':
wherein ,p (F|Q) represents the new sentence directionProbability of quantity F at the Q-th intent; m represents the number of text labels, alpha and d respectively represent a weight matrix and a bias term, y i Representing the value of the new sentence vector F in the i dimension, η represents the regularization parameter of L'.
6. The method as recited in claim 2, further comprising:
collecting the cultural relic data and processing the cultural relic data;
performing sentence breaking processing on the processed text data to form independent text Wen Yugou;
Each independent cultural sentence and corresponding cultural octopus, dynasty, author, intention label, translation, admire and annotation content are stored to form a cultural Wen Yugou library.
7. The method of claim 6, wherein said processing said textual data comprises:
and carrying out data cleaning, complex-simplified conversion, supplementary punctuation and supplementary interpretation content processing on the cultural relics.
8. A sentence recommendation device is characterized by comprising
The topic vector recognition module is used for acquiring the input content of a user, and performing topic recognition on the input content through a trained document topic generation model to obtain a topic vector;
the sentence ontology vector recognition module is used for converting the input content into sentence ontology vectors through a trained first BERT model;
the splicing module is used for splicing the theme vector and the sentence body vector to form a target sentence vector;
and the matching module is used for inputting the target sentence vector into the trained second BERT model to obtain the characteristic vector of the input content, and determining the sentence of the Chinese language matched with the input content based on the similarity between the characteristic vector of the input content and the characteristic vector of the sentence of the Chinese language in the library of the Chinese language Wen Yugou.
9. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310085815.3A CN116089597A (en) | 2023-01-17 | 2023-01-17 | Statement recommendation method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310085815.3A CN116089597A (en) | 2023-01-17 | 2023-01-17 | Statement recommendation method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116089597A true CN116089597A (en) | 2023-05-09 |
Family
ID=86186746
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310085815.3A Pending CN116089597A (en) | 2023-01-17 | 2023-01-17 | Statement recommendation method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116089597A (en) |
-
2023
- 2023-01-17 CN CN202310085815.3A patent/CN116089597A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3923185A2 (en) | Image classification method and apparatus, electronic device and storage medium | |
US20220004714A1 (en) | Event extraction method and apparatus, and storage medium | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
JP7301922B2 (en) | Semantic retrieval method, device, electronic device, storage medium and computer program | |
CN111291195B (en) | Data processing method, device, terminal and readable storage medium | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
WO2021121198A1 (en) | Semantic similarity-based entity relation extraction method and apparatus, device and medium | |
CN112528677B (en) | Training method and device of semantic vector extraction model and electronic equipment | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
Rizvi et al. | Optical character recognition system for Nastalique Urdu-like script languages using supervised learning | |
Gao et al. | Text classification research based on improved Word2vec and CNN | |
CN111753082A (en) | Text classification method and device based on comment data, equipment and medium | |
CN113986864A (en) | Log data processing method and device, electronic equipment and storage medium | |
CN115859980A (en) | Semi-supervised named entity identification method, system and electronic equipment | |
CN115759119A (en) | Financial text emotion analysis method, system, medium and equipment | |
CN114398943B (en) | Sample enhancement method and device thereof | |
CN116595195A (en) | Knowledge graph construction method, device and medium | |
CN112800244A (en) | Method for constructing knowledge graph of traditional Chinese medicine and national medicine | |
CN114218940B (en) | Text information processing and model training method, device, equipment and storage medium | |
Ronghui et al. | Application of Improved Convolutional Neural Network in Text Classification. | |
CN115238093A (en) | Model training method and device, electronic equipment and storage medium | |
CN114969371A (en) | Heat sorting method and device of combined knowledge graph | |
CN114676699A (en) | Entity emotion analysis method and device, computer equipment and storage medium | |
CN114491030A (en) | Skill label extraction and candidate phrase classification model training method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |