CN111104799B - Text information characterization method, system, computer equipment and storage medium - Google Patents

Text information characterization method, system, computer equipment and storage medium Download PDF

Info

Publication number
CN111104799B
CN111104799B CN201910981528.4A CN201910981528A CN111104799B CN 111104799 B CN111104799 B CN 111104799B CN 201910981528 A CN201910981528 A CN 201910981528A CN 111104799 B CN111104799 B CN 111104799B
Authority
CN
China
Prior art keywords
sentence
training
word
vector
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910981528.4A
Other languages
Chinese (zh)
Other versions
CN111104799A (en
Inventor
侯晓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN201910981528.4A priority Critical patent/CN111104799B/en
Publication of CN111104799A publication Critical patent/CN111104799A/en
Application granted granted Critical
Publication of CN111104799B publication Critical patent/CN111104799B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of artificial intelligence, and relates to a text information characterization method, a text information characterization system, computer equipment and a storage medium, wherein the text information characterization method comprises the following steps: acquiring a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the corpus to be analyzed is text information, and the text information comprises at least one sentence; acquiring word vectors of word segmentation contained in each sentence in the corpus to be analyzed, obtaining a word vector group of each sentence, sequentially inputting the word vectors in the word vector group into an initial sentence vector algorithm model according to the sequence, and generating initial sentence vectors of the corresponding sentences; and inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence, wherein the final sentence vector is used for representing text information, and the pre-trained sentence vector model is generated based on the context relation of the sentences. The scheme provided by the invention can avoid the influence caused by different semantics of the words in different sentences, and the text information is characterized more accurately.

Description

Text information characterization method, system, computer equipment and storage medium
Technical Field
The embodiment of the invention belongs to the technical field of artificial intelligence, and particularly relates to a text information characterization method, a text information characterization system, computer equipment and a storage medium.
Background
In the field of natural language processing, text information characterization is the basis for solving the text processing problem, but Word2 Vec-based Word vector addition and average are generally adopted as text information characterization methods in the prior art, but the semantics of the same Word in different sentences and different contexts are different, so the text information characterization based on the Word vector is inaccurate, and is not applicable to text information characterization such as article information in the field of information stream recommendation.
Disclosure of Invention
In view of the above, embodiments of the present invention provide a method, a system, a computer device, and a storage medium for text information characterization, so as to solve the problem in the prior art that text information characterization based on word vectors is not accurate enough, and is not applicable to text information characterization such as article information in the field of information stream recommendation.
In a first aspect, an embodiment of the present invention provides a text information characterizing method, including:
acquiring a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the corpus to be analyzed is text information, and the text information comprises at least one sentence;
Acquiring word vectors of word segmentation contained in each sentence in the corpus to be analyzed, obtaining a word vector group of each sentence, sequentially inputting word vectors in the word vector group into an initial sentence vector algorithm model according to sequence, and generating initial sentence vectors of corresponding sentences;
and inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence, wherein the final sentence vector is used for representing text information, and the pre-trained sentence vector model is generated based on the context relation of the sentences.
As an embodiment of the present invention, before the corpus to be analyzed is obtained, the method further includes a step of performing model training on the pre-trained sentence vector model, where a training process of the pre-trained sentence vector model includes:
acquiring a training corpus, performing word segmentation pretreatment on the corpus in the training corpus, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the training corpus is a training text information set which comprises at least one training sentence;
acquiring word vectors of the word segmentation contained in each training sentence, obtaining a word vector group of each training sentence, sequentially inputting the word vectors in the word vector group of the training sentence into the initial sentence vector algorithm model according to the sequence, and generating initial sentence vectors of the corresponding training sentences;
Based on the context relation corresponding to each training sentence in the training corpus, inputting an initial sentence vector corresponding to each training sentence into an initial sentence vector model for training, and obtaining the pre-trained sentence vector model.
As an embodiment of the present invention, the inputting, based on the context corresponding to each training sentence in the training corpus, an initial sentence vector corresponding to each training sentence into an initial sentence vector model for training, and obtaining the pre-trained sentence vector model includes:
configuring a parameter matrix of the initial sentence vector model, wherein the parameter matrix is connected with an input layer and an output layer of the initial sentence vector model;
generating a training sample and a test sample according to the context relation corresponding to each training sentence, wherein the training sample and the test sample respectively comprise K1 and K2 sentence groups, each sentence group comprises at least one training sentence used for generating an input sentence vector and at least one training sentence used for generating an output sentence vector, and K1 and K2 are positive integers;
inputting the input sentence vector in each sentence group in the training sample to the initial sentence vector model in sequence for training, and gradually adjusting parameters in the parameter matrix until the sentence groups in the training sample are trained, so that the output of the initial sentence vector model is gradually matched with the corresponding output sentence vectors in the sentence groups;
And checking the initial sentence vector model after training through the checking sample, and finishing training of the initial sentence vector model after checking, so as to obtain the trained sentence vector model.
As an embodiment of the present invention, the inputting the initial sentence vector into the pre-trained sentence vector model, obtaining the final sentence vector of each sentence includes: and inputting the initial sentence vector into the pre-trained sentence vector model, and multiplying the initial sentence vector of the corpus to be analyzed by the parameter matrix to obtain a final sentence vector used for representing the text information of the corpus to be analyzed.
As an embodiment of the present invention, the initial sentence vector model may be a skip-gram model or a cbow model.
As an embodiment of the present invention, the connecting a corpus, performing word segmentation preprocessing on the corpus in the corpus, to obtain a group of segmented words, and generating corresponding word vectors for all the segmented words respectively includes:
performing word segmentation on the corpus in the corpus by adopting a preset word segmentation algorithm, and performing word removal operation on word segmentation results to obtain a lexicon with N word segmentation quantity, wherein N is a positive integer;
And inputting the N segmented words in the word stock into a preset word vector model to obtain word vectors of the N segmented words.
As an implementation mode of the invention, the initial sentence vector algorithm model is a GRU algorithm model.
In a second aspect, an embodiment of the present invention provides a text information characterization system, including:
the word vector generation module is used for acquiring the corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained segmented words, wherein the corpus to be analyzed is text information; the text information comprises at least one sentence;
the initial sentence vector generation module is used for obtaining word vectors of word segmentation contained in each sentence in the corpus to be analyzed to obtain a word vector group of each sentence, and sequentially inputting the word vectors in the word vector group into an initial sentence vector algorithm model to generate initial sentence vectors of corresponding sentences;
the text information characterization module is used for inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence, and the final sentence vector is used for characterizing text information; wherein the pre-trained sentence vector model is generated based on the context of the sentence.
In a third aspect, an embodiment of the present invention provides a computer apparatus, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores computer readable instructions executable by the at least one processor, which when executed by the at least one processor, cause the at least one processor to perform the steps of the text information characterization method as described above.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon computer-readable instructions that, when executed by at least one processor, implement the steps of a text information characterization method as described above.
According to the text information characterization method, the system, the computer equipment and the storage medium provided by the embodiment of the invention, the text information characterization at the sentence level is performed by establishing the pre-trained sentence vector model based on the context relation of the sentences, and the influence of different semantics of the words in different sentences can be avoided in the text information characterization process due to the consideration of the sentence context relation, so that the text information characterization is more accurate.
Drawings
In order to more clearly illustrate the solution of the present invention, a brief description will be given below of the drawings required for the description of the embodiments, it being apparent that the drawings in the following description are some embodiments of the present invention and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a text message characterization method according to an embodiment of the present invention;
FIG. 2 is a flow chart of generating word vectors according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a network node of a GRU algorithm model according to an embodiment of the present invention;
FIG. 4 is a flowchart of a training process of a pre-trained sentence vector model according to an embodiment of the present invention;
FIG. 5 is a flowchart of training an initial sentence vector model based on context of training sentences according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a text message characterization system according to an embodiment of the present invention;
FIG. 7 is another schematic diagram of a text message characterization system according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a model training module according to an embodiment of the present invention;
fig. 9 is a block diagram of a computer device according to an embodiment of the present invention.
Description of the embodiments
In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The appearances of the phrase "an embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
An embodiment of the present invention provides a text information characterization method, as shown in fig. 1, where the text information characterization method includes:
s1, acquiring a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the corpus to be analyzed is text information, and the text information comprises at least one sentence;
s2, obtaining word vectors of the word segmentation contained in each sentence in the corpus to be analyzed, obtaining a word vector group of each sentence, sequentially inputting the word vectors in the word vector group into an initial sentence vector algorithm model according to the sequence, and generating initial sentence vectors of the corresponding sentences;
S3, inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence, wherein the final sentence vector is used for representing text information, and the pre-trained sentence vector model is generated based on the context relation of the sentences.
Specifically, in the embodiment of the present invention, the corpus to be analyzed in step S1 may be various text information from the internet or local to the terminal device, where for obtaining the word vector, in some embodiments of the present invention, as shown in fig. 2, obtaining the corpus to be analyzed, performing word segmentation preprocessing on the corpus to be analyzed, and generating the corresponding word vector based on the obtained word segmentation respectively may specifically include:
s11, performing word segmentation on the corpus to be analyzed by adopting a preset word segmentation algorithm, and performing word segmentation stopping operation on a word segmentation result to obtain a word stock with N word segmentation quantity, wherein N is a positive integer;
s12, inputting N segmented words in the word stock into a preset word vector model to obtain word vectors of the N segmented words.
Specifically, different types of word segmentation algorithms can be selected for different languages in S11, and for the chinese corpus, a word segmentation method based on character string matching (mechanical word segmentation), a word segmentation method based on understanding, and a word segmentation method based on statistics can be adopted, specifically, such as a shortest path word segmentation algorithm, a jieba word segmentation algorithm, and the like, which are not limited in this scheme.
In this embodiment, step S12 may be implemented by using a word2vec model, specifically, N participles are ordered and represented by one-hot vectors, for example, "a traffic accident in a certain place, a safe life rapid start special case pre-claim service" obtains a participle through the participle preprocessing of step S11: "certain place", "traffic", "accident", "safe life", "quick", "start", "special case", "pre-claim", "service", form word stock of 9 word segments, after ordering these 9 word segments, use one-hot vector to represent the result as follows:
somewhere → [1,0,0,0,0,0,0,0,0];
traffic → [0,1,0,0,0,0,0,0,0];
accident [0,0,1,0,0,0,0,0,0];
safe life [0,0,0,1,0,0,0,0,0];
rapid → [0,0,0,0,1,0,0,0,0];
start up → [0,0,0,0,0,1,0,0,0];
special case [0,0,0,0,0,0,1,0,0];
pre-claim → [0,0,0,0,0,0,0,1,0];
service → [0,0,0,0,0,0,0,0,1];
the dimension of the one-hot vector is the same as the word segmentation number N in the word stock, the one-hot vector is used as input of a word2vec model, specifically, one or more word-segmented one-hot vectors are input to the word2vec model in combination with the context of word segmentation in the word stock, training optimization is carried out on the weight matrix initially set in the word2vec model, word vectors of each word segment are obtained according to the weight matrix after training is completed, and specifically, the one-hot vector of each word segment is multiplied by the weight matrix after training is completed to obtain the corresponding word vector.
In the embodiment of the invention, for the confirmation of the word segmentation of each sentence in the step S2, the same word segmentation preprocessing method as the step S1 is adopted to ensure the consistency of word segmentation results, and the number of sentences in the corpus to be analyzed is kept consistent with the number of initial sentence vectors obtained in the step S2.
Regarding the obtaining of the initial sentence vector of each sentence, in one embodiment of the present invention, the generating the initial sentence vector of the corresponding sentence according to the word vector group may include: and carrying out average or weighted average on each word vector in the word vector group to obtain the initial sentence vector of the corresponding sentence. For word vector averaging, for example, "traffic accident in certain place, safe life rapid start special case pre-claim service" described above, the word is obtained by word segmentation preprocessing in step S11: the method comprises the steps of (1) corresponding to 9 word vectors, namely, a certain place, traffic, accident, safe life, quick speed, starting, special case, pre-claim and service, directly averaging the numerical values in the 9 word vectors to generate a new vector with the same dimension, namely, the initial sentence vector; for word vector weighted average, each word occupies a certain weight in the whole word library according to the occurrence frequency or importance degree, for example, among 9 word vectors corresponding to "certain place", "traffic", "accident", "safe life", "quick", "start", "special case", "pre-claim", "service", the words such as "accident", "safe life", "pre-claim" need to more highlight the importance thereof in text characterization, so the weight importance is higher than other words, the weight of each word can be calculated according to the occurrence frequency of each word in the historical corpus, and the numerical value in each word vector in each sentence is weighted average according to the weight, so as to generate a new vector with the same dimension, namely a corresponding initial sentence vector.
As an implementation manner of the present invention, the initial sentence vector algorithm model may be a GRU algorithm model, and the description will be specifically made by taking the initial sentence vector algorithm model as an example of the GRU algorithm model. The GRU algorithm is one of RNN convolutional neural networks, the network layer of the GRU algorithm model cascade comprises a plurality of cascaded network nodes, the structure of each network node is the same, and referring to FIG. 3, all sentences are stored in a certain sequence, if the corpus to be analyzed currently contains M sentences, the method comprises the steps ofRepresenting the ith sentence, wherein the value range of i is 1 to M, the number of the words contained in each sentence is t, t is a positive integer, and +.>Sequentially express statement +.>Each word included and is added with +.>The word vector corresponding to each word is represented, for example, two sentences are included in the 'traffic accident in certain place, safe life rapid starting special case pre-claim service', wherein the first sentence 'traffic accident in certain place' obtains the word through word preprocessing in the step S11: "somewhere", "traffic" and "accident" are respectively>Representation at the same time in +.>The word vector representing the three word segments, and the second sentence "safe life rapid start special case pre-claim service" obtains the word segments through word segment preprocessing in step S11: "safe life", "quick", "start", "special case", "pre-claim", "service", respectively +. >The representation is made of a combination of a first and a second color,at the same time->Word vectors representing the six segmentations, and so on for more sentences contained in the corpus to be analyzed. When the word vectors are sequentially input into each network node of the GRU algorithm model for processing, the following formula is satisfied:
(1)
(2)
(3)
(4)
the network node of the GRU algorithm comprises an update gate and a reset gate, wherein the output of the update gate isThe output of the reset gate is +.>Reset gate of the t-th word->And update door->Output of word vector of t-th word and t-1 step +.>Obtained by->Information (state to be selected) representing the current need, is shown in the specification>Representing all information currently stored; in the formula->And->To activate a function, activate a function->For compressing the processing result between 0 and 1, activating the function +.>For compressing the result between-1 and 1 to facilitate processing by a subsequent network node; in the formula->Representing the Hadamard product, i.e. the corresponding element product. In the formula->Andrespectively represent input +.>And a connection matrix from the last network node to the update gate; />And->Respectively represent input +.>And a connection matrix of the last network node to the reset gate; />And->Respectively represent input +.>And last network node to the alternative state +.>Is a connection matrix of (a); wherein the update gate controls the extent to which the status information of the previous network node is brought into the status information of the current network node,/for the current network node >The greater the value of (1) the more the state information of the previous processing node is brought in, the reset gate controls the degree to which the state information of the previous network node is ignored, < >>The smaller the value is, the more the value is ignored, the word segmentation information contained in all word vectors can be effectively accumulated to a final network node for processing through a reset gate and an update gate, and a result containing all word segmentation information is obtained, namely the initial sentence vector.
In the embodiment of the present invention, for step S3, before the corpus to be analyzed is obtained, the method further includes a step of performing model training on the pre-trained sentence vector model, where, as shown in fig. 4, the training process of the pre-trained sentence vector model includes:
s31, acquiring a training corpus, performing word segmentation pretreatment on the corpus in the training corpus, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the training corpus is a training text information set which comprises at least one training sentence;
s32, obtaining word vectors of the word segmentation contained in each training sentence, obtaining a word vector group of each training sentence, and sequentially inputting the word vectors in the word vector group of the training sentence into the initial sentence vector algorithm model to generate initial sentence vectors of the corresponding training sentences;
S33, inputting initial sentence vectors corresponding to all training sentences into an initial sentence vector model for training based on the context relation corresponding to all training sentences in the training corpus, and obtaining the pre-trained sentence vector model.
The training corpus can be composed of Internet corpus such as hundred-degree encyclopedia and wikipedia or other network corpus, such as various information websites, the non-supervision model training of the algorithm model is converted into supervised model training by utilizing the Internet large-scale corpus, the effect of the algorithm model adopted in the scheme is effectively improved, and the training corpus can be a Chinese training corpus or a foreign language training corpus or a combined training corpus formed by appointed languages.
In this embodiment, the process of obtaining the word segmentation and the word vector of the corpus in the training corpus in step S31 is the same as the process of step S1 to ensure consistency of the word segmentation result, and similarly, the process of obtaining the initial sentence vector of the training corpus in step S31 is the same as the process of step S2, and the number of sentences in the training corpus is the same as the number of the initial sentence vectors obtained in step S32.
For step S33, as shown in fig. 5, the inputting the initial sentence vector corresponding to each training sentence into the initial sentence vector model for training based on the context corresponding to each training sentence in the training corpus, to obtain the pre-trained sentence vector model may specifically include:
s331, configuring a parameter matrix of the initial sentence vector model, wherein the parameter matrix is connected with an input layer and an output layer of the initial sentence vector model;
s332, generating training samples and test samples according to the context relation corresponding to each training sentence, wherein each training sample and test sample respectively comprises K1 and K2 sentence groups, each sentence group comprises at least one training sentence used for generating an input sentence vector and at least one training sentence used for generating an output sentence vector, K1 and K2 are positive integers, K1 and K2 can be equal or unequal, and K1 can be taken to be not smaller than K2, namely the number of the training samples is not smaller than the number of the test samples; the training sentences used for generating the input sentence vectors and the training sentences used for generating the output sentence vectors have a context relation, such as the text of 'i am xx, i am from xxx', wherein the sentences of 'i am xx' and 'i am from xxx' have a precedence relation (belonging to the context relation) in terms of language, and 'i am xx' can be used as the sentences for generating the input sentence vectors and 'i am from xxx' can be used as the sentences for generating the output sentence vectors.
S333, sequentially inputting the input sentence vectors in each sentence group in the training sample into the initial sentence vector model for training, and gradually adjusting parameters in the parameter matrix until the sentence groups in the training sample are trained, so that the output of the initial sentence vector model is gradually matched with the corresponding output sentence vectors in the sentence groups;
s334, checking the initial sentence vector model after training through the checking sample, and finishing training of the initial sentence vector model after checking, thereby obtaining the trained sentence vector model.
Further, in an embodiment of the present invention, the inputting the initial sentence vector into the pre-trained sentence vector model, obtaining the final sentence vector of each sentence includes: and inputting the initial sentence vector into the pre-trained sentence vector model, and multiplying the initial sentence vector of the corpus to be analyzed by the parameter matrix to obtain a final sentence vector used for representing the text information of the corpus to be analyzed.
In this embodiment, the initial sentence vector model in the above description may be a skip-gram model or a cbow model. Specifically, for the skip-gram model, a sentence having a context with the sentence is predicted by inputting a sentence, and at this time, only one sentence group included in the training sample and the test sample is used as an input sentence; for the cbow model, a sentence located in the middle of the plurality of sentences is predicted by inputting the plurality of sentences, the sentence having a context with the plurality of sentences input, when the training sample and the test sample contain only one sentence group serving as an output. In this embodiment, the initial sentence vector is corrected by the trained sentence vector model, and the text is characterized more accurately by considering the context relation of the sentence, so that the text information such as the title of the news information is characterized more accurately when the sentence is applied to information stream pushing, thereby being beneficial to improving the reading conversion rate of the information.
According to the text information characterization method provided by the embodiment of the invention, the sentence vector model is established based on the context relation of the sentences, and the text information characterization at sentence level is performed, so that the influence of different semantics of the words in different sentences can be avoided in the text information characterization process due to the consideration of the sentence context relation, and the text information characterization is more accurate; in addition, the internet large-scale corpus can be utilized in the training process of the pre-trained sentence vector model, the unsupervised training can be effectively converted into the supervised training, the training effect of the model is effectively improved, and therefore the accuracy of text information representation is improved.
The embodiment of the invention provides a text information characterization system, and the text information characterization method provided by the embodiment can be executed, as shown in fig. 6, the text information characterization system comprises a word vector generation module 10, an initial sentence vector generation module 20 and a text information characterization module 30, wherein the word vector generation module 10 is used for obtaining a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained word segmentation, and the corpus to be analyzed is text information; the text information comprises at least one sentence; the initial sentence vector generating module 20 is configured to obtain word vectors of the word segmentation included in each sentence in the corpus to be analyzed, obtain a word vector group of each sentence, and sequentially input the word vectors in the word vector group into an initial sentence vector algorithm model according to a sequence, so as to generate an initial sentence vector of a corresponding sentence; the text information characterization module 30 is configured to input the initial sentence vector into a pre-trained sentence vector model, and obtain a final sentence vector of each sentence, where the final sentence vector is used to characterize text information; wherein the pre-trained sentence vector model is generated based on the context of the sentence.
Specifically, in the embodiment of the present invention, the corpus to be analyzed processed in the word vector generation module 10 may be various text information from the internet or local to the terminal device. For obtaining the word vector, in some embodiments of the present invention, the word vector generating module 10 obtains a corpus to be analyzed, performs word segmentation preprocessing on the corpus to be analyzed, and is specifically configured to: and performing word segmentation on the corpus to be analyzed by adopting a preset word segmentation algorithm, performing word removal operation on a word segmentation result to obtain a word stock with N word segmentation quantity, wherein N is a positive integer, and inputting N word segments in the word stock into a preset word vector model to obtain word vectors of the N word segments. Specifically, the word vector generation module 10 may select different types of word segmentation algorithms for different languages, and for chinese corpus, a word segmentation method based on character string matching (mechanical word segmentation), a word segmentation method based on understanding, and a word segmentation method based on statistics may be used, such as a shortest path word segmentation algorithm, a jieba word segmentation algorithm, etc., which are not limited in this scheme.
In this embodiment, the initial sentence vector generating module 20 may use a word2vec model to generate sentence vectors, and the specific implementation process may refer to the related content in the above method embodiment, which is not expanded herein. In addition, the initial sentence vector generating module 20 uses the same word segmentation preprocessing method as the word vector generating module 10 to ensure the consistency of word segmentation results for the confirmation of the word segmentation of each sentence, and the number of sentences in the corpus to be analyzed is consistent with the number of initial sentence vectors obtained by the initial sentence vector generating module 20.
Regarding the obtaining of the initial sentence vector of each sentence, in one embodiment of the present invention, when the initial sentence vector generating module 20 generates the initial sentence vector of the corresponding sentence according to the word vector group, the method is specifically used for: and carrying out average or weighted average on each word vector in the word vector group to obtain the initial sentence vector of the corresponding sentence. For the weighted average mode, each word segment occupies a certain weight in the whole word bank according to the occurrence frequency or importance degree, the weight is used for weighted average of each word vector in each sentence every day to obtain a corresponding initial sentence vector, and the processing procedures of word vector average and word vector weighted average can refer to the related technical content in the above method embodiment, which is not expanded any more.
As an embodiment of the present invention, the initial sentence vector algorithm model adopted by the initial sentence vector generating module 20 may be a GRU algorithm model, and the description of the GRU algorithm model may refer to the relevant content in the above method embodiment, which is not expanded herein.
In an embodiment of the present invention, as shown in fig. 7, the text information characterization system further includes a model training module 40, configured to perform model training on the pre-trained sentence vector model before the corpus to be analyzed is obtained, where, as shown in fig. 4, a process of training the pre-trained sentence vector model by the model training module 40 includes:
Acquiring a training corpus through the word vector generation module 10, performing word segmentation pretreatment on the corpus in the training corpus, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the training corpus is a training text information set which comprises at least one training sentence; acquiring word vectors of the word segmentation contained in each training sentence through the initial sentence vector generation module 20 to obtain a word vector group of each training sentence, and sequentially inputting the word vectors in the word vector group of the training sentence into the initial sentence vector algorithm model to generate initial sentence vectors of the corresponding training sentences; and finally, inputting initial sentence vectors corresponding to all training sentences into an initial sentence vector model for training based on the context relation corresponding to all training sentences in the training corpus, and obtaining the pre-trained sentence vector model.
The training corpus obtained by the word vector generation module 10 may be an internet corpus such as a hundred-degree corpus, a wikipedia corpus, or other network corpuses, for example, various information websites, and the large-scale corpus of the internet is used to facilitate the conversion of the unsupervised model training of the algorithm model into the supervised model training, so as to effectively improve the effect of the algorithm model adopted in the scheme, and the training corpus may be a chinese training corpus or a foreign language training corpus, or a combined training corpus formed by a specified language.
In an embodiment of the present invention, as shown in fig. 8, the model training module 40 may include a parameter matrix configuration unit 41, a sample generation unit 42, a model training unit 43, and a model verification unit 44; the parameter matrix configuration unit 41 is configured to configure a parameter matrix of the initial sentence vector model, where the parameter matrix connects an input layer and an output layer of the initial sentence vector model; the sample generating unit 42 is connected to the word vector generating module 10 and the initial sentence vector generating module 20, and is configured to generate training samples and test samples according to the context corresponding to each training sentence, where the training samples and test samples include K1 and K2 sentence groups, each sentence group includes at least one training sentence used to generate an input sentence vector and at least one training sentence used to generate an output sentence vector, where K1 and K2 are positive integers, and K1 and K2 may be equal or unequal, and K1 may be not less than K2, i.e., the number of training samples is not less than the number of test samples; the model training unit 43 is configured to sequentially input an input sentence vector in each sentence group in the training sample to the initial sentence vector model for training, and gradually adjust parameters in the parameter matrix until the sentence group in the training sample completes training, so that an output of the initial sentence vector model gradually matches with a corresponding output sentence vector in the sentence group; the model checking unit 44 is configured to check the initial sentence vector model after training through the check sample, and complete training of the initial sentence vector model after checking, to obtain the trained sentence vector model.
Further, the initial sentence vector is input to the text information characterization module 30, so that the initial sentence vector of the corpus to be analyzed is multiplied by the parameter matrix to obtain a final sentence vector for characterizing the text information of the corpus to be analyzed.
As an implementation manner of the present invention, the initial sentence vector model may be a skip-gram model or a cbow model. Specifically, for the skip-gram model, a sentence having a context with the sentence is predicted by inputting a sentence, and at this time, only one sentence group included in the training sample and the test sample is used as an input sentence; for the cbow model, a sentence located in the middle of the plurality of sentences is predicted by inputting the plurality of sentences, the sentence having a context with the plurality of sentences input, when the training sample and the test sample contain only one sentence group serving as an output. In this embodiment, the initial sentence vector is corrected by the trained sentence vector model, and the text is characterized more accurately by considering the context relation of the sentence, so that the text information such as the title of the news information is characterized more accurately when the sentence is applied to information stream pushing, thereby being beneficial to improving the reading conversion rate of the information.
According to the text information characterization system provided by the embodiment of the invention, the sentence vector model is established based on the context relation of the sentences, and the text information characterization at the sentence level is performed, so that the influence caused by the semantic difference of the words in different sentences can be avoided in the text information characterization process due to the consideration of the sentence context relation, and the text information characterization is more accurate; in addition, the internet large-scale corpus can be utilized in the training process of the pre-trained sentence vector model, the unsupervised training can be effectively converted into the supervised training, the training effect of the model is effectively improved, and therefore the accuracy of text information representation is improved.
The embodiment of the present invention further provides a computer device, as shown in fig. 9, where the computer device includes at least one processor 71, and a memory 72 communicatively connected to the at least one processor 71, and in fig. 7, one processor 71 is shown, where the memory 72 stores computer readable instructions executable by the at least one processor 71, where the computer readable instructions are executed by the at least one processor 71 to enable the at least one processor 71 to perform the steps of the text information characterization method as described above.
Specifically, the memory 72 in the embodiment of the present invention is a non-volatile computer readable storage medium, which may be used to store computer readable instructions, non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the text information characterization method in the above embodiments of the present application; the processor 71 executes various functional applications and performs data processing by running non-volatile software programs, computer readable instructions and modules stored in the memory 72, i.e. implements the text information characterization method described in the method embodiments above.
In some embodiments, the memory 72 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the data storage area may store data created during the processing of the text information characterization method, etc. In addition, memory 72 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device;
in some embodiments, memory 72 may optionally include remote memory located remotely from processor 71, which may be connected to a computer device performing domain name filtering processing through a network, examples of which include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
In an embodiment of the present invention, the computer device performing the text information characterization method may further include an input system 73 and an output system 74; wherein the input system 73 may obtain information about the user's operations on the computer device, and the output system 74 may include a display device such as a display screen. In the embodiment of the present invention, the processor 71, the memory 72, the input system 73 and the output system 74 may be connected by a bus or other means, which is illustrated in fig. 7 as a bus connection.
According to the computer device provided in the embodiment of the present invention, the steps of the text information characterization method in the above embodiment can be performed when the processor 71 executes the code in the memory 72, and the technical details, which are not described in detail in the present embodiment, of the embodiment of the method can be referred to in the technical content provided in the method embodiment of the present application.
Embodiments of the present invention also provide a computer readable storage medium, where computer readable instructions are stored, where the computer readable instructions are executed by at least one processor, where the steps of the text information characterizing method described above can be implemented, and where the steps of the method are executed, technical effects of the foregoing method embodiments are provided, and technical details that are not described in detail in this embodiment may be found in technical details provided in the method embodiments of the present application.
The embodiment of the invention also provides a computer program product which can execute the text information characterization method provided by the embodiment of the method, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the technical content provided in the method embodiments of the present application.
It should be noted that, in the above embodiment of the present invention, each functional module may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several computer readable instructions for causing a computer system (which may be a personal computer, a server, or a network system, etc.) or an intelligent terminal device or Processor (Processor) to execute some of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In the foregoing embodiments of the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, e.g., at least two modules or components may be combined or integrated into another system, or some features may be omitted or not performed.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e. may be located in one place, or may be distributed over at least two network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
It is apparent that the above-described embodiments are only some embodiments of the present invention, but not all embodiments, and the preferred embodiments of the present invention are shown in the drawings, which do not limit the scope of the patent claims. This invention may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the invention are directly or indirectly applied to other related technical fields, and are also within the scope of the invention.

Claims (7)

1. A method for characterizing text messages, comprising:
acquiring a training corpus, performing word segmentation pretreatment on the corpus in the training corpus, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the training corpus is a training text information set which comprises at least one training sentence;
acquiring word vectors of the word segmentation contained in each training sentence, obtaining a word vector group of each training sentence, and sequentially inputting the word vectors in the word vector group of the training sentence into an initial sentence vector algorithm model according to the sequence to generate an initial sentence vector of the corresponding training sentence;
configuring a parameter matrix of the initial sentence vector model, wherein the parameter matrix is connected with an input layer and an output layer of the initial sentence vector model;
generating a training sample and a test sample according to the context relation corresponding to each training sentence, wherein the training sample and the test sample respectively comprise K1 and K2 sentence groups, each sentence group comprises at least one training sentence used for generating an input sentence vector and at least one training sentence used for generating an output sentence vector, and K1 and K2 are positive integers;
inputting the input sentence vector in each sentence group in the training sample to the initial sentence vector model in sequence for training, and gradually adjusting parameters in the parameter matrix until the sentence groups in the training sample are trained, so that the output of the initial sentence vector model is gradually matched with the corresponding output sentence vectors in the sentence groups;
Checking the initial sentence vector model after training through the checking sample, and finishing training of the initial sentence vector model after checking to obtain a trained sentence vector model;
acquiring a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the corpus to be analyzed is text information, and the text information comprises at least one sentence;
acquiring word vectors of word segmentation contained in each sentence in the corpus to be analyzed to obtain a word vector group of each sentence, sequentially inputting the word vectors in the word vector group into the initial sentence vector algorithm model in sequence to generate initial sentence vectors of the corresponding sentences;
and inputting the initial sentence vector into a pre-trained sentence vector model, and multiplying the initial sentence vector of the corpus to be analyzed by the parameter matrix to obtain a final sentence vector for representing the text information of the corpus to be analyzed, wherein the final sentence vector is used for representing the text information, and the pre-trained sentence vector model is generated based on the context relation of sentences.
2. The text information characterization method of claim 1 wherein the initial sentence vector model is a skip-gram model or a cbow model.
3. The text information characterization method of claim 1 wherein the obtaining the corpus to be analyzed, performing word segmentation preprocessing on the corpus to be analyzed, and generating the corresponding word vectors based on the obtained word segments respectively includes:
performing word segmentation on the corpus to be analyzed by adopting a preset word segmentation algorithm, and performing word removal operation on a word segmentation result to obtain a word stock with N word segmentation quantity, wherein N is a positive integer;
and inputting the N segmented words in the word stock into a preset word vector model to obtain word vectors of the N segmented words.
4. The text information characterization method of claim 1 wherein the initial sentence vector algorithm model is a GRU algorithm model.
5. A text message characterization system, comprising:
the model training module is used for acquiring a training corpus, performing word segmentation preprocessing on the corpus in the training corpus, and respectively generating corresponding word vectors based on the obtained word segmentation, wherein the training corpus is a training text information set which comprises at least one training sentence; acquiring word vectors of the word segmentation contained in each training sentence, obtaining a word vector group of each training sentence, and sequentially inputting the word vectors in the word vector group of the training sentence into an initial sentence vector algorithm model according to the sequence to generate an initial sentence vector of the corresponding training sentence; configuring a parameter matrix of the initial sentence vector model, wherein the parameter matrix is connected with an input layer and an output layer of the initial sentence vector model; generating a training sample and a test sample according to the context relation corresponding to each training sentence, wherein the training sample and the test sample respectively comprise K1 and K2 sentence groups, each sentence group comprises at least one training sentence used for generating an input sentence vector and at least one training sentence used for generating an output sentence vector, and K1 and K2 are positive integers; inputting the input sentence vector in each sentence group in the training sample to the initial sentence vector model in sequence for training, and gradually adjusting parameters in the parameter matrix until the sentence groups in the training sample are trained, so that the output of the initial sentence vector model is gradually matched with the corresponding output sentence vectors in the sentence groups; checking the initial sentence vector model after training through the checking sample, and finishing training of the initial sentence vector model after checking to obtain a trained sentence vector model;
The word vector generation module is used for acquiring the corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained segmented words, wherein the corpus to be analyzed is text information; the text information comprises at least one sentence;
the initial sentence vector generation module is used for obtaining word vectors of word segmentation contained in each sentence in the corpus to be analyzed to obtain a word vector group of each sentence, and sequentially inputting the word vectors in the word vector group into the initial sentence vector algorithm model to generate initial sentence vectors of corresponding sentences;
the text information characterization module is used for inputting the initial sentence vector into a pre-trained sentence vector model, multiplying the initial sentence vector of the corpus to be analyzed by the parameter matrix to obtain a final sentence vector for characterizing the text information of the corpus to be analyzed, wherein the final sentence vector is used for characterizing the text information; wherein the pre-trained sentence vector model is generated based on the context of the sentence.
6. A computer device, comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
The memory stores computer readable instructions executable by the at least one processor, which when executed by the at least one processor, cause the at least one processor to perform the steps of the text information characterization method of any of claims 1 to 4.
7. A computer-readable storage medium having stored thereon computer-readable instructions which, when executed by at least one processor, implement the steps of the text information characterization method of any of claims 1 to 4.
CN201910981528.4A 2019-10-16 2019-10-16 Text information characterization method, system, computer equipment and storage medium Active CN111104799B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910981528.4A CN111104799B (en) 2019-10-16 2019-10-16 Text information characterization method, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910981528.4A CN111104799B (en) 2019-10-16 2019-10-16 Text information characterization method, system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111104799A CN111104799A (en) 2020-05-05
CN111104799B true CN111104799B (en) 2023-07-21

Family

ID=70421422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910981528.4A Active CN111104799B (en) 2019-10-16 2019-10-16 Text information characterization method, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111104799B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694941B (en) * 2020-05-22 2024-01-05 腾讯科技(深圳)有限公司 Reply information determining method and device, storage medium and electronic equipment
CN111639194B (en) * 2020-05-29 2023-08-08 天健厚德网络科技(大连)有限公司 Knowledge graph query method and system based on sentence vector
CN112016295B (en) * 2020-09-04 2024-02-23 平安科技(深圳)有限公司 Symptom data processing method, symptom data processing device, computer equipment and storage medium
CN112926329B (en) * 2021-03-10 2024-02-20 招商银行股份有限公司 Text generation method, device, equipment and computer readable storage medium
CN113157853B (en) * 2021-05-27 2024-02-06 中国平安人寿保险股份有限公司 Problem mining method, device, electronic equipment and storage medium
CN113435582B (en) * 2021-06-30 2023-05-30 平安科技(深圳)有限公司 Text processing method and related equipment based on sentence vector pre-training model
CN113707299A (en) * 2021-08-27 2021-11-26 平安科技(深圳)有限公司 Auxiliary diagnosis method and device based on inquiry session and computer equipment
CN114036272A (en) * 2021-10-29 2022-02-11 厦门快商通科技股份有限公司 Semantic analysis method and system for dialog system, electronic device and storage medium
CN114118085B (en) * 2022-01-26 2022-04-19 云智慧(北京)科技有限公司 Text information processing method, device and equipment
CN114943220B (en) * 2022-04-12 2023-01-10 中国科学院计算机网络信息中心 Sentence vector generation method and duplicate checking method for scientific research establishment duplicate checking

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280058A (en) * 2018-01-02 2018-07-13 中国科学院自动化研究所 Relation extraction method and apparatus based on intensified learning
WO2019056692A1 (en) * 2017-09-25 2019-03-28 平安科技(深圳)有限公司 News sentence clustering method based on semantic similarity, device, and storage medium
WO2019072166A1 (en) * 2017-10-10 2019-04-18 腾讯科技(深圳)有限公司 Semantic analysis method, device, and storage medium
CN110287312A (en) * 2019-05-10 2019-09-27 平安科技(深圳)有限公司 Calculation method, device, computer equipment and the computer storage medium of text similarity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019056692A1 (en) * 2017-09-25 2019-03-28 平安科技(深圳)有限公司 News sentence clustering method based on semantic similarity, device, and storage medium
WO2019072166A1 (en) * 2017-10-10 2019-04-18 腾讯科技(深圳)有限公司 Semantic analysis method, device, and storage medium
CN108280058A (en) * 2018-01-02 2018-07-13 中国科学院自动化研究所 Relation extraction method and apparatus based on intensified learning
CN110287312A (en) * 2019-05-10 2019-09-27 平安科技(深圳)有限公司 Calculation method, device, computer equipment and the computer storage medium of text similarity

Also Published As

Publication number Publication date
CN111104799A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
CN111104799B (en) Text information characterization method, system, computer equipment and storage medium
WO2021204272A1 (en) Privacy protection-based target service model determination
CN110795911B (en) Real-time adding method and device for online text labels and related equipment
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN109543165B (en) Text generation method and device based on circular convolution attention model
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113408299A (en) Training method, device, equipment and storage medium of semantic representation model
CN110598207B (en) Word vector obtaining method and device and storage medium
CN110866119B (en) Article quality determination method and device, electronic equipment and storage medium
CN109948160B (en) Short text classification method and device
CN110598869B (en) Classification method and device based on sequence model and electronic equipment
CN110175469B (en) Social media user privacy leakage detection method, system, device and medium
CN109918503B (en) Groove filling method for extracting semantic features based on dynamic window self-attention mechanism
CN111241843A (en) Semantic relation inference system and method based on composite neural network
CN110895655A (en) Method and device for extracting text core phrase
CN112434143B (en) Dialog method, storage medium and system based on hidden state constraint of GRU (generalized regression Unit)
JP2022088540A (en) Method for generating user interest image, device, electronic apparatus and storage medium
CN114141236A (en) Language model updating method and device, electronic equipment and storage medium
CN114091555A (en) Training method and device of image recognition model, electronic equipment and storage medium
CN111797621A (en) Method and system for replacing terms
CN113408297B (en) Method, apparatus, electronic device and readable storage medium for generating node representation
CN115169549B (en) Artificial intelligent model updating method and device, electronic equipment and storage medium
CN117149957B (en) Text processing method, device, equipment and medium
CN117668198A (en) Question-answering processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant