CN111104799A

CN111104799A - Text information representation method and system, computer equipment and storage medium

Info

Publication number: CN111104799A
Application number: CN201910981528.4A
Authority: CN
Inventors: 侯晓龙
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-05-05
Anticipated expiration: 2039-10-16
Also published as: CN111104799B

Abstract

The invention belongs to the field of artificial intelligence, and relates to a text information representation method, a text information representation system, computer equipment and a storage medium, wherein the method comprises the following steps: obtaining a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained words, wherein the corpus to be analyzed is text information which comprises at least one sentence; obtaining word vectors of participles contained in each statement in the corpus to be analyzed to obtain a word vector group of each statement, sequentially inputting the word vectors in the word vector group into an initial sentence vector algorithm model according to a sequence, and generating an initial sentence vector of a corresponding statement; and inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence, wherein the final sentence vector is used for representing text information, and the pre-trained sentence vector model is generated based on the context of the sentence. The scheme provided by the invention can avoid the influence caused by different semantics of words in different sentences, and can more accurately represent the text information.

Description

Text information representation method and system, computer equipment and storage medium

Technical Field

The embodiment of the invention belongs to the technical field of artificial intelligence, and particularly relates to a text information representation method and system, computer equipment and a storage medium.

Background

In the field of natural language processing, text information representation is the basis for solving the text processing problem, and in the prior art, Word vector addition and average based on Word2Vec are generally adopted as a text information representation method, but the semantics of the same Word in different sentences and different contexts are different, so that the text information representation based on the Word vector is inaccurate and is not suitable for the representation of text information such as article information in the field of information stream recommendation.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and a system for characterizing text information, a computer device, and a storage medium, so as to solve the problem in the prior art that the representation of text information based on word vectors is not accurate enough and is not suitable for the representation of text information such as article information in the field of information stream recommendation.

In a first aspect, an embodiment of the present invention provides a text information characterization method, including:

obtaining a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained words, wherein the corpus to be analyzed is text information which comprises at least one sentence;

obtaining word vectors of participles contained in each statement in the corpus to be analyzed to obtain a word vector group of each statement, sequentially inputting the word vectors in the word vector group into an initial sentence vector algorithm model according to a sequence, and generating an initial sentence vector of a corresponding statement;

and inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence, wherein the final sentence vector is used for representing text information, and the pre-trained sentence vector model is generated based on the context of the sentence.

As an implementable manner of the present invention, before the obtaining of the corpus to be analyzed, the method further includes a step of performing model training on the pre-trained sentence vector model, wherein a training process of the pre-trained sentence vector model includes:

acquiring a training corpus set, performing word segmentation pretreatment on corpora in the training corpus set, and respectively generating corresponding word vectors based on the obtained word segments, wherein the training corpus set is a training text information set, and the training text information comprises at least one training sentence;

acquiring word vectors of participles contained in each training sentence to obtain a word vector group of each training sentence, and sequentially inputting the word vectors in the word vector group of the training sentences into the initial sentence vector algorithm model in sequence to generate corresponding initial sentence vectors of the training sentences;

and inputting the initial sentence vector corresponding to each training sentence into an initial sentence vector model for training based on the context corresponding to each training sentence in the training corpus set to obtain the pre-trained sentence vector model.

As an implementable aspect of the present invention, the obtaining of the pre-trained sentence vector model by inputting the initial sentence vector corresponding to each training sentence into the initial sentence vector model based on the context corresponding to each training sentence in the training corpus set includes:

configuring a parameter matrix of the initial sentence vector model, wherein the parameter matrix is connected with an input layer and an output layer of the initial sentence vector model;

generating training samples and test samples according to the context corresponding to each training sentence, wherein the training samples and the test samples respectively comprise K1 sentence groups and K2 sentence groups, each sentence group comprises at least one training sentence used for generating an input sentence vector and at least one training sentence used for generating an output sentence vector, and K1 and K2 are positive integers;

sequentially inputting the input sentence vector in each sentence group in the training sample into the initial sentence vector model for training, gradually adjusting the parameters in the parameter matrix until the sentence group in the training sample is trained, and gradually matching the output of the initial sentence vector model with the corresponding output sentence vector in the sentence group;

and testing the initial sentence vector model after training through the test sample, and finishing the training of the initial sentence vector model if the test is passed, so as to obtain the trained sentence vector model.

As an implementation manner of the present invention, the inputting the initial sentence vector to a pre-trained sentence vector model to obtain a final sentence vector of each sentence includes: and inputting the initial sentence vector into the pre-trained sentence vector model, and multiplying the initial sentence vector of the corpus to be analyzed by the parameter matrix to obtain a final sentence vector for representing the text information of the corpus to be analyzed.

As a way in which the present invention may be implemented, the initial sentence vector model may be a skip-gram model or a cbow model.

As an implementable manner of the present invention, the connecting corpus performs word segmentation preprocessing on corpus in the corpus to obtain a group of words, and generating corresponding word vectors for all the obtained words respectively includes:

performing word segmentation on the corpus in the corpus by adopting a preset word segmentation algorithm, and executing word stop operation on word segmentation results to obtain a word bank with the number of segmented words being N, wherein N is a positive integer;

and inputting the N participles in the word stock into a preset word vector model to obtain word vectors of the N participles.

As a practical mode of the present invention, the initial sentence vector algorithm model is a GRU algorithm model.

In a second aspect, an embodiment of the present invention provides a text information characterization system, including:

the word vector generation module is used for acquiring linguistic data to be analyzed, performing word segmentation pretreatment on the linguistic data to be analyzed, and respectively generating corresponding word vectors based on the obtained word segments, wherein the linguistic data to be analyzed is text information; the text information comprises at least one sentence;

the initial sentence vector generation module is used for acquiring word vectors of participles contained in each sentence in the corpus to be analyzed to obtain a word vector group of each sentence, and sequentially inputting the word vectors in the word vector group into the initial sentence vector algorithm model to generate an initial sentence vector of the corresponding sentence;

the text information representation module is used for inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence, and the final sentence vector is used for representing text information; wherein the pre-trained sentence vector model is generated based on context relationships of sentences.

In a third aspect, an embodiment of the present invention provides a computer device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores computer-readable instructions executable by the at least one processor, which, when executed by the at least one processor, cause the at least one processor to perform the steps of the textual information characterization method as described above.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which computer-readable instructions are stored, and the computer-readable instructions, when executed by at least one processor, implement the steps of the text information characterization method as described above.

According to the text information representation method, the text information representation system, the computer equipment and the storage medium provided by the embodiment of the invention, the pre-trained sentence vector model is established based on the context relationship of the sentences to carry out text information representation at sentence level.

Drawings

In order to illustrate the solution of the invention more clearly, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are some embodiments of the invention, and that other drawings may be derived from these drawings by a person skilled in the art without inventive effort.

Fig. 1 is a flowchart of a text information characterization method according to an embodiment of the present invention;

FIG. 2 is a flow chart of generating word vectors according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a network node of a GRU algorithm model according to an embodiment of the present invention;

FIG. 4 is a flowchart of a training process for a pre-trained sentence vector model according to an embodiment of the present invention;

FIG. 5 is a flowchart of training an initial sentence vector model based on context of a training sentence according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a text information representation system according to an embodiment of the present invention;

FIG. 7 is another schematic diagram of a textual information representation system according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a model training module according to an embodiment of the present invention;

fig. 9 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The appearances of the phrase "an embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The embodiment of the invention provides a text information representation method, as shown in fig. 1, the text information representation method comprises the following steps:

s1, obtaining a corpus to be analyzed, performing word segmentation pretreatment on the corpus to be analyzed, and respectively generating corresponding word vectors based on the obtained words, wherein the corpus to be analyzed is text information, and the text information comprises at least one sentence;

s2, obtaining word vectors of participles contained in each statement in the corpus to be analyzed to obtain a word vector group of each statement, sequentially inputting the word vectors in the word vector group into an initial sentence vector algorithm model in sequence to generate an initial sentence vector of a corresponding statement;

s3, inputting the initial sentence vector to a pre-trained sentence vector model to obtain a final sentence vector of each sentence, wherein the final sentence vector is used for representing text information, and the pre-trained sentence vector model is generated based on the context of the sentence.

Specifically, in this embodiment of the present invention, the corpus to be analyzed in step S1 may be various text information from the internet or local in the terminal device, where for the obtaining of the word vector, in some embodiments of the present invention, as shown in fig. 2, the obtaining the corpus to be analyzed, performing word segmentation preprocessing on the corpus to be analyzed, and generating corresponding word vectors based on the obtained words respectively may specifically include:

s11, performing word segmentation on the corpus to be analyzed by adopting a preset word segmentation algorithm, and executing word stop-word operation on a word segmentation result to obtain a word bank with the number of segmented words being N, wherein N is a positive integer;

s12, inputting the N participles in the word stock into a preset word vector model to obtain word vectors of the N participles.

Specifically, the S11 may select different types of word segmentation algorithms for different languages, and for the chinese corpus, a word segmentation method based on string matching (mechanical word segmentation), a word segmentation method based on understanding, and a word segmentation method based on statistics, such as a shortest path word segmentation algorithm, a jieba word segmentation algorithm, and the like, may be used, which is not limited in this scheme.

In this embodiment, the step S12 may be implemented by using a word2vec model, specifically, N segmented words are sorted and respectively represented by one-hot vectors, for example, "traffic accident at some place, safe life and quick start special claim pre-claiming service" to obtain segmented words through the segmented word pre-processing of the step S11: the "place", "traffic", "accident", "safe life", "fast", "start", "special case", "pre-claim", "service", form a thesaurus of 9 segmented words, and the results are expressed by one-hot vectors after the 9 segmented words are sorted as follows:

somewhere → [1, 0, 0, 0, 0, 0, 0, 0, 0 ];

traffic → [0, 1, 0, 0, 0, 0, 0, 0, 0 ];

accident → [0, 0, 1, 0, 0, 0, 0, 0, 0 ];

safe life → [0, 0, 0, 1, 0, 0, 0, 0, 0 ];

rapidly → [0, 0, 0, 0, 1, 0, 0, 0, 0 ];

start → [0, 0, 0, 0, 0, 1, 0, 0, 0 ];

special case → [0, 0, 0, 0, 0, 0, 1, 0, 0 ];

preclaims → [0, 0, 0, 0, 0, 0, 0, 1, 0 ];

service → [0, 0, 0, 0, 0, 0, 0, 0, 1 ];

the dimension of the one-hot vector is the same as the number N of the participles in the word stock, the one-hot vector is used as the input of the word2vec model, specifically, the one-hot vector of one or more participles is input to the word2vec model by combining the context relationship of the participles in the word stock, the weight matrix initially set in the word2vec model is trained and optimized, the word vector of each participle is obtained according to the trained weight matrix, and specifically, the one-hot vector of each participle is multiplied by the trained weight matrix to obtain the corresponding word vector.

In the embodiment of the present invention, for the confirmation of the participle of each sentence in step S2, the same participle preprocessing method as that in step S1 is adopted to ensure the consistency of the participle result, and the number of sentences in the corpus to be analyzed is consistent with the number of initial sentence vectors obtained in step S2.

Regarding the obtaining of the initial sentence vector of each sentence, in an embodiment of the present invention, the generating the initial sentence vector of the corresponding sentence according to the word vector group may include: and averaging or weighted averaging is carried out on each word vector in the word vector group to obtain an initial sentence vector of the corresponding sentence. For the average word vector, for example, the above "traffic accident at a certain place, safe life and quick start special claim pre-claiming service" obtains the participles through the participle pre-processing of step S11: corresponding to 9 word vectors, directly averaging numerical values in the 9 word vectors to generate a new vector with the same dimension, namely the initial sentence vector; for the word vector weighted average, each participle occupies a certain weight in the whole word bank according to the occurrence frequency or the importance degree, for example, in 9 word vectors corresponding to "certain place", "traffic", "accident", "safe life", "quick", "start", "special case", "pre-claim", "service", the words such as "accident", "safe life", and "pre-claim" need to be more prominent in the text representation, so the weight importance is higher than other participles, the weight of the participle can be calculated according to the occurrence frequency of each participle in the history corpus, and the value in each word vector in each sentence is weighted averaged by the weight to generate a new vector with the same dimension, so that the corresponding initial sentence vector can be obtained.

As a way in which the present invention can be implemented, the initial sentence vector algorithm model may be a GRU algorithm model, and the following description specifically takes the initial sentence vector algorithm model as the GRU algorithm model as an example. The GRU algorithm is one of RNN convolutional neural networks, the network layer of the GRU algorithm model cascade includes a plurality of network nodes in cascade, the structure of each network node is the same, specifically referring to fig. 3, all sentences are stored in a certain order, if the corpus to be analyzed currently includes M sentences, S is used_iThe ith sentence is expressed, the value range of i is 1 to M, the number of participles contained in each sentence is t, t is a positive integer, and the method is used for solving the problem that the sentence is not easy to be found in the past

Sequentially represent a sentence S_iIncluding each participle, and

the word vector indicating each word segmentation, such as "traffic accident somewhere, safe life quick-start special claim service" includes two sentences, wherein the first sentence "traffic accident somewhere" is pre-segmented in step S11And (3) processing to obtain word segmentation: "somewhere", "traffic", and "accident", respectively

Is shown at the same time as

The word vectors representing the three participles, the second sentence "safe longevity quickly starts special claim pre-claiming service" obtains the participles through the participle pre-processing of step S11: "safe life", "fast", "start", "special case", "pre-claim" and "service", respectively, in the following

Is shown at the same time as

The word vectors representing these six participles are analogized to the more sentences contained in the corpus to be analyzed. When the word vectors are sequentially input into each network node of the GRU algorithm model for processing, the following formula is satisfied:

the network node of the GRU algorithm comprises an update gate and a reset gate, wherein the output of the update gate is r_tThe output of the reset gate is z_tReset gate r for the tth word^tAnd update gate z^tFrom the word vector of the t-th word and the t-1 stepOutput h of^tIs obtained by

Indicating the currently required information (candidate state), h_tRepresenting all the information currently stored, wherein σ and tanh are activation functions, the activation function σ is used to compress the processing result between 0 and 1, the activation function tanh is used to compress the result between-1 and 1 for subsequent processing by the network node, ⊙ represents a Hadamard product, i.e. the product of corresponding elements, in the formula W_rAnd U_rRespectively represent inputs

And a connection matrix of a last network node to the refresh gate; w_zAnd U_zRespectively represent inputs

And a connection matrix of a last network node to the reset gate; w and U represent inputs, respectively

And the last network node to the candidate state

The connection matrix of (2); wherein the update gate can control the extent to which the status information of the previous network node is brought into the status information of the current network node, z_tThe larger the value of (a), the more the state information of the previous processing node is brought in, the degree to which the reset gate control ignores the state information of the previous network node, r_tThe smaller the value is, the more the word segmentation information is ignored, the reset gate and the update gate can effectively accumulate the word segmentation information contained in all the word vectors to the last network node for processing, and the result containing all the word segmentation information is obtained, namely the initial sentence vector.

In this embodiment of the present invention, as for step S3, before the obtaining the corpus to be analyzed, the method further includes a step of performing model training on the pre-trained sentence vector model, where as shown in fig. 4, a training process of the pre-trained sentence vector model includes:

s31, acquiring a training corpus set, performing word segmentation preprocessing on corpora in the training corpus set, and respectively generating corresponding word vectors based on the obtained word segments, wherein the training corpus set is a training text information set, and the training text information comprises at least one training sentence;

s32, obtaining word vectors of participles contained in each training sentence to obtain a word vector group of each training sentence, sequentially inputting the word vectors in the word vector group of each training sentence into the initial sentence vector algorithm model according to the sequence, and generating the initial sentence vector of the corresponding training sentence;

and S33, inputting the initial sentence vector corresponding to each training sentence into the initial sentence vector model for training based on the context corresponding to each training sentence in the training corpus set to obtain the pre-trained sentence vector model.

The training corpus can be an internet corpus such as Baidu encyclopedia and Wikipedia or other network corpus, such as various information websites, and the unsupervised model training of the algorithm model can be converted into supervised model training by utilizing the large-scale corpus of the internet, so that the effect of the algorithm model adopted in the scheme is effectively improved.

In this embodiment, the process of obtaining the participles and the word vectors of the participles in the corpus of the training sentences in step S31 is the same as that in step S1 described above, so as to ensure the consistency of the participle results, and similarly, the process of obtaining the initial sentence vectors of the training sentences in step S31 is the same as that in step S2, and the number of sentences in the corpus of the training sentences is the same as that of the initial sentence vectors obtained in step S32.

For step S33, as shown in fig. 5, the inputting an initial sentence vector corresponding to each training sentence into an initial sentence vector model for training based on the context corresponding to each training sentence in the training corpus set, and obtaining the pre-trained sentence vector model may specifically include:

s331, configuring a parameter matrix of the initial sentence vector model, wherein the parameter matrix is connected with an input layer and an output layer of the initial sentence vector model;

s332, generating training samples and inspection samples according to the context corresponding to each training sentence, wherein the training samples and the inspection samples respectively comprise K1 sentence groups and K2 sentence groups, each sentence group comprises at least one training sentence used for generating an input sentence vector and at least one training sentence used for generating an output sentence vector, K1 and K2 are positive integers, K1 and K2 can be equal or unequal, and K1 can be not less than K2, namely the number of the training samples is not less than the number of the inspection samples; the training sentences used for generating the input sentence vectors and the training sentences used for generating the output sentence vectors have a contextual relationship, such as the text "i call xx, i comes from xxx", wherein the sentences "i call xx" and the sentences "i come from xxx" have a precedence relationship (belong to the contextual relationship) in the language order, and at the moment, "i call xx" can be used as the sentences used for generating the input sentence vectors and "i come from xxx" can be used as the sentences used for generating the output sentence vectors.

S333, sequentially inputting the input sentence vectors in each sentence group in the training sample into the initial sentence vector model for training, gradually adjusting the parameters in the parameter matrix until the sentence groups in the training sample are trained, and gradually matching the output of the initial sentence vector model with the corresponding output sentence vectors in the sentence groups;

s334, the trained initial sentence vector model is tested through the test sample, and the training of the initial sentence vector model is completed after the test is passed, so that the trained sentence vector model is obtained.

Further, in this embodiment of the present invention, the inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector of each sentence includes: and inputting the initial sentence vector into the pre-trained sentence vector model, and multiplying the initial sentence vector of the corpus to be analyzed by the parameter matrix to obtain a final sentence vector for representing the text information of the corpus to be analyzed.

In this embodiment, the initial sentence vector model in the above description may be a skip-gram model or a cbow model. Specifically, for the skip-gram model, a statement having a context relationship with the statement is predicted by inputting a statement, and only one statement group contained in the training sample and the test sample is used as an input statement; for the cbow model, a sentence located in the middle of a plurality of sentences is predicted by inputting the plurality of sentences, the sentence has a context relationship with the input plurality of sentences, and the training sample and the test sample only contain one sentence group as an output sentence. In the embodiment, the initial sentence vector is corrected through the trained sentence vector model, and the representation of the text is more accurate due to the consideration of the context relationship of the sentences, so that when the method is applied to information stream pushing, the representation of the text information such as the title of news information is more accurate, and the reading conversion rate of the information is favorably improved.

According to the text information representation method provided by the embodiment of the invention, the sentence vector model is established based on the context of the sentence, and the text information representation at sentence level is carried out, so that the influence caused by different semantics of words in different sentences can be avoided in the representation process of the text information due to the consideration of the context of the sentence, and the representation of the text information is more accurate; in addition, the large-scale corpus of the internet can be utilized in the training process of the pre-trained sentence vector model, unsupervised training can be effectively converted into supervised training, the model training effect is effectively improved, and therefore the accuracy of representation of text information is improved.

The embodiment of the present invention provides a text information representation system, which can execute the text information representation method provided in the above embodiment, and as shown in fig. 6, the text information representation system includes a word vector generation module 10, an initial sentence vector generation module 20, and a text information representation module 30, where the word vector generation module 10 is configured to obtain a corpus to be analyzed, perform word segmentation preprocessing on the corpus to be analyzed, and generate corresponding word vectors based on the obtained word segments, where the corpus to be analyzed is text information; the text information comprises at least one sentence; the initial sentence vector generation module 20 is configured to obtain a word vector of a participle included in each sentence in the corpus to be analyzed, obtain a word vector group of each sentence, sequentially input the word vectors in the word vector group into the initial sentence vector algorithm model in order, and generate an initial sentence vector of a corresponding sentence; the text information representation module 30 is configured to input the initial sentence vector to a pre-trained sentence vector model to obtain a final sentence vector of each sentence, where the final sentence vector is used to represent text information; wherein the pre-trained sentence vector model is generated based on context relationships of sentences.

Specifically, in the embodiment of the present invention, the corpus to be analyzed processed in the word vector generation module 10 may be various text information from the internet or local to the terminal device. For the acquisition of word vectors, in some embodiments of the present invention, the word vector generation module 10 acquires a corpus to be analyzed, performs word segmentation preprocessing on the corpus to be analyzed, and specifically, when generating corresponding word vectors based on the obtained word segments, is configured to: and performing word segmentation on the corpus to be analyzed by adopting a preset word segmentation algorithm, executing word deactivation operation on a word segmentation result to obtain a word bank with the number of segmented words being N, wherein N is a positive integer, and inputting N segmented words in the word bank into a preset word vector model to obtain word vectors of the N segmented words. Specifically, the word vector generation module 10 may select different types of word segmentation algorithms for different languages, and for a chinese corpus, a word segmentation method based on string matching (mechanical word segmentation), a word segmentation method based on understanding, and a word segmentation method based on statistics, such as a shortest path word segmentation algorithm, a jieba word segmentation algorithm, and the like, may be used, which is not limited in this scheme.

In this embodiment, the initial sentence vector generation module 20 may use a word2vec model to generate a sentence vector, and a specific implementation process may refer to relevant contents in the foregoing method embodiment, which is not expanded herein. In addition, the initial sentence vector generation module 20 adopts the same word segmentation preprocessing method as the word vector generation module 10 for the confirmation of the segmentation of each sentence, so as to ensure the consistency of the segmentation result, and the number of sentences in the corpus to be analyzed is consistent with the number of initial sentence vectors obtained by the initial sentence vector generation module 20.

Regarding the obtaining of the initial sentence vector of each sentence, in an embodiment of the present invention, when the initial sentence vector generation module 20 generates the initial sentence vector of the corresponding sentence according to the word vector group, the initial sentence vector generation module is specifically configured to: and averaging or weighted averaging is carried out on each word vector in the word vector group to obtain an initial sentence vector of the corresponding sentence. For the weighted average mode, each participle occupies a certain weight in the whole word bank according to the occurrence frequency or the importance degree, the weight is used for carrying out weighted average on each word vector in each day sentence to obtain a corresponding initial sentence vector, and the processing processes of word vector averaging and word vector weighted averaging can refer to the related technical contents in the embodiment of the method and are not expanded here.

As a way in which the present invention may be implemented, the initial sentence vector algorithm model adopted by the initial sentence vector generation module 20 may be a GRU algorithm model, and for the description of the GRU algorithm model, reference may be made to relevant contents in the above method embodiment, which is not expanded herein.

In this embodiment of the present invention, as shown in fig. 7, the text information characterization system further includes a model training module 40, configured to perform model training on the pre-trained sentence vector model before the corpus to be analyzed is obtained, where as shown in fig. 4, a process of training the pre-trained sentence vector model by the model training module 40 includes:

acquiring a training corpus set through the word vector generation module 10, performing word segmentation preprocessing on corpora in the training corpus set, and respectively generating corresponding word vectors based on the obtained segmented words, wherein the training corpus set is a training text information set, and the training text information comprises at least one training sentence; obtaining word vectors of participles contained in each training sentence through the initial sentence vector generation module 20 to obtain a word vector group of each training sentence, and sequentially inputting the word vectors in the word vector group of the training sentence into the initial sentence vector algorithm model in order to generate an initial sentence vector of the corresponding training sentence; and finally, inputting the initial sentence vector corresponding to each training sentence into the initial sentence vector model for training based on the context corresponding to each training sentence in the training corpus set to obtain the pre-trained sentence vector model.

The corpus acquired by the word vector generation module 10 may be an internet corpus such as Baidu encyclopedia, Wikipedia, or other network corpus, such as various information websites, and the unsupervised model training of the algorithm model is converted into supervised model training by using the internet large-scale corpus, so as to effectively improve the effect of the algorithm model adopted in the scheme.

In the embodiment of the present invention, as shown in fig. 8, the model training module 40 may include a parameter matrix configuration unit 41, a sample generation unit 42, a model training unit 43, and a model verification unit 44; the parameter matrix configuration unit 41 is configured to configure a parameter matrix of the initial sentence vector model, where the parameter matrix connects an input layer and an output layer of the initial sentence vector model; the sample generating unit 42 is connected to the word vector generating module 10 and the initial sentence vector generating module 20, and configured to generate training samples and check samples according to context relationships corresponding to training sentences, where the training samples and the check samples respectively include K1 and K2 sentence groups, each sentence group includes at least one training sentence used for generating an input sentence vector and at least one training sentence used for generating an output sentence vector, where K1 and K2 are positive integers, and K1 and K2 may be equal or unequal, and K1 may be not less than K2, that is, the number of training samples is not less than the number of check samples; the model training unit 43 is configured to sequentially input the input sentence vector in each sentence set in the training sample into the initial sentence vector model for training, gradually adjust the parameters in the parameter matrix until the sentence set in the training sample is trained, and gradually match the output of the initial sentence vector model with the corresponding output sentence vector in the sentence set; the model checking unit 44 is configured to check the trained initial sentence vector model through the check sample, and complete training of the initial sentence vector model if the check is passed, so as to obtain the trained sentence vector model.

Further, the initial sentence vector is input into the text information representation module 30, so that the initial sentence vector of the corpus to be analyzed is multiplied by the parameter matrix to obtain a final sentence vector for representing the text information of the corpus to be analyzed.

As a way in which the present invention may be implemented, the initial sentence vector model may be a skip-gram model or a cbow model. Specifically, for the skip-gram model, a statement having a context relationship with the statement is predicted by inputting a statement, and only one statement group contained in the training sample and the test sample is used as an input statement; for the cbow model, a sentence located in the middle of a plurality of sentences is predicted by inputting the plurality of sentences, the sentence has a context relationship with the input plurality of sentences, and the training sample and the test sample only contain one sentence group as an output sentence. In the embodiment, the initial sentence vector is corrected through the trained sentence vector model, and the representation of the text is more accurate due to the consideration of the context relationship of the sentences, so that when the method is applied to information stream pushing, the representation of the text information such as the title of news information is more accurate, and the reading conversion rate of the information is favorably improved.

According to the text information representation system provided by the embodiment of the invention, the sentence vector model is established based on the context of the sentence, and the text information representation at sentence level is carried out, so that the influence caused by different semantics of words in different sentences can be avoided in the representation process of the text information due to the consideration of the context of the sentence, and the representation of the text information is more accurate; in addition, the large-scale corpus of the internet can be utilized in the training process of the pre-trained sentence vector model, unsupervised training can be effectively converted into supervised training, the model training effect is effectively improved, and therefore the accuracy of representation of text information is improved.

Embodiments of the present invention further provide a computer device, as shown in fig. 9, the computer device includes at least one processor 71, and a memory 72 communicatively connected to the at least one processor 71, one processor 71 is shown in fig. 7, and the memory 72 stores computer-readable instructions executable by the at least one processor 71, where the computer-readable instructions are executed by the at least one processor 71, so as to enable the at least one processor 71 to perform the steps of the text information characterization method as described above.

Specifically, the memory 72 in the embodiment of the present invention is a nonvolatile computer-readable storage medium, and can be used to store computer-readable instructions, a nonvolatile software program, a nonvolatile computer-executable program, and modules, such as program instructions/modules corresponding to the text information characterization method in the foregoing embodiment of the present application; the processor 71 executes various functional applications and performs data processing, namely, implementing the text information characterization method described in the above method embodiment, by executing the nonvolatile software program, the computer readable instructions and the modules stored in the memory 72.

In some embodiments, the memory 72 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the data storage area may store data created during processing of the text information representation method, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device;

in some embodiments, memory 72 may optionally include a remote memory located remotely from processor 71 and connectable to a computer device performing domain name filtering via a network, examples of which include, but are not limited to, the internet, an intranet, a local area network, a mobile communications network, and combinations thereof.

In an embodiment of the present invention, the computer device executing the text information representation method may further include an input system 73 and an output system 74; the input system 73 may obtain operation information of a user on a computer device, and the output system 74 may include a display device such as a display screen. In the embodiment of the present invention, the processor 71, the memory 72, the input system 73 and the output system 74 may be connected by a bus or other means, and fig. 7 illustrates the connection by a bus as an example.

According to the computer device provided by the embodiment of the present invention, when the processor 71 executes the codes in the memory 72, the steps of the text information characterization method in the above embodiment can be executed, and the technical effects of the above embodiment of the method are achieved, and the technical details not described in detail in the embodiment of the present invention can be referred to the technical contents provided in the embodiment of the method of the present application.

Embodiments of the present invention further provide a computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by at least one processor, the steps of the text information characterization method can be implemented, and when the steps of the method are executed, the technical effects of the above-mentioned method embodiments are achieved, and the technical details that are not described in detail in this embodiment may be referred to in the technical contents provided in the method embodiments of the present application.

The embodiment of the invention also provides a computer program product which can execute the text information representation method provided in the embodiment of the method of the application and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the technical contents provided in the method embodiments of the present application.

It should be noted that, in the above embodiments of the present invention, each functional module may be integrated into one processing unit, or each functional module may exist alone physically, or two or more functional modules may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several computer readable instructions for enabling a computer system (which may be a personal computer, a server, or a network system, etc.) or an intelligent terminal device or a Processor (Processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In the above embodiments provided by the present invention, it should be understood that the disclosed system and method may be implemented in other ways. For example, the system embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical division, and other divisions may be realized in practice, for example, at least two modules or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on at least two network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention without limiting its scope. This invention may be embodied in many different forms and, on the contrary, these embodiments are provided so that this disclosure will be thorough and complete. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and modifications can be made, and equivalents may be substituted for elements thereof. All equivalent structures made by using the contents of the specification and the attached drawings of the invention can be directly or indirectly applied to other related technical fields, and are also within the protection scope of the patent of the invention.

Claims

1. A method for characterizing textual information, comprising:

2. The method according to claim 1, wherein before the obtaining the corpus to be analyzed, the method further comprises a step of performing model training on the pre-trained sentence vector model, wherein the training process of the pre-trained sentence vector model comprises:

3. The method of claim 2, wherein the inputting an initial sentence vector corresponding to each training sentence into an initial sentence vector model for training based on the context corresponding to each training sentence in the corpus to obtain the pre-trained sentence vector model comprises:

4. The method of claim 3, wherein the inputting the initial sentence vector into a pre-trained sentence vector model to obtain a final sentence vector for each sentence comprises: and inputting the initial sentence vector into the pre-trained sentence vector model, and multiplying the initial sentence vector of the corpus to be analyzed by the parameter matrix to obtain a final sentence vector for representing the text information of the corpus to be analyzed.

5. The method of claim 2, wherein the initial sentence vector model is a skip-gram model or a cbow model.

6. The method for characterizing text information according to claim 1, wherein the obtaining the corpus to be analyzed, performing word segmentation preprocessing on the corpus to be analyzed, and generating corresponding word vectors based on the obtained word segments respectively comprises:

performing word segmentation on the corpus to be analyzed by adopting a preset word segmentation algorithm, and executing word-stop operation on a word segmentation result to obtain a word bank with the number of segmented words being N, wherein N is a positive integer;

7. The method of characterizing textual information according to claim 1, wherein the initial sentence vector algorithm model is a GRU algorithm model.

8. A textual information characterization system, comprising:

9. A computer device, comprising:

at least one processor; and the number of the first and second groups,

the memory stores computer-readable instructions executable by the at least one processor, which, when executed by the at least one processor, cause the at least one processor to perform the steps of the textual information characterization method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has computer-readable instructions stored thereon, which, when executed by at least one processor, implement the steps of the textual information characterization method according to any of claims 1 to 7.