WO2024087298A1

WO2024087298A1 - Text processing method and apparatus, electronic device and storage medium

Info

Publication number: WO2024087298A1
Application number: PCT/CN2022/134592
Authority: WO
Inventors: 宋彦; 田元贺; 毛震东; 李世鹏
Original assignee: 苏州思萃人工智能研究所有限公司
Priority date: 2022-10-27
Filing date: 2022-11-28
Publication date: 2024-05-02
Also published as: CN115618848A

Abstract

The present application provides a text processing method and apparatus, an electronic device and a storage medium. The text processing method comprises: obtaining a text to be analyzed, and determining an original vector corresponding to the text to be analyzed; extracting from the text to be analyzed at least one segmented word to be used, and determining vectors to be used that are corresponding to the at least one segmented word to be used; according to each vector to be used and a weight to be used that is corresponding to each vector to be used, obtaining a vector to be spliced of the text to be analyzed; and splicing the vector to be spliced and the original vector to obtain a target vector, so as to perform, on the basis of the target vector, text analysis on the text to be analyzed.

Description

Text processing method, device, electronic device and storage medium

This application claims priority to the Chinese patent application filed with the China Patent Office on October 27, 2022, with application number 202211327875.3, the entire contents of which are incorporated by reference into this application.

Technical Field

The present application relates to the technical field of natural language processing, for example, to a text processing method, device, electronic device and storage medium.

Background technique

By performing syntactic analysis on the text, we can gain a more comprehensive understanding of the text.

When performing syntactic analysis on text, it is mostly done through a more powerful encoder, but lacks analysis of text representation. The analysis results obtained based on such methods often tend to miss important information in the text, that is, the syntactic structure analysis of the text is not detailed enough, which may lead to inaccurate syntactic analysis results of the text.

Summary of the invention

The present application provides a text processing method, device, electronic device and storage medium to solve the problem that the syntactic component analysis results of the text are not accurate due to the large granularity of text analysis.

The present application embodiment provides a text processing method, including:

Obtaining a text to be analyzed, and determining an original vector corresponding to the text to be analyzed;

Extracting at least one to-be-used segmented word from the to-be-analyzed text, and determining a to-be-used vector corresponding to the at least one to-be-used segmented word;

According to each vector to be used and the weight to be used corresponding to each vector to be used, a vector to be concatenated of the text to be analyzed is obtained;

The vector to be concatenated is concatenated with the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.

The present application also provides a text processing device, including:

An original vector determination module, configured to obtain a text to be analyzed and determine an original vector corresponding to the text to be analyzed;

A to-be-used vector determination module, configured to extract at least one to-be-used word from the to-be-analyzed text, and determine a to-be-used vector corresponding to the at least one to-be-used word;

A module for determining vectors to be spliced, configured to obtain vectors to be spliced of the text to be analyzed according to each vector to be used and a weight to be used corresponding to each vector to be used;

The target vector determination module is configured to perform a splicing process on the vector to be spliced and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.

The present application also provides an electronic device, including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the text processing method described in any embodiment of the present application.

An embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the text processing method described in any embodiment of the present application when executed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a flow chart of a text processing method provided according to Embodiment 1 of the present application;

FIG2 is a schematic diagram of a model structure of text processing provided according to Embodiment 2 of the present application;

FIG3 is a schematic diagram of the structure of a text processing device provided according to Embodiment 3 of the present application;

FIG. 4 is a schematic diagram of the structure of an electronic device that implements the text processing method according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. The described embodiments are only embodiments of a part of the present application.

The terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. The terms used in this way are interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein.

Embodiment 1

Figure 1 is a flowchart of a text processing method provided in the first embodiment of the present application. This embodiment can be applied to situations where a more detailed and accurate analysis of the syntactic components of a text is performed. The method can be executed by a text processing device, which can be implemented in the form of hardware and/or software. The text processing device can be configured in a computing device that can execute the text processing method.

As shown in FIG1 , the method includes the following steps.

S110: Acquire a text to be analyzed, and determine an original vector corresponding to the text to be analyzed.

The text to be analyzed can be understood as the text that needs to be analyzed for syntactic components. The original vector can be understood as the vector obtained after the text to be analyzed is vectorized. For example, the text to be analyzed can be vectorized through a language representation model to obtain the original vector.

In practical applications, syntactic component analysis of text is a basic work of natural language processing. Based on syntactic component analysis, operations such as opinion extraction or sentiment analysis can be performed on the text. When analyzing text with a simple component structure, the syntactic component information in the text can usually be obtained more accurately. However, for text with a more complex structure, syntactic analysis is more difficult, which may result in missing important information in the text. For example, the text can be vectorized and the syntactic component information corresponding to the text can be obtained by subtracting the end vector corresponding to the text from the beginning vector. However, such an analysis method is relatively rough and it is difficult to obtain more accurate syntactic component information from the text.

Obtain a text to be analyzed that needs to be analyzed for syntactic components, and determine the original vector corresponding to the text to be analyzed. Optionally, determining the original vector corresponding to the text to be analyzed includes: based on the language representation model, performing vector processing on at least one to-be-used word segment in the text to be analyzed to obtain a to-be-used latent vector corresponding to at least one to-be-used word segment; for each to-be-used latent vector, based on the difference between the next latent vector relative to the current latent vector and the current latent vector, obtain the original vector corresponding to the text to be analyzed.

The language representation model is based on the bidirectional encoder representation from transformer (Bidirectional Encoder Representations from Transformer, BERT) with powerful language representation and feature extraction capabilities. In this technical solution, the text to be analyzed can be feature extracted based on the BERT model, and the original vector corresponding to the text to be analyzed can be generated. The text to be analyzed includes at least one participle. In this technical solution, each participle is called a participle to be used. By vectorizing at least one participle to be used, the corresponding latent vector to be used can be obtained, so as to obtain the original vector corresponding to the text to be analyzed based on at least one latent vector to be used.

In practical applications, the text to be analyzed is segmented to obtain at least one segmented word to be used, and the at least one segmented word to be used is encoded based on the BERT model to obtain a corresponding latent vector to be used. By concatenating at least one latent vector to be used, a text vector corresponding to the text to be analyzed can be obtained. It can be determined by the following formula:

_h1 … _hi … _hj … _hn =BERT( _x1 … _xi … _xj … _xn )

_hi represents the latent vector to be used, and _xi represents the word segment to be used. Wherein, i, j and n are natural numbers, which are used to represent the position of the latent vector to be used in the text vector and the position of the word segment to be used in the text to be analyzed.

Based on the above description, it can be known that the above text vector includes at least one latent vector to be used. For each latent vector to be used, the difference between the next latent vector relative to the current latent vector and the current latent vector can be used to obtain the corresponding difference vector, and the difference vector is used as the original vector corresponding to the current latent vector. In this technical solution, in order to make the result of the syntactic component analysis of the text to be analyzed more accurate, the text to be analyzed can be divided into multiple text intervals, each of which includes at least one word to be used. Through the latent vector to be used corresponding to each word to be used, the original vector corresponding to each word to be used can be obtained, so as to perform a more detailed analysis of the text to be analyzed based on the original vector of at least one word to be used.

Taking one of the latent vectors to be used as the current latent vector as an example, the original vector corresponding to the current latent vector can be obtained based on the following formula:

_ri,j = _hj _-hi

Among them, ri _,j represents the original vector corresponding to the current latent vector, _hj represents the next latent vector relative to the current latent vector, and _hi represents the current latent vector.

S120: Extract at least one to-be-used word from the text to be analyzed, and determine a to-be-used vector corresponding to the at least one to-be-used word.

The syntactic component analysis of the text to be analyzed in this technical solution is adjusted on the basis of the syntactic analysis. That is to say, the original vector in this technical solution is based on the result of the syntactic component analysis of the text to be analyzed, and this technical solution is based on the original vector corresponding to the text to be analyzed. The syntactic component analysis of the text to be analyzed is more detailed. Since the vector corresponding to the word to be used is also used when determining the original vector, for the convenience of distinction, the vector corresponding to the word to be used when determining the original vector is called the latent vector to be used, and the vector corresponding to the word to be used when analyzing based on this technical solution is called the vector to be used.

The vector to be used is the vector obtained after the text to be analyzed is vectorized by the vector processing method based on the technical solution.

When analyzing the text to be analyzed, it is necessary to determine the vector to be used corresponding to at least one to-be-used word segment in the text to be analyzed. In the present technical solution, determining the vector to be used corresponding to at least one to-be-used word segment includes: respectively determining the word segment category corresponding to at least one to-be-used word segment; for each word segment category, performing vector processing on at least one to-be-used word segment in the current word segment category to obtain the vector to be used corresponding to each word segment category.

In the present technical solution, the word segmentation category can be understood as an N-tuple category, and the so-called N-tuple is a word block composed of continuous words. For example, the text to be analyzed is "在现场上", and the text to be analyzed is segmented to obtain three to-be-used word segments, namely "在", "校园" and "上". Then the text to be analyzed can correspond to three different N-tuples, namely, unigram: "在", "校园" and "上"; bigram: "在校园", and "校园上"; ternary: "在校园上". The to-be-used word segments in each N-tuple category are vectorized to obtain the corresponding to-be-used vectors.

In the present technical solution, taking the vector processing of the to-be-used participles in the current participle category as an example, vector processing is performed on at least one to-be-used participle in the current participle category to obtain the to-be-used vector corresponding to each participle category, including: based on the embedding function, vector processing is performed on at least one to-be-used participle in the current participle category to obtain the to-be-used vector corresponding to at least one to-be-used participle in the current participle category.

In the technical solution, the embedding function can determine the vector to be used corresponding to each to-be-used word based on the pre-built embedding matrix. Based on the embedding function, vector processing is performed on at least one to-be-used word in the current word segmentation category to obtain the vector to be used corresponding to at least one to-be-used word in the current word segmentation category, including: calling the pre-built embedding matrix and determining the matrix mapping element corresponding to at least one to-be-used word in the current word segmentation category; based on each matrix mapping element, determining the vector to be used corresponding to the corresponding to-be-used word in the current word segmentation category.

The matrix mapping element can be understood as an element in the embedding matrix corresponding to the word segment to be used, and can be a row number element of the embedding matrix corresponding to the word segment to be used.

Exemplarily, a large number of to-be-used participles may be included in the pre-constructed embedding matrix, at least one to-be-used participle is placed in order in the embedding matrix, and a corresponding matrix mapping element is generated. Each to-be-used participle corresponds to a unique vector in the embedding matrix. Based on this, based on the pre-constructed embedding matrix and the matrix mapping element corresponding to the to-be-used participle in the embedding matrix, the to-be-used vector corresponding to the to-be-used participle can be determined. For example, the matrix mapping element corresponding to "playground" in the embedding matrix is "11", indicating that "playground" is in the 11th position in the embedding matrix, that is, the unique vector corresponding to the matrix mapping element is the to-be-used vector corresponding to "playground".

That is to say, in the present technical solution, in order to determine the vector to be used corresponding to each word segment to be used, the matrix mapping element of each word segment to be used in the pre-constructed embedding matrix can be determined, so as to determine the vector to be used corresponding to the corresponding word segment to be used according to the unique vector corresponding to each matrix mapping element.

S130 , obtaining a vector to be concatenated of the text to be analyzed according to each vector to be used and a weight to be used corresponding to each vector to be used.

The vector to be concatenated can be used to concatenate with the original vector to obtain a target vector, so as to perform a more detailed syntactic component analysis on the text to be analyzed based on the target vector.

In the technical solution, when analyzing the text to be analyzed, the text to be analyzed is divided into text intervals to obtain at least one text interval, that is, at least one segmentation category, and different segmentation categories include at least one segmentation to be used, and each segmentation to be used corresponds to a unique vector to be used. The weight to be used corresponding to at least one vector to be used is consistent with the weight corresponding to the segmentation category corresponding to at least one vector to be used. In other words, if the current segmentation category includes 3 segmentations to be used, and the 3 segmentations to be used correspond to different vectors to be used, if the weight value corresponding to the current segmentation category is 0.2, then the weights to be used corresponding to the 3 vectors to be used are all 0.2.

In practical applications, the number of to-be-used segmentations in each segmentation category may be one or more. Taking the current segmentation category as an example, when determining the weight corresponding to the current segmentation category, that is, the to-be-used weight, it can be determined based on the following formula:

in,

represents the weight to be used, exp represents the exponential function with the natural constant e as the base, _ri,j represents the original vector,

represents the N-tuple vector to be used,

Represents the number of N-tuples, u represents the u-th word category, and v represents the v-th word to be used in the word category.

According to each vector to be used and the weight to be used corresponding to each vector to be used, the vector to be spliced of the text to be analyzed is obtained, including: according to each vector to be used and the original vector, respectively determining the weight to be used corresponding to each vector to be used; according to each vector to be used and the weight to be used corresponding to each vector to be used, performing weighted averaging processing to obtain the vector to be spliced corresponding to the text to be analyzed.

The vector to be spliced can be obtained by the following formula:

Determine the weighted average vector corresponding to each N-tuple

in,

represents the weighted average vector of N-tuples,

represents the weight to be used,

represents the vector to be used, and · is the vector inner product symbol.

The weighted average vectors of N-tuples of all categories are concatenated to obtain a vector containing N-tuple information (i.e., the vector to be concatenated):

Among them, a _i,j represents the vector to be spliced,

is the vector splicing symbol,

Represents the weighted average vector of N-tuples.

S140 , concatenating the vector to be concatenated with the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector.

The target vector can be understood as a vector corresponding to the text to be analyzed obtained by concatenating each vector to be used.

The target vector can be determined based on the following formula:

Among them, r' _i,j represents the target vector, a _i,j represents the vector to be spliced, and ri _,j represents the original vector.

Vector stitching symbol.

Optionally, the vector to be concatenated is concatenated with the original vector to obtain a target vector, and text analysis is performed on the text to be analyzed based on the target vector, including: based on a pre-built encoder, the vector to be concatenated and the original vector are concatenated to obtain a target vector; and the target vector is input into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.

On the basis of the original vector, the target vector obtained by processing the text to be analyzed by this technical solution is spliced, which can make up for the problem that the analysis of the text to be analyzed in the related technology is relatively rough, resulting in inaccurate analysis results. That is to say, on the basis of the vector representation of the text to be analyzed, this technical solution adds at least one vector representation information corresponding to the word segmentation to be used, and combining the two can obtain more syntactic structure information corresponding to the text to be analyzed. Therefore, analyzing the target vector based on the pre-built syntactic analysis model can obtain more accurate analysis results.

The technical solution of the embodiment of the present application obtains the text to be analyzed and determines the original vector corresponding to the text to be analyzed. The original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector. At least one to-be-used participle is extracted from the text to be analyzed, and the vector to be used corresponding to at least one to-be-used participle is determined, and the participle category corresponding to at least one to-be-used participle is determined respectively, and the vector to be used corresponding to at least one to-be-used participle is determined based on the embedded function. According to each to-be-used vector and the corresponding to-be-used weight, the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight. The vector to be spliced is spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector. The problem that the granularity of text analysis is large, resulting in inaccurate results of syntactic component analysis of the text is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.

Embodiment 2

In an example, the model of the technical solution for analyzing the text to be analyzed is shown in FIG2. Taking the text to be analyzed as "and playing football on the playground" as an example, when the text to be analyzed is analyzed for syntactic components, a method based on a graph structure is usually adopted. An encoder, such as a BERT model, can be used to encode the text to be analyzed x= _x1 ... _xi ... _xj ... _xq containing q participles to be used to obtain the corresponding latent vector (wherein the latent vector to be used of the i-th participle is _hi ), and the formula is as follows:

_h1 … _hi … _hj … _hn =BERT( _x1 … _xi … _xj … _xn )

Among them, _hi represents the latent vector to be used, and _xi represents the word segmentation to be used.

Among them, i, j and n are natural numbers, which are used to indicate the position of the latent vector to be used in the text vector and the position of the word to be used in the text to be analyzed.

The vector representation ri _,j of each text interval ( _xi , _xj ) = _xi …xj _-1 can be obtained by the following formula:

_ri,j = _hj _-hi

We can use two fully connected layers (where matrix W ₁ and offset vector b ₁ are the parameters of the first fully connected layer; matrix W ₂ and offset vector b ₂ are the parameters of the second fully connected layer; ReLu is the activation function) to map ri _,j to vector o _i,j :

o _i,j =W ₂ ·(ReLu(W ₁ · _ri,j +b ₁ ))+b ₂

Among them, the dimension of the vector o _i,j is equal to the number of syntactic component categories (such as noun phrase (NP), verb phrase (VP), prepositional phrase (PP), etc.), and the value corresponding to one dimension of the vector represents the score that the text interval ( _xi , _xj ) belongs to a syntactic component category l, and the score is denoted as s(i,j,l).

All text interval scores s(i,j,l) of the text to be analyzed are input into the Cocke–Younger–Kasami (CYK) algorithm to calculate the highest-scoring and optimal legal syntax tree.

This technical solution analyzes the text to be analyzed on the basis of the above-mentioned syntactic component analysis. The text to be analyzed is divided into text intervals to obtain at least one text interval, and the segmentation category corresponding to at least one text interval is determined, that is, the corresponding segmentation category is determined according to the number of segmentations to be used. In practical applications, all matching N-tuples in the text interval ( _xi , _xj ) can be extracted based on the existing N-tuple vocabulary N (that is, if an N-tuple in the vocabulary N is a substring of the text interval ( _xi , _xj ), then the N-tuple is extracted). The lengths of the N-tuples are extracted in turn, and each N-tuple is respectively mapped to a different segmentation category. The v-th N-tuple belonging to the u-th category is recorded as

There are a total of

N-tuples.

For example, the text to be analyzed is "in the playground". After word segmentation, three word segmentations to be used can be obtained, namely "in", "playground" and "on". Then the text to be analyzed can correspond to three different N-tuples, namely, unigram: "in", "playground" and "on"; bigram: "in the playground" and "on", as well as "in" and "on the playground"; triplet: "in the playground".

Based on the embedding function, the N-tuple

Mapped to N-tuple embedding vector

In the pre-built embedding matrix, we can extract

The row number (ie, matrix mapping element) corresponding to the sequence number in the embedding matrix is extracted, and the vector corresponding to the row number is used as the vector to be used corresponding to the word segment to be used.

For the N-tuples in category u, the weight of the N-tuple of the current category can be determined by the following formula:

That is, the weight to be used:

in,

represents the N-tuple vector to be used,

Represents the number of N-tuples.

The weighted average vector of the N-tuple of category u is calculated by the following formula:

in,

represents the weighted average vector of N-tuples,

represents the weight to be used,

represents the vector to be used, and · is the vector inner product symbol.

Among them, a _i,j represents the vector to be spliced,

is the vector splicing symbol,

Represents the weighted average vector of N-tuples.

Based on the following formula, the vector to be spliced is concatenated with the original vector to obtain the target vector:

Vector stitching symbol.

The syntactic component analysis result can be obtained by performing syntactic component analysis on the text to be analyzed based on the target vector.

This technical solution divides the text to be analyzed into multiple sub-text intervals, and determines N-tuples of the texts in the multiple text intervals respectively, and sets corresponding weights according to the influence of each N-tuple on the syntactic component analysis, so that when the text to be analyzed is analyzed based on each N-tuple, the granularity of the text analysis is finer and the analysis result of the text to be analyzed is more accurate.

The technical solution of the embodiment of the present application, by obtaining the text to be analyzed and determining the original vector corresponding to the text to be analyzed, the original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector. Extract at least one to-be-used participle from the text to be analyzed, and determine the vector to be used corresponding to at least one to-be-used participle, respectively determine the participle category corresponding to at least one to-be-used participle, and determine the vector to be used corresponding to at least one to-be-used participle based on the embedding function. According to each to-be-used vector and the corresponding to-be-used weight, the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight. Splice the vector to be spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector. The problem that the granularity of text analysis is large, resulting in inaccurate results of syntactic component analysis of the text is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.

Embodiment 3

Fig. 3 is a schematic diagram of the structure of a text processing device provided in Embodiment 3 of the present application. As shown in Fig. 3 , the device comprises: an original vector determination module 210 , a to-be-used vector determination module 220 , a to-be-joined vector determination module 230 , and a target vector determination module 240 .

The original vector determination module 210 is configured to obtain the text to be analyzed and determine the original vector corresponding to the text to be analyzed;

A to-be-used vector determination module 220 is configured to extract at least one to-be-used word from the to-be-analyzed text and determine a to-be-used vector corresponding to the at least one to-be-used word;

A to-be-joined vector determination module 230 is configured to obtain a to-be-joined vector of the to-be-analyzed text according to each to-be-used vector and a to-be-used weight corresponding to each to-be-used vector;

The target vector determination module 240 is configured to perform a concatenation process on the vector to be concatenated and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.

The technical solution of the embodiment of the present application obtains the text to be analyzed and determines the original vector corresponding to the text to be analyzed. The original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector. At least one to-be-used participle is extracted from the text to be analyzed, and the vector to be used corresponding to at least one to-be-used participle is determined, and the participle category corresponding to at least one to-be-used participle is determined respectively, and the vector to be used corresponding to at least one to-be-used participle is determined based on the embedded function. According to each to-be-used vector and the corresponding to-be-used weight, the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight. Finally, the vector to be spliced is spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector. The problem that the granularity of text analysis is large, resulting in inaccurate results of syntactic component analysis of the text is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.

Optionally, the original vector determination module 210 includes: a latent vector determination submodule, configured to perform vector processing on at least one to-be-used word segment in the text to be analyzed based on a language representation model, to obtain a to-be-used latent vector corresponding to at least one to-be-used word segment;

The original vector determination submodule is configured to obtain, for each latent vector to be used, an original vector corresponding to the text to be analyzed based on a subsequent latent vector relative to the current latent vector and a difference between the current latent vector and the latent vector.

Optionally, the to-be-used vector determination module 220 includes: a segmentation category determination submodule, configured to respectively determine a segmentation category corresponding to at least one to-be-used segmentation, wherein the segmentation category includes at least one to-be-used segmentation;

The to-be-used vector determination submodule is configured to perform vector processing on at least one to-be-used word in the current word segmentation category for each word segmentation category, so as to obtain the to-be-used vector corresponding to each word segmentation category.

Optionally, the submodule for determining the vector to be used includes: a unit for determining the vector to be used, which is configured to perform vector processing on at least one to-be-used word in the current word segmentation category based on an embedding function to obtain a vector to be used corresponding to at least one to-be-used word in the current word segmentation category.

Optionally, the to-be-used vector determination unit includes: a mapping element determination subunit, configured to retrieve a pre-built embedding matrix and determine a matrix mapping element corresponding to at least one to-be-used segmentation word in the current segmentation category;

The to-be-used vector determination subunit is configured to determine the to-be-used vector corresponding to the to-be-used word in the current word segmentation category based on each matrix mapping element.

Optionally, the vector to be spliced determining module 230 includes: a weight determining submodule, configured to determine a weight to be used corresponding to each vector to be used according to each vector to be used and the original vector;

The submodule for determining the vector to be spliced is configured to perform weighted average processing according to each vector to be used and the weight to be used corresponding to each vector to be used, so as to obtain the vector to be spliced corresponding to the text to be analyzed.

Optionally, the target vector determination module 240 includes: a target vector determination submodule, configured to perform a splicing process on the vector to be spliced and the original vector based on a pre-built encoder to obtain a target vector;

The text analysis submodule is configured to input the target vector into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.

The text processing device provided in the embodiments of the present application can execute the text processing method provided in any embodiment of the present application, and has the corresponding functional modules and effects of the execution method.

Embodiment 4

Fig. 4 shows a schematic diagram of the structure of an electronic device 10 of an embodiment of the present application. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices (such as helmets, glasses, watches, etc.) and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present application described and/or required herein.

As shown in FIG4 , the electronic device 10 includes at least one processor 11, and a memory connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can perform a variety of appropriate actions and processes according to the computer program stored in the ROM 12 or the computer program loaded from the storage unit 18 to the RAM 13. In the RAM 13, a variety of programs and data required for the operation of the electronic device 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other through a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a disk, an optical disk, etc.; and a communication unit 19, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), a variety of dedicated artificial intelligence (AI) computing chips, a variety of processors running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 performs the multiple methods and processes described above, such as a text processing method.

In some embodiments, the text processing method may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as a storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the text processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the text processing method in any other suitable manner (e.g., by means of firmware).

Various embodiments of the systems and techniques described above herein may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard parts (ASSPs), system on chip systems (SOCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs that are executable and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.

Computer programs for implementing the text processing methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that when the computer program is executed by the processor, the functions/operations specified in the flow chart and/or block diagram are implemented. The computer program may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.

In the context of the present application, a computer readable storage medium may be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, device, or apparatus. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. Alternatively, a computer readable storage medium may be a machine readable signal medium. A machine readable storage medium includes an electrical connection based on one or more lines, a portable computer disk, a hard disk, a RAM, a ROM, an Erasable Programmable Read-Only Memory (EPROM), a flash memory, an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The storage medium may be a non-transitory storage medium.

To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having: a display device (e.g., a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the electronic device. Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form (including acoustic input, voice input, or tactile input).

The systems and techniques described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes frontend components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.

A computing system may include a client and a server. The client and the server are generally remote from each other and usually interact through a communication network. The client and server relationship is generated by computer programs running on the respective computers and having a client-server relationship with each other. The server may be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and virtual private servers (VPS) services.

The various forms of processes shown above can be used to reorder, add or delete steps. For example, the multiple steps recorded in this application can be executed in parallel, sequentially or in different orders, as long as the expected results of the technical solution of this application can be achieved, and this document is not limited here.

The above implementations do not constitute a limitation on the protection scope of the present application. Various modifications, combinations, sub-combinations and substitutions may be made according to design requirements and other factors.

Claims

A text processing method, comprising:

Obtaining a text to be analyzed, and determining an original vector corresponding to the text to be analyzed;

Extracting at least one to-be-used segmented word from the to-be-analyzed text, and determining a to-be-used vector corresponding to the at least one to-be-used segmented word;

According to each vector to be used and the weight to be used corresponding to each vector to be used, a vector to be concatenated of the text to be analyzed is obtained;

The vector to be concatenated is concatenated with the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
The method according to claim 1, wherein determining the original vector corresponding to the text to be analyzed comprises:

Based on the language representation model, vector processing is performed on at least one to-be-used word in the to-be-analyzed text to obtain a to-be-used latent vector corresponding to the at least one to-be-used word;

For each latent vector to be used, the original vector corresponding to the text to be analyzed is obtained based on a subsequent latent vector relative to the current latent vector and a difference between the current latent vector.
The method according to claim 1, wherein determining the to-be-used vector corresponding to the at least one to-be-used word segmentation comprises:

Respectively determining a participle category corresponding to the at least one participle to be used, wherein the participle category includes at least one participle to be used;

For each word segmentation category, vector processing is performed on at least one to-be-used word segmentation in the current word segmentation category to obtain a to-be-used vector corresponding to each word segmentation category.
The method according to claim 3, wherein the performing vector processing on at least one to-be-used segmentation word in the current segmentation category to obtain a to-be-used vector corresponding to each segmentation category comprises:

Based on the embedding function, vector processing is performed on at least one to-be-used participle in the current participle category to obtain a to-be-used vector corresponding to the at least one to-be-used participle in the current participle category.
The method according to claim 4, wherein the performing vector processing on at least one to-be-used segmentation word in the current segmentation category based on the embedding function to obtain a to-be-used vector corresponding to at least one to-be-used segmentation word in the current segmentation category comprises:

Retrieving a pre-built embedding matrix and determining a matrix mapping element corresponding to at least one to-be-used segmentation in the current segmentation category;

Based on each matrix mapping element, a vector to be used corresponding to a corresponding word to be used in the current word segmentation category is determined.
The method according to claim 1, wherein the step of obtaining the vector to be concatenated of the text to be analyzed according to each vector to be used and the weight to be used corresponding to each vector to be used comprises:

Determine, according to each vector to be used and the original vector, a weight to be used corresponding to each vector to be used;

A weighted average process is performed based on each vector to be used and the weight to be used corresponding to each vector to be used, so as to obtain the vector to be spliced corresponding to the text to be analyzed.
The method according to claim 1, wherein the step of concatenating the vector to be concatenated with the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector, comprises:

Based on a pre-built encoder, concatenate the vector to be concatenated and the original vector to obtain the target vector;

The target vector is input into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.
A text processing device, comprising:

An original vector determination module, configured to obtain a text to be analyzed and determine an original vector corresponding to the text to be analyzed;

a to-be-used vector determination module, configured to extract at least one to-be-used word from the to-be-analyzed text, and determine a to-be-used vector corresponding to the at least one to-be-used word;

A module for determining vectors to be spliced, configured to obtain vectors to be spliced of the text to be analyzed according to each vector to be used and a weight to be used corresponding to each vector to be used;

The target vector determination module is configured to perform a splicing process on the vector to be spliced and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
An electronic device, comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the text processing method according to any one of claims 1 to 7.
A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the text processing method according to any one of claims 1 to 7 when executed.