CN115618848A - Text processing method and device, electronic equipment and storage medium - Google Patents

Text processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115618848A
CN115618848A CN202211327875.3A CN202211327875A CN115618848A CN 115618848 A CN115618848 A CN 115618848A CN 202211327875 A CN202211327875 A CN 202211327875A CN 115618848 A CN115618848 A CN 115618848A
Authority
CN
China
Prior art keywords
vector
text
analyzed
vectors
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211327875.3A
Other languages
Chinese (zh)
Inventor
宋彦
田元贺
毛震东
李世鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Sicui Artificial Intelligence Research Institute Co ltd
Original Assignee
Suzhou Sicui Artificial Intelligence Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Sicui Artificial Intelligence Research Institute Co ltd filed Critical Suzhou Sicui Artificial Intelligence Research Institute Co ltd
Priority to CN202211327875.3A priority Critical patent/CN115618848A/en
Priority to PCT/CN2022/134592 priority patent/WO2024087298A1/en
Publication of CN115618848A publication Critical patent/CN115618848A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text processing method, a text processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a text to be analyzed, and determining an original vector corresponding to the text to be analyzed; extracting at least one participle to be used from the text to be analyzed, and determining a vector to be used corresponding to each participle to be used; obtaining vectors to be spliced of the texts to be analyzed according to the vectors to be used and the corresponding weights to be used; and splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector. The problem that the syntactic component analysis result of the text is not accurate enough due to large text analysis granularity is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.

Description

Text processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text processing method and apparatus, an electronic device, and a storage medium.
Background
By parsing the text, the text can be more fully understood.
At present, when the syntax analysis is performed on the text, most of the syntax analysis is performed by a more powerful coder, and the analysis on the text representation is lacked. The analysis result obtained based on the method often leaves out important information in the text, that is, the syntactic structure analysis of the text is not fine enough, which may result in that the syntactic analysis result of the text is not accurate enough.
In order to solve the above problems, improvements in text analysis methods are required.
Disclosure of Invention
The invention provides a text processing method, a text processing device, electronic equipment and a storage medium, and aims to solve the problem that the syntactic component analysis result of a text is not accurate enough due to large text analysis granularity.
In a first aspect, an embodiment of the present invention provides a text processing method, including:
acquiring a text to be analyzed, and determining an original vector corresponding to the text to be analyzed;
extracting at least one participle to be used from the text to be analyzed, and determining a vector to be used corresponding to each participle to be used;
obtaining vectors to be spliced of the texts to be analyzed according to the vectors to be used and the corresponding weights to be used;
and splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector.
In a second aspect, an embodiment of the present invention further provides a text processing apparatus, including:
the system comprises an original vector determining module, a text analysis module and a text analysis module, wherein the original vector determining module is used for acquiring a text to be analyzed and determining an original vector corresponding to the text to be analyzed;
the to-be-used vector determining module is used for extracting at least one to-be-used word from the to-be-analyzed text and determining a to-be-used vector corresponding to each to-be-used word;
the to-be-spliced vector determining module is used for obtaining the to-be-spliced vectors of the to-be-analyzed texts according to the to-be-used vectors and the corresponding to-be-used weights;
and the target vector determining module is used for splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a text processing method according to any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to, when executed, enable a processor to implement the text processing method according to any embodiment of the present invention.
According to the technical scheme of the embodiment of the invention, the original vector corresponding to the text to be analyzed can be obtained through the BERT model by acquiring the text to be analyzed and determining the original vector corresponding to the text to be analyzed, so that the vector to be spliced and the original vector obtained by the technical scheme are spliced to obtain the target vector. Extracting at least one word to be used from the text to be analyzed, determining a vector to be used corresponding to each word to be used, respectively determining a word class corresponding to each word to be used, and determining a vector to be used corresponding to each word to be used based on an embedding function. Further, the vectors to be spliced of the texts to be analyzed are obtained according to the vectors to be used and the corresponding weights to be used, the weights to be used corresponding to the corresponding vectors to be used can be determined according to the weights corresponding to the word segmentation categories, and the vectors to be spliced are obtained according to the vectors to be used and the corresponding weights to be used. And finally, splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector. The problem that the syntactic component analysis result of the text is not accurate enough due to large text analysis granularity is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a model structure of text processing according to a second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a text processing apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device implementing a text processing method according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.
Example one
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present invention, which is applicable to a case where a syntax component of a text is analyzed more precisely and accurately, and the method may be executed by a text processing apparatus, which may be implemented in a form of hardware and/or software, and the text processing apparatus may be configured in a computing device that can execute the text processing method.
As shown in fig. 1, the method includes:
s110, obtaining a text to be analyzed, and determining an original vector corresponding to the text to be analyzed.
The text to be analyzed can be understood as the text which needs syntactic component analysis. The original vector can be understood as a vector obtained by vectorizing the text to be analyzed, for example, the original vector can be obtained by vectorizing the text to be analyzed through the existing language representation model.
In practical applications, syntactic component analysis of a text is a fundamental work of natural language processing, and operations such as viewpoint extraction and emotion analysis can be further performed on the text on the basis of the syntactic component analysis. When a text with a simple component structure is analyzed, syntactic component information in the text can be obtained more accurately, but for a text with a complex structure, the difficulty of syntactic analysis is high, and important information in the text can be missed. For example, in the prior art, a text may be vectorized, and the terminal vector and the head vector corresponding to the text are subtracted to obtain the syntactic component information corresponding to the text, but such an analysis method is rough and it is difficult to obtain more accurate syntactic component information from the text.
Specifically, a text to be analyzed, which needs to be subjected to syntactic component analysis, is obtained, and an original vector corresponding to the text to be analyzed is determined. Optionally, determining an original vector corresponding to the text to be analyzed includes: based on the language representation model, performing vector processing on at least one word to be used in the text to be analyzed to obtain a corresponding hidden vector to be used; and aiming at each hidden vector to be used, obtaining an original vector corresponding to the text to be analyzed based on a difference value of a next hidden vector relative to the current hidden vector and the current hidden vector.
In the technical scheme, feature extraction may be performed on a text to be analyzed based on a BERT model, and an original vector corresponding to the text to be analyzed is generated. It can be understood that the text to be analyzed includes at least one word segmentation, and in the present technical solution, each word segmentation is referred to as a word segmentation to be used. The corresponding hidden vectors to be used can be obtained by respectively carrying out vectorization processing on the participles to be used, so that the original vectors corresponding to the text to be analyzed can be obtained based on the hidden vectors to be used.
In practical application, word segmentation processing is carried out on a text to be analyzed to obtain at least one word to be used, each model to be used is coded based on a BERT model to obtain corresponding hidden vectors to be used, and the hidden vectors to be used are spliced to obtain a text vector corresponding to the text to be analyzed. Specifically, it can be determined by the following formula:
h 1 …h i …h j …h n =BERT(x 1 …x i …x j …x n )
wherein h is i Representing hidden vectors to be used, x i Indicating that word segmentation is to be used.
And i, j and n are natural numbers and are used for representing the positions of the hidden vector text vectors to be used and the positions of the participles to be used in the text to be analyzed.
Based on the above explanation, the text vector includes at least one hidden vector to be used, and for each hidden vector to be used, a corresponding difference vector can be obtained by using a difference value between a next hidden vector corresponding to the current hidden vector and the current hidden vector, and the difference vector is used as an original vector corresponding to the current hidden vector. It should be noted that, in the technical solution, in order to make the result of parsing the syntactic component of the text to be analyzed more accurate, the text to be analyzed may be divided into a plurality of text intervals, each text interval includes at least one to-be-used participle, and an original vector corresponding to each to-be-used participle may be obtained by using a to-be-used implicit vector corresponding to each to-be-used participle, so as to perform more detailed analysis on the text to be analyzed based on the original vector of each to-be-used participle.
Specifically, taking one of the hidden vectors to be used as the current hidden vector as an example, the original vector corresponding to the current hidden vector may be obtained based on the following formula:
r i,j =h j -h i
wherein r is i,j Representing the original vector, h, corresponding to the current hidden vector j Representing a subsequent hidden vector, h, relative to the current hidden vector j Representing the current hidden vector.
S120, extracting at least one to-be-used participle from the to-be-analyzed text, and determining to-be-used vectors corresponding to the to-be-used participles.
It should be noted that, in the technical solution, the parsing of the syntactic component of the text to be analyzed is further optimized on the basis of the existing parsing, that is, the original vector in the technical solution is a result of parsing the syntactic component of the text to be analyzed on the basis of the existing technology, and in the technical solution, the parsing of the syntactic component of the text to be analyzed is performed more finely on the basis of the original vector corresponding to the text to be analyzed. Because the vector corresponding to the word to be used is also used when the original vector is determined, for the convenience of distinction, the vector corresponding to the word to be used is called a hidden vector to be used when the original vector is determined, and the vector corresponding to the word to be used is called a vector to be used when the optimization is performed based on the technical scheme.
The vector to be used is a vector obtained by vectorizing the text to be analyzed based on the vector processing method of the technical scheme.
Specifically, when a text to be analyzed is analyzed, vectors to be used corresponding to each participle to be used in the text to be analyzed need to be determined. In the technical scheme, determining the vectors to be used corresponding to the participles to be used comprises the following steps: respectively determining the segmentation class corresponding to each segmentation to be used; and aiming at each word segmentation class, performing vector processing on at least one word to be used in the current word segmentation class to obtain a corresponding vector to be used.
In the technical scheme, the word segmentation class can be understood as an N-tuple class, wherein the N-tuple is a word block formed on the basis of continuous words. For example, the text to be analyzed is "on playground", and the text to be analyzed is participled, so that 3 to-be-used participles can be obtained, where the to-be-used participles are "on", "playground" and "on", respectively, and the text to be analyzed can correspond to three different N-tuples, that is, an unary group: "on", "playground" and "up"; a binary group: "on" and "on" playground, and "on" and "playground"; triplet: "on the playground". Further, vector processing is respectively carried out on the to-be-used participles in each N-tuple type, and then corresponding to-be-used vectors can be obtained.
In the technical solution, taking vector processing on a to-be-used participle in a current participle category as an example, vector processing is performed on at least one to-be-used participle in the current participle category to obtain a corresponding to-be-used vector, including: and respectively carrying out vector processing on at least one to-be-used participle in the current participle category based on the embedding function to obtain a corresponding to-be-used vector.
In the technical scheme, the embedding function may determine the to-be-used vectors corresponding to the to-be-used participles based on a pre-constructed embedding matrix. Specifically, based on the embedding function, vector processing is performed on at least one to-be-used participle in the current participle category to obtain a corresponding to-be-used vector, including: calling a pre-constructed embedded matrix, and determining a matrix mapping element corresponding to at least one to-be-used participle in the current participle category; and determining the vectors to be used corresponding to the corresponding participles to be used in the current participle category based on each matrix mapping element.
The matrix mapping element may be understood as an element in the embedded matrix corresponding to the to-be-used participle, and specifically may be a row number element of the embedded matrix corresponding to the to-be-used participle.
Illustratively, a large number of to-be-used participles can be included in the pre-constructed embedding matrix, and the to-be-used participles are orderly placed in the embedding matrix and corresponding matrix mapping elements are generated. It should be noted that each word to be used corresponds to a unique vector in the embedding matrix, and based on this, the vector to be used corresponding to the word to be used may be determined based on the pre-constructed embedding matrix and the corresponding matrix mapping element of the word to be used in the embedding matrix. For example, the matrix mapping element corresponding to the "playground" in the embedded matrix is "11", which indicates that the 11 th position of the "playground" in the embedded matrix, i.e., the unique vector corresponding to the matrix mapping element is the to-be-used vector corresponding to the "playground".
That is to say, in the present technical solution, in order to determine the to-be-used vector corresponding to each to-be-used participle, a matrix mapping element of each to-be-used participle in a pre-constructed embedded matrix may be determined first, so as to determine the to-be-used vector corresponding to the corresponding to-be-used participle according to a unique vector corresponding to each matrix mapping element.
And S130, obtaining the vectors to be spliced of the texts to be analyzed according to the vectors to be used and the corresponding weights to be used.
The vector to be spliced can be used for being spliced with the original vector to obtain a target vector, and the text to be analyzed is subjected to more detailed syntactic component analysis based on the target vector.
In the technical scheme, when a text to be analyzed is analyzed, text intervals are firstly divided for the text to be analyzed to obtain at least one text interval, namely at least one word segmentation class, different word segmentation classes comprise at least one word to be used, and each word to be used corresponds to a unique vector to be used. It should be noted that the to-be-used weight corresponding to each to-be-used vector is consistent with the weight corresponding to the corresponding participle category. That is to say, if the current word segmentation class includes 3 to-be-used word segmentations, and the 3 to-be-used word segmentations respectively correspond to different to-be-used vectors, and if the weight value corresponding to the current word segmentation class is 0.2, the to-be-used weights corresponding to the 3 to-be-used vectors are all 0.2.
In practical applications, the number of the to-be-used participles in each participle category may be one or multiple. Taking the current participle category as an example, when determining the weight corresponding to the current participle category, that is, the weight to be used, the weight can be determined based on the following formula:
Figure BDA0003912671630000081
wherein the content of the first and second substances,
Figure BDA0003912671630000082
representing the weight to be used, exp represents an exponential function based on a natural constant e, r i,j Which represents the original vector of the original vector,
Figure BDA0003912671630000083
a vector to be used representing the N-tuple,
Figure BDA0003912671630000084
representing the number of N-tuples.
Further, obtaining the vectors to be spliced of the texts to be analyzed according to the vectors to be used and the corresponding weights to be used, including: respectively determining corresponding weights to be used according to the vectors to be used and the original vectors; and carrying out weighted average processing according to each vector to be used and the corresponding weight to be used to obtain the vector to be spliced corresponding to the text to be analyzed.
Specifically, the vector to be spliced can be obtained by the following formula:
firstly, determining the weighted average vector corresponding to each X-tuple
Figure BDA0003912671630000085
Figure BDA0003912671630000086
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003912671630000091
a weighted average vector representing the N-tuple,
Figure BDA0003912671630000092
which represents the weight to be used and,
Figure BDA0003912671630000093
represents the vector to be used, is the vector inner product sign.
Next, performing stitching processing on the N-tuple weighted average vectors of all categories to obtain a vector (i.e., a vector to be stitched) containing N-tuple information:
Figure BDA0003912671630000094
wherein, a i,j The vectors to be spliced are represented by a vector,
Figure BDA0003912671630000095
in order to concatenate the symbols for the vector,
Figure BDA0003912671630000096
representing a weighted average vector of N-tuples.
And S140, splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector.
The target vector can be understood as a vector corresponding to the text to be analyzed, which is obtained by splicing the vectors to be used.
Specifically, the target vector may be determined based on the following formula:
Figure BDA0003912671630000097
wherein r' i,j Representing the target vector, a i,j Representing the vector to be spliced, r i,j Which represents the original vector of the original vector,
Figure BDA0003912671630000098
the symbols are vector concatenated.
Optionally, the vector to be spliced and the original vector are spliced to obtain a target vector, and text analysis is performed on the text to be analyzed based on the target vector, including: splicing the vector to be spliced and the original vector based on a pre-constructed encoder to obtain a target vector; and inputting the target vector into a pre-constructed syntactic analysis model, and analyzing the text to be analyzed based on the syntactic analysis model.
Specifically, on the basis of the original vector, the target vector obtained by processing the text to be analyzed by the technical scheme is spliced, so that the problem that the analysis result is not accurate enough due to the fact that the text to be analyzed is relatively rough in the prior art can be solved. That is to say, in the technical scheme, on the basis of the existing vector representation of the text to be analyzed, vector representation information corresponding to each word to be used is added, and the vector representation information are combined, so that more syntactic structure information corresponding to the text to be analyzed can be obtained. Therefore, the target vector is analyzed based on the pre-constructed syntactic analysis model, and a more accurate analysis result can be obtained.
According to the technical scheme of the embodiment of the invention, the text to be analyzed is obtained, the original vector corresponding to the text to be analyzed is determined, the original vector corresponding to the text to be analyzed can be obtained through the BERT model, and the vector to be spliced and the original vector obtained by the technical scheme are spliced to obtain the target vector. Extracting at least one word to be used from the text to be analyzed, determining a vector to be used corresponding to each word to be used, determining a word class corresponding to each word to be used, and determining a vector to be used corresponding to each word to be used based on an embedding function. Further, the vectors to be spliced of the texts to be analyzed are obtained according to the vectors to be used and the corresponding weights to be used, the weights to be used corresponding to the corresponding vectors to be used can be determined according to the weights corresponding to the word segmentation categories, and the vectors to be spliced are obtained according to the vectors to be used and the corresponding weights to be used. And finally, splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector. The problem that the syntactic component analysis result of the text is not accurate enough due to large text analysis granularity is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.
Example two
In a specific example, the model for analyzing the text to be analyzed according to the present technical solution is shown in fig. 2 below, taking the text to be analyzed as "and kick a ball on a playground" as an example, when the prior art performs syntactic component analysis on the text to be analyzed, a method based on a graph structure is usually adopted, and specifically, an encoder such as a BERT model can be used to analyze the text to be analyzed including q tokens to be used
Figure BDA0003912671630000101
Coding is carried out to obtain a corresponding hidden vector (wherein the hidden vector to be used of the ith participle is h i ) The formula is as follows:
h 1 …h i …h j …h n =BERT(x 1 …x i …x j …x n )
wherein h is i Representing hidden vectors to be used, x i Indicating that word segmentation is to be used.
And i, j and n are natural numbers and are used for representing the positions of the hidden vector text vectors to be used and the positions of the participles to be used in the text to be analyzed.
Further, each text interval (x) can be obtained by the following formula i ,x j )=x i …x j-1 Is characterized by r i,j
r i,j =h j -h i
Wherein r is i,j Representing the original vector, h, corresponding to the current hidden vector j Representing a subsequent hidden vector, h, relative to the current hidden vector i Representing the current hidden vector.
Next, two fully-connected layers (where the matrix W is 1 And an offset vector b 1 Is the parameter of the first fully connected layer; matrix W 2 And an offset vector b 2 Is a parameter of the second fully-connected layer; reLu is an activation function), r i,j Mapped as a vector o i,j
o i,j =W 2 ·(ReLu(W 1 ·r i,j +b 1 ))+b 2
Wherein the vector o i,j Is equal to the number of syntactic component classes (e.g., noun Phrase (NP), verb Phrase (VP), preposition Phrase (PP), etc.), and the value of a certain dimension of the vector represents the text interval (x) i ,x j ) And (4) a score belonging to a certain syntactic category l, and the score is marked as s (i, j, l).
And finally, inputting all text interval scores s (i, j, l) of the text to be analyzed into a Cocke-Younger-Kasami (CYK) algorithm, and calculating to obtain the optimal legal syntax tree with the highest score.
According to the technical scheme, the text to be analyzed is further optimized and analyzed on the basis of the syntactic component analysis. Specifically, text intervals are divided for a text to be analyzed to obtain at least one text interval, and word segmentation categories corresponding to the text intervals are determined, that is, corresponding word segmentation categories are determined according to the number of words to be used. In practical application, the text interval (x) can be extracted according to the existing N-tuple word list N i ,x j ) All matching N-tuples in (i.e., if the N-tuple in a vocabulary N is a text interval (x) i ,x j ) The substring of (1), then the N-tuple is extracted). Then, the length of the N-tuple is extracted in sequence, each N-tuple is respectively corresponding to different participle categories, and the v-th N-tuple belonging to the u-th category is recorded as
Figure BDA0003912671630000111
In the u-th class, a common
Figure BDA0003912671630000112
N-tuples.
For example, the text to be analyzed is "on playground", and the text to be analyzed is participled, so that 3 to-be-used participles can be obtained, where the to-be-used participles are "on", "playground" and "on", respectively, and the text to be analyzed can correspond to three different N-tuples, that is, an unary group: "on", "playground" and "up"; a binary group: "on" and "on" playground, and "on" and "playground"; triple group: "on the playground".
Further, based on the embedding function, the N-tuple is processed
Figure BDA0003912671630000121
Mapping as N-tuple embedding
Figure BDA0003912671630000122
Specifically, the extraction can be performed in a pre-constructed embedding matrix
Figure BDA0003912671630000123
And (3) for the row number (namely, matrix mapping element) of the corresponding sequence number in the embedded matrix, and extracting the vector corresponding to the row number as the to-be-used vector corresponding to the to-be-used participle.
Further, for N-tuples in class u, the weight of the N-tuple of the current class can be determined by the following formula
Figure BDA0003912671630000124
I.e. the weight to be used:
Figure BDA0003912671630000125
wherein the content of the first and second substances,
Figure BDA0003912671630000126
representing the weight to be used, exp represents an exponential function based on a natural constant e, r i,j Which represents the original vector of the vector to be encoded,
Figure BDA0003912671630000127
a vector to be used representing an N-tuple,
Figure BDA0003912671630000128
representing the number of N-tuples.
The weighted average vector of the N-tuple of class u is calculated by the following formula
Figure BDA0003912671630000129
Figure BDA00039126716300001210
Wherein the content of the first and second substances,
Figure BDA00039126716300001211
a weighted average vector representing the N-tuple,
Figure BDA00039126716300001212
indicating the weight to be used and,
Figure BDA00039126716300001213
represents the vector to be used, is the vector inner product sign.
Next, performing stitching processing on the N-tuple weighted average vectors of all categories to obtain a vector (i.e., a vector to be stitched) containing N-tuple information:
Figure BDA00039126716300001214
wherein, a i,j The vectors to be spliced are represented by a vector,
Figure BDA00039126716300001215
in order to concatenate the symbols for the vector,
Figure BDA00039126716300001216
representing a weighted average vector of N-tuples.
And finally, splicing the vector to be spliced and the original vector based on the following formula to obtain a target vector:
Figure BDA0003912671630000131
wherein r' i,j Representing the target vector, a i,j Represents the vector to be spliced, r i,j Which represents the original vector of the vector to be encoded,
Figure BDA0003912671630000132
the symbols are vector concatenated.
Further, syntactic component analysis is performed on the text to be analyzed based on the target vector, so that a syntactic component analysis result can be obtained.
Compared with the prior art, the technical scheme has the advantages that the text to be analyzed is divided into a plurality of sub-text intervals, the N-tuple of the text in each text interval is determined, and the corresponding weight is set according to the influence of each N-tuple on the sentence component analysis, so that when the text to be analyzed based on each N-tuple is analyzed, the granularity of the text analysis is finer, and the analysis result of the text to be analyzed is more accurate.
According to the technical scheme of the embodiment of the invention, the text to be analyzed is obtained, the original vector corresponding to the text to be analyzed is determined, the original vector corresponding to the text to be analyzed can be obtained through the BERT model, and the vector to be spliced and the original vector obtained by the technical scheme are spliced to obtain the target vector. Extracting at least one word to be used from the text to be analyzed, determining a vector to be used corresponding to each word to be used, determining a word class corresponding to each word to be used, and determining a vector to be used corresponding to each word to be used based on an embedding function. Further, the vectors to be spliced of the texts to be analyzed are obtained according to the vectors to be used and the corresponding weights to be used, the weights to be used corresponding to the corresponding vectors to be used can be determined according to the weights corresponding to the word segmentation categories, and the vectors to be spliced are obtained according to the vectors to be used and the corresponding weights to be used. And finally, splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector. The problem that the syntactic component analysis result of the text is not accurate enough due to large text analysis granularity is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a text processing apparatus according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: an original vector determination module 210, a to-be-used vector determination module 220, a to-be-spliced vector determination module 230, and a target vector determination module 240.
The original vector determining module 210 is configured to obtain a text to be analyzed, and determine an original vector corresponding to the text to be analyzed;
the to-be-used vector determining module 220 is configured to extract at least one to-be-used participle from the to-be-analyzed text, and determine a to-be-used vector corresponding to each to-be-used participle;
the to-be-spliced vector determining module 230 is configured to obtain to-be-spliced vectors of the texts to be analyzed according to the to-be-used vectors and the corresponding to-be-used weights;
and the target vector determining module 240 is configured to perform splicing processing on the vector to be spliced and the original vector to obtain a target vector, and perform text analysis on the text to be analyzed based on the target vector.
According to the technical scheme of the embodiment of the invention, the text to be analyzed is obtained, the original vector corresponding to the text to be analyzed is determined, the original vector corresponding to the text to be analyzed can be obtained through the BERT model, and the vector to be spliced and the original vector obtained by the technical scheme are spliced to obtain the target vector. Extracting at least one word to be used from the text to be analyzed, determining a vector to be used corresponding to each word to be used, determining a word class corresponding to each word to be used, and determining a vector to be used corresponding to each word to be used based on an embedding function. Further, the vectors to be spliced of the texts to be analyzed are obtained according to the vectors to be used and the corresponding weights to be used, the weights to be used corresponding to the corresponding vectors to be used can be determined according to the weights corresponding to the word segmentation categories, and the vectors to be spliced are obtained according to the vectors to be used and the corresponding weights to be used. And finally, splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector. The problem that the syntactic component analysis result of the text is not accurate enough due to large text analysis granularity is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.
Optionally, the original vector determining module includes: the hidden vector determining submodule is used for carrying out vector processing on at least one word to be used in the text to be analyzed based on the language representation model to obtain a corresponding hidden vector to be used;
and the original vector determining submodule is used for obtaining an original vector corresponding to the text to be analyzed based on a difference value of a next hidden vector relative to the current hidden vector and the current hidden vector aiming at each hidden vector to be used.
Optionally, the vector to be used determining module includes: the word segmentation class determination submodule is used for respectively determining the word segmentation class corresponding to each word to be used; the word segmentation category comprises at least one word to be used;
and the to-be-used vector determination submodule is used for carrying out vector processing on at least one to-be-used word in the current word segmentation class aiming at each word segmentation class to obtain a corresponding to-be-used vector.
Optionally, the vector to be used determining sub-module includes: and the to-be-used vector determining unit is used for respectively carrying out vector processing on at least one to-be-used participle in the current participle category based on the embedding function to obtain a corresponding to-be-used vector.
Optionally, the to-be-used vector determining unit includes: the mapping element determining subunit is used for calling a pre-constructed embedded matrix and determining a matrix mapping element corresponding to at least one to-be-used word in the current word segmentation class;
and the to-be-used vector determining subunit is used for determining the to-be-used vector corresponding to the corresponding to-be-used participle in the current participle category based on each matrix mapping element.
Optionally, the module for determining a vector to be spliced includes: the weight determining submodule is used for respectively determining corresponding weights to be used according to the vectors to be used and the original vectors;
and the vector to be spliced determining submodule is used for carrying out weighted average processing according to each vector to be used and the corresponding weight to be used so as to obtain the vector to be spliced corresponding to the text to be analyzed.
Optionally, the target vector determining module includes: the target vector determining submodule is used for splicing the vector to be spliced and the original vector based on a pre-constructed encoder to obtain a target vector;
and the text analysis sub-module is used for inputting the target vector into a pre-constructed syntactic analysis model so as to analyze the text to be analyzed based on the syntactic analysis model.
The text processing device provided by the embodiment of the invention can execute the text processing method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 4 shows a schematic structural diagram of the electronic device 10 of the embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as a text processing method.
In some embodiments, the text processing method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the text processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the text processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the text processing method of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of text processing, comprising:
acquiring a text to be analyzed, and determining an original vector corresponding to the text to be analyzed;
extracting at least one participle to be used from the text to be analyzed, and determining a vector to be used corresponding to each participle to be used;
obtaining vectors to be spliced of the texts to be analyzed according to the vectors to be used and the corresponding weights to be used;
and splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector.
2. The method of claim 1, wherein the determining an original vector corresponding to the text to be analyzed comprises:
based on a language representation model, carrying out vector processing on at least one word to be used in the text to be analyzed to obtain a corresponding hidden vector to be used;
and aiming at each hidden vector to be used, obtaining an original vector corresponding to the text to be analyzed based on a difference value of a next hidden vector relative to the current hidden vector and the current hidden vector.
3. The method according to claim 1, wherein the determining the to-be-used vector corresponding to each to-be-used participle comprises:
respectively determining the segmentation class corresponding to each segmentation to be used; the word segmentation category comprises at least one word to be used;
and aiming at each word segmentation class, performing vector processing on at least one word to be used in the current word segmentation class to obtain a corresponding vector to be used.
4. The method according to claim 3, wherein the performing vector processing on at least one to-be-used participle in the current participle category to obtain a corresponding to-be-used vector comprises:
and respectively carrying out vector processing on at least one to-be-used participle in the current participle category based on an embedding function to obtain a corresponding to-be-used vector.
5. The method according to claim 4, wherein the performing vector processing on at least one to-be-used participle in the current participle category based on the embedding function to obtain a corresponding to-be-used vector comprises:
calling a pre-constructed embedded matrix, and determining a matrix mapping element corresponding to at least one word to be used in the current word segmentation class;
and determining the vectors to be used corresponding to the corresponding participles to be used in the current participle category based on each matrix mapping element.
6. The method according to claim 1, wherein obtaining the vectors to be spliced of the texts to be analyzed according to the vectors to be used and the corresponding weights to be used comprises:
determining corresponding weights to be used respectively according to the vectors to be used and the original vectors;
and carrying out weighted average processing according to each vector to be used and the corresponding weight to be used to obtain the vector to be spliced corresponding to the text to be analyzed.
7. The method according to claim 1, wherein the stitching the vector to be stitched with the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector, including:
splicing the vectors to be spliced and the original vectors based on a pre-constructed encoder to obtain target vectors;
and inputting the target vector into a pre-constructed syntactic analysis model so as to analyze the text to be analyzed based on the syntactic analysis model.
8. A text processing apparatus, comprising:
the system comprises an original vector determining module, a text analysis module and a text analysis module, wherein the original vector determining module is used for acquiring a text to be analyzed and determining an original vector corresponding to the text to be analyzed;
the to-be-used vector determining module is used for extracting at least one to-be-used word from the to-be-analyzed text and determining a to-be-used vector corresponding to each to-be-used word;
the to-be-spliced vector determining module is used for obtaining the to-be-spliced vectors of the to-be-analyzed texts according to the to-be-used vectors and the corresponding to-be-used weights;
and the target vector determining module is used for splicing the vector to be spliced and the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the text processing method of any one of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a processor to perform the text processing method of any one of claims 1-7 when executed.
CN202211327875.3A 2022-10-27 2022-10-27 Text processing method and device, electronic equipment and storage medium Pending CN115618848A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211327875.3A CN115618848A (en) 2022-10-27 2022-10-27 Text processing method and device, electronic equipment and storage medium
PCT/CN2022/134592 WO2024087298A1 (en) 2022-10-27 2022-11-28 Text processing method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211327875.3A CN115618848A (en) 2022-10-27 2022-10-27 Text processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115618848A true CN115618848A (en) 2023-01-17

Family

ID=84875704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211327875.3A Pending CN115618848A (en) 2022-10-27 2022-10-27 Text processing method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN115618848A (en)
WO (1) WO2024087298A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408268B (en) * 2021-06-22 2023-01-13 平安科技(深圳)有限公司 Slot filling method, device, equipment and storage medium
CN113536772A (en) * 2021-07-15 2021-10-22 浙江诺诺网络科技有限公司 Text processing method, device, equipment and storage medium
CN113919344B (en) * 2021-09-26 2022-09-23 腾讯科技(深圳)有限公司 Text processing method and device

Also Published As

Publication number Publication date
WO2024087298A1 (en) 2024-05-02

Similar Documents

Publication Publication Date Title
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN115438650B (en) Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN115640520B (en) Pre-training method, device and storage medium of cross-language cross-modal model
CN114912450B (en) Information generation method and device, training method, electronic device and storage medium
CN113053367A (en) Speech recognition method, model training method and device for speech recognition
CN114490998A (en) Text information extraction method and device, electronic equipment and storage medium
CN113407698B (en) Method and device for training and recognizing intention of intention recognition model
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN113553428A (en) Document classification method and device and electronic equipment
CN113407610A (en) Information extraction method and device, electronic equipment and readable storage medium
CN112906368A (en) Industry text increment method, related device and computer program product
CN113204616B (en) Training of text extraction model and text extraction method and device
CN114841172A (en) Knowledge distillation method, apparatus and program product for text matching double tower model
CN112541557B (en) Training method and device for generating countermeasure network and electronic equipment
CN115510860A (en) Text sentiment analysis method and device, electronic equipment and storage medium
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN115618848A (en) Text processing method and device, electronic equipment and storage medium
CN113408269A (en) Text emotion analysis method and device
CN113553833A (en) Text error correction method and device and electronic equipment
CN112560481A (en) Statement processing method, device and storage medium
CN112560466A (en) Link entity association method and device, electronic equipment and storage medium
CN113743409A (en) Text recognition method and device
CN112989797B (en) Model training and text expansion methods, devices, equipment and storage medium
CN112507712B (en) Method and device for establishing slot identification model and slot identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination