WO2024087298A1 - Text processing method and apparatus, electronic device and storage medium - Google Patents

Text processing method and apparatus, electronic device and storage medium Download PDF

Info

Publication number
WO2024087298A1
WO2024087298A1 PCT/CN2022/134592 CN2022134592W WO2024087298A1 WO 2024087298 A1 WO2024087298 A1 WO 2024087298A1 CN 2022134592 W CN2022134592 W CN 2022134592W WO 2024087298 A1 WO2024087298 A1 WO 2024087298A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
text
analyzed
word
category
Prior art date
Application number
PCT/CN2022/134592
Other languages
French (fr)
Chinese (zh)
Inventor
宋彦
田元贺
毛震东
李世鹏
Original Assignee
苏州思萃人工智能研究所有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州思萃人工智能研究所有限公司 filed Critical 苏州思萃人工智能研究所有限公司
Publication of WO2024087298A1 publication Critical patent/WO2024087298A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the technical field of natural language processing, for example, to a text processing method, device, electronic device and storage medium.
  • the present application provides a text processing method, device, electronic device and storage medium to solve the problem that the syntactic component analysis results of the text are not accurate due to the large granularity of text analysis.
  • the present application embodiment provides a text processing method, including:
  • the vector to be concatenated is concatenated with the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
  • the present application also provides a text processing device, including:
  • An original vector determination module configured to obtain a text to be analyzed and determine an original vector corresponding to the text to be analyzed
  • a to-be-used vector determination module configured to extract at least one to-be-used word from the to-be-analyzed text, and determine a to-be-used vector corresponding to the at least one to-be-used word;
  • a module for determining vectors to be spliced configured to obtain vectors to be spliced of the text to be analyzed according to each vector to be used and a weight to be used corresponding to each vector to be used;
  • the target vector determination module is configured to perform a splicing process on the vector to be spliced and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
  • the present application also provides an electronic device, including:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the text processing method described in any embodiment of the present application.
  • An embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the text processing method described in any embodiment of the present application when executed.
  • FIG1 is a flow chart of a text processing method provided according to Embodiment 1 of the present application.
  • FIG2 is a schematic diagram of a model structure of text processing provided according to Embodiment 2 of the present application.
  • FIG3 is a schematic diagram of the structure of a text processing device provided according to Embodiment 3 of the present application.
  • FIG. 4 is a schematic diagram of the structure of an electronic device that implements the text processing method according to an embodiment of the present application.
  • Figure 1 is a flowchart of a text processing method provided in the first embodiment of the present application. This embodiment can be applied to situations where a more detailed and accurate analysis of the syntactic components of a text is performed.
  • the method can be executed by a text processing device, which can be implemented in the form of hardware and/or software.
  • the text processing device can be configured in a computing device that can execute the text processing method.
  • the method includes the following steps.
  • S110 Acquire a text to be analyzed, and determine an original vector corresponding to the text to be analyzed.
  • the text to be analyzed can be understood as the text that needs to be analyzed for syntactic components.
  • the original vector can be understood as the vector obtained after the text to be analyzed is vectorized.
  • the text to be analyzed can be vectorized through a language representation model to obtain the original vector.
  • syntactic component analysis of text is a basic work of natural language processing. Based on syntactic component analysis, operations such as opinion extraction or sentiment analysis can be performed on the text.
  • operations such as opinion extraction or sentiment analysis can be performed on the text.
  • the syntactic component information in the text can usually be obtained more accurately.
  • syntactic analysis is more difficult, which may result in missing important information in the text.
  • the text can be vectorized and the syntactic component information corresponding to the text can be obtained by subtracting the end vector corresponding to the text from the beginning vector.
  • such an analysis method is relatively rough and it is difficult to obtain more accurate syntactic component information from the text.
  • determining the original vector corresponding to the text to be analyzed includes: based on the language representation model, performing vector processing on at least one to-be-used word segment in the text to be analyzed to obtain a to-be-used latent vector corresponding to at least one to-be-used word segment; for each to-be-used latent vector, based on the difference between the next latent vector relative to the current latent vector and the current latent vector, obtain the original vector corresponding to the text to be analyzed.
  • the language representation model is based on the bidirectional encoder representation from transformer (Bidirectional Encoder Representations from Transformer, BERT) with powerful language representation and feature extraction capabilities.
  • the text to be analyzed can be feature extracted based on the BERT model, and the original vector corresponding to the text to be analyzed can be generated.
  • the text to be analyzed includes at least one participle.
  • each participle is called a participle to be used.
  • the corresponding latent vector to be used can be obtained, so as to obtain the original vector corresponding to the text to be analyzed based on at least one latent vector to be used.
  • the text to be analyzed is segmented to obtain at least one segmented word to be used, and the at least one segmented word to be used is encoded based on the BERT model to obtain a corresponding latent vector to be used.
  • a text vector corresponding to the text to be analyzed can be obtained. It can be determined by the following formula:
  • hi represents the latent vector to be used
  • xi represents the word segment to be used.
  • i, j and n are natural numbers, which are used to represent the position of the latent vector to be used in the text vector and the position of the word segment to be used in the text to be analyzed.
  • the above text vector includes at least one latent vector to be used.
  • the difference between the next latent vector relative to the current latent vector and the current latent vector can be used to obtain the corresponding difference vector, and the difference vector is used as the original vector corresponding to the current latent vector.
  • the text to be analyzed in order to make the result of the syntactic component analysis of the text to be analyzed more accurate, can be divided into multiple text intervals, each of which includes at least one word to be used. Through the latent vector to be used corresponding to each word to be used, the original vector corresponding to each word to be used can be obtained, so as to perform a more detailed analysis of the text to be analyzed based on the original vector of at least one word to be used.
  • the original vector corresponding to the current latent vector can be obtained based on the following formula:
  • ri ,j represents the original vector corresponding to the current latent vector
  • hj represents the next latent vector relative to the current latent vector
  • hi represents the current latent vector
  • S120 Extract at least one to-be-used word from the text to be analyzed, and determine a to-be-used vector corresponding to the at least one to-be-used word.
  • the syntactic component analysis of the text to be analyzed in this technical solution is adjusted on the basis of the syntactic analysis. That is to say, the original vector in this technical solution is based on the result of the syntactic component analysis of the text to be analyzed, and this technical solution is based on the original vector corresponding to the text to be analyzed.
  • the syntactic component analysis of the text to be analyzed is more detailed. Since the vector corresponding to the word to be used is also used when determining the original vector, for the convenience of distinction, the vector corresponding to the word to be used when determining the original vector is called the latent vector to be used, and the vector corresponding to the word to be used when analyzing based on this technical solution is called the vector to be used.
  • the vector to be used is the vector obtained after the text to be analyzed is vectorized by the vector processing method based on the technical solution.
  • determining the vector to be used corresponding to at least one to-be-used word segment includes: respectively determining the word segment category corresponding to at least one to-be-used word segment; for each word segment category, performing vector processing on at least one to-be-used word segment in the current word segment category to obtain the vector to be used corresponding to each word segment category.
  • the word segmentation category can be understood as an N-tuple category, and the so-called N-tuple is a word block composed of continuous words.
  • the text to be analyzed is " ⁇ "
  • the text to be analyzed is segmented to obtain three to-be-used word segments, namely " ⁇ ", " ⁇ ” and " ⁇ ”.
  • the text to be analyzed can correspond to three different N-tuples, namely, unigram: " ⁇ ", " ⁇ ” and “ ⁇ ”; bigram: " ⁇ ", and " ⁇ ”; ternary: " ⁇ ”.
  • the to-be-used word segments in each N-tuple category are vectorized to obtain the corresponding to-be-used vectors.
  • vector processing is performed on at least one to-be-used participle in the current participle category to obtain the to-be-used vector corresponding to each participle category, including: based on the embedding function, vector processing is performed on at least one to-be-used participle in the current participle category to obtain the to-be-used vector corresponding to at least one to-be-used participle in the current participle category.
  • the embedding function can determine the vector to be used corresponding to each to-be-used word based on the pre-built embedding matrix. Based on the embedding function, vector processing is performed on at least one to-be-used word in the current word segmentation category to obtain the vector to be used corresponding to at least one to-be-used word in the current word segmentation category, including: calling the pre-built embedding matrix and determining the matrix mapping element corresponding to at least one to-be-used word in the current word segmentation category; based on each matrix mapping element, determining the vector to be used corresponding to the corresponding to-be-used word in the current word segmentation category.
  • the matrix mapping element can be understood as an element in the embedding matrix corresponding to the word segment to be used, and can be a row number element of the embedding matrix corresponding to the word segment to be used.
  • a large number of to-be-used participles may be included in the pre-constructed embedding matrix, at least one to-be-used participle is placed in order in the embedding matrix, and a corresponding matrix mapping element is generated.
  • Each to-be-used participle corresponds to a unique vector in the embedding matrix. Based on this, based on the pre-constructed embedding matrix and the matrix mapping element corresponding to the to-be-used participle in the embedding matrix, the to-be-used vector corresponding to the to-be-used participle can be determined.
  • the matrix mapping element corresponding to "playground” in the embedding matrix is "11", indicating that "playground” is in the 11th position in the embedding matrix, that is, the unique vector corresponding to the matrix mapping element is the to-be-used vector corresponding to "playground”.
  • the matrix mapping element of each word segment to be used in the pre-constructed embedding matrix can be determined, so as to determine the vector to be used corresponding to the corresponding word segment to be used according to the unique vector corresponding to each matrix mapping element.
  • the vector to be concatenated can be used to concatenate with the original vector to obtain a target vector, so as to perform a more detailed syntactic component analysis on the text to be analyzed based on the target vector.
  • the text to be analyzed is divided into text intervals to obtain at least one text interval, that is, at least one segmentation category, and different segmentation categories include at least one segmentation to be used, and each segmentation to be used corresponds to a unique vector to be used.
  • the weight to be used corresponding to at least one vector to be used is consistent with the weight corresponding to the segmentation category corresponding to at least one vector to be used. In other words, if the current segmentation category includes 3 segmentations to be used, and the 3 segmentations to be used correspond to different vectors to be used, if the weight value corresponding to the current segmentation category is 0.2, then the weights to be used corresponding to the 3 vectors to be used are all 0.2.
  • the number of to-be-used segmentations in each segmentation category may be one or more.
  • the weight corresponding to the current segmentation category that is, the to-be-used weight, it can be determined based on the following formula:
  • the vector to be spliced of the text to be analyzed is obtained, including: according to each vector to be used and the original vector, respectively determining the weight to be used corresponding to each vector to be used; according to each vector to be used and the weight to be used corresponding to each vector to be used, performing weighted averaging processing to obtain the vector to be spliced corresponding to the text to be analyzed.
  • the vector to be spliced can be obtained by the following formula:
  • the weighted average vectors of N-tuples of all categories are concatenated to obtain a vector containing N-tuple information (i.e., the vector to be concatenated):
  • a i,j represents the vector to be spliced
  • the target vector can be understood as a vector corresponding to the text to be analyzed obtained by concatenating each vector to be used.
  • the target vector can be determined based on the following formula:
  • r' i,j represents the target vector
  • a i,j represents the vector to be spliced
  • ri ,j represents the original vector.
  • the vector to be concatenated is concatenated with the original vector to obtain a target vector, and text analysis is performed on the text to be analyzed based on the target vector, including: based on a pre-built encoder, the vector to be concatenated and the original vector are concatenated to obtain a target vector; and the target vector is input into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.
  • the target vector obtained by processing the text to be analyzed by this technical solution is spliced, which can make up for the problem that the analysis of the text to be analyzed in the related technology is relatively rough, resulting in inaccurate analysis results. That is to say, on the basis of the vector representation of the text to be analyzed, this technical solution adds at least one vector representation information corresponding to the word segmentation to be used, and combining the two can obtain more syntactic structure information corresponding to the text to be analyzed. Therefore, analyzing the target vector based on the pre-built syntactic analysis model can obtain more accurate analysis results.
  • the technical solution of the embodiment of the present application obtains the text to be analyzed and determines the original vector corresponding to the text to be analyzed.
  • the original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector.
  • At least one to-be-used participle is extracted from the text to be analyzed, and the vector to be used corresponding to at least one to-be-used participle is determined, and the participle category corresponding to at least one to-be-used participle is determined respectively, and the vector to be used corresponding to at least one to-be-used participle is determined based on the embedded function.
  • the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight.
  • the vector to be spliced is spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
  • the model of the technical solution for analyzing the text to be analyzed is shown in FIG2.
  • hi represents the latent vector to be used
  • xi represents the word segmentation to be used.
  • i, j and n are natural numbers, which are used to indicate the position of the latent vector to be used in the text vector and the position of the word to be used in the text to be analyzed.
  • ri ,j represents the original vector corresponding to the current latent vector
  • hj represents the next latent vector relative to the current latent vector
  • hi represents the current latent vector
  • the dimension of the vector o i,j is equal to the number of syntactic component categories (such as noun phrase (NP), verb phrase (VP), prepositional phrase (PP), etc.), and the value corresponding to one dimension of the vector represents the score that the text interval ( xi , xj ) belongs to a syntactic component category l, and the score is denoted as s(i,j,l).
  • syntactic component categories such as noun phrase (NP), verb phrase (VP), prepositional phrase (PP), etc.
  • This technical solution analyzes the text to be analyzed on the basis of the above-mentioned syntactic component analysis.
  • the text to be analyzed is divided into text intervals to obtain at least one text interval, and the segmentation category corresponding to at least one text interval is determined, that is, the corresponding segmentation category is determined according to the number of segmentations to be used.
  • all matching N-tuples in the text interval ( xi , xj ) can be extracted based on the existing N-tuple vocabulary N (that is, if an N-tuple in the vocabulary N is a substring of the text interval ( xi , xj ), then the N-tuple is extracted).
  • the lengths of the N-tuples are extracted in turn, and each N-tuple is respectively mapped to a different segmentation category.
  • the v-th N-tuple belonging to the u-th category is recorded as There are a total of N-tuples.
  • the text to be analyzed is "in the playground”.
  • three word segmentations to be used can be obtained, namely "in”, “playground” and “on”.
  • the text to be analyzed can correspond to three different N-tuples, namely, unigram: “in”, “playground” and “on”; bigram: “in the playground” and “on”, as well as “in” and “on the playground”; triplet: “in the playground”.
  • the N-tuple Mapped to N-tuple embedding vector In the pre-built embedding matrix, we can extract The row number (ie, matrix mapping element) corresponding to the sequence number in the embedding matrix is extracted, and the vector corresponding to the row number is used as the vector to be used corresponding to the word segment to be used.
  • the row number ie, matrix mapping element
  • the weight of the N-tuple of the current category can be determined by the following formula: That is, the weight to be used:
  • the weighted average vector of the N-tuple of category u is calculated by the following formula:
  • the weighted average vectors of N-tuples of all categories are concatenated to obtain a vector containing N-tuple information (i.e., the vector to be concatenated):
  • a i,j represents the vector to be spliced
  • the vector to be spliced is concatenated with the original vector to obtain the target vector:
  • r' i,j represents the target vector
  • a i,j represents the vector to be spliced
  • ri ,j represents the original vector.
  • the syntactic component analysis result can be obtained by performing syntactic component analysis on the text to be analyzed based on the target vector.
  • This technical solution divides the text to be analyzed into multiple sub-text intervals, and determines N-tuples of the texts in the multiple text intervals respectively, and sets corresponding weights according to the influence of each N-tuple on the syntactic component analysis, so that when the text to be analyzed is analyzed based on each N-tuple, the granularity of the text analysis is finer and the analysis result of the text to be analyzed is more accurate.
  • the original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector.
  • Extract at least one to-be-used participle from the text to be analyzed, and determine the vector to be used corresponding to at least one to-be-used participle respectively determine the participle category corresponding to at least one to-be-used participle, and determine the vector to be used corresponding to at least one to-be-used participle based on the embedding function.
  • the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight.
  • Splice the vector to be spliced with the original vector to obtain the target vector so as to perform text analysis on the text to be analyzed based on the target vector.
  • Fig. 3 is a schematic diagram of the structure of a text processing device provided in Embodiment 3 of the present application. As shown in Fig. 3 , the device comprises: an original vector determination module 210 , a to-be-used vector determination module 220 , a to-be-joined vector determination module 230 , and a target vector determination module 240 .
  • the original vector determination module 210 is configured to obtain the text to be analyzed and determine the original vector corresponding to the text to be analyzed;
  • a to-be-used vector determination module 220 is configured to extract at least one to-be-used word from the to-be-analyzed text and determine a to-be-used vector corresponding to the at least one to-be-used word;
  • a to-be-joined vector determination module 230 is configured to obtain a to-be-joined vector of the to-be-analyzed text according to each to-be-used vector and a to-be-used weight corresponding to each to-be-used vector;
  • the target vector determination module 240 is configured to perform a concatenation process on the vector to be concatenated and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
  • the technical solution of the embodiment of the present application obtains the text to be analyzed and determines the original vector corresponding to the text to be analyzed.
  • the original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector.
  • At least one to-be-used participle is extracted from the text to be analyzed, and the vector to be used corresponding to at least one to-be-used participle is determined, and the participle category corresponding to at least one to-be-used participle is determined respectively, and the vector to be used corresponding to at least one to-be-used participle is determined based on the embedded function.
  • the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight.
  • the vector to be spliced is spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
  • the original vector determination module 210 includes: a latent vector determination submodule, configured to perform vector processing on at least one to-be-used word segment in the text to be analyzed based on a language representation model, to obtain a to-be-used latent vector corresponding to at least one to-be-used word segment;
  • the original vector determination submodule is configured to obtain, for each latent vector to be used, an original vector corresponding to the text to be analyzed based on a subsequent latent vector relative to the current latent vector and a difference between the current latent vector and the latent vector.
  • the to-be-used vector determination module 220 includes: a segmentation category determination submodule, configured to respectively determine a segmentation category corresponding to at least one to-be-used segmentation, wherein the segmentation category includes at least one to-be-used segmentation;
  • the to-be-used vector determination submodule is configured to perform vector processing on at least one to-be-used word in the current word segmentation category for each word segmentation category, so as to obtain the to-be-used vector corresponding to each word segmentation category.
  • the submodule for determining the vector to be used includes: a unit for determining the vector to be used, which is configured to perform vector processing on at least one to-be-used word in the current word segmentation category based on an embedding function to obtain a vector to be used corresponding to at least one to-be-used word in the current word segmentation category.
  • the to-be-used vector determination unit includes: a mapping element determination subunit, configured to retrieve a pre-built embedding matrix and determine a matrix mapping element corresponding to at least one to-be-used segmentation word in the current segmentation category;
  • the to-be-used vector determination subunit is configured to determine the to-be-used vector corresponding to the to-be-used word in the current word segmentation category based on each matrix mapping element.
  • the vector to be spliced determining module 230 includes: a weight determining submodule, configured to determine a weight to be used corresponding to each vector to be used according to each vector to be used and the original vector;
  • the submodule for determining the vector to be spliced is configured to perform weighted average processing according to each vector to be used and the weight to be used corresponding to each vector to be used, so as to obtain the vector to be spliced corresponding to the text to be analyzed.
  • the target vector determination module 240 includes: a target vector determination submodule, configured to perform a splicing process on the vector to be spliced and the original vector based on a pre-built encoder to obtain a target vector;
  • the text analysis submodule is configured to input the target vector into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.
  • the text processing device provided in the embodiments of the present application can execute the text processing method provided in any embodiment of the present application, and has the corresponding functional modules and effects of the execution method.
  • Fig. 4 shows a schematic diagram of the structure of an electronic device 10 of an embodiment of the present application.
  • the electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • the electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices (such as helmets, glasses, watches, etc.) and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present application described and/or required herein.
  • the electronic device 10 includes at least one processor 11, and a memory connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can perform a variety of appropriate actions and processes according to the computer program stored in the ROM 12 or the computer program loaded from the storage unit 18 to the RAM 13.
  • the RAM 13 a variety of programs and data required for the operation of the electronic device 10 can also be stored.
  • the processor 11, the ROM 12, and the RAM 13 are connected to each other through a bus 14.
  • An input/output (I/O) interface 15 is also connected to the bus 14.
  • a number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a disk, an optical disk, etc.; and a communication unit 19, such as a network card, a modem, a wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • the processor 11 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), a variety of dedicated artificial intelligence (AI) computing chips, a variety of processors running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the processor 11 performs the multiple methods and processes described above, such as a text processing method.
  • the text processing method may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as a storage unit 18.
  • a computer-readable storage medium such as a storage unit 18.
  • part or all of the computer program may be loaded and/or installed on the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to perform the text processing method in any other suitable manner (e.g., by means of firmware).
  • Various embodiments of the systems and techniques described above herein may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard parts (ASSPs), system on chip systems (SOCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard parts
  • SOCs system on chip systems
  • CPLDs complex programmable logic devices
  • These various embodiments may include: being implemented in one or more computer programs that are executable and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • a programmable processor which may be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Computer programs for implementing the text processing methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that when the computer program is executed by the processor, the functions/operations specified in the flow chart and/or block diagram are implemented.
  • the computer program may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.
  • a computer readable storage medium may be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, device, or apparatus.
  • a computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be a machine readable signal medium.
  • a machine readable storage medium includes an electrical connection based on one or more lines, a portable computer disk, a hard disk, a RAM, a ROM, an Erasable Programmable Read-Only Memory (EPROM), a flash memory, an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • the storage medium may be a non-transitory storage medium.
  • the systems and techniques described herein may be implemented on an electronic device having: a display device (e.g., a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the electronic device.
  • a display device e.g., a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor
  • a keyboard and pointing device e.g., a mouse or trackball
  • Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form (including acoustic input, voice input, or tactile input).
  • the systems and techniques described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes frontend components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components.
  • the components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
  • a computing system may include a client and a server.
  • the client and the server are generally remote from each other and usually interact through a communication network.
  • the client and server relationship is generated by computer programs running on the respective computers and having a client-server relationship with each other.
  • the server may be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and virtual private servers (VPS) services.
  • VPN virtual private servers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present application provides a text processing method and apparatus, an electronic device and a storage medium. The text processing method comprises: obtaining a text to be analyzed, and determining an original vector corresponding to the text to be analyzed; extracting from the text to be analyzed at least one segmented word to be used, and determining vectors to be used that are corresponding to the at least one segmented word to be used; according to each vector to be used and a weight to be used that is corresponding to each vector to be used, obtaining a vector to be spliced of the text to be analyzed; and splicing the vector to be spliced and the original vector to obtain a target vector, so as to perform, on the basis of the target vector, text analysis on the text to be analyzed.

Description

文本处理方法、装置、电子设备及存储介质Text processing method, device, electronic device and storage medium
本申请要求在2022年10月27日提交中国专利局、申请号为202211327875.3的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on October 27, 2022, with application number 202211327875.3, the entire contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请涉及自然语言处理技术领域,例如涉及一种文本处理方法、装置、电子设备及存储介质。The present application relates to the technical field of natural language processing, for example, to a text processing method, device, electronic device and storage medium.
背景技术Background technique
通过对文本进行句法分析,可以对文本进行更加全面的理解。By performing syntactic analysis on the text, we can gain a more comprehensive understanding of the text.
在对文本进行句法分析时,大多是通过更加强大的编码器,而缺乏对文本表征的分析。基于这样的方法得到的分析结果,往往容易遗漏文本中的重要信息,也就是说,对文本的句法结构分析不够细致,可能导致对文本的句法分析结果不够准确。When performing syntactic analysis on text, it is mostly done through a more powerful encoder, but lacks analysis of text representation. The analysis results obtained based on such methods often tend to miss important information in the text, that is, the syntactic structure analysis of the text is not detailed enough, which may lead to inaccurate syntactic analysis results of the text.
发明内容Summary of the invention
本申请提供了一种文本处理方法、装置、电子设备及存储介质,以解决因文本分析颗粒度大,导致文本的句法成分分析结果不够准确的问题。The present application provides a text processing method, device, electronic device and storage medium to solve the problem that the syntactic component analysis results of the text are not accurate due to the large granularity of text analysis.
本申请实施例提供了一种文本处理方法,包括:The present application embodiment provides a text processing method, including:
获取待分析文本,并确定与所述待分析文本相对应的原始向量;Obtaining a text to be analyzed, and determining an original vector corresponding to the text to be analyzed;
从所述待分析文本中提取至少一个待使用分词,并确定至少一个待使用分词所对应的待使用向量;Extracting at least one to-be-used segmented word from the to-be-analyzed text, and determining a to-be-used vector corresponding to the at least one to-be-used segmented word;
根据每个待使用向量以及每个待使用向量对应的待使用权重,得到所述待分析文本的待拼接向量;According to each vector to be used and the weight to be used corresponding to each vector to be used, a vector to be concatenated of the text to be analyzed is obtained;
将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所述目标向量对所述待分析文本进行文本分析。The vector to be concatenated is concatenated with the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
本申请实施例还提供了一种文本处理装置,包括:The present application also provides a text processing device, including:
原始向量确定模块,设置为获取待分析文本,并确定与所述待分析文本相对应的原始向量;An original vector determination module, configured to obtain a text to be analyzed and determine an original vector corresponding to the text to be analyzed;
待使用向量确定模块,设置为从所述待分析文本中提取至少一个待使用分词,并确定至少一个待使用分词所对应的待使用向量;A to-be-used vector determination module, configured to extract at least one to-be-used word from the to-be-analyzed text, and determine a to-be-used vector corresponding to the at least one to-be-used word;
待拼接向量确定模块,设置为根据每个待使用向量以及每个待使用向量对应的待使用权重,得到所述待分析文本的待拼接向量;A module for determining vectors to be spliced, configured to obtain vectors to be spliced of the text to be analyzed according to each vector to be used and a weight to be used corresponding to each vector to be used;
目标向量确定模块,设置为将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所述目标向量对所述待分析文本进行文本分析。The target vector determination module is configured to perform a splicing process on the vector to be spliced and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
本申请实施例还提供了一种电子设备,包括:The present application also provides an electronic device, including:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的文本处理方法。The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the text processing method described in any embodiment of the present application.
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现本申请任一实施例所述的文本处理方法。An embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the text processing method described in any embodiment of the present application when executed.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是根据本申请实施例一提供的一种文本处理方法的流程图;FIG1 is a flow chart of a text processing method provided according to Embodiment 1 of the present application;
图2是根据本申请实施例二提供的一种文本处理的模型结构示意图;FIG2 is a schematic diagram of a model structure of text processing provided according to Embodiment 2 of the present application;
图3是根据本申请实施例三提供的一种文本处理装置的结构示意图;FIG3 is a schematic diagram of the structure of a text processing device provided according to Embodiment 3 of the present application;
图4是实现本申请实施例的文本处理方法的电子设备的结构示意图。FIG. 4 is a schematic diagram of the structure of an electronic device that implements the text processing method according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,所描述的实施例仅仅是本申请一部分的实施例。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. The described embodiments are only embodiments of a part of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。The terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. The terms used in this way are interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein.
实施例一 Embodiment 1
图1为本申请实施例一提供了一种文本处理方法的流程图,本实施例可适用于对文本的句法成分进行更加细致准确的分析的情况,该方法可以由文本处理装置来执行,该文本处理装置可以采用硬件和/或软件的形式实现,该文本处理装置可配置于可执行文本处理方法的计算设备中。Figure 1 is a flowchart of a text processing method provided in the first embodiment of the present application. This embodiment can be applied to situations where a more detailed and accurate analysis of the syntactic components of a text is performed. The method can be executed by a text processing device, which can be implemented in the form of hardware and/or software. The text processing device can be configured in a computing device that can execute the text processing method.
如图1所示,该方法包括以下步骤。As shown in FIG1 , the method includes the following steps.
S110、获取待分析文本,并确定与待分析文本相对应的原始向量。S110: Acquire a text to be analyzed, and determine an original vector corresponding to the text to be analyzed.
待分析文本可以理解为需要进行句法成分分析的文本。原始向量可以理解为对待分析文本进行向量化处理后得到的向量,如可以通过语言表征模型对待分析文本进行向量化处理,得到原始向量。The text to be analyzed can be understood as the text that needs to be analyzed for syntactic components. The original vector can be understood as the vector obtained after the text to be analyzed is vectorized. For example, the text to be analyzed can be vectorized through a language representation model to obtain the original vector.
在实际应用中,对文本进行句法成分分析是自然语言处理的基础性工作,在句法成分分析的基础上,可以对文本进行观点抽取或情感分析等操作。对于简单成分结构的文本进行分析时,通常能够较为准确的得到该文本中的句法成分信息,但是对于结构较为复杂的文本,进行句法分析的难度较高,导致可能遗漏文本中的重要信息。如,可以对文本进行向量化处理,并通过文本所对应的末端向量与首端向量相减,以得到该文本所对应的句法成分信息,但这样的分析方法较为粗糙,难以从文本中得到较为准确的句法成分信息。In practical applications, syntactic component analysis of text is a basic work of natural language processing. Based on syntactic component analysis, operations such as opinion extraction or sentiment analysis can be performed on the text. When analyzing text with a simple component structure, the syntactic component information in the text can usually be obtained more accurately. However, for text with a more complex structure, syntactic analysis is more difficult, which may result in missing important information in the text. For example, the text can be vectorized and the syntactic component information corresponding to the text can be obtained by subtracting the end vector corresponding to the text from the beginning vector. However, such an analysis method is relatively rough and it is difficult to obtain more accurate syntactic component information from the text.
获取需要进行句法成分分析的待分析文本,并确定与待分析文本相对应的原始向量。可选的,确定与待分析文本相对应的原始向量,包括:基于语言表征模型,对待分析文本中的至少一个待使用分词进行向量处理,得到至少一个待使用分词对应的待使用隐向量;针对每个待使用隐向量,基于相对于当前隐向量的后一隐向量以及当前隐向量的差值,得到待分析文本所对应的原始向量。Obtain a text to be analyzed that needs to be analyzed for syntactic components, and determine the original vector corresponding to the text to be analyzed. Optionally, determining the original vector corresponding to the text to be analyzed includes: based on the language representation model, performing vector processing on at least one to-be-used word segment in the text to be analyzed to obtain a to-be-used latent vector corresponding to at least one to-be-used word segment; for each to-be-used latent vector, based on the difference between the next latent vector relative to the current latent vector and the current latent vector, obtain the original vector corresponding to the text to be analyzed.
语言表征模型基于变换器的双向编码器表示(BidirectionalEncoder Representations from Transformer,BERT)具有强大的语言表征能力和特征提取能力,在本技术方案中,可以基于BERT模型对待分析文本进行特征提取,并生成与待分析文本相对应的原始向量。待分析文本中包括至少一个分词,在本技术方案中,将每个分词称为待使用分词。通过对至少一个待使用分词分别进行向量化处理,可以得到相应的待使用隐向量,以基于至少一个待使用隐向量得到与待分析文本相对应的原始向量。The language representation model is based on the bidirectional encoder representation from transformer (Bidirectional Encoder Representations from Transformer, BERT) with powerful language representation and feature extraction capabilities. In this technical solution, the text to be analyzed can be feature extracted based on the BERT model, and the original vector corresponding to the text to be analyzed can be generated. The text to be analyzed includes at least one participle. In this technical solution, each participle is called a participle to be used. By vectorizing at least one participle to be used, the corresponding latent vector to be used can be obtained, so as to obtain the original vector corresponding to the text to be analyzed based on at least one latent vector to be used.
在实际应用中,对待分析文本进行分词处理,得到至少一个待使用分词,并基于BERT模型对至少一个待使用分词进行编码,得到相应的待使用隐向量,通过将至少一个待使用隐向量进行拼接,即可得到与待分析文本相对应的文本向量。可以通过以下公式确定:In practical applications, the text to be analyzed is segmented to obtain at least one segmented word to be used, and the at least one segmented word to be used is encoded based on the BERT model to obtain a corresponding latent vector to be used. By concatenating at least one latent vector to be used, a text vector corresponding to the text to be analyzed can be obtained. It can be determined by the following formula:
h 1…h i…h j…h n=BERT(x 1…x i…x j…x n) h1hihjhn =BERT( x1xixjxn )
h i表示待使用隐向量,x i表示待使用分词。其中,i、j和n为自然数,用于表示待使用隐向量在文本向量中的位置,以及待使用分词在待分析文本中的位置。 hi represents the latent vector to be used, and xi represents the word segment to be used. Wherein, i, j and n are natural numbers, which are used to represent the position of the latent vector to be used in the text vector and the position of the word segment to be used in the text to be analyzed.
基于上述阐述可知,在上述文本向量中包括至少一个待使用隐向量,针对 每个待使用隐向量,利用相对于当前隐向量的后一隐向量以及当前隐向量的差值,可以得到相应的差值向量,并将差值向量作为与当前隐向量相对应的原始向量。在本技术方案中,为了使得对待分析文本的句法成分分析的结果更加准确,可以将待分析文本划分为多个文本区间,每个文本区间内包括至少一个待使用分词,通过每个待使用分词所对应的待使用隐向量,可以得到与每个待使用分词相对应的原始向量,以基于至少一个待使用分词的原始向量对待分析文本进行更加细致的分析。Based on the above description, it can be known that the above text vector includes at least one latent vector to be used. For each latent vector to be used, the difference between the next latent vector relative to the current latent vector and the current latent vector can be used to obtain the corresponding difference vector, and the difference vector is used as the original vector corresponding to the current latent vector. In this technical solution, in order to make the result of the syntactic component analysis of the text to be analyzed more accurate, the text to be analyzed can be divided into multiple text intervals, each of which includes at least one word to be used. Through the latent vector to be used corresponding to each word to be used, the original vector corresponding to each word to be used can be obtained, so as to perform a more detailed analysis of the text to be analyzed based on the original vector of at least one word to be used.
以其中一个待使用隐向量作为当前隐向量为例,可以基于以下公式得到与当前隐向量相对应的原始向量:Taking one of the latent vectors to be used as the current latent vector as an example, the original vector corresponding to the current latent vector can be obtained based on the following formula:
r i,j=h j-h i ri,j = hj -hi
其中,r i,j表示与当前隐向量相对应的原始向量,h j表示相对于当前隐向量的后一隐向量,h i表示当前隐向量。 Among them, ri ,j represents the original vector corresponding to the current latent vector, hj represents the next latent vector relative to the current latent vector, and hi represents the current latent vector.
S120、从待分析文本中提取至少一个待使用分词,并确定至少一个待使用分词所对应的待使用向量。S120: Extract at least one to-be-used word from the text to be analyzed, and determine a to-be-used vector corresponding to the at least one to-be-used word.
本技术方案对待分析文本进行句法成分分析是在句法分析的基础上,进行调整,也就是说,本技术方案中的原始向量是基于对待分析文本进行句法成分分析的结果,而本技术方案是在与待分析文本相对应的原始向量的基础上,更加细致的对待分析文本进行句法成分的分析。由于在确定原始向量时也用到了待使用分词所对应的向量,为了方便区分,将确定原始向量时,待使用分词所对应的向量称为待使用隐向量,将基于本技术方案进行分析时,待使用分词所对应的向量称为待使用向量。The syntactic component analysis of the text to be analyzed in this technical solution is adjusted on the basis of the syntactic analysis. That is to say, the original vector in this technical solution is based on the result of the syntactic component analysis of the text to be analyzed, and this technical solution is based on the original vector corresponding to the text to be analyzed. The syntactic component analysis of the text to be analyzed is more detailed. Since the vector corresponding to the word to be used is also used when determining the original vector, for the convenience of distinction, the vector corresponding to the word to be used when determining the original vector is called the latent vector to be used, and the vector corresponding to the word to be used when analyzing based on this technical solution is called the vector to be used.
待使用向量即为基于本技术方案的向量处理方法对待分析文本进行向量化处理后得到的向量。The vector to be used is the vector obtained after the text to be analyzed is vectorized by the vector processing method based on the technical solution.
对待分析文本进行分析时,需要确定待分析文本中的至少一个待使用分词所对应的待使用向量。在本技术方案中,确定至少一个待使用分词所对应的待使用向量,包括:分别确定至少一个待使用分词所对应的分词类别;针对每个分词类别,对当前分词类别中的至少一个待使用分词进行向量处理,得到每个分词类别对应的待使用向量。When analyzing the text to be analyzed, it is necessary to determine the vector to be used corresponding to at least one to-be-used word segment in the text to be analyzed. In the present technical solution, determining the vector to be used corresponding to at least one to-be-used word segment includes: respectively determining the word segment category corresponding to at least one to-be-used word segment; for each word segment category, performing vector processing on at least one to-be-used word segment in the current word segment category to obtain the vector to be used corresponding to each word segment category.
在本技术方案中,分词类别可以理解为N元组类别,所谓N元组即为基于连续的词语组成的词块。示例性地,待分析文本为“在操场上”,对待分析文本进行分词,可以得到3个待使用分词,分别为“在”、“操场”和“上”, 则该待分析文本可对应三个不同的N元组,即一元组:“在”、“操场”和“上”;二元组:“在操场”,以及“操场上”;三元组:“在操场上”。分别对每个N元组类别中的待使用分词进行向量处理,即可得到相应的待使用向量。In the present technical solution, the word segmentation category can be understood as an N-tuple category, and the so-called N-tuple is a word block composed of continuous words. For example, the text to be analyzed is "在现场上", and the text to be analyzed is segmented to obtain three to-be-used word segments, namely "在", "校园" and "上". Then the text to be analyzed can correspond to three different N-tuples, namely, unigram: "在", "校园" and "上"; bigram: "在校园", and "校园上"; ternary: "在校园上". The to-be-used word segments in each N-tuple category are vectorized to obtain the corresponding to-be-used vectors.
在本技术方案中,以对当前分词类别中的待使用分词进行向量处理为例,对当前分词类别中的至少一个待使用分词进行向量处理,得到每个分词类别对应的待使用向量,包括:基于嵌入函数,分别对当前分词类别中的至少一个待使用分词进行向量处理,得到当前分词类别中的至少一个待使用分词对应的待使用向量。In the present technical solution, taking the vector processing of the to-be-used participles in the current participle category as an example, vector processing is performed on at least one to-be-used participle in the current participle category to obtain the to-be-used vector corresponding to each participle category, including: based on the embedding function, vector processing is performed on at least one to-be-used participle in the current participle category to obtain the to-be-used vector corresponding to at least one to-be-used participle in the current participle category.
在本技术方案中,嵌入函数可以基于预先构建的嵌入矩阵,确定与每个待使用分词相对应的待使用向量。基于嵌入函数,分别对当前分词类别中的至少一个待使用分词进行向量处理,得到当前分词类别中的至少一个待使用分词对应的待使用向量,包括:调取预先构建的嵌入矩阵,并确定当前分词类别中至少一个待使用分词所对应的矩阵映射元素;基于每个矩阵映射元素,确定当前分词类别中相应的待使用分词所对应的待使用向量。In the technical solution, the embedding function can determine the vector to be used corresponding to each to-be-used word based on the pre-built embedding matrix. Based on the embedding function, vector processing is performed on at least one to-be-used word in the current word segmentation category to obtain the vector to be used corresponding to at least one to-be-used word in the current word segmentation category, including: calling the pre-built embedding matrix and determining the matrix mapping element corresponding to at least one to-be-used word in the current word segmentation category; based on each matrix mapping element, determining the vector to be used corresponding to the corresponding to-be-used word in the current word segmentation category.
矩阵映射元素可以理解为待使用分词所对应的嵌入矩阵中的元素,可以为待使用分词所对应的嵌入矩阵的行数序号元素。The matrix mapping element can be understood as an element in the embedding matrix corresponding to the word segment to be used, and can be a row number element of the embedding matrix corresponding to the word segment to be used.
示例性地,预先构建的嵌入矩阵中可以包括大量的待使用分词,将至少一个待使用分词有序的放置在嵌入矩阵中,并生成相应的矩阵映射元素。每个待使用分词在嵌入矩阵中对应唯一的向量,基于此,基于预先构建的嵌入矩阵,以及待使用分词在嵌入矩阵中所对应的矩阵映射元素,可以确定待使用分词所对应的待使用向量。如,“操场”在嵌入矩阵中所对应的矩阵映射元素为“11”,表明,“操场”在嵌入矩阵中的第11个位置,即,与该矩阵映射元素所对应的唯一向量即为“操场”所对应的待使用向量。Exemplarily, a large number of to-be-used participles may be included in the pre-constructed embedding matrix, at least one to-be-used participle is placed in order in the embedding matrix, and a corresponding matrix mapping element is generated. Each to-be-used participle corresponds to a unique vector in the embedding matrix. Based on this, based on the pre-constructed embedding matrix and the matrix mapping element corresponding to the to-be-used participle in the embedding matrix, the to-be-used vector corresponding to the to-be-used participle can be determined. For example, the matrix mapping element corresponding to "playground" in the embedding matrix is "11", indicating that "playground" is in the 11th position in the embedding matrix, that is, the unique vector corresponding to the matrix mapping element is the to-be-used vector corresponding to "playground".
也就是说,在本技术方案中,为了能够确定每个待使用分词所对应的待使用向量,可以确定每个待使用分词在预先构建的嵌入矩阵中的矩阵映射元素,以根据每个矩阵映射元素所对应的唯一向量,确定相应的待使用分词所对应的待使用向量。That is to say, in the present technical solution, in order to determine the vector to be used corresponding to each word segment to be used, the matrix mapping element of each word segment to be used in the pre-constructed embedding matrix can be determined, so as to determine the vector to be used corresponding to the corresponding word segment to be used according to the unique vector corresponding to each matrix mapping element.
S130、根据每个待使用向量以及每个待使用向量对应的待使用权重,得到待分析文本的待拼接向量。S130 , obtaining a vector to be concatenated of the text to be analyzed according to each vector to be used and a weight to be used corresponding to each vector to be used.
待拼接向量可以用于与原始向量进行拼接,得到目标向量,以基于目标向量对待分析文本进行更加细致的句法成分分析。The vector to be concatenated can be used to concatenate with the original vector to obtain a target vector, so as to perform a more detailed syntactic component analysis on the text to be analyzed based on the target vector.
在本技术方案中,对待分析文本进行分析时,对待分析文本进行文本区间的划分,得到至少一个文本区间,也就是至少一个分词类别,且不同的分词类 别中包括至少一个待使用分词,每个待使用分词对应唯一的待使用向量。至少一个待使用向量所对应的待使用权重与至少一个待使用向量对应的分词类别所对应的权重相一致。也就是说,若当前分词类别中包括3个待使用分词,且3个待使用分词分别对应不同的待使用向量,若当前分词类别所对应的权重值为0.2,则这3个待使用向量所对应的待使用权重均为0.2。In the technical solution, when analyzing the text to be analyzed, the text to be analyzed is divided into text intervals to obtain at least one text interval, that is, at least one segmentation category, and different segmentation categories include at least one segmentation to be used, and each segmentation to be used corresponds to a unique vector to be used. The weight to be used corresponding to at least one vector to be used is consistent with the weight corresponding to the segmentation category corresponding to at least one vector to be used. In other words, if the current segmentation category includes 3 segmentations to be used, and the 3 segmentations to be used correspond to different vectors to be used, if the weight value corresponding to the current segmentation category is 0.2, then the weights to be used corresponding to the 3 vectors to be used are all 0.2.
在实际应用中,每个分词类别中的待使用分词的数量可以为一个,也可以为多个。以当前分词类别为例,在确定当前分词类别所对应的权重,也即待使用权重时,可以基于以下公式确定:In practical applications, the number of to-be-used segmentations in each segmentation category may be one or more. Taking the current segmentation category as an example, when determining the weight corresponding to the current segmentation category, that is, the to-be-used weight, it can be determined based on the following formula:
Figure PCTCN2022134592-appb-000001
Figure PCTCN2022134592-appb-000001
其中,
Figure PCTCN2022134592-appb-000002
表示待使用权重,exp表示以自然常数e为底的指数函数,r i,j表示原始向量,
Figure PCTCN2022134592-appb-000003
表示N元组的待使用向量,
Figure PCTCN2022134592-appb-000004
表示N元组的数量,u表示第u个分词类别,v表示该分词类别中的第v个待使用分词。
in,
Figure PCTCN2022134592-appb-000002
represents the weight to be used, exp represents the exponential function with the natural constant e as the base, ri,j represents the original vector,
Figure PCTCN2022134592-appb-000003
represents the N-tuple vector to be used,
Figure PCTCN2022134592-appb-000004
Represents the number of N-tuples, u represents the u-th word category, and v represents the v-th word to be used in the word category.
根据每个待使用向量以及每个待使用向量对应的待使用权重,得到待分析文本的待拼接向量,包括:根据每个待使用向量以及原始向量,分别确定每个待使用向量对应的待使用权重;根据每个待使用向量,以及每个待使用向量对应的待使用权重进行加权平均处理,得到待分析文本所对应的待拼接向量。According to each vector to be used and the weight to be used corresponding to each vector to be used, the vector to be spliced of the text to be analyzed is obtained, including: according to each vector to be used and the original vector, respectively determining the weight to be used corresponding to each vector to be used; according to each vector to be used and the weight to be used corresponding to each vector to be used, performing weighted averaging processing to obtain the vector to be spliced corresponding to the text to be analyzed.
可以通过以下公式得到待拼接向量:The vector to be spliced can be obtained by the following formula:
确定每个N元组所对应的加权平均向量
Figure PCTCN2022134592-appb-000005
Determine the weighted average vector corresponding to each N-tuple
Figure PCTCN2022134592-appb-000005
Figure PCTCN2022134592-appb-000006
Figure PCTCN2022134592-appb-000006
其中,
Figure PCTCN2022134592-appb-000007
表示N元组的加权平均向量,
Figure PCTCN2022134592-appb-000008
表示待使用权重,
Figure PCTCN2022134592-appb-000009
表示待使用向量,·为向量内积符号。
in,
Figure PCTCN2022134592-appb-000007
represents the weighted average vector of N-tuples,
Figure PCTCN2022134592-appb-000008
represents the weight to be used,
Figure PCTCN2022134592-appb-000009
represents the vector to be used, and · is the vector inner product symbol.
将所有类别的N元组加权平均向量进行拼接处理,得到包含N元组信息的向量(即,待拼接向量):The weighted average vectors of N-tuples of all categories are concatenated to obtain a vector containing N-tuple information (i.e., the vector to be concatenated):
Figure PCTCN2022134592-appb-000010
Figure PCTCN2022134592-appb-000010
其中,a i,j表示待拼接向量,
Figure PCTCN2022134592-appb-000011
为向量拼接符号,
Figure PCTCN2022134592-appb-000012
表示N元组的加权平均向量。
Among them, a i,j represents the vector to be spliced,
Figure PCTCN2022134592-appb-000011
is the vector splicing symbol,
Figure PCTCN2022134592-appb-000012
Represents the weighted average vector of N-tuples.
S140、将待拼接向量与原始向量进行拼接处理,得到目标向量,以基于目标向量对待分析文本进行文本分析。S140 , concatenating the vector to be concatenated with the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector.
目标向量可以理解为基于每个待使用向量进行拼接,得到的与待分析文本相对应的向量。The target vector can be understood as a vector corresponding to the text to be analyzed obtained by concatenating each vector to be used.
可以基于以下公式确定目标向量:The target vector can be determined based on the following formula:
Figure PCTCN2022134592-appb-000013
Figure PCTCN2022134592-appb-000013
其中,r‘ i,j表示目标向量,a i,j表示待拼接向量,r i,j表示原始向量,
Figure PCTCN2022134592-appb-000014
为向量拼接符号。
Among them, r' i,j represents the target vector, a i,j represents the vector to be spliced, and ri ,j represents the original vector.
Figure PCTCN2022134592-appb-000014
Vector stitching symbol.
可选的,将待拼接向量与原始向量进行拼接处理,得到目标向量,以基于目标向量对待分析文本进行文本分析,包括:基于预先构建的编码器,对待拼接向量和原始向量进行拼接处理,得到目标向量;将目标向量输入预先构建的句法分析模型,以基于句法分析模型对待分析文本进行分析。Optionally, the vector to be concatenated is concatenated with the original vector to obtain a target vector, and text analysis is performed on the text to be analyzed based on the target vector, including: based on a pre-built encoder, the vector to be concatenated and the original vector are concatenated to obtain a target vector; and the target vector is input into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.
在原始向量的基础上,将本技术方案对待分析文本处理得到的目标向量拼接,可以弥补相关技术中对待分析文本的分析较为粗糙,导致分析结果不够准确的问题。也就是说,本技术方案在对待分析文本的向量表征的基础上,添加了至少一个待使用分词所对应的向量表征信息,将两者相结合,可以得到更多的与待分析文本相对应的句法结构信息。因此,基于预先构建的句法分析模型对目标向量进行分析,可以得到更加准确的分析结果。On the basis of the original vector, the target vector obtained by processing the text to be analyzed by this technical solution is spliced, which can make up for the problem that the analysis of the text to be analyzed in the related technology is relatively rough, resulting in inaccurate analysis results. That is to say, on the basis of the vector representation of the text to be analyzed, this technical solution adds at least one vector representation information corresponding to the word segmentation to be used, and combining the two can obtain more syntactic structure information corresponding to the text to be analyzed. Therefore, analyzing the target vector based on the pre-built syntactic analysis model can obtain more accurate analysis results.
本申请实施例的技术方案,通过获取待分析文本,并确定与所述待分析文本相对应的原始向量,通过BERT模型可以得到与待分析文本相对应的原始向量,以将本技术方案得到的待拼接向量与原始向量进行拼接处理,得到目标向量。从所述待分析文本中提取至少一个待使用分词,并确定至少一个待使用分词所对应的待使用向量,分别确定至少一个待使用分词所对应的分词类别,并基于嵌入函数确定至少一个待使用分词所对应的待使用向量。根据每个待使用向量以及相应的待使用权重,得到所述待分析文本的待拼接向量,根据每个分词类别所对应的权重可以确定相应的待使用向量所对应的待使用权重,以根据每个待使用向量以及相应的待使用权重,得到待拼接向量。将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所 述目标向量对所述待分析文本进行文本分析。解决了对文本分析颗粒度大,导致文本的句法成分分析结果不够准确的问题,达到了准确的对文本的句法成分结构进行分析的效果。The technical solution of the embodiment of the present application obtains the text to be analyzed and determines the original vector corresponding to the text to be analyzed. The original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector. At least one to-be-used participle is extracted from the text to be analyzed, and the vector to be used corresponding to at least one to-be-used participle is determined, and the participle category corresponding to at least one to-be-used participle is determined respectively, and the vector to be used corresponding to at least one to-be-used participle is determined based on the embedded function. According to each to-be-used vector and the corresponding to-be-used weight, the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight. The vector to be spliced is spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector. The problem that the granularity of text analysis is large, resulting in inaccurate results of syntactic component analysis of the text is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.
实施例二 Embodiment 2
在一个例子中,本技术方案对待分析文本进行分析的模型如图2所示,以待分析文本为“并且在操场上踢球”为例,对待分析文本进行句法成分分析时,通常采用基于图结构的方法,可以使用编码器,如BERT模型对包含q个待使用分词的待分析文本x=x 1…x i…x j…x q进行编码,得到相应的隐向量(其中,第i个分词的待使用隐向量为h i),公式如下: In an example, the model of the technical solution for analyzing the text to be analyzed is shown in FIG2. Taking the text to be analyzed as "and playing football on the playground" as an example, when the text to be analyzed is analyzed for syntactic components, a method based on a graph structure is usually adopted. An encoder, such as a BERT model, can be used to encode the text to be analyzed x= x1 ... xi ... xj ... xq containing q participles to be used to obtain the corresponding latent vector (wherein the latent vector to be used of the i-th participle is hi ), and the formula is as follows:
h 1…h i…h j…h n=BERT(x 1…x i…x j…x n) h1hihjhn =BERT( x1xixjxn )
其中,h i表示待使用隐向量,x i表示待使用分词。 Among them, hi represents the latent vector to be used, and xi represents the word segmentation to be used.
其中,i、j和n为自然数,用于表示待使用隐向量在文本向量中的位置,以及待使用分词在待分析文本中的位置。Among them, i, j and n are natural numbers, which are used to indicate the position of the latent vector to be used in the text vector and the position of the word to be used in the text to be analyzed.
通过以下公式可以得到每个文本区间(x i,x j)=x i…x j-1的向量表征r i,jThe vector representation ri ,j of each text interval ( xi , xj ) = xi …xj -1 can be obtained by the following formula:
r i,j=h j-h i ri,j = hj -hi
其中,r i,j表示与当前隐向量相对应的原始向量,h j表示相对于当前隐向量的后一隐向量,h i表示当前隐向量。 Among them, ri ,j represents the original vector corresponding to the current latent vector, hj represents the next latent vector relative to the current latent vector, and hi represents the current latent vector.
可以使用两个全连接层(其中矩阵W 1和偏移向量b 1为第一个全连接层的参数;矩阵W 2和偏移向量b 2为第二个全连接层的参数;ReLu是激活函数),把r i,j映射为向量o i,jWe can use two fully connected layers (where matrix W 1 and offset vector b 1 are the parameters of the first fully connected layer; matrix W 2 and offset vector b 2 are the parameters of the second fully connected layer; ReLu is the activation function) to map ri ,j to vector o i,j :
o i,j=W 2·(ReLu(W 1·r i,j+b 1))+b 2 o i,j =W 2 ·(ReLu(W 1 · ri,j +b 1 ))+b 2
其中,向量o i,j的维数等于句法成分类别(例如名词短语(Noun Phrase,NP)、动词短语(Verbal Phrase,VP)、介词短语(Prepositional Phrase,PP)等)的数量,该向量的一个维度对应的值,代表了文本区间(x i,x j)属于一个句法成分类别l的分数,分数记为s(i,j,l)。 Among them, the dimension of the vector o i,j is equal to the number of syntactic component categories (such as noun phrase (NP), verb phrase (VP), prepositional phrase (PP), etc.), and the value corresponding to one dimension of the vector represents the score that the text interval ( xi , xj ) belongs to a syntactic component category l, and the score is denoted as s(i,j,l).
将待分析文本的所有文本区间分数s(i,j,l)输入Cocke–Younger–Kasami(CYK)算法,计算得到分数最高的、最优的合法句法树。All text interval scores s(i,j,l) of the text to be analyzed are input into the Cocke–Younger–Kasami (CYK) algorithm to calculate the highest-scoring and optimal legal syntax tree.
本技术方案在上述句法成分分析的基础上,对待分析文本进行分析。对待分析文本进行文本区间的划分,得到至少一个文本区间,并确定至少一个文本区间所对应的分词类别,即,根据待使用分词的数量,确定相应的分词类别。在实际应用中,可以依据现有的N元组词表N,提取文本区间(x i,x j)中所有匹配的N元组(即,如果一个词表N中的N元组是文本区间(x i,x j)的子串,则提取该N元组)。依次提取N元组的长度,将每个N元组分别对应到不同的分词类别中,记属于第u个类别的第v个N元组为
Figure PCTCN2022134592-appb-000015
第u个类别里面一共有
Figure PCTCN2022134592-appb-000016
个N元组。
This technical solution analyzes the text to be analyzed on the basis of the above-mentioned syntactic component analysis. The text to be analyzed is divided into text intervals to obtain at least one text interval, and the segmentation category corresponding to at least one text interval is determined, that is, the corresponding segmentation category is determined according to the number of segmentations to be used. In practical applications, all matching N-tuples in the text interval ( xi , xj ) can be extracted based on the existing N-tuple vocabulary N (that is, if an N-tuple in the vocabulary N is a substring of the text interval ( xi , xj ), then the N-tuple is extracted). The lengths of the N-tuples are extracted in turn, and each N-tuple is respectively mapped to a different segmentation category. The v-th N-tuple belonging to the u-th category is recorded as
Figure PCTCN2022134592-appb-000015
There are a total of
Figure PCTCN2022134592-appb-000016
N-tuples.
示例性地,待分析文本为“在操场上”,对待分析文本进行分词,可以得到3个待使用分词,分别为“在”、“操场”和“上”,则该待分析文本可对应三个不同的N元组,即一元组:“在”、“操场”和“上”;二元组:“在操场”和“上”,以及“在”和“操场上”;三元组:“在操场上”。For example, the text to be analyzed is "in the playground". After word segmentation, three word segmentations to be used can be obtained, namely "in", "playground" and "on". Then the text to be analyzed can correspond to three different N-tuples, namely, unigram: "in", "playground" and "on"; bigram: "in the playground" and "on", as well as "in" and "on the playground"; triplet: "in the playground".
基于嵌入函数,把N元组
Figure PCTCN2022134592-appb-000017
映射为N元组嵌入向量
Figure PCTCN2022134592-appb-000018
可以在预先构建的嵌入矩阵中,提取
Figure PCTCN2022134592-appb-000019
对在嵌入矩阵中所对应的序号的行数(即,矩阵映射元素),并提取行数所对应的向量为待使用分词所对应的待使用向量。
Based on the embedding function, the N-tuple
Figure PCTCN2022134592-appb-000017
Mapped to N-tuple embedding vector
Figure PCTCN2022134592-appb-000018
In the pre-built embedding matrix, we can extract
Figure PCTCN2022134592-appb-000019
The row number (ie, matrix mapping element) corresponding to the sequence number in the embedding matrix is extracted, and the vector corresponding to the row number is used as the vector to be used corresponding to the word segment to be used.
对于类别u中的N元组,可以通过以下公式确定当前类别的N元组的权重
Figure PCTCN2022134592-appb-000020
也即待使用权重:
For the N-tuples in category u, the weight of the N-tuple of the current category can be determined by the following formula:
Figure PCTCN2022134592-appb-000020
That is, the weight to be used:
Figure PCTCN2022134592-appb-000021
Figure PCTCN2022134592-appb-000021
其中,
Figure PCTCN2022134592-appb-000022
表示待使用权重,exp表示以自然常数e为底的指数函数,r i,j表示原始向量,
Figure PCTCN2022134592-appb-000023
表示N元组的待使用向量,
Figure PCTCN2022134592-appb-000024
表示N元组的数量。
in,
Figure PCTCN2022134592-appb-000022
represents the weight to be used, exp represents the exponential function with the natural constant e as the base, ri,j represents the original vector,
Figure PCTCN2022134592-appb-000023
represents the N-tuple vector to be used,
Figure PCTCN2022134592-appb-000024
Represents the number of N-tuples.
通过以下公式,计算类别u的N元组的加权平均向量
Figure PCTCN2022134592-appb-000025
The weighted average vector of the N-tuple of category u is calculated by the following formula:
Figure PCTCN2022134592-appb-000025
Figure PCTCN2022134592-appb-000026
Figure PCTCN2022134592-appb-000026
其中,
Figure PCTCN2022134592-appb-000027
表示N元组的加权平均向量,
Figure PCTCN2022134592-appb-000028
表示待使用权重,
Figure PCTCN2022134592-appb-000029
表示待使用向量,·为向量内积符号。
in,
Figure PCTCN2022134592-appb-000027
represents the weighted average vector of N-tuples,
Figure PCTCN2022134592-appb-000028
represents the weight to be used,
Figure PCTCN2022134592-appb-000029
represents the vector to be used, and · is the vector inner product symbol.
将所有类别的N元组加权平均向量进行拼接处理,得到包含N元组信息的向量(即,待拼接向量):The weighted average vectors of N-tuples of all categories are concatenated to obtain a vector containing N-tuple information (i.e., the vector to be concatenated):
Figure PCTCN2022134592-appb-000030
Figure PCTCN2022134592-appb-000030
其中,a i,j表示待拼接向量,
Figure PCTCN2022134592-appb-000031
为向量拼接符号,
Figure PCTCN2022134592-appb-000032
表示N元组的加权平均向量。
Among them, a i,j represents the vector to be spliced,
Figure PCTCN2022134592-appb-000031
is the vector splicing symbol,
Figure PCTCN2022134592-appb-000032
Represents the weighted average vector of N-tuples.
基于以下公式,将待拼接向量与原始向量进行拼接处理,得到目标向量:Based on the following formula, the vector to be spliced is concatenated with the original vector to obtain the target vector:
Figure PCTCN2022134592-appb-000033
Figure PCTCN2022134592-appb-000033
其中,r‘ i,j表示目标向量,a i,j表示待拼接向量,r i,j表示原始向量,
Figure PCTCN2022134592-appb-000034
为向量拼接符号。
Among them, r' i,j represents the target vector, a i,j represents the vector to be spliced, and ri ,j represents the original vector.
Figure PCTCN2022134592-appb-000034
Vector stitching symbol.
基于目标向量对待分析文本进行句法成分分析可以得到句法成分分析结 果。The syntactic component analysis result can be obtained by performing syntactic component analysis on the text to be analyzed based on the target vector.
本技术方案将待分析文本划分为多个子文本区间,并分别对多个文本区间的文本进行N元组的确定,并根据每个N元组对句法成分分析的影响设定相应的权重,以在对基于每个N元组对待分析文本进行分析时,文本分析的颗粒度更细,对待分析文本的分析结果更加准确。This technical solution divides the text to be analyzed into multiple sub-text intervals, and determines N-tuples of the texts in the multiple text intervals respectively, and sets corresponding weights according to the influence of each N-tuple on the syntactic component analysis, so that when the text to be analyzed is analyzed based on each N-tuple, the granularity of the text analysis is finer and the analysis result of the text to be analyzed is more accurate.
本申请实施例的技术方案,通过获取待分析文本,并确定与所述待分析文本相对应的原始向量,通过BERT模型可以得到与待分析文本相对应的原始向量,以将本技术方案得到的待拼接向量与原始向量进行拼接处理,得到目标向量。从所述待分析文本中提取至少一个待使用分词,并确定至少一个待使用分词所对应的待使用向量,分别确定至少一个待使用分词所对应的分词类别,并基于嵌入函数确定至少一个待使用分词所对应的待使用向量。根据每个待使用向量以及相应的待使用权重,得到所述待分析文本的待拼接向量,根据每个分词类别所对应的权重可以确定相应的待使用向量所对应的待使用权重,以根据每个待使用向量以及相应的待使用权重,得到待拼接向量。将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所述目标向量对所述待分析文本进行文本分析。解决了对文本分析颗粒度大,导致文本的句法成分分析结果不够准确的问题,达到了准确的对文本的句法成分结构进行分析的效果。The technical solution of the embodiment of the present application, by obtaining the text to be analyzed and determining the original vector corresponding to the text to be analyzed, the original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector. Extract at least one to-be-used participle from the text to be analyzed, and determine the vector to be used corresponding to at least one to-be-used participle, respectively determine the participle category corresponding to at least one to-be-used participle, and determine the vector to be used corresponding to at least one to-be-used participle based on the embedding function. According to each to-be-used vector and the corresponding to-be-used weight, the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight. Splice the vector to be spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector. The problem that the granularity of text analysis is large, resulting in inaccurate results of syntactic component analysis of the text is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.
实施例三Embodiment 3
图3为本申请实施例三提供的一种文本处理装置的结构示意图。如图3所示,该装置包括:原始向量确定模块210、待使用向量确定模块220、待拼接向量确定模块230和目标向量确定模块240。Fig. 3 is a schematic diagram of the structure of a text processing device provided in Embodiment 3 of the present application. As shown in Fig. 3 , the device comprises: an original vector determination module 210 , a to-be-used vector determination module 220 , a to-be-joined vector determination module 230 , and a target vector determination module 240 .
原始向量确定模块210,设置为获取待分析文本,并确定与待分析文本相对应的原始向量;The original vector determination module 210 is configured to obtain the text to be analyzed and determine the original vector corresponding to the text to be analyzed;
待使用向量确定模块220,设置为从待分析文本中提取至少一个待使用分词,并确定至少一个待使用分词所对应的待使用向量;A to-be-used vector determination module 220 is configured to extract at least one to-be-used word from the to-be-analyzed text and determine a to-be-used vector corresponding to the at least one to-be-used word;
待拼接向量确定模块230,设置为根据每个待使用向量以及每个待使用向量对应的待使用权重,得到待分析文本的待拼接向量;A to-be-joined vector determination module 230 is configured to obtain a to-be-joined vector of the to-be-analyzed text according to each to-be-used vector and a to-be-used weight corresponding to each to-be-used vector;
目标向量确定模块240,设置为将待拼接向量与原始向量进行拼接处理,得到目标向量,以基于目标向量对待分析文本进行文本分析。The target vector determination module 240 is configured to perform a concatenation process on the vector to be concatenated and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
本申请实施例的技术方案,通过获取待分析文本,并确定与所述待分析文本相对应的原始向量,通过BERT模型可以得到与待分析文本相对应的原 始向量,以将本技术方案得到的待拼接向量与原始向量进行拼接处理,得到目标向量。从所述待分析文本中提取至少一个待使用分词,并确定至少一个待使用分词所对应的待使用向量,分别确定至少一个待使用分词所对应的分词类别,并基于嵌入函数确定至少一个待使用分词所对应的待使用向量。根据每个待使用向量以及相应的待使用权重,得到所述待分析文本的待拼接向量,根据每个分词类别所对应的权重可以确定相应的待使用向量所对应的待使用权重,以根据每个待使用向量以及相应的待使用权重,得到待拼接向量。最后将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所述目标向量对所述待分析文本进行文本分析。解决了对文本分析颗粒度大,导致文本的句法成分分析结果不够准确的问题,达到了准确的对文本的句法成分结构进行分析的效果。The technical solution of the embodiment of the present application obtains the text to be analyzed and determines the original vector corresponding to the text to be analyzed. The original vector corresponding to the text to be analyzed can be obtained through the BERT model, so as to splice the vector to be spliced obtained by the technical solution with the original vector to obtain the target vector. At least one to-be-used participle is extracted from the text to be analyzed, and the vector to be used corresponding to at least one to-be-used participle is determined, and the participle category corresponding to at least one to-be-used participle is determined respectively, and the vector to be used corresponding to at least one to-be-used participle is determined based on the embedded function. According to each to-be-used vector and the corresponding to-be-used weight, the vector to be spliced of the text to be analyzed is obtained, and the weight to be used corresponding to the corresponding to-be-used vector can be determined according to the weight corresponding to each participle category, so as to obtain the vector to be spliced according to each to-be-used vector and the corresponding to-be-used weight. Finally, the vector to be spliced is spliced with the original vector to obtain the target vector, so as to perform text analysis on the text to be analyzed based on the target vector. The problem that the granularity of text analysis is large, resulting in inaccurate results of syntactic component analysis of the text is solved, and the effect of accurately analyzing the syntactic component structure of the text is achieved.
可选的,原始向量确定模210块包括:隐向量确定子模块,设置为基于语言表征模型,对待分析文本中的至少一个待使用分词进行向量处理,得到至少一个待使用分词对应的待使用隐向量;Optionally, the original vector determination module 210 includes: a latent vector determination submodule, configured to perform vector processing on at least one to-be-used word segment in the text to be analyzed based on a language representation model, to obtain a to-be-used latent vector corresponding to at least one to-be-used word segment;
原始向量确定子模块,设置为针对每个待使用隐向量,基于相对于当前隐向量的后一隐向量以及当前隐向量的差值,得到待分析文本所对应的原始向量。The original vector determination submodule is configured to obtain, for each latent vector to be used, an original vector corresponding to the text to be analyzed based on a subsequent latent vector relative to the current latent vector and a difference between the current latent vector and the latent vector.
可选的,待使用向量确定模块220包括:分词类别确定子模块,设置为分别确定至少一个待使用分词所对应的分词类别,其中,分词类别中包括至少一个待使用分词;Optionally, the to-be-used vector determination module 220 includes: a segmentation category determination submodule, configured to respectively determine a segmentation category corresponding to at least one to-be-used segmentation, wherein the segmentation category includes at least one to-be-used segmentation;
待使用向量确定子模块,设置为针对每个分词类别,对当前分词类别中的至少一个待使用分词进行向量处理,得到每个分词类别对应的待使用向量。The to-be-used vector determination submodule is configured to perform vector processing on at least one to-be-used word in the current word segmentation category for each word segmentation category, so as to obtain the to-be-used vector corresponding to each word segmentation category.
可选的,待使用向量确定子模块包括:待使用向量确定单元,设置为基于嵌入函数,分别对当前分词类别中的至少一个待使用分词进行向量处理,得到当前分词类别中的至少一个待使用分词对应的待使用向量。Optionally, the submodule for determining the vector to be used includes: a unit for determining the vector to be used, which is configured to perform vector processing on at least one to-be-used word in the current word segmentation category based on an embedding function to obtain a vector to be used corresponding to at least one to-be-used word in the current word segmentation category.
可选的,待使用向量确定单元包括:映射元素确定子单元,设置为调取预先构建的嵌入矩阵,并确定当前分词类别中至少一个待使用分词所对应的矩阵映射元素;Optionally, the to-be-used vector determination unit includes: a mapping element determination subunit, configured to retrieve a pre-built embedding matrix and determine a matrix mapping element corresponding to at least one to-be-used segmentation word in the current segmentation category;
待使用向量确定子单元,设置为基于每个矩阵映射元素,确定当前分词类别中相应的待使用分词所对应的待使用向量。The to-be-used vector determination subunit is configured to determine the to-be-used vector corresponding to the to-be-used word in the current word segmentation category based on each matrix mapping element.
可选的,待拼接向量确定模块230包括:权重确定子模块,设置为根据每个待使用向量以及原始向量,分别确定每个待使用向量对应的待使用权重;Optionally, the vector to be spliced determining module 230 includes: a weight determining submodule, configured to determine a weight to be used corresponding to each vector to be used according to each vector to be used and the original vector;
待拼接向量确定子模块,设置为根据每个待使用向量,以及每个待使用向量对应的待使用权重进行加权平均处理,得到待分析文本所对应的待拼接向量。The submodule for determining the vector to be spliced is configured to perform weighted average processing according to each vector to be used and the weight to be used corresponding to each vector to be used, so as to obtain the vector to be spliced corresponding to the text to be analyzed.
可选的,目标向量确定模块240包括:目标向量确定子模块,设置为基于预先构建的编码器,对待拼接向量和原始向量进行拼接处理,得到目标向量;Optionally, the target vector determination module 240 includes: a target vector determination submodule, configured to perform a splicing process on the vector to be spliced and the original vector based on a pre-built encoder to obtain a target vector;
文本分析子模块,设置为将目标向量输入预先构建的句法分析模型,以基于句法分析模型对待分析文本进行分析。The text analysis submodule is configured to input the target vector into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.
本申请实施例所提供的文本处理装置可执行本申请任意实施例所提供的文本处理方法,具备执行方法相应的功能模块和效果。The text processing device provided in the embodiments of the present application can execute the text processing method provided in any embodiment of the present application, and has the corresponding functional modules and effects of the execution method.
实施例四Embodiment 4
图4示出了本申请的实施例的电子设备10的结构示意图。电子设备旨在表示多种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示多种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。Fig. 4 shows a schematic diagram of the structure of an electronic device 10 of an embodiment of the present application. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices (such as helmets, glasses, watches, etc.) and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present application described and/or required herein.
如图4所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(Read-Only Memory,ROM)12、随机访问存储器(Random Access Memory,RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在ROM12中的计算机程序或者从存储单元18加载到RAM13中的计算机程序,来执行多种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的多种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(Input/Output,I/O)接口15也连接至总线14。As shown in FIG4 , the electronic device 10 includes at least one processor 11, and a memory connected to the at least one processor 11, such as a read-only memory (ROM) 12, a random access memory (RAM) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can perform a variety of appropriate actions and processes according to the computer program stored in the ROM 12 or the computer program loaded from the storage unit 18 to the RAM 13. In the RAM 13, a variety of programs and data required for the operation of the electronic device 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other through a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如多种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或多种电信网络与其他设备交换信息/数据。A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a disk, an optical disk, etc.; and a communication unit 19, such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
处理器11可以是多种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(Central Processing Unit,CPU)、图形处理单元(Graphics Processing Unit,GPU)、多种专用的人工智能(Artificial Intelligence,AI)计算芯片、多种运行机器学习模型算法的处理器、数字信号处理器(Digital Signal Processor,DSP)、以及任何适当 的处理器、控制器、微控制器等。处理器11执行上文所描述的多个方法和处理,例如文本处理方法。The processor 11 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), a variety of dedicated artificial intelligence (AI) computing chips, a variety of processors running machine learning model algorithms, a digital signal processor (DSP), and any appropriate processor, controller, microcontroller, etc. The processor 11 performs the multiple methods and processes described above, such as a text processing method.
在一些实施例中,文本处理方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的文本处理方法的一个或多个步骤。备选地,在其他实施例中,处理器11可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行文本处理方法。In some embodiments, the text processing method may be implemented as a computer program, which is tangibly contained in a computer-readable storage medium, such as a storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the text processing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the text processing method in any other suitable manner (e.g., by means of firmware).
本文中以上描述的系统和技术的多种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、芯片上系统的系统(System on Chip,SOC)、负载可编程逻辑设备(Complex Programmable Logic Device,CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些多种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various embodiments of the systems and techniques described above herein may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard parts (ASSPs), system on chip systems (SOCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs that are executable and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
用于实施本申请的文本处理方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器,使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Computer programs for implementing the text processing methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that when the computer program is executed by the processor, the functions/operations specified in the flow chart and/or block diagram are implemented. The computer program may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.
在本申请的上下文中,计算机可读存储介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。备选地,计算机可读存储介质可以是机器可读信号介质。机器可读存储介质包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、快闪存储器、光纤、便捷式紧凑盘只读存储 器(Compact Disc Read Only Memory,CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。存储介质可以是非暂态(non-transitory)存储介质。In the context of the present application, a computer readable storage medium may be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, device, or apparatus. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. Alternatively, a computer readable storage medium may be a machine readable signal medium. A machine readable storage medium includes an electrical connection based on one or more lines, a portable computer disk, a hard disk, a RAM, a ROM, an Erasable Programmable Read-Only Memory (EPROM), a flash memory, an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The storage medium may be a non-transitory storage medium.
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,阴极射线管(Cathode Ray Tube,CRT)或者液晶显示器(Liquid Crystal Display,LCD)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on an electronic device having: a display device (e.g., a cathode ray tube (CRT) or a liquid crystal display (LCD) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the electronic device. Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form (including acoustic input, voice input, or tactile input).
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(Local Area Network,LAN)、广域网(Wide Area Network,WAN)、区块链网络和互联网。The systems and techniques described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes frontend components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与虚拟专用服务器(Virtual Private Server,VPS)服务中,存在的管理难度大,业务扩展性弱的缺陷。A computing system may include a client and a server. The client and the server are generally remote from each other and usually interact through a communication network. The client and server relationship is generated by computer programs running on the respective computers and having a client-server relationship with each other. The server may be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability in traditional physical hosts and virtual private servers (VPS) services.
可以使用上面所示的多种形式的流程,重新排序、增加或删除步骤。例如,本申请中记载的多个步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请的技术方案所期望的结果,本文在此不进行限制。The various forms of processes shown above can be used to reorder, add or delete steps. For example, the multiple steps recorded in this application can be executed in parallel, sequentially or in different orders, as long as the expected results of the technical solution of this application can be achieved, and this document is not limited here.
上述实施方式,并不构成对本申请保护范围的限制。根据设计要求和其他因素,可以进行多种修改、组合、子组合和替代。The above implementations do not constitute a limitation on the protection scope of the present application. Various modifications, combinations, sub-combinations and substitutions may be made according to design requirements and other factors.

Claims (10)

  1. 一种文本处理方法,包括:A text processing method, comprising:
    获取待分析文本,并确定与所述待分析文本相对应的原始向量;Obtaining a text to be analyzed, and determining an original vector corresponding to the text to be analyzed;
    从所述待分析文本中提取至少一个待使用分词,并确定所述至少一个待使用分词所对应的待使用向量;Extracting at least one to-be-used segmented word from the to-be-analyzed text, and determining a to-be-used vector corresponding to the at least one to-be-used segmented word;
    根据每个待使用向量以及每个待使用向量对应的待使用权重,得到所述待分析文本的待拼接向量;According to each vector to be used and the weight to be used corresponding to each vector to be used, a vector to be concatenated of the text to be analyzed is obtained;
    将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所述目标向量对所述待分析文本进行文本分析。The vector to be concatenated is concatenated with the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
  2. 根据权利要求1所述的方法,其中,所述确定与所述待分析文本相对应的原始向量,包括:The method according to claim 1, wherein determining the original vector corresponding to the text to be analyzed comprises:
    基于语言表征模型,对所述待分析文本中的至少一个待使用分词进行向量处理,得到所述至少一个待使用分词对应的待使用隐向量;Based on the language representation model, vector processing is performed on at least one to-be-used word in the to-be-analyzed text to obtain a to-be-used latent vector corresponding to the at least one to-be-used word;
    针对每个待使用隐向量,基于相对于当前隐向量的后一隐向量以及所述当前隐向量的差值,得到所述待分析文本所对应的原始向量。For each latent vector to be used, the original vector corresponding to the text to be analyzed is obtained based on a subsequent latent vector relative to the current latent vector and a difference between the current latent vector.
  3. 根据权利要求1所述的方法,其中,所述确定所述至少一个待使用分词所对应的待使用向量,包括:The method according to claim 1, wherein determining the to-be-used vector corresponding to the at least one to-be-used word segmentation comprises:
    分别确定所述至少一个待使用分词所对应的分词类别,其中,所述分词类别中包括至少一个待使用分词;Respectively determining a participle category corresponding to the at least one participle to be used, wherein the participle category includes at least one participle to be used;
    针对每个分词类别,对当前分词类别中的至少一个待使用分词进行向量处理,得到每个分词类别对应的待使用向量。For each word segmentation category, vector processing is performed on at least one to-be-used word segmentation in the current word segmentation category to obtain a to-be-used vector corresponding to each word segmentation category.
  4. 根据权利要求3所述的方法,其中,所述对当前分词类别中的至少一个待使用分词进行向量处理,得到每个分词类别对应的待使用向量,包括:The method according to claim 3, wherein the performing vector processing on at least one to-be-used segmentation word in the current segmentation category to obtain a to-be-used vector corresponding to each segmentation category comprises:
    基于嵌入函数,分别对所述当前分词类别中的至少一个待使用分词进行向量处理,得到所述当前分词类别中的至少一个待使用分词对应的待使用向量。Based on the embedding function, vector processing is performed on at least one to-be-used participle in the current participle category to obtain a to-be-used vector corresponding to the at least one to-be-used participle in the current participle category.
  5. 根据权利要求4所述的方法,其中,所述基于嵌入函数,分别对所述当前分词类别中的至少一个待使用分词进行向量处理,得到所述当前分词类别中的至少一个待使用分词对应的待使用向量,包括:The method according to claim 4, wherein the performing vector processing on at least one to-be-used segmentation word in the current segmentation category based on the embedding function to obtain a to-be-used vector corresponding to at least one to-be-used segmentation word in the current segmentation category comprises:
    调取预先构建的嵌入矩阵,并确定所述当前分词类别中至少一个待使用分词所对应的矩阵映射元素;Retrieving a pre-built embedding matrix and determining a matrix mapping element corresponding to at least one to-be-used segmentation in the current segmentation category;
    基于每个矩阵映射元素,确定所述当前分词类别中相应的待使用分词所对应的待使用向量。Based on each matrix mapping element, a vector to be used corresponding to a corresponding word to be used in the current word segmentation category is determined.
  6. 根据权利要求1所述的方法,其中,所述根据每个待使用向量以及每个待使用向量对应的待使用权重,得到所述待分析文本的待拼接向量,包括:The method according to claim 1, wherein the step of obtaining the vector to be concatenated of the text to be analyzed according to each vector to be used and the weight to be used corresponding to each vector to be used comprises:
    根据每个待使用向量,以及所述原始向量,确定每个待使用向量对应的待使用权重;Determine, according to each vector to be used and the original vector, a weight to be used corresponding to each vector to be used;
    根据每个待使用向量,以及每个待使用向量对应的待使用权重进行加权平均处理,得到所述待分析文本所对应的待拼接向量。A weighted average process is performed based on each vector to be used and the weight to be used corresponding to each vector to be used, so as to obtain the vector to be spliced corresponding to the text to be analyzed.
  7. 根据权利要求1所述的方法,其中,所述将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所述目标向量对所述待分析文本进行文本分析,包括:The method according to claim 1, wherein the step of concatenating the vector to be concatenated with the original vector to obtain a target vector, and performing text analysis on the text to be analyzed based on the target vector, comprises:
    基于预先构建的编码器,对所述待拼接向量和所述原始向量进行拼接处理,得到所述目标向量;Based on a pre-built encoder, concatenate the vector to be concatenated and the original vector to obtain the target vector;
    将所述目标向量输入预先构建的句法分析模型,以基于所述句法分析模型对所述待分析文本进行分析。The target vector is input into a pre-built syntactic analysis model to analyze the text to be analyzed based on the syntactic analysis model.
  8. 一种文本处理装置,包括:A text processing device, comprising:
    原始向量确定模块,设置为获取待分析文本,并确定与所述待分析文本相对应的原始向量;An original vector determination module, configured to obtain a text to be analyzed and determine an original vector corresponding to the text to be analyzed;
    待使用向量确定模块,设置为从所述待分析文本中提取至少一个待使用分词,并确定所述至少一个待使用分词所对应的待使用向量;a to-be-used vector determination module, configured to extract at least one to-be-used word from the to-be-analyzed text, and determine a to-be-used vector corresponding to the at least one to-be-used word;
    待拼接向量确定模块,设置为根据每个待使用向量以及每个待使用向量对应的待使用权重,得到所述待分析文本的待拼接向量;A module for determining vectors to be spliced, configured to obtain vectors to be spliced of the text to be analyzed according to each vector to be used and a weight to be used corresponding to each vector to be used;
    目标向量确定模块,设置为将所述待拼接向量与所述原始向量进行拼接处理,得到目标向量,以基于所述目标向量对所述待分析文本进行文本分析。The target vector determination module is configured to perform a splicing process on the vector to be spliced and the original vector to obtain a target vector, so as to perform text analysis on the text to be analyzed based on the target vector.
  9. 一种电子设备,包括:An electronic device, comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的文本处理方法。The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the text processing method according to any one of claims 1 to 7.
  10. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的文本处理方法。A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the text processing method according to any one of claims 1 to 7 when executed.
PCT/CN2022/134592 2022-10-27 2022-11-28 Text processing method and apparatus, electronic device and storage medium WO2024087298A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211327875.3A CN115618848A (en) 2022-10-27 2022-10-27 Text processing method and device, electronic equipment and storage medium
CN202211327875.3 2022-10-27

Publications (1)

Publication Number Publication Date
WO2024087298A1 true WO2024087298A1 (en) 2024-05-02

Family

ID=84875704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134592 WO2024087298A1 (en) 2022-10-27 2022-11-28 Text processing method and apparatus, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN115618848A (en)
WO (1) WO2024087298A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408268A (en) * 2021-06-22 2021-09-17 平安科技(深圳)有限公司 Slot filling method, device, equipment and storage medium
CN113536772A (en) * 2021-07-15 2021-10-22 浙江诺诺网络科技有限公司 Text processing method, device, equipment and storage medium
CN113919344A (en) * 2021-09-26 2022-01-11 腾讯科技(深圳)有限公司 Text processing method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408268A (en) * 2021-06-22 2021-09-17 平安科技(深圳)有限公司 Slot filling method, device, equipment and storage medium
CN113536772A (en) * 2021-07-15 2021-10-22 浙江诺诺网络科技有限公司 Text processing method, device, equipment and storage medium
CN113919344A (en) * 2021-09-26 2022-01-11 腾讯科技(深圳)有限公司 Text processing method and device

Also Published As

Publication number Publication date
CN115618848A (en) 2023-01-17

Similar Documents

Publication Publication Date Title
WO2019214145A1 (en) Text sentiment analyzing method, apparatus and storage medium
US20210312139A1 (en) Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium
WO2020143163A1 (en) Named entity recognition method and apparatus based on attention mechanism, and computer device
US20220253631A1 (en) Image processing method, electronic device and storage medium
WO2021164231A1 (en) Official document abstract extraction method and apparatus, and device and computer readable storage medium
CN105512110B (en) A kind of wrongly written character word construction of knowledge base method based on fuzzy matching with statistics
CN111199726B (en) Speech processing based on fine granularity mapping of speech components
TW202020692A (en) Semantic analysis method, semantic analysis system, and non-transitory computer-readable medium
US20220188509A1 (en) Method for extracting content from document, electronic device, and storage medium
US10083172B2 (en) Native-script and cross-script chinese name matching
US20230005283A1 (en) Information extraction method and apparatus, electronic device and readable storage medium
WO2023050470A1 (en) Event detection method and apparatus based on multi-layer graph attention network
JP2023015215A (en) Method and apparatus for extracting text information, electronic device, and storage medium
WO2023231331A1 (en) Knowledge extraction method, system and device, and storage medium
CN114417879B (en) Method and device for generating cross-language text semantic model and electronic equipment
WO2024087297A1 (en) Text sentiment analysis method and apparatus, electronic device, and storage medium
TW201905734A (en) Semantic analysis device, method and computer program product thereof
CN113806522A (en) Abstract generation method, device, equipment and storage medium
WO2024087298A1 (en) Text processing method and apparatus, electronic device and storage medium
CN112906368A (en) Industry text increment method, related device and computer program product
CN113761923A (en) Named entity recognition method and device, electronic equipment and storage medium
US20210342379A1 (en) Method and device for processing sentence, and storage medium
US20230081015A1 (en) Method and apparatus for acquiring information, electronic device and storage medium
WO2023137903A1 (en) Reply statement determination method and apparatus based on rough semantics, and electronic device
CN106650803A (en) Method and device for calculating similarity between strings