CN112069813B

CN112069813B - Text processing method, device, equipment and computer readable storage medium

Info

Publication number: CN112069813B
Application number: CN202010944900.7A
Authority: CN
Inventors: 王兴光
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2023-10-13
Anticipated expiration: 2040-09-10
Also published as: CN112069813A

Abstract

The embodiment of the application provides a text processing method, a device, equipment and a computer readable storage medium, relating to the technical field of artificial intelligence, wherein the method comprises the following steps: dividing word vectors of each word in a text to be processed to at least form a global information sub-vector and a local information sub-vector of the word vectors; performing attention calculation on the corresponding words through the global information sub-vector of each word to obtain attention values of the corresponding words; accumulating the local information sub-vectors of the corresponding words and the attention value to obtain weighted word vectors of the corresponding words, and further forming a merging vector; and determining the merging vector as a feature vector of the text to be processed, and adopting the feature vector to process the text to be processed. According to the embodiment of the application, the feature vector of the text to be processed can be accurately obtained, and the accuracy of the processing result in the subsequent text processing process is further improved.

Description

Text processing method, device, equipment and computer readable storage medium

Technical Field

The embodiment of the application relates to the technical field of Internet, and relates to a text processing method, a text processing device, text processing equipment and a computer readable storage medium.

Background

In the field of artificial intelligence, when text processing is performed on a text, for example, any text processing such as translation of the text, question-answer matching of the text, and searching of the text is performed, it is generally necessary to process a vector corresponding to the text in advance to obtain a processed feature vector, and then processing the text based on the processed feature vector.

In the related art, processing of a vector corresponding to a text is generally implemented in advance using Ordered Neurons (Ordered Neurons) or Self-Attention structures (Self-Attention).

However, the related art vector processing method cannot describe the semantic hierarchical relationship between symbols in the text, and the Self-attribute defaults that the embedded representation vector (embedded) corresponding to the current symbol is to interact with other symbols completely, so that the accuracy of the processing result obtained in the subsequent text processing process is lower.

Disclosure of Invention

The embodiment of the application provides a text processing method, a text processing device, text processing equipment and a computer readable storage medium, and relates to the technical field of artificial intelligence. Because the word vector of each word in the text to be processed is divided, at least a global information sub-vector and a local information sub-vector of the word vector are formed, and attention calculation is carried out based on the global information sub-vector and the local information sub-vector, the weighted word vector of each word can be accurately obtained, and the accuracy of a processing result in the subsequent text processing process is further improved.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a text processing method, which comprises the following steps:

dividing word vectors of each word in a text to be processed to at least form a global information sub-vector and a local information sub-vector of the word vectors;

performing attention calculation on the corresponding words through the global information sub-vector of each word to obtain attention values of the corresponding words;

accumulating the local information sub-vectors of the corresponding words and the attention value to obtain weighted word vectors of the corresponding words;

combining the weighted word vectors of at least one word in the text to be processed to form a combined vector;

and determining the merging vector as a feature vector of the text to be processed, and adopting the feature vector to process the text to be processed.

An embodiment of the present application provides a text processing apparatus, including:

the division module is used for dividing word vectors of each word in the text to be processed to at least form a global information sub-vector and a local information sub-vector of the word vectors;

the attention calculating module is used for carrying out attention calculation on the corresponding words through the global information sub-vector of each word to obtain the attention value of the corresponding word;

The accumulation processing module is used for carrying out accumulation processing on the local information sub-vector of the corresponding word and the attention value to obtain a weighted word vector of the corresponding word;

the merging module is used for merging the weighted word vectors of at least one word in the text to be processed to form a merged vector;

and the processing module is used for determining the merging vector as the characteristic vector of the text to be processed and adopting the characteristic vector to process the text to be processed.

The embodiment of the application provides text processing equipment, which comprises the following components:

a memory for storing executable instructions; and the processor is used for realizing the text processing method when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute the executable instructions to implement the text processing method.

The embodiment of the application has the following beneficial effects: the method comprises the steps of dividing word vectors of each word in a text to be processed, forming at least a global information sub-vector and a local information sub-vector of the word vector, carrying out attention calculation on the corresponding word based on the global information sub-vector to obtain an attention value of the corresponding word, and carrying out accumulation processing on the local information sub-vector and the attention value of the corresponding word to obtain a weighted word vector of the corresponding word, so that a feature vector of the text to be processed is finally determined according to the weighted word vector. Therefore, the weighted word vector of each word can be accurately obtained through the global information sub-vector and the local information sub-vector, so that the feature vector of the text to be processed can be accurately obtained, and the accuracy of the processing result in the subsequent text processing process is further improved.

Drawings

FIG. 1 is a schematic diagram of an alternative architecture of a text processing system provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a Self-Attention model provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of a server according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of an alternative text processing method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of an alternative text processing method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of an alternative text processing method according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of an alternative text processing method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a modified Self-Attention model according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of this application belong. The terminology used in the embodiments of the application is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before explaining the text processing method according to the embodiment of the present application, a text processing method in the related art will be described first:

in the related art, processing of the text-corresponding vector is usually implemented in advance using ordered neurons or self-attention structures. The method provides a feature vector after processing of a text, wherein the hidden state vector is weighted based on a serialization model of a Long Short-Term Memory network (LSTM), and the hierarchical relationship of states of different positions of the vector is modeled; in the method of Slef-Attention, a model, such as a pre-trained language characterization model (BERT, bidirectional Encoder Representation from Transformers), models the positional relationship between symbols (Token) by introducing positional embedding (Position Embedding), which shows some excellent effects in many tasks.

However, the related art vector processing method cannot describe semantic hierarchical relationships between symbols in a text, such as upper and lower relationships in syntactic analysis, and the Self-attribute defaults that an embedded representation vector corresponding to a current symbol is to be completely interacted with other symbols, so that local information of the symbol itself is not fully considered, and therefore, accuracy of a processing result obtained in a subsequent text processing process is low.

In order to solve at least one problem of the text processing method in the related art, the embodiment of the application provides a text processing method, which is a Self-Attention computing method considering local and global information of each word in a text to be processed, and introduces a new activation function to split a vector corresponding to each symbol (including a word and a punctuation) of the sf-Attention into a local part and a global part. The global part performs normal Self-Attention calculation, and accumulates the output after Self-Attention calculation in a residual-like manner.

The embodiment of the application provides a text processing method, which comprises the steps of firstly, dividing word vectors of each word in a text to be processed to at least form global information sub-vectors and local information sub-vectors of the word vectors; then, through the global information sub-vector of each word, performing attention calculation on the corresponding word to obtain an attention value of the corresponding word; accumulating the local information sub-vectors and the attention values of the corresponding words to obtain weighted word vectors of the corresponding words; combining weighted word vectors of at least one word in the text to be processed to form a combined vector; and finally, determining the merging vector as a feature vector of the text to be processed, and adopting the feature vector to process the text to be processed. Therefore, the weighted word vector of each word can be accurately obtained through the global information sub-vector and the local information sub-vector, so that the feature vector of the text to be processed can be accurately obtained, and the accuracy of the processing result in the subsequent text processing process is further improved.

In the following, an exemplary application of the text processing device according to the embodiment of the present application will be described, and in one implementation manner, the text processing device provided in the embodiment of the present application may be implemented as any terminal such as a notebook computer, a tablet computer, a desktop computer, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), an intelligent robot, and in another implementation manner, the text processing device provided in the embodiment of the present application may be implemented as a server. In the following, an exemplary application when the text processing device is implemented as a server will be described.

Referring to FIG. 1, FIG. 1 is a schematic diagram of an alternative architecture of a text processing system 10 provided in an embodiment of the present application. In order to implement text processing on a document to be processed, the text processing system 10 provided in the embodiment of the present application includes a terminal 100, a network 200, and a server 300, where a text processing application (for example, the text processing application may be a translation application or a text search application) is running on the terminal 100, and the text processing application is a translation software, and the text to be processed is illustrated as an example of the text to be translated. The user may input a text to be translated on a client of the translation software of the terminal 100, after the terminal 100 obtains the text to be translated, the text to be translated is sent to the server 300 through the network 200, and the server 300 divides a word vector of each word in the text to be translated to form at least a global information sub-vector and a local information sub-vector of the word vector; performing attention calculation on the corresponding words through the global information sub-vectors of each word to obtain attention values of the corresponding words; accumulating the local information sub-vectors and the attention values of the corresponding words to obtain weighted word vectors of the corresponding words; combining weighted word vectors of at least one word in the text to be translated to form a combined vector; and determining the merging vector as a feature vector of the text to be translated, and translating the text to be translated by adopting the feature vector to obtain a translated text after translation. After obtaining the translated text, the server 300 transmits the translated text to the terminal 100 through the network 200, and the terminal 100 displays the translated text on the current interface 100-1.

The text processing method provided by the embodiment of the application also relates to the technical field of artificial intelligence, and can be realized by a natural language processing technology and a machine learning technology in the artificial intelligence technology. Wherein, the liquid crystal display device comprises a liquid crystal display device,

artificial intelligence (AI, artificial Intelligence) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Natural language processing (NLP, nature Language processing) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary discipline involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In the embodiment of the application, the text processing method of the embodiment of the application is realized by a natural language processing technology and a machine learning technology in an artificial intelligence technology. It should be noted that, in the text processing method of the embodiment of the present application, the step of obtaining the feature vector of the text to be processed may be implemented by Self-Attention, se lf-Attention uses Attention mechanism (Attention), calculates the association between each word and other words in the text to be processed, that is, calculates the Attention value (Attention Score) between each word and other words, obtains the weighted vector representation of each word by using the Attention value of each word, and then places the weighted vector representation in a feedforward neural network to obtain a new vector representation, where the new vector representation can well consider the context information in the text to be processed.

Fig. 2 is a schematic diagram of a Self-attribute model provided in the embodiment of the present application, as shown in fig. 2, for Self-attribute, input vectors include Q (Query), K (Key) and V (Value), where three vectors Q, K and V are all input from the same text, first, a dot product between Q and K is calculated by matrix multiplication (MatMul) 201, then, in order to prevent the dot product from being excessively large, scaling is performed by scaling module (Scale) 202, mask processing 203 is performed, finally, logistic regression processing 204 is performed by Softmax function, the result is normalized into probability distribution, and then, a weighted sum representation is obtained by multiplying vector V by matrix multiplication 205, so as to obtain a weighted vector representation of each word.

Fig. 3 is a schematic structural diagram of a server 300 according to an embodiment of the present application, where the server 300 shown in fig. 3 includes: at least one processor 310, a memory 350, at least one network interface 320, and a user interface 330. The various components in server 300 are coupled together by bus system 340. It is understood that the bus system 340 is used to enable connected communications between these components. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 3 as bus system 340.

The processor 310 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, which may be a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 330 includes one or more output devices 331 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 330 also includes one or more input devices 332, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 310. Memory 350 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 350 described in embodiments of the present application is intended to comprise any suitable type of memory. In some embodiments, memory 350 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

The operating system 351 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

network communication module 352 for reaching other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

An input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 3 shows a text processing apparatus 354 stored in a memory 350, where the text processing apparatus 354 may be a text processing apparatus in a server 300, and may be software in the form of a program and a plug-in, and includes the following software modules: the dividing module 3541, the attention calculating module 3542, the accumulating processing module 3543, the combining module 3544, and the processing module 3545 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the text processing method provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specif ic Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic De vice), complex programmable logic device (CPLD, complex Programmable Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic component.

The text processing method provided by the embodiment of the present application will be described below in connection with exemplary applications and implementations of the server 300 provided by the embodiment of the present application. Referring to fig. 4, fig. 4 is a schematic flowchart of an alternative text processing method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.

In step S401, the word vector of each word in the text to be processed is divided to form at least a global information sub-vector and a local information sub-vector of the word vector.

Here, the text to be processed may be any type of text, and may be, for example, text to be translated, text to be retrieved, text of questions to be matched with an answer, and the like. The text to be processed comprises at least one word, wherein the word comprises words capable of representing text semantic information, and words which are not used for representing the text semantic information, such as a mood word, a mood auxiliary word and the like. The text to be processed can be any language type text, for example, chinese text, english text and the like.

In the embodiment of the application, after the text to be processed is acquired, the text to be processed is divided to form at least one word, and each word can be a word or a single word. After at least one word is obtained, word vectors of each word are matched in a preset word vector library. Here, a Word vector (Word emmbeddi ng), also known as a collective term for a set of language modeling and feature learning techniques in Word embedded natural language processing (NLP, natural Language Processing), wherein words or phrases from a vocabulary are mapped to vectors of real numbers, conceptually, the Word vector involves mathematical embedding from a space of one dimension per Word to a space of successive vectors with lower dimensions. That is, the word vector of each word is stored in the preset word vector library, and after at least one word is obtained by division, the word vector of each word can be sequentially matched from the preset word vector library.

In the embodiment of the application, the word vector of each word is divided into at least two parts, namely a global information sub-vector and a local information sub-vector of the word vector, wherein the global information sub-vector and the local information sub-vector are combined to form the word vector of the word, that is, elements with certain dimension in the word vector of each word are divided to form a global information sub-vector, and part or all elements in the remaining dimension in the word vector of the word are divided to form a local information sub-vector.

Step S402, attention calculation is carried out on the corresponding words through the global information sub-vectors of each word, and the attention value of the corresponding word is obtained.

Here, the attention calculation may be implemented by using an attention model, and the global information sub-vector of each word is used as an input of the attention model and is input into the attention model to obtain an attention value of the word, where the attention value is a value used for characterizing the weight of the word in the whole text to be processed. If the attention value is higher, the weight of the word in the whole text to be processed is higher, which indicates that the word is more important, and the word needs to be paid more attention in the subsequent processing process of the whole text processing model; if the attention value is lower, the weight of the word in the whole text to be processed is lower, which indicates that the word is less important and can not be used as much attention in the subsequent processing of the whole text processing model.

Step S403, the local information sub-vector and the attention value of the corresponding word are accumulated to obtain a weighted word vector of the corresponding word.

Here, the accumulation processing of the local information sub-vector and the attention value of the corresponding word corresponds to the residual processing of the corresponding word, and the calculated attention value is accumulated in the local information sub-vector to obtain a weighted word vector of the corresponding word, and the weighted word vector is a vector to which the weight of the word is added. It should be noted that, by performing the accumulation processing on the local information sub-vector and the attention value of the corresponding word, the vector of the important word in the text to be processed can be weighted higher than the vector of the unimportant word.

Step S404, combining the weighted word vectors of at least one word in the text to be processed to form a combined vector.

Here, merging the weighted word vectors of at least one word in the text to be processed means that the weighted word vector of each word is sequentially connected with the weighted word vector of the next word to form a merged vector with higher dimension. For example, the text to be processed includes two words a and B, where the weighted word vector of a is an n-dimensional vector and the weighted word vector of B is an m-dimensional vector, so that after the weighted word vectors of a and B are combined, an n+m-dimensional combined vector is formed, and the elements in the combined vector are the elements in the weighted word vectors of a and B. In short, the weighted word vectors of a and B are combined, that is, the elements in the weighted word vectors of a and B are spliced to form a combined vector with higher dimension.

In step S405, the combined vector is determined as a feature vector of the text to be processed, and the text to be processed is processed by using the feature vector.

Here, the merged vector is a vector capable of representing information of a text to be processed, and attention is calculated in accordance with importance of each word in the merged vector, that is, the merged vector is a vector obtained by giving weight to a word vector of each word. Thus, if the subsequent calculation of the text processing is performed by merging the vectors, the calculation is performed according to the importance of each word in the entire text to be processed, and thus the subsequent text processing will also take into account the different importance of each word in the text to be processed.

In the embodiment of the application, the merging vector is determined as the feature vector of the text to be processed, and the feature vector can be input into any text processing model to be used as an input value of the text processing model for relevant calculation of text processing.

According to the text processing method provided by the embodiment of the application, as the word vector of each word in the text to be processed is divided, at least the global information sub-vector and the local information sub-vector of the word vector are formed, the attention calculation is carried out on the corresponding word based on the global information sub-vector to obtain the attention value of the corresponding word, and the accumulation processing is carried out on the local information sub-vector and the attention value of the corresponding word to obtain the weighted word vector of the corresponding word, so that the feature vector of the text to be processed is finally determined according to the weighted word vector. Therefore, the weighted word vector of each word can be accurately obtained through the global information sub-vector and the local information sub-vector, so that the feature vector of the text to be processed can be accurately obtained, and the accuracy of the processing result in the subsequent text processing process is further improved.

In some embodiments, the text processing system includes a terminal and a server, where the terminal runs a text processing application, and the text processing application may be, for example, translation software, a text search application, a question-answer matching application, and the text processing application is described below as an example of a question-answer matching application. The method comprises the steps that a question-answer matching application is operated on a terminal, a server is a server of the question-answer matching application, a user inputs questions on a client of the question-answer matching application on the terminal, the server matches answer texts corresponding to the questions in a text library according to the questions input by the user, and the matched answer texts are output to the user.

Fig. 5 is a schematic flow chart of an alternative text processing method according to an embodiment of the present application, as shown in fig. 5, the method includes the following steps:

in step S501, the terminal obtains a text to be processed input by the user, where the text to be processed includes a question of an answer to be matched.

The terminal can acquire the text to be processed in any mode, for example, a user can input the text to be processed through a text input module on the terminal, wherein the text input module can be a touch screen input module or a physical input module such as a keyboard or a mouse; or the user can also input voice information in a voice input mode, and the terminal analyzes the voice information of the user to obtain a text to be processed; or, the user can also input gesture information in a gesture input mode, and the terminal analyzes the gesture information of the user to obtain a text to be processed.

In step S502, the terminal encapsulates the text to be processed in the text processing request.

Here, the text processing request is used to request processing of the text to be processed, i.e. the text processing request is used to request matching of an answer to the question.

In step S503, the terminal sends a text processing request to the server.

Step S504, the server analyzes the text processing request to obtain the questions to be matched with the answers.

In the embodiment of the application, after the questions of the answers to be matched are analyzed, the text of the questions is divided, and at least one word is obtained.

In step S505, the server acquires a word vector of each word in the question from a preset word vector library.

Here, the preset word vector library includes word vectors of at least one word, and after the at least one word is obtained by division, the word vectors of each word may be sequentially matched in the preset word vector library.

In step S506, the server divides the word vector of each word to form at least a global information sub-vector and a local information sub-vector of the word vector.

In step S507, the server calculates the attention of the corresponding word through the global information sub-vector of each word, and obtains the attention value of the corresponding word.

In step S508, the server performs accumulation processing on the local information sub-vector and the attention value of the corresponding word to obtain a weighted word vector of the corresponding word.

In step S509, the server merges the weighted word vectors of all the words in the question to form a merged vector.

It should be noted that, step S506 to step S509 correspond to step S401 to step S404, please refer to the detailed explanation in step S401 to step S404, and the process of step S506 to step S509 is not repeated in the embodiment of the present application.

In step S510, the server determines the merging vector as a feature vector of the text to be processed, and performs question-answer matching on the question using the feature vector to match the answer text corresponding to the question in the text library.

Here, since the text processing model is a question-answer matching model, after the merged vector of the text to be processed is obtained, the merged vector is input into the question-answer matching model as a feature vector of the text to be processed, and the feature vector of the text to be processed is processed through the question-answer matching model, so that the answer text corresponding to the question is matched in the text library.

In the embodiment of the application, the text library comprises at least one text, each text corresponds to a specific field, and each text is a solution corresponding to at least one problem. Each text corresponds to a text vector, which is a vector for representing text information. Therefore, by calculating the similarity between the feature vector of the text to be processed and the text vector of each text, or calculating the matching between the feature vector of the text to be processed and the text vector of each text, the target text which is the answer text of the question corresponding to the text to be processed and is most relevant or matched with the text to be processed can be determined.

In step S511, the server transmits the reply text as an answer to the question to the terminal.

In step S512, the terminal displays the answer to the question on the current interface.

The text processing method provided by the embodiment of the application realizes question-answer matching of the questions corresponding to the text to be processed. The method comprises the steps of carrying out interaction between a terminal and a server, sending a text to be processed, which is acquired by the terminal, to the server, so as to request the server to carry out text processing on the text to be processed, dividing word vectors of each word in the text before the text processing, forming at least global information sub-vectors and local information sub-vectors of the word vectors, carrying out attention calculation on the corresponding words based on the global information sub-vectors to obtain attention values of the corresponding words, and carrying out accumulation processing on the local information sub-vectors and the attention values of the corresponding words to obtain weighted word vectors of the corresponding words, so that feature vectors of the text to be processed can be finally determined according to the weighted word vectors. Therefore, the weighted word vector of each word can be accurately obtained through the global information sub-vector and the local information sub-vector, so that the feature vector of the text to be processed can be accurately obtained, and the accuracy of the matched reply text in the subsequent text matching process is further improved.

Based on fig. 4, fig. 6 is a schematic flowchart of an alternative text processing method according to an embodiment of the present application, as shown in fig. 6, step S401 may be implemented by:

in step S601, a gating vector is determined, where the gating vector at least includes non-zero regions.

Here, the gating vector is a vector for determining a division position when dividing a word vector, and is a vector obtained by presetting or transforming a vector of a pre-apparatus.

In some embodiments, determining the gating vector in step S601 may be achieved by:

in step S6011, a first gating vector and a second gating vector are acquired.

Here, the sum of all elements of the first gating vector is 1, and the elements in the first gating vector are arranged in a sequentially increasing order; the sum of all elements of the second gating vector is 1, and the elements in the second gating vector are arranged in a descending order; the dimensions of the first gating vector are the same as the dimensions of the second gating vector. That is, the order of the elements in the first gating vector and the second gating vector is exactly opposite, one in increasing order and one in decreasing order.

Step S6012, multiplying the element of each position in the first gating vector with the element of the corresponding position in the second gating vector in turn, to obtain the product of the corresponding positions. In step S6013, the product of each position is added to a new vector in order of each position in the first gating vector, and the gating vector is generated.

For example, the first gating vector may be an increasing sequence of [0, … …, s ], the dimension of the first gating vector being N, wherein the value of each element is less than or equal to 1, the sum of N elements being 1. When the first gating vector is set, the value of the first element and the value of the second element may be assigned to the second element, and the value of the second element and the value of the third element may be assigned to the third element, so that the nth number in the first gating vector represents the sum of the first n numbers, and the value of the last element is 1, that is, the first gating vector may be a vector increasing to 1. Conversely, the second gating vector may be an increasing sequence of [ t, … …,0], the dimension of the second gating vector also being N, wherein the value of each element is less than or equal to 1, the sum of N elements being 1, the second gating vector may be a vector that decreases to 0.

Here, it is assumed that the first 10 elements of the first gating vector are 0 (which may be a very small number close to 0, e.g. 10 to the power-5) and the last element of the end is 1 (which may be a number close to 1 and less than 1), the second gating vector is likely to be 0 for the last 10 elements, or the sequence of the entire vector is tapered down to 0. The first gating vector and the second gating vector may then be multiplied by the corresponding positions, i.e. the number of 1 st position of the first gating vector and 1 st position of the second gating vector, the number of 2 nd position of the first gating vector and 2 nd position of the second gating vector multiplied … … and so on, until the number of nth position in the first gating vector and the number of nth position of the second gating vector are multiplied. At this time, since elements of the first 10 positions of the first gating vector and the last 10 positions of the second gating vector are all 0, after multiplication, the first 10 positions and the last 10 positions of the resulting gating vector are all 0, and then the intermediate position is not 0.

In step S602, the non-zero interval is determined as the global position interval.

Here, the non-zero interval of the gate vector intermediate position is determined as the global position interval. For example, if the first 10 positions of the resulting gating vector are 0 and the last 10 positions are also 0 after multiplying the elements of each position in the first gating vector with the elements of the corresponding position in the second gating vector, the global position interval is determined from the intervals corresponding to the positions other than the first 10 positions and the last 10 positions.

In step S603, the subinterval located after the non-zero interval in the gating vector is determined as the local position interval. Here, assuming that the start position of the non-zero section is i and the end position of the non-zero section is j, the local position section is a section from the position j to the last position in the gating section.

In step S604, the word vectors of each word are divided according to the global position interval and the local position interval, and at least the global information sub-vector and the local information sub-vector of the word vector are formed.

In some embodiments, step S604 may be implemented by:

step S6041, determining the position of the first element in the global position section in the gating vector as the initial position.

Step S6042, determining the position of the last element in the global position section in the gating vector as the end position.

In step S6043, the word vector of each word is divided according to the initial position and the final position to form at least a global information sub-vector and a local information sub-vector of the word vector.

In some embodiments, the method may further comprise the steps of:

step S61, a first number corresponding to the vector dimension of the gating vector is obtained.

Step S62, equally dividing the word vector into a first number of subintervals according to the sequence of elements in the word vector of each word; wherein each of the first number of subintervals corresponds in turn to a position in the gating vector.

Correspondingly, step S6043 may be implemented by:

step S6043a, combining the first subinterval corresponding to the initial position in the first number of subintervals, the second subinterval corresponding to the end position in the first number of subintervals, and other subintervals between the first subinterval and the second subinterval to form a global information subvector. Step S6043b, merging the remaining subintervals after the second subinterval to form the local information subvector.

For example, assuming that the word vector X of the input word is a 128-dimensional vector, but here both the first and second gating vectors are 8-dimensional vectors, that is, the first and second gating vectors are not 128-dimensional, then the dimension of the resulting gating vector after the first and second gating vectors interact is also 8-dimensional. At this time, the word vector X may be changed into an 8 by 16 vector, that is, every 16 consecutive elements are a sub-section, each sub-section corresponds to a position, and the whole word vector X corresponds to 8 positions, that is, every 16 elements of the word vector X corresponds to one element in the gating vector. And finally, determining the first position of the non-zero interval of the gating interval as an initial position and the last position as a final position, merging the corresponding subinterval of the initial position in the word vector X, the subinterval of the final position in the word vector X and the elements between the two subintervals to form a global information subvector. For example, the non-zero interval is from position 3 to position 5 in the gating vector, and then the 128-dimensional word vector X is equally divided into 8 subintervals, and then the elements between the third subinterval and the 5 th subinterval form a global information subvector, and the elements after the 5 th subinterval form a local information subvector.

Based on fig. 4, fig. 7 is a schematic flowchart of an alternative text processing method according to an embodiment of the present application, as shown in fig. 7, step S402 may be implemented by:

in step S701, for each word in the text to be processed, the global information sub-vector of the corresponding word and the word vector of each word are input as input values into the self-attention model.

Step S702, calculating the attention value of the corresponding word by the self-attention model.

The Self-Attention model is a Self-Attention model, and may be the model shown in fig. 2 described above or the Self-Attention model obtained after deformation based on the model shown in fig. 2. In the embodiment of the application, the attention value of each word in the whole text to be processed can be calculated through the self-attention model, so that the weighted calculation is carried out on each word, and the weight of the more important word in the whole text to be processed is higher.

With continued reference to fig. 7, in some embodiments, after forming the feature vector in step S405, the method may further include the steps of:

in step S703, the combined vector is determined as a feature vector of the text to be processed.

In step S704, the feature vectors are divided to form at least global feature sub-vectors and local feature sub-vectors of the feature vectors.

Step S705, performing attention calculation on the text to be processed through the global feature sub-vector to obtain a text attention value of the text to be processed.

In step S706, the local feature sub-vectors and the text attention value are accumulated to obtain a weighted text vector of the text to be processed.

It should be noted that, steps S703 to S706 are processes of performing self-attention computation on the merging vectors determined by the method according to the embodiment of the present application, where self-attention computation is a process of computing attention values and obtaining weighted word vectors according to the embodiment of the present application. That is, the model for self-attention computation provided in the embodiment of the present application may be used at any position in the entire text processing model, may perform self-attention computation once when the entire text processing model is initially input, and may perform self-attention computation on the obtained intermediate vector one or more times at an intermediate position of the text processing model.

Correspondingly, the text processing of the text to be processed using the feature vector in step S405 may be implemented by the following steps: in step S707, text processing is performed on the text to be processed using the weighted text vector.

The text processing method provided by the embodiment of the application can be applied to any position in the text processing model, namely, the self-attention computing model with a multi-layer structure can be added in the text processing model, and the self-attention computing model can be used in the interlayer in the text processing model. Therefore, according to the text processing requirements, the self-attention calculation based on the global and local information can be carried out on the output vectors under different processing conditions, the weighted word vectors obtained in each layer are ensured to be more consistent with the semantic representation of the text to be processed, and the accuracy of the processing result of the whole text processing model is further improved.

In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The embodiment of the application provides a text processing method which is applicable to all algorithms possibly used for Self-Attention.

For a word, the word often contains both local information and global information. The local information refers to: the semantics such as polarity information described in "cold" and "hot" can be clearly expressed without any context, and a person can obtain information corresponding to low cold temperature, high character, etc., and information corresponding to high heat, open character, etc., without any context. Global information refers to: the context is needed to further clarify the semantics, e.g., the cold word in which others are cold, i.e., the character's pop-up. The embodiment of the application provides a Self-Attention computing method sensitive to local and global based on the assumption of the local information and the global information.

In the traditional method, the semantics of the word are often directly encoded into a vector, and the local information and the global information of the word are not distinguished. The Attention algorithm proposed in recent years, including Self-Attention, ignores local information of words, models global information of words directly, or models global information of sentences for the whole sentence of text. This solution does not distinguish well between the inherent properties of the word itself and the contextual properties of the word, namely the local information and the global information mentioned in the embodiments of the present application.

The core idea of the embodiment of the application is that the local information and the global information are distinguished by an algorithm, and the algorithm flow can be realized by the following codes:

def cumsum_active(X,n_chunks):

shape＝X.get_shape()

l＝tf.nn.softmax(tf.layers.dense(X,n_chunks))#N,T,NChunks

g＝tf.nn.softmax(tf.layers.dense(X,n_chunks))#N,T,NChunks

l＝tf.expand_dims(tf.math.cumsum(f,axis＝-1),axis＝-1)#N,T,NChunks,1

g＝tf.expand_dims(1.-tf.math.cumsum(i,axis＝-1),axis＝-1)#N,T,NChunks,1

w＝l*g

X＝tf.stack(tf.split(X,n_chunks,axis＝2),axis＝2)#N,T,NChunks,C/NChunksX_context＝tf.reshape(X*w,shape)

X_local＝tf.reshape(X*l,shape)

return X_context,X_local

as can be seen from the above code, for input X, two active gate structures, i gate and g gate, are calculated, respectively, where i gate corresponds to the first gating vector described above and g gate corresponds to the second gating vector described above.

The l gates, by way of a summation, eventually represent an increasing sequence of [0, … …,1], and the g gates, as opposed to the l gates, represent a decreasing sequence of [1, … …,0 ]. During the calculation, meaningless representation areas are introduced. The l gate interacts with the g gate to obtain a non-zero interval, and the starting position and the ending position of the assumed interval are a position i and a position j respectively; for the original vector X, the global information of the original vector X is vector X [ i:j ], namely, the element from the position i to the position j in the X vector is determined as the vector representation of the global information; the local information is defined as X [ j ], namely, the element from the position i to the last position of the X vector in the X vector is determined as the vector representation of the local information; for the remainder of the vector X0:i, the elements in the X vector from the initial position to position i, are defined as nonsensical representation areas. It should be noted that, since the positions i and j of different words are different, the nonsensical region is completely calculated by the vector of the external word.

In the embodiment of the application, two vectors can be obtained through the cumsum-active operation: x_context and x_local, where x_context [ i: j ] is not 0 and x_local [ j: ] is not 0.

Detailed explanation of nonsensical part: for some words, a D-dimensional vector representation may not be needed, and the whole information can be described by fewer dimensions, so that the vector of the word is divided into three parts, and the nonsensical part is completely obtained through context calculation and does not influence the weight calculation of the Attention; the global part is also obtained by context calculation, but directly influences the weight calculation of the Attention; the local part does not participate in the context calculation and is the local information inherent to each symbol.

In the embodiment of the application, the modified Self-attribute method can be applied to a transformerJ structure, which is equivalent to modifying two positions of the model: the first position is that before the attribute weight is calculated, extracting x_context from the vector by a cumsum-activity method, and executing subsequent operation of the attribute based on the x_context; the second position is that after the Q (i.e., query) representation of the new word vector is obtained, the original representation of the Query is not accumulated and is changed to a partial representation of the Query.

The modified Self-Attention calculation method refers to the modified Self-Attention model structure schematic diagram shown in fig. 8, and based on the above-mentioned Self-Attention model structure schematic diagram of fig. 2, the thick solid line part 801 in fig. 8 is a newly added operation, and corresponds to the cumum-activity calculation 81 and the new residual path respectively. Dashed line portion 802 represents the primary residual module rejection. Note that Q in fig. 8 _l Represents a partial representation of Q _g Global representation of Q, K _g Representing a global representation of K (i.e., key), V _g The global representation of V (i.e., value), the operations of K and V are the same as Q, and finally the output result Q is obtained _a 。

According to the method provided by the embodiment of the application, the input vector is segmented through the cumsum-active, and the input vector is expressed as nonsensical, global and local areas, so that global information and local information contained in the input vector can be better kept. The global information participates in calculating the Attention weights, and the weights reassign the meaningless and global two parts of the vector, and the local area replaces the original residual structure. The algorithm provided by the embodiment of the application can be directly modified and applied to the existing Self-attribute on calculation, so that a better lifting effect can be achieved on some tasks.

It should be noted that, in the calculation of self-activation, there may be other positions that may be tried, for example, after the weight calculation, the cumsum-activation is recalculated, that is, the cumsum-activation structure provided by the embodiment of the present application is overlapped for multiple times; the cumsum-activation can also be changed into a multi-layer structure, or the cumsum-activation is used in a interlayer in the whole text processing model, and the like.

The three-section information representation based on Self-Attention and the subsequent use method provided by the embodiment of the application, especially the local representation of the original representation in the calculation of the replacement residual error, belong to the core protection scope of the embodiment of the application. The embodiment of the application does not restrict the internal result of Self-attribute.

Continuing with the description below of an exemplary architecture in which the text processing device 354 provided by embodiments of the present application is implemented as a software module, in some embodiments, as shown in fig. 3, the software module stored in the text processing device 354 of the memory 350 may be a text processing device in the server 300, including:

the division module 3541 is configured to divide a word vector of each word in the text to be processed, and at least form a global information sub-vector and a local information sub-vector of the word vector;

An attention calculating module 3542, configured to calculate, by using the global information sub-vector of each word, an attention of a corresponding word, so as to obtain an attention value of the corresponding word;

the accumulation processing module 3543 is configured to perform accumulation processing on the local information sub-vector of the corresponding word and the attention value, so as to obtain a weighted word vector of the corresponding word;

a merging module 3544, configured to merge the weighted word vectors of at least one word in the text to be processed to form a merged vector;

and the processing module 3545 is configured to determine the combined vector as a feature vector of the text to be processed, and perform text processing on the text to be processed by using the feature vector.

In some embodiments, the partitioning module is further to:

determining a gating vector, wherein the gating vector at least comprises a non-zero area;

determining the non-zero interval as a global position interval;

determining a subinterval positioned behind the non-zero interval in the gating vector as a local position interval;

dividing the word vector of each word according to the global position interval and the local position interval to at least form a global information sub-vector and a local information sub-vector of the word vector.

In some embodiments, the partitioning module is further to:

acquiring a first gating vector and a second gating vector; the sum of all elements of the first gating vector is 1, and the elements in the first gating vector are arranged in a sequential increasing order; the sum of all elements of the second gating vector is 1, and the elements in the second gating vector are arranged in a descending order; the dimension of the first gating vector is the same as the dimension of the second gating vector;

sequentially multiplying the element of each position in the first gating vector with the element of the corresponding position in the second gating vector to obtain the product of the corresponding positions;

and adding the product of each position to a new vector in sequence according to the sequence of each position in the first gating vector, and generating the gating vector.

In some embodiments, the partitioning module is further to:

determining the position of the first element in the global position interval in the gating vector as an initial position;

determining the position of the last element in the global position interval in the gating vector as a termination position;

Dividing the word vector of each word according to the initial position and the end position to at least form a global information sub-vector and a local information sub-vector of the word vector.

In some embodiments, the apparatus further comprises:

the first quantity acquisition module is used for acquiring a first quantity corresponding to the vector dimension of the gating vector;

an equally dividing module, configured to equally divide the word vector into the first number of subintervals according to the order of elements in the word vector of each word; wherein each of the first number of subintervals corresponds in turn to a position in the gating vector;

the dividing module is further configured to:

combining a first subinterval corresponding to the initial position in the first number of subintervals, a second subinterval corresponding to the termination position in the first number of subintervals, and other subintervals between the first subinterval and the second subinterval to form the global information subvector;

and merging the rest subintervals after the second subinterval to form the local information subvector.

In some embodiments, the attention calculation module is further to:

For each word in the text to be processed, the global information sub-vector of the corresponding word and the word vector of each word are used as input values and input into a self-attention model;

and calculating the attention value of the corresponding word through the self-attention model.

In some embodiments, the apparatus further comprises:

the feature vector dividing module is used for dividing the feature vector to at least form a global feature sub-vector and a local feature sub-vector of the feature vector;

the first attention calculating module is used for carrying out attention calculation on the text to be processed through the global feature sub-vector to obtain a text attention value of the text to be processed;

the first accumulation processing module is used for carrying out accumulation processing on the local feature sub-vector and the text attention value to obtain a weighted text vector of the text to be processed;

correspondingly, the processing module is further configured to:

and carrying out text processing on the text to be processed by adopting the weighted text vector.

It should be noted that, the description of the apparatus according to the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted. For technical details not disclosed in the present apparatus embodiment, please refer to the description of the method embodiment of the present application for understanding.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method according to the embodiment of the present application.

Embodiments of the present application provide a storage medium having stored therein executable instructions which, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, as shown in fig. 4.

In some embodiments, the storage medium may be a computer-readable storage medium, such as ferroelectric Memory (FRAM, ferromagnetic Random Access Memory), read-Only Memory (ROM, R ead Only Memory), programmable Read-Only Memory (PROM, programmable Read Only Memory), erasable programmable Read-Only Memory (EPROM, erasable Programmable Read Only Memory), electrically erasable programmable Read-Only Memory (EEPROM, electrically Erasable Programmable Read Only Memory), flash Memory, magnetic surface Memory, optical Disk, or Compact Disk-Read Only Memory (CD-ROM), among others; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (html, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A text processing method, comprising:

2. The method of claim 1, wherein the dividing the word vector for each word in the text to be processed to form at least a global information sub-vector and a local information sub-vector for the word vector comprises:

determining the non-zero interval as a global position interval;

3. The method of claim 2, wherein the determining a gating vector comprises:

4. The method of claim 2, wherein said dividing the word vector for each word by the global position interval and the local position interval forms at least a global information sub-vector and a local information sub-vector of the word vector, comprising:

5. The method according to claim 4, wherein the method further comprises:

acquiring a first number corresponding to a vector dimension of the gating vector;

equally dividing the word vector into the first number of subintervals according to the sequence of elements in the word vector of each word; wherein each of the first number of subintervals corresponds in turn to a position in the gating vector;

The dividing the word vector of each word according to the initial position and the end position, at least forming a global information sub-vector and a local information sub-vector of the word vector, including:

6. The method according to claim 1, wherein said performing attention calculation on the corresponding word by the global information sub-vector of each word to obtain an attention value of the corresponding word includes:

7. The method according to any one of claims 1 to 6, further comprising:

Dividing the feature vector to at least form a global feature sub-vector and a local feature sub-vector of the feature vector;

performing attention calculation on the text to be processed through the global feature sub-vector to obtain a text attention value of the text to be processed;

accumulating the local feature sub-vectors and the text attention value to obtain a weighted text vector of the text to be processed;

correspondingly, the text processing for the text to be processed by adopting the feature vector comprises the following steps:

8. A text processing apparatus, comprising:

9. A text processing apparatus, comprising:

a memory for storing executable instructions; a processor for implementing the text processing method of any of claims 1 to 7 when executing executable instructions stored in said memory.

10. A computer readable storage medium, characterized in that executable instructions are stored for causing a processor to execute the executable instructions for implementing the text processing method of any one of claims 1 to 7.