WO2021051574A1 - Procédé et système d'étiquetage de séquence de texte en anglais et dispositif informatique - Google Patents

Procédé et système d'étiquetage de séquence de texte en anglais et dispositif informatique Download PDF

Info

Publication number
WO2021051574A1
WO2021051574A1 PCT/CN2019/117771 CN2019117771W WO2021051574A1 WO 2021051574 A1 WO2021051574 A1 WO 2021051574A1 CN 2019117771 W CN2019117771 W CN 2019117771W WO 2021051574 A1 WO2021051574 A1 WO 2021051574A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
layer
word
input
target sentence
Prior art date
Application number
PCT/CN2019/117771
Other languages
English (en)
Chinese (zh)
Inventor
孙超
于凤英
王健宗
韩茂琨
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051574A1 publication Critical patent/WO2021051574A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the field of computer data processing, and in particular to a method, system, computer equipment, and non-volatile computer-readable storage medium for labeling English text sequences based on neural networks.
  • NLP Natural Language Processing
  • the sequence labeling model is the most common model, and it is also widely used, and its output is a label sequence.
  • tags are interconnected, forming structural information between tags. Using this structural information, the sequence labeling model can quickly and effectively predict the label corresponding to each word in the text sequence (for example, person's name, place name, etc.).
  • sequence labeling models include Multilayer Perceptron (MLP), Auto Encoder (AE), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (Recurrent Neural Networks). Neural Networks, RNN), etc.
  • MLP Multilayer Perceptron
  • AE Auto Encoder
  • CNN Convolutional Neural Networks
  • RNN Recurrent Neural Networks
  • the purpose of the embodiments of the present application is to provide a neural network-based English text sequence labeling method, system, computer equipment, and non-volatile computer-readable storage medium, which can effectively improve the labeling accuracy.
  • an embodiment of the present application provides a method for labeling English text sequences based on a neural network, and the method includes:
  • Extract the word information, character information and morphological characteristics of the target sentence and input the word information, character information and morphological characteristics to the first BI-LSTM layer and the first dropout layer to obtain the first output matrix
  • the character matrix I (k ⁇ m) ⁇ 1 of the target sentence is obtained through the character embedding layer, and the character matrix I (k ⁇ m) ⁇ 1 is transformed into a k ⁇ m ⁇ d-dimensional matrix through the second word embedding layer, and Input the k ⁇ m ⁇ d-dimensional matrix into the second BI-LSTM layer, and obtain the fourth matrix through the second BI-LSTM layer And the fourth matrix Input to the second dropout layer to get the second output matrix
  • the first output matrix Second output matrix Third output matrix And the fourth output matrix Perform linear addition and get the result of linear addition ⁇ i is Corresponding weight coefficient;
  • an embodiment of the present application also provides an English text sequence labeling system based on a neural network, including:
  • the first output module is used to extract the word information, character information and morphological features of the target sentence, and input the word information, character information and morphological features to the first BI-LSTM layer and the first dropout layer to obtain the first output matrix
  • the second output module is used to obtain the character matrix I (k ⁇ m) ⁇ 1 of the target sentence through the character embedding layer, and convert the character matrix I (k ⁇ m) ⁇ 1 into k ⁇ m through the second word embedding layer ⁇ d-dimensional matrix, and input the k ⁇ m ⁇ d-dimensional matrix into the second BI-LSTM layer, and obtain the fourth matrix through the second BI-LSTM layer And the fourth matrix Input to the second dropout layer to get the second output matrix
  • the third output module is used to extract the semantic information of the target sentence, and input the semantic information to the third BI-LSTM layer and the third dropout layer to obtain the third output matrix
  • the fourth output module is used to input the binary information into the fourth Bi-LSTM layer and the fourth dropout layer based on the binary information extracted by the convolutional layer to obtain the fourth output matrix
  • Linear calculation module used to convert the first output matrix Second output matrix Third output matrix And the fourth output matrix Perform linear addition and get the result of linear addition ⁇ i is Corresponding weight coefficient
  • the fifth output module is used to input the linear addition result O into the second LSTM layer to record the output at each time step
  • To get the fifth output matrix i is the sequence number of each word in the target sentence, z is the input dimension of the second LSTM layer;
  • an embodiment of the present application further provides a computer device, the computer device including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, the When the computer-readable instructions are executed by the processor, the following steps are implemented:
  • Extract the word information, character information and morphological characteristics of the target sentence and input the word information, character information and morphological characteristics to the first BI-LSTM layer and the first dropout layer to obtain the first output matrix
  • the character matrix I (k ⁇ m) ⁇ 1 of the target sentence is obtained through the character embedding layer, and the character matrix I (k ⁇ m) ⁇ 1 is transformed into a k ⁇ m ⁇ d-dimensional matrix through the second word embedding layer, and Input the k ⁇ m ⁇ d-dimensional matrix into the second BI-LSTM layer, and obtain the fourth matrix through the second BI-LSTM layer And the fourth matrix Input to the second dropout layer to get the second output matrix
  • the first output matrix Second output matrix Third output matrix And the fourth output matrix Perform linear addition and get the result of linear addition ⁇ i is Corresponding weight coefficient;
  • the embodiments of the present application also provide a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions may Is executed by at least one processor, so that the at least one processor executes the following steps:
  • Extract the word information, character information and morphological characteristics of the target sentence and input the word information, character information and morphological characteristics to the first BI-LSTM layer and the first dropout layer to obtain the first output matrix
  • the character matrix I (k ⁇ m) ⁇ 1 of the target sentence is obtained through the character embedding layer, and the character matrix I (k ⁇ m) ⁇ 1 is transformed into a k ⁇ m ⁇ d-dimensional matrix through the second word embedding layer, and Input the k ⁇ m ⁇ d-dimensional matrix into the second BI-LSTM layer, and obtain the fourth matrix through the second BI-LSTM layer And the fourth matrix Input to the second dropout layer to get the second output matrix
  • the first output matrix Second output matrix Third output matrix And the fourth output matrix Perform linear addition and get the result of linear addition ⁇ i is Corresponding weight coefficient;
  • the neural network-based English text sequence labeling method, system, computer equipment, and non-volatile computer-readable storage medium extract the features of the target sentence through different dimensions, such as extracting semantics, binary features, Character-level features and morphological features and other aspects of feature information, these features are linearly weighted to obtain comprehensive features, and the label sequence of the target sentence is output through these comprehensive features. Since multiple dimensions of feature information are taken into account at the same time, it can ensure Higher standard accuracy rate.
  • FIG. 1 is a flowchart of Embodiment 1 of a method for labeling English text sequences based on a neural network in this application.
  • Embodiment 2 is a schematic diagram of program modules of Embodiment 2 of the English text sequence labeling system based on neural network of the present application.
  • FIG. 3 is a schematic diagram of the hardware structure of the third embodiment of the computer equipment of this application.
  • the embedding layer is used to convert each word in the target sentence into a fixed-size word vector, or convert each character into a fixed-size character vector.
  • the LSTM layer is a long and short-term memory network layer, which is used to output the corresponding information labels (such as semantic labeling, part-of-speech labeling, etc.) of each character or each word according to the sequence of each character or each word in the target sentence.
  • information labels such as semantic labeling, part-of-speech labeling, etc.
  • the dropout layer is a pooling layer, a network layer set up to prevent the neural network from overfitting.
  • FIG. 1 shows a flow chart of the method for labeling English text sequences based on neural network in the first embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. details as follows.
  • Step S100 Extract word information, character information, and morphological features of the target sentence, and input the word information, character information, and morphological features to the first BI-LSTM layer and the first dropout layer to obtain a first output matrix
  • the purpose of extracting word information is to provide the word vector of each word in the target sentence as the basic information of the target sentence in this embodiment, and the subsequently extracted information is incremental information based on different dimensions.
  • the purpose of extracting character information is to predict the next character in the alphabet based on the context of a character. It is used to obtain structural information between each word. For example, “man” and “policeman” have the same meaning, and there are structural differences. Similarity.
  • the purpose of extracting morphological features is to use the rich morphological features of words, such as obtaining different morphological information based on different suffixes and spellings of each word, and using the obtained morphological information in word labeling to improve the accuracy of word labeling.
  • each word may have different suffixes, and these differentiated suffixes can be considered as the morphological characteristics of these words. It may also be a prefix, etc., which is not limited in this embodiment.
  • step S100 includes the following steps S100A to S100D:
  • Step S100A word information extraction step: obtain the first matrix W m ⁇ d of the target sentence through the first word embedding layer.
  • the first word embedding layer is used to convert each word in the target sentence into a fixed-size word vector.
  • n is the number of words in the target sentence
  • d is the word vector dimension of each word in the target sentence.
  • the target sentence is input to the first word embedding layer, and the m words in the target sentence are respectively mapped to the word vector through the first word embedding layer to obtain the first matrix W m ⁇ d (ie , Word vector matrix), where each word is mapped to a d-dimensional word vector.
  • W m ⁇ d ie , Word vector matrix
  • each sentence is represented as a column vector I m ⁇ 1 , where each element represents a word, and the d-dimensional word vector corresponding to each element can be obtained through models such as word2vec, for example, 128 is a word vector.
  • Step S100B character-level information extraction step: obtain the second matrix C m ⁇ n of the target sentence through the character embedding layer and the first LSTM layer, where n is the character vector dimension of the character in each word.
  • the character embedding layer is used to convert each letter in each word into a fixed-size character vector.
  • the first LSTM layer is to output the information label corresponding to each character according to the sequence of each character of the target sentence.
  • the step S100B may include steps S100B1 to S100B2. details as follows:
  • Step S100B1 divide each word in the target sentence into a k-dimensional column vector C k ⁇ 1 , and input C k ⁇ 1 into the randomly initialized character embedding layer, and output k ⁇ n by the character embedding layer Matrix, where k is the length of the word and n is the vector dimension. It is not difficult to understand that each word is represented as a k-dimensional column vector C k ⁇ 1 , where each element represents a character, and the n-dimensional character vector corresponding to each character is obtained, thereby obtaining a k ⁇ n matrix.
  • Step S100B2 input the k ⁇ n matrix into the first LSTM layer, and use the last hidden state C 1 ⁇ n in the first LSTM layer as the vector representation of the corresponding words to convert the target sentence containing m words Is the second matrix C m ⁇ n .
  • Step S100C morphological information extraction step: obtain the morphological features of each word in the target sentence, and establish a one-hot vector SUV 1 ⁇ 10 for each word to obtain the third matrix SUV m ⁇ 10 of the target sentence.
  • the step S100C may include steps S100C1 to S100C4. details as follows:
  • Step S100C1 pre-calculate and select the 10 suffixes with the highest frequency in the training data set, and collect multiple preselected words ending with these suffixes.
  • Step S100C2 Determine whether the suffix of each preselected word is a real suffix according to the part of speech and frequency of each preselected word.
  • Step S100C3 Record the part of speech and frequency of each preselected word among the plurality of preselected words.
  • Step S100C4 establish a one-hot vector SUV 1 ⁇ 10 for each preselected word:
  • the suffix of the corresponding preselected word is determined to be the real suffix, record the ⁇ preselected word, suffix> pair, and based on the suffix in the 10 suffixes Establish a one-hot vector SUV 1 ⁇ 10 for the pre-selected word.
  • the target sentence includes m words and has m one-hot vectors, thus forming the third matrix SUV m ⁇ 10 .
  • the morphological feature is the suffix and spelling feature of the word of interest.
  • Step S100D splicing the first matrix W m ⁇ d , the second matrix C m ⁇ n and the third matrix SUV m ⁇ 10 , and the spliced vector matrix [W m ⁇ d ,C m ⁇ n ,SUV m ⁇ 10 ] Input to the first BI-LSTM layer and the first dropout layer to get the first output matrix
  • First output matrix m is the number of words, and d is the vector dimension of each word;
  • BI-LSTM Bi-directional Long Short-Term Memory
  • the first layer is the input layer (input layer)
  • the second and third layers are the BI-LSTM layer
  • the last layer is the output layer (outputlayer).
  • the BI-LSTM layer is composed of two LSTMs, one of which processes the sequence according to the input sequence of the sequence, and the other processes the sequence according to the opposite direction of the sequence input.
  • o t ⁇ [0,1] represents the selection weight of the node cell memory information at time t
  • b o is the bias of the output gate
  • W o is the weight matrix of the output gate
  • x t represents the input data to the LSTM layer at time t, that is, the vector corresponding to one of the words in the stitching vector matrix [W m ⁇ d , C m ⁇ n , SUV m ⁇ 10] in this embodiment; Is the output vector to the LSTM layer at time t.
  • Step S200 selective information extraction step: obtain the character matrix I (k ⁇ m) ⁇ 1 of the target sentence through the character embedding layer, and convert the character matrix I (k ⁇ m) ⁇ 1 into k through the second word embedding layer ⁇ m ⁇ d-dimensional matrix, and input the k ⁇ m ⁇ d-dimensional matrix into the second BI-LSTM layer, and obtain the fourth matrix through the second BI-LSTM layer And the fourth matrix Input to the second dropout layer to get the second output matrix
  • the second BI-LSTM layer is to output the information label corresponding to each character according to the sequence of each word of the target sentence.
  • k is the length of each word
  • m is the number of words in the target sentence
  • d is the word vector dimension of the word.
  • I (k ⁇ m) ⁇ 1 represents the character matrix of each sentence.
  • the matrix is composed of sentences passing through the character embedding layer, and it contains context information and character information. Convert it into a k ⁇ m ⁇ d-dimensional matrix through the second embedding layer, and input it into the second BI-LSTM layer to get In summary, it can be expressed as:
  • Step S300 extract the semantic information of the target sentence and the binary information extracted based on the convolutional layer, and input the semantic information into the third BI-LSTM layer and the third dropout layer to obtain a third output matrix And input the binary information to the fourth Bi-LSTM layer and the fourth dropout layer to obtain the fourth output matrix
  • Semantic information is extracted, and each word of the target sentence is information-labeled from the semantic dimension.
  • the purpose of extracting binary information is to extract the depth features of the target sentence, which can then be used for information labeling.
  • the step S300 may include steps S300A to S300B. details as follows:
  • Step S300A semantic information extraction step: Obtain each word of the target sentence through the semantic embedding layer for labeling, and input each labelled word into the third BI-LSTM layer to obtain the fifth matrix S m ⁇ d , and add the fifth matrix S m ⁇ d The matrix S m ⁇ d is input to the third dropout layer to obtain the third output matrix
  • the pre-trained AdaGram model can be used to initialize the semantic embedding layer
  • Step S300B binary information extraction step: obtain the sixth matrix B m ⁇ d of the target sentence through the third word embedding layer and the convolutional layer, and input the sixth matrix B m ⁇ d into the fourth Bi-LSTM layer and the second Four dropout layers to get the fourth output matrix
  • the binary information is to perform convolution operations through the convolution layer to obtain a convolution feature map, and then input the features in the convolution feature map into the recurrent neural network, thereby outputting corresponding information annotations.
  • the step S300B may include steps S300B1 to S300B3. details as follows:
  • step S300B1 the m*d word vector matrix of each word in the target sentence is obtained through the third word embedding layer.
  • step S300B2 a convolution operation is performed on the m*d-dimensional word vector matrix through the convolution layer to obtain d m*1 convolution feature maps.
  • each convolution feature map is 1, and the length is m.
  • the convolution kernel is 2*d, the number of words is m, and the number of convolution kernels is also d.
  • c ij is the eigenvalue of the j-th element in m in the i-th feature convolution map
  • w ij is the word vector matrix covered by the convolution kernel corresponding to the i-th convolution feature map
  • mi is the convolution kernel used to calculate the i-th convolution feature map
  • b i is the bias term used to calculate the i-th convolution feature map
  • f is a nonlinear activation function, such as ReLU function.
  • Step S300B3 configure the jth element in each convolution feature map to the jth input vector to obtain m d-dimensional input vectors (ie B m ⁇ d ), 1 ⁇ j ⁇ m, 1 ⁇ i ⁇ d, wherein the arrangement order of the elements in the j-th input vector is determined by the i value of the feature convolution map where each element is located;
  • Step S300B4 input B m ⁇ d into the fourth BI-LSTM layer in order, and output a fourth output matrix through the fourth dropout layer
  • Step S400 the first output matrix Second output matrix Third output matrix And the fourth output matrix Perform linear addition and get the result of linear addition ⁇ i is The corresponding weight coefficient.
  • the feature information extracted from each dimension in steps S100-S300 is weighted and linearly weighted to obtain the comprehensive features, and these comprehensive features are used as the basis of the label sequence of the output matrix, which can be specifically as follows.
  • Step S500 input the linear addition result O into the second LSTM layer and record the output at each time step To get the fifth output matrix i is the sequence number of each word in the target sentence, and z is the input dimension of the second LSTM layer.
  • Step S600 the fifth output matrix
  • the tag sequence Y (y 1 , y 2 ,..., y m ) is output through the CRF.
  • A is the state transition matrix
  • a i,j represents the probability of transition from the i-th label to the j-th label
  • the best output tag sequence can be obtained.
  • FIG. 2 shows a schematic diagram of the program modules of the second embodiment of the English text sequence labeling system based on neural network of the present application.
  • the English text sequence labeling system 20 based on neural network may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and run by one or more processors. It is executed to complete this application and realize the above-mentioned neural network-based English text sequence labeling method.
  • the program module referred to in the embodiments of the present application refers to a series of computer-readable instruction segments that can complete specific functions. The following description will specifically introduce the functions of each program module in this embodiment:
  • the first output module 200 is used to extract word information, character information, and morphological characteristics of the target sentence, and input the word information, character information, and morphological characteristics to the first BI-LSTM layer and the first dropout layer to obtain the first Output matrix
  • the second output module 202 is configured to obtain the character matrix I (k ⁇ m) ⁇ 1 of the target sentence through the character embedding layer, and convert the character matrix I (k ⁇ m) ⁇ 1 into k ⁇ through the second word embedding layer m ⁇ d-dimensional matrix, and input the k ⁇ m ⁇ d-dimensional matrix into the second BI-LSTM layer, and obtain the fourth matrix through the second BI-LSTM layer And the fourth matrix Input to the second dropout layer to get the second output matrix
  • the third output module 204 is used to extract the semantic information of the target sentence and input the semantic information to the third BI-LSTM layer and the third dropout layer to obtain a third output matrix
  • the fourth output module 206 is used to input the binary information to the fourth Bi-LSTM layer and the fourth dropout layer based on the binary information extracted by the convolutional layer to obtain the fourth output matrix
  • the linear calculation module 208 is used to convert the first output matrix Second output matrix Third output matrix And the fourth output matrix Perform linear addition and get the result of linear addition ⁇ i is Corresponding weight coefficient;
  • the fifth output module 210 is used to input the linear addition result O into the second LSTM layer and record the output at each time step To get the fifth output matrix i is the sequence number of each word in the target sentence, z is the input dimension of the second LSTM layer;
  • the first output module 200 is also used for:
  • the first output module 200 is also used for:
  • the k ⁇ n matrix is input into the first LSTM layer, and the last hidden state C 1 ⁇ ?
  • the target sentence containing m words is converted into a second matrix C m ⁇ n .
  • the first output module 200 is also used for:
  • each preselected word determines whether the suffix of each preselected word is a real suffix
  • the third output module 204 is further configured to include:
  • the fourth output module 206 is further configured to obtain the sixth matrix B m ⁇ d of the target sentence through the third word embedding layer and the convolutional layer, and input the sixth matrix B m ⁇ d into the fourth Bi-LSTM Layer and the fourth dropout layer to get the fourth output matrix
  • the fourth output module 206 is also used for:
  • the j-th element in each convolution feature map is configured into the j-th input vector to obtain the input vector B m ⁇ d , 1 ⁇ j ⁇ m, 1 ⁇ i ⁇ d, where the j-th input vector
  • the arrangement order of the elements in is determined by the i value of the feature convolution map where each element is located;
  • the computer device 2 is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • the computer device 2 may be a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers).
  • the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, a network interface 23, and a neural network-based English text sequence labeling system 20 that can communicate with each other through a system bus. among them:
  • the memory 21 includes at least one type of non-volatile computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), Random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk Wait.
  • the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2.
  • the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SMC) equipped on the computer device 20. SD) card, flash card (Flash Card), etc.
  • the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
  • the memory 21 is generally used to store the operating system and various application software installed in the computer device 2, such as the program code of the English text sequence labeling system 20 based on neural network in the fifth embodiment.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 22 is generally used to control the overall operation of the computer device 2.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the English text sequence labeling system 20 based on a neural network to implement the neural network-based English text sequence labeling method of the first embodiment .
  • the network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic devices.
  • the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
  • FIG. 3 only shows the computer device 2 with components 20-23, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the neural network-based English text sequence labeling system 20 stored in the memory 21 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 , And executed by one or more processors (the processor 22 in this embodiment) to complete the application.
  • FIG. 2 shows a schematic diagram of the program modules of the second embodiment of the neural network-based English text sequence labeling system 20.
  • the neural network-based English text sequence labeling system 20 can be divided into the first The output module 200, the second output module 202, the third output module 204, the fourth output module 206, the linear calculation module 208, the fifth output module 210, and the sixth output module 212.
  • the program module referred to in this application refers to a series of computer-readable instruction segments that can complete specific functions. The specific functions of the program modules 200-212 have been described in detail in the second embodiment, and will not be repeated here.
  • This embodiment also provides a non-volatile computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory ( SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which storage There are computer-readable instructions, and the corresponding functions are realized when the program is executed by the processor.
  • the non-volatile computer-readable storage medium of this embodiment is used to store the English text sequence labeling system 20 based on neural network, and the processor executes the following steps:
  • Extract the word information, character information and morphological characteristics of the target sentence and input the word information, character information and morphological characteristics to the first BI-LSTM layer and the first dropout layer to obtain the first output matrix
  • the character matrix I (k ⁇ m) ⁇ 1 of the target sentence is obtained through the character embedding layer, and the character matrix I (k ⁇ m) ⁇ 1 is transformed into a k ⁇ m ⁇ d-dimensional matrix through the second word embedding layer, and Input the k ⁇ m ⁇ d-dimensional matrix into the second BI-LSTM layer, and obtain the fourth matrix through the second BI-LSTM layer And the fourth matrix Input to the second dropout layer to get the second output matrix
  • the first output matrix Second output matrix Third output matrix And the fourth output matrix Perform linear addition and get the result of linear addition ⁇ i is Corresponding weight coefficient;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé d'étiquetage de séquence de texte en anglais basée sur un réseau neuronal. Le procédé consiste : à extraire des informations de mot, des informations de caractère et des caractéristiques morphologiques d'une phrase cible et à entrer les informations de mot, les informations de caractère et les caractéristiques morphologiques dans une première couche de BI-LSTM (mémoire bidirectionnelle à court-long terme) et une première couche d'abandon de manière à obtenir une première matrice de sortie Om ×d 1 ; à obtenir une deuxième matrice de sortie Om ×d 2 au moyen d'une quatrième expression de matrice (I) correspondant à des informations sélectives; au moyen d'une cinquième matrice Sm × d correspondant à des informations sémantiques, à obtenir une troisième matrice de sortie Om ×d 3 ; au moyen d'une sixième matrice Bm × d correspondant à des informations binaires, à obtenir une quatrième matrice de sortie Om ×d 4 ; à réaliser une addition linéaire sur Om ×d 1, Om ×d 2, Om ×d 3 et Om ×d 4 pour obtenir un résultat d'addition linéaire O = ∑i=1 4Oωi i ; à entrer le résultat d'addition linéaire O dans une seconde couche LSTM pour obtenir une cinquième expression de matrice de sortie (II) ; et à prendre une expression (III) en tant que séquence d'entrée de champs aléatoires conditionnels (CRF) de manière à délivrer en sortie une séquence d'étiquettes Y = (y1, y2,... , ym) au moyen du CRF, de sorte que la précision du marquage peut être améliorée de manière efficace.
PCT/CN2019/117771 2019-09-16 2019-11-13 Procédé et système d'étiquetage de séquence de texte en anglais et dispositif informatique WO2021051574A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910871720.8 2019-09-16
CN201910871720.8A CN110750965B (zh) 2019-09-16 2019-09-16 英文文本序列标注方法、系统及计算机设备

Publications (1)

Publication Number Publication Date
WO2021051574A1 true WO2021051574A1 (fr) 2021-03-25

Family

ID=69276480

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117771 WO2021051574A1 (fr) 2019-09-16 2019-11-13 Procédé et système d'étiquetage de séquence de texte en anglais et dispositif informatique

Country Status (2)

Country Link
CN (1) CN110750965B (fr)
WO (1) WO2021051574A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949320A (zh) * 2021-03-30 2021-06-11 平安科技(深圳)有限公司 基于条件随机场的序列标注方法、装置、设备及介质
CN113326698A (zh) * 2021-06-18 2021-08-31 深圳前海微众银行股份有限公司 检测实体关系的方法、模型训练方法及电子设备
CN113378547A (zh) * 2021-06-16 2021-09-10 武汉大学 一种基于gcn的汉语复句隐式关系分析方法及装置
CN113836929A (zh) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 命名实体识别方法、装置、设备及存储介质
CN114048368A (zh) * 2021-08-14 2022-02-15 北京庚图科技有限公司 一种基于非结构化情报中提取数据的方法、装置及介质
CN114492451A (zh) * 2021-12-22 2022-05-13 马上消费金融股份有限公司 文本匹配方法、装置、电子设备及计算机可读存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115688A (zh) * 2020-09-18 2020-12-22 南方电网深圳数字电网研究院有限公司 一种基于bio的web端文本标注方法及系统
CN112183086B (zh) * 2020-09-23 2024-06-14 北京先声智能科技有限公司 基于意群标注的英语发音连读标记模型
CN112528610B (zh) * 2020-12-09 2023-11-14 北京百度网讯科技有限公司 一种数据标注方法、装置、电子设备及存储介质
CN114154493B (zh) * 2022-01-28 2022-06-28 北京芯盾时代科技有限公司 一种短信类别的识别方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (zh) * 2016-10-27 2017-04-19 浙江大学 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法
CN108038103A (zh) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 一种对文本序列进行分词的方法、装置和电子设备
WO2018105194A1 (fr) * 2016-12-07 2018-06-14 Mitsubishi Electric Corporation Procédé et système de génération d'étiquette à plusieurs niveaux de pertinence
CN108268444A (zh) * 2018-01-10 2018-07-10 南京邮电大学 一种基于双向lstm、cnn和crf的中文分词方法
CN108717409A (zh) * 2018-05-16 2018-10-30 联动优势科技有限公司 一种序列标注方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299262B (zh) * 2018-10-09 2022-04-15 中山大学 一种融合多粒度信息的文本蕴含关系识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (zh) * 2016-10-27 2017-04-19 浙江大学 一种基于Bi‑LSTM、CNN和CRF的文本命名实体识别方法
WO2018105194A1 (fr) * 2016-12-07 2018-06-14 Mitsubishi Electric Corporation Procédé et système de génération d'étiquette à plusieurs niveaux de pertinence
CN108038103A (zh) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 一种对文本序列进行分词的方法、装置和电子设备
CN108268444A (zh) * 2018-01-10 2018-07-10 南京邮电大学 一种基于双向lstm、cnn和crf的中文分词方法
CN108717409A (zh) * 2018-05-16 2018-10-30 联动优势科技有限公司 一种序列标注方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PENG ZONGHUI: "English Sequence Labeling Research Based on Neural Networks", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE CHINA MASTER’S THESES FULL-TEXT DATABASE, 11 March 2018 (2018-03-11), XP055796435 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949320A (zh) * 2021-03-30 2021-06-11 平安科技(深圳)有限公司 基于条件随机场的序列标注方法、装置、设备及介质
CN112949320B (zh) * 2021-03-30 2024-06-11 平安科技(深圳)有限公司 基于条件随机场的序列标注方法、装置、设备及介质
CN113378547A (zh) * 2021-06-16 2021-09-10 武汉大学 一种基于gcn的汉语复句隐式关系分析方法及装置
CN113378547B (zh) * 2021-06-16 2023-07-21 武汉大学 一种基于gcn的汉语复句隐式关系分析方法及装置
CN113326698A (zh) * 2021-06-18 2021-08-31 深圳前海微众银行股份有限公司 检测实体关系的方法、模型训练方法及电子设备
CN113326698B (zh) * 2021-06-18 2023-05-09 深圳前海微众银行股份有限公司 检测实体关系的方法、模型训练方法及电子设备
CN114048368A (zh) * 2021-08-14 2022-02-15 北京庚图科技有限公司 一种基于非结构化情报中提取数据的方法、装置及介质
CN113836929A (zh) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 命名实体识别方法、装置、设备及存储介质
CN114492451A (zh) * 2021-12-22 2022-05-13 马上消费金融股份有限公司 文本匹配方法、装置、电子设备及计算机可读存储介质
CN114492451B (zh) * 2021-12-22 2023-10-24 马上消费金融股份有限公司 文本匹配方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN110750965B (zh) 2023-06-30
CN110750965A (zh) 2020-02-04

Similar Documents

Publication Publication Date Title
WO2021051574A1 (fr) Procédé et système d'étiquetage de séquence de texte en anglais et dispositif informatique
WO2021027533A1 (fr) Appareil et procédé de reconnaissance sémantique de texte, dispositif informatique, et support d'informations
CN107797985B (zh) 建立同义鉴别模型以及鉴别同义文本的方法、装置
CN110457682B (zh) 电子病历词性标注方法、模型训练方法及相关装置
WO2020224106A1 (fr) Procédé et système de classement de texte basé sur un réseau neuronal, et dispositif informatique
WO2021121198A1 (fr) Procédé et appareil d'extraction de relation d'entité basée sur une similitude sémantique, dispositif et support
CN110162771B (zh) 事件触发词的识别方法、装置、电子设备
CN111985229A (zh) 一种序列标注方法、装置及计算机设备
WO2023134082A1 (fr) Procédé et appareil d'apprentissage pour un module de génération d'instructions de sous-titrage d'image, et dispositif électronique
CN111274829B (zh) 一种利用跨语言信息的序列标注方法
WO2020147409A1 (fr) Procédé et appareil de classification de texte, dispositif informatique et support de stockage
WO2022174496A1 (fr) Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage
CN112188311B (zh) 用于确定新闻的视频素材的方法和装置
CN111177392A (zh) 一种数据处理方法及装置
CN111639500A (zh) 语义角色标注方法、装置、计算机设备及存储介质
CN113723077B (zh) 基于双向表征模型的句向量生成方法、装置及计算机设备
CN113158687A (zh) 语义的消歧方法及装置、存储介质、电子装置
CN111767714A (zh) 一种文本通顺度确定方法、装置、设备及介质
CN112199954B (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
CN111191011B (zh) 一种文本标签的搜索匹配方法、装置、设备及存储介质
CN113377910A (zh) 情感评价方法、装置、电子设备和存储介质
CN114817523A (zh) 摘要生成方法、装置、计算机设备和存储介质
CN110276001B (zh) 盘点页识别方法、装置、计算设备和介质
WO2020215581A1 (fr) Procédé et appareil de codage de caractères chinois basés sur un modèle de réseau de mémoires à long et court terme bidirectionnel
WO2016161631A1 (fr) Systèmes dynamiques cachés

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945607

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945607

Country of ref document: EP

Kind code of ref document: A1