CN113657103B - Non-standard Chinese express mail information identification method and system based on NER - Google Patents

Non-standard Chinese express mail information identification method and system based on NER Download PDF

Info

Publication number
CN113657103B
CN113657103B CN202110951137.5A CN202110951137A CN113657103B CN 113657103 B CN113657103 B CN 113657103B CN 202110951137 A CN202110951137 A CN 202110951137A CN 113657103 B CN113657103 B CN 113657103B
Authority
CN
China
Prior art keywords
word
sequence
mail information
express mail
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110951137.5A
Other languages
Chinese (zh)
Other versions
CN113657103A (en
Inventor
孟凡超
叶子
初佃辉
周学权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110951137.5A priority Critical patent/CN113657103B/en
Publication of CN113657103A publication Critical patent/CN113657103A/en
Application granted granted Critical
Publication of CN113657103B publication Critical patent/CN113657103B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Discrimination (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a non-standard Chinese express mail information identification method and system based on NER, which are used for uniformly acquiring express mail information from next data of an express company, and then preprocessing data to obtain a marked data set; reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding; establishing a time sequence probability prediction model for semantic decoding to obtain corresponding label score probability; establishing a label transition probability in a maximized probability prediction learning data set, and correcting the output of a time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence; and visually displaying the recognition result of the nonstandard Chinese express mail information entity. The invention digs the context information in the text from the front direction and the back direction and considers the relativity between characters so as to output a more accurate prediction sequence, thereby well improving the condition of lower recognition accuracy of the mail information elements when the user input is not standard.

Description

Non-standard Chinese express mail information identification method and system based on NER
Technical Field
The invention relates to the technical field of intelligent express delivery, in particular to a non-standard Chinese express delivery mail information identification method and system based on NER.
Background
With the rise of the Internet and electronic commerce, the express industry is rapidly developing. This places tremendous strain on the end couriers' pickup and delivery. How to improve the user experience and the mail sending efficiency of the express industry has become the current research focus. Through alleviate loaded down with trivial details degree standardization user express mail information at express delivery in-process, can improve express delivery order efficiency and terminal express delivery person's delivery efficiency, be a feasible and effectual way of solving present express delivery and seizing and sending inefficiency.
In the prior art, only the condition that the user inputs the standard mail information is considered, namely, each client inputs the text format of name-telephone number-province/autonomous region/direct administration city-city/autonomous state/county/autonomous county-district-detailed address, but in the actual application scene, the analysis process becomes particularly complex due to the diversity and complexity of the expression mode of the Chinese express address information. Aiming at the problem, the traditional solution method is a rule-based Chinese address resolution method, a statistical model-based Chinese address resolution method and a deep learning-based Chinese address resolution method. However, the Chinese address resolution method based on the rules has certain recognition accuracy on the address information with strict rules, and relies on a relatively complete dictionary to a great extent and needs to be manually corrected. When a user inputs nonstandard express address information, the identification accuracy is greatly reduced; aiming at the problems of low adaptability, poor expansibility and the like of a rule-based method, the Chinese express address resolution method based on the statistical model is applied to Chinese express address resolution, so that the defects of a dictionary-based method and a rule-based method are overcome to a certain extent, and the problem of low segmentation efficiency of segmentation rules based on the rule-based method is avoided. The Chinese address segmentation method based on statistics has better effect than the traditional address segmentation method based on rules, and the probability model has good segmentation effect and good interpretation, but the word segmentation effect of the method is limited by feature setting, so that the problems of model training, fitting and the like are required to be prevented from occurring due to excessive features; the Chinese address information analysis method based on deep learning improves the efficiency and the computing performance of Chinese word segmentation to a great extent. Because the address resolution method based on deep learning is mostly applied to the English field, only partial non-normalized Chinese address new information processing and address element identification are completed, and when parameters are complex, the flexibility of a model is not high, and the actual needs of users cannot be well met. Meanwhile, the input form of the user is not fixed and various, so that the difficulty of Chinese mail information analysis is greatly enhanced. Therefore, the existing algorithm is generally difficult to directly solve the problem of identifying the non-standardized Chinese express mail information.
In summary, the existing researches have the following defects:
1) The existing research only has certain identification precision on address information with strict regularity, and the existing method relies on a relatively complete dictionary to a great extent and needs manual participation for correction, so that the adaptability is not strong and the expansibility is poor;
2) The word segmentation effect of the existing method is limited by the feature setting, so that the problems of model training, fitting and the like caused by excessive features are required to be prevented;
3) The existing method only completes the processing of partial non-normalized Chinese address new information and the identification of address elements, and when parameters are complex, the flexibility of the model is not high, and the actual needs of users cannot be well met.
Disclosure of Invention
The invention provides a non-standard Chinese express mail information identification method based on NER, which is used for obtaining an accurate and reasonable label prediction sequence and obtaining a required entity according to a prediction label.
The method comprises the following steps:
step 1: uniformly acquiring express mail information from next data of an express company, and preprocessing data to obtain a labeling data set;
step 2: reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding;
step 3: establishing a time sequence probability prediction model for semantic decoding to obtain corresponding label score probability;
step 4: establishing a label transition probability in a maximized probability prediction learning data set, and correcting the output of a time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence;
step 5: and visually displaying the recognition result of the nonstandard Chinese express mail information entity.
In the invention, the specific steps of the step 1 are as follows:
step 1.1: uniformly acquiring express mail information from the next data of an express company to form a Chinese express mail information data set;
step 1.2: preprocessing the obtained Chinese express mail information data set, and word segmentation is carried out on the text by taking a single character as a unit;
step 1.3: the marking of individual characters is done with the BIEO system.
In the present invention, the step 2 includes: and establishing a text vectorization model to perform word embedding, representing the characteristics of the words, and constructing the distribution of word sequences in the express mail information text to evaluate the probability of any word sequence.
In the present invention, the step 3 further includes: memorizing the needed information and forgetting useless information from two directions by using a time sequence probability prediction model;
the unit of the time sequence probability prediction model is formed by inputting a word x at the current moment t State of cell C t Temporary cell status
Figure BDA0003218496670000041
Hidden state h t Forgetting door f t Memory gate i t And an output gate o t Composition;
step 3.1: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating a forgetting door, and selecting information to be forgotten to obtain f t . The formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f ) (1)
step 3.2: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating the information to be memorized selected by the memory gate to obtain i t And temporary cell status
Figure BDA0003218496670000042
The formula is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i ) (2)
step 3.3: according to the value i of the memory gate t Forget the value of the door, f t Temporary cell status
Figure BDA0003218496670000043
Cell state C at the last moment t-1 To calculate the cell state C at the current time t . The formula is as follows:
Figure BDA0003218496670000044
step 3.4: according to the hidden layer state h of the previous moment t-1 Input word x at the current time t And cell state C at the present moment t Calculating the value o of the output gate t Hidden layer state h t . The formula is as follows:
h t =o t *tanh(C t ) (4)
o t =σ(W o ·[h t-1 ,x t ]+b o ) (5)
the information in the state of the unit is updated and discarded through a memory gate and a forgetting gate at each unit moment, useful information is calculated and transmitted to the next unit, and finally a state sequence { h ] with the same sentence length is obtained 0 ,h 1 ,h 2 ,...,h n-1 ,}。
In the present invention, the step 4 further includes:
obtaining constraint rules from training data through a maximized probability prediction model, and guaranteeing validity of a prediction label through the constraint rules;
step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore (6)
step 4.2: let emissionscore=x 0,START +X 1,B-NAME +…+X n-1,O +X n,END Wherein X is 0,START And X n,END May be set to 0;
step 4.3: let Transmit score=t START->B-NAME +t B-NAME->I-NAME +...+t O->END Transmit score is the corresponding sum in the sequence state transition matrix;
step 4.4: calculating the loss of the maximized probability prediction model; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
Figure BDA0003218496670000051
in the present invention, the method for generating the EmissionScore in the step 4.2 is as follows:
step 4.2.1: any word vector in the sentence is subjected to a time sequence probability prediction model to obtain the sum X of the corresponding label scoring of each position in the sequence 1,B-NAME +X 2,I-NAME …+X n-1,O As an output, output to the CRF layer for calculation of EmissionScore.
In the invention, the generation method of the Transmit score in the step 4.3 is as follows:
step 4.3.1: a probability value is corresponding between i and i+1 of the sequence, and the TransitionScore is the sum of probability correspondence between any positions in the state transition matrix of the sequence;
in the invention, the hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O";
wherein B represents the first word of an entity word, E represents the last word of an entity word, I represents the word in the middle of an entity word, and O represents a non-entity. These independent words are used as input for text vectorization.
The invention also provides a non-standardized Chinese express mail information identification system based on NER, which comprises: the client is used for inputting Chinese express mail information of a user and visually outputting an analysis result of express address information in a text;
the server side executes the computer program to realize the non-standardized Chinese express mail information identification method based on NER;
and the database end is used for storing the Chinese express mail information of the client group.
From the above technical scheme, the invention has the following advantages:
according to the NER-based non-standard Chinese express mail information identification method, a Chinese express mail address information text input by a user at a client is segmented by taking a single character as a minimum segmentation unit, each segmented character is marked, and the generated text with the label is input into a text vectorization model.
And secondly, word embedding is carried out by using a text vectorization model, and the text vectorization model is based on the BERT model, so that the text vectorization model can be used for capturing the dependency relationship among sentences more thoroughly by referring to the advantage of the BERT, and a text sequence in the express mail information is constructed. Then, a time sequence probability prediction model is provided, the data subjected to text vectorization processing is converted into word vectors with context correlation, the word vectors are input into the time sequence probability prediction model for further semantic decoding, and the prediction value of each label is output. And then, providing an output sequence of a maximized probability prediction model decoding BILSTM layer, wherein the maximized probability prediction model can obtain constraint rules from training data so as to ensure that a prediction label is reasonable, and converting the prediction label into a corresponding entity according to the finally obtained prediction label. And finally, returning the obtained result to a user interface, so that a user can visually check the accuracy of text recognition of the express mail information.
Aiming at the defects of the non-standardized Chinese express mail information identification problem in the existing research, when the non-standardized Chinese express mail address information identification model is established, firstly, a Chinese express mail address information text input by a user at a client is segmented by taking a single character as a minimum segmentation unit, and then, each segmented character is marked. The generated text with the label is input into a text vectorization model, word embedding is carried out, distribution of word sequences in the express mail information text is constructed, and prediction probability of any sequence is obtained. And converting the text vectorized data into word vectors with context correlation by using a time sequence probability prediction model, and further performing semantic decoding to obtain a predictive value of each tag as an input of a downstream reasoning task. And the maximized probability prediction model decodes the output sequence of the time sequence probability prediction model, reduces the occurrence probability of illegal sequences and obtains the most probable legal prediction label. And finally, returning the obtained recognition result to a user interface according to the prediction label and the conversion to a corresponding entity, and verifying the accuracy of text recognition of the Chinese express mail information. Aiming at the problem of recognition of the non-standardized Chinese express mail information, the novel solving model and the novel solving algorithm provided by the invention improve the accuracy of recognition of the non-standardized Chinese express mail information. And compared experiments prove that the NER-based non-standardized Chinese express mail information identification model is superior to other traditional models in performance and the existing more mature address resolution method in the market, and has practical value.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a non-standard Chinese express mail information identification method based on NER;
FIG. 2 is a diagram of a text vectorization model;
FIG. 3 is a block diagram of a time series probability prediction model unit;
FIG. 4 is an overall structure diagram of a time sequence probability prediction model;
FIG. 5 is an architecture diagram of a NER-based non-standardized Chinese express mail information identification model;
fig. 6 is a schematic diagram of input and output of non-standardized chinese express mail information identification based on NER of the present invention.
Detailed Description
In the non-standard Chinese express mail information identification method based on NER provided by the invention, the units and algorithm steps of each example described in the disclosed embodiment can be realized by electronic hardware, computer software or a combination of the two, and in order to clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described according to functions in the above description. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The block diagram shown in the drawing of the non-standard Chinese express mail information identification method based on NER is only a functional entity and does not necessarily correspond to a physically independent entity. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In the non-standard Chinese express mail information identification method based on NER provided by the invention, it should be understood that the disclosed system, device and method can be realized in other modes. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
As shown in fig. 1 to 6, the non-standard chinese express mail information identification method based on NER provided by the present invention includes:
s1: the method comprises the steps that express mail information is uniformly obtained from express mail next data of an express company, and then data preprocessing is carried out to obtain a marked data set;
wherein, step 1.1: and uniformly acquiring the express mail information from the order of the express company by data to form a Chinese express mail information data set.
Step 1.2: and preprocessing the obtained Chinese express mail information data set, and segmenting the text by taking a single character as a unit.
Step 1.3: the marking of individual characters is done with the BIEO system. The hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O". Wherein B represents the first word of an entity word, E represents the last word of an entity word, I represents the word in the middle of an entity word, and O represents a non-entity. These independent words are used as input for text vectorization.
S2: reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding;
word embedding is carried out by using a text vectorization model, and distribution of word sequences in the express mail information text is constructed so as to evaluate probability of any word sequence.
S3: establishing a time sequence probability prediction model to perform semantic decoding to obtain label score probability corresponding to each word;
in step S3, the time sequence probability prediction model is utilized to memorize the needed information and forget useless information from the front and back directions. The unit structure of the time sequence probability prediction model is formed by inputting a word x at the current moment t State of cell C t Temporary cell status
Figure BDA0003218496670000101
Hidden state h t Forgetting door f t Memory gate i t And an output gate o t Composition is prepared.
Step 3.1: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating a forgetting door, and selecting information to be forgotten to obtain f t . Formulas such asThe following steps:
f t =σ(W f ·[h t-1 ,x t ]+b f ) (1)
step 3.2: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating the information to be memorized selected by the memory gate to obtain i t And temporary cell status
Figure BDA0003218496670000111
The formula is as follows: />
i t =σ(W i ·[h t-1 ,x t ]+b i ) (2)
Step 3.3: according to the value i of the memory gate t Forget the value of the door, f t Temporary cell status
Figure BDA0003218496670000112
Cell state C at the last moment t-1 To calculate the cell state C at the current time t . The formula is as follows:
Figure BDA0003218496670000113
step 3.4: according to the hidden layer state h of the previous moment t-1 Input word x at the current time t And cell state C at the present moment t Calculating the value o of the output gate t Hidden layer state h t . The formula is as follows:
h t =o t *tanh(C t ) (4)
o t =σ(W o ·[h t-1 ,x t ]+b o ) (5)
the information in the state of the unit is updated and discarded through a memory gate and a forgetting gate at each unit moment, useful information is calculated and transmitted to the next unit, and finally a state sequence { h ] with the same sentence length is obtained 0 ,h 1 ,h 2 ,...,h n-1 ,}。
S4: establishing a maximum probability prediction model to learn the label transition probability in the data set, and correcting the output of the time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence;
that is, constraint rules are obtained from training data by maximizing the probabilistic predictive model, and the validity of the predictive label is guaranteed by these rules.
Step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore (6)
step 4.2: let emissionscore=x 0,START +X 1,B-NAME +...+X n-1,O +X n,END Wherein X is 0,START And X n,END May be set to 0.
In the invention, the method for generating the EmissionScore in the step 4.2 is as follows:
step 4.2.1: any word vector in the sentence passes through a time sequence probability prediction layer to obtain the sum X of the label scoring corresponding to each position in the sequence 1,B-NAME +X 2,I-NAME ...+X n-1,O As an output, the result is outputted to the maximized probability prediction model to perform calculation of EmissionScore.
Step 4.3: let Transmit score=t START->B-NAME +t B-NAME->I-NAME +...+t O->END Transmit score is the corresponding sum in the sequence state transition matrix.
In the invention, the generation method of the Transmit score in the step 4.3 is as follows:
step 4.3.1: a probability value is corresponding between i and i+1 of the sequence, and the TransitionScore is the sum of probability correspondence between any positions in the state transition matrix of the sequence;
step 4.4: calculating loss; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
Figure BDA0003218496670000121
s5: and visually displaying the recognition result of the nonstandard Chinese express mail information entity.
In the present invention, step 5 further includes: returning the result of the Chinese express mail information text analyzed by the NER-based non-standardized Chinese express mail information identification model to a user interface in a visual mode; the user can view the results of the entity identification as well as the accuracy.
In order to make the objects, features and advantages of the present invention more comprehensible, embodiments accompanied with specific embodiments and figures are described in detail below, wherein the embodiments are described only in part but not in all embodiments. All other embodiments, based on the embodiments in this patent, which would be within the purview of one of ordinary skill in the art without the particular effort to make the invention are intended to be within the scope of the patent protection. The method comprises the following specific steps:
step one, unified obtaining express mail information from express next data of an express company, and then preprocessing the data to obtain a marked data set
Step 1.1: and acquiring mail information from the express delivery next data of the express company to form a Chinese express mail information data set.
Step 1.2: and preprocessing the obtained Chinese express mail information data set, and segmenting the text by taking a single character as a unit.
Step 1.3: the marking of individual characters is done with the BIEO system. The hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O". Wherein B represents the first word of an entity word, E represents the last word of an entity word, I represents the word in the middle of an entity word, and O represents a non-entity. These independent words are used as input for text vectorization. The hierarchy annotation system is shown in table 1.
Table 1 level mail information labeling system
Figure BDA0003218496670000141
Figure BDA0003218496670000151
And secondly, word embedding is carried out by using a text vectorization model, and the distribution of word sequences in the express mail information text is constructed so as to evaluate the probability of any word sequence.
Fig. 2 gives a simple example. Because the data set obtained in the first step has a larger size, a simple example is given here to describe the part of the express mail text information that should be contained in the user input text. Fig. 2 illustrates the overall flow of word embedding using a text vectorization model to construct the distribution of word sequences in the text of the courier information. And evaluating the probability of any word sequence through a text vectorization model. To train bi-directional features, the pre-training task of the text vectorization model consists of masking the language and predicting the next sentence. Because the text vectorization model is based on the Bert text vectorization model, the whole framework of the text vectorization model refers to the encoders of the multi-layer transformers of the Bert, and the encoders are stacked so as to achieve a better effect. Each layer of encoder consists of a layer of muti-head-attribute and a layer of feed-word. The full-join calculation in the Encoder using self-intent can make word-to-word correlation more accurately expressed.
And thirdly, performing semantic decoding by using a time sequence probability prediction model to obtain label score probability corresponding to each word.
Fig. 3 and 4 give a simple example. When in useThe sequence probability prediction model is a model which consists of time sequences in the front direction and the rear direction and can memorize needed information and forget useless information. The structural unit of the time sequence probability prediction model is formed by inputting word x at the current moment t State of cell C t Temporary cell status
Figure BDA0003218496670000161
Hidden state h t Forgetting door f t Memory gate i t And an output gate o t Composition is prepared.
Step 3.1: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating a forgetting door, and selecting information to be forgotten to obtain f t . The formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f ) (1)
step 3.2: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating the information to be memorized selected by the memory gate to obtain i t And temporary cell status
Figure BDA0003218496670000162
The formula is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i ) (2)
step 3.3: according to the value i of the memory gate t Forget the value of the door, f t Temporary cell status
Figure BDA0003218496670000163
Cell state C at the last moment t-1 To calculate the cell state C at the current time t . The formula is as follows:
Figure BDA0003218496670000164
step 3.4: according to the hidden layer state h of the previous moment t-1 Input word x at the current time t And cell state C at the present moment t Calculating the value o of the output gate t Hidden layer state h t . The formula is as follows:
h t =o t *tanh(C t ) (4)
o t =σ(W o ·[h t-1 ,x t ]+b o ) (5)
as can be seen from FIG. 3, x t Is added information at time t, the information in the state of the unit is updated and discarded through a memory gate and a forgetting gate at each unit time, useful information is calculated and transmitted to the next unit, and finally a state sequence { h) with the same sentence length is obtained 0 ,h 1 ,h 2 ,...,h n-1 These gate structures allow selective passage of information for removal or addition of information to the cell state. The output o of the current neuron is finally obtained through operation t . FIG. 4 is a diagram of a time series probability prediction model, wherein the input of the time series probability prediction model is an embedded word vector obtained by a text vectorization model, and the output is a prediction label corresponding to each word, such as 0.9 (B-NAME), 0.7 (I-NAME), 0.05 (E-AREA, 0.01 (O).
Fourth, through maximizing the label transition probability in the probability prediction model learning data set, correcting the output of the time sequence probability prediction model, and obtaining an accurate and reasonable label prediction sequence, the method comprises the following steps:
step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore (6)
step 4.2: let emissionscore=x 0,START +X 1,B-NAME +...+X n-1,O +X n,END Wherein X is 0,START And X n,END May be set to 0;
step 4.3: let Transmit score=t START->B-NAME +t B-NAME->I-NAME +...+t O->END ,TransitioThe nScore is the corresponding sum in the sequence state transition matrix;
step 4.4: calculating loss; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
Figure BDA0003218496670000181
FIG. 5 shows a model architecture diagram of NER-based non-standardized Chinese express mail information identification model.
Fig. 6 shows an input-output schematic diagram of non-standardized Chinese mail information identification, which is convenient for a user to visually display.
The method for visualizing the analysis result of the non-standardized Chinese express mail information is provided, and the recognition result of the irregular Chinese express mail information of the user is returned in a visualized mode.
The invention also provides a non-standardized Chinese express mail information identification system based on NER, which comprises: the client is used for inputting Chinese express mail information of a user and visually outputting an analysis result of express address information in a text; the server side executes the computer program to realize the non-standard Chinese express mail information identification method based on NER; and the database end is used for storing the Chinese express mail information of the client group.
The client provides an operation and display interface by using a JSP technology, wherein the operation and display interface comprises input of basic information of Chinese express mail, including name, telephone number, province/autonomous region/direct administration city, city/autonomous state/county/autonomous county, district, detailed address, invalid information and the like, and visual display of recognition results of non-standardized Chinese express mail information entities; the server side is realized by using a java technology, the intercepted request is processed, and then the result is returned to the client side; and the database end establishes a database for storing the express mail basic information of the client group by adopting a MySQL database.
The NER-based non-standardized chinese express mail information identification system is a unit and algorithm steps of examples described in connection with the embodiments disclosed herein, and can be implemented in electronic hardware, computer software, or a combination of both, and to clearly illustrate the interchangeability of hardware and software, the components and steps of examples have been generally described in terms of functionality in the foregoing description. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (2)

1. A non-standard Chinese express mail information identification method based on NER is characterized by comprising the following steps:
step 1: uniformly acquiring express mail information from next data of an express company, and preprocessing data to obtain a labeling data set;
step 1.1: uniformly acquiring express mail information from the next data of an express company to form a Chinese express mail information data set;
step 1.2: preprocessing the obtained Chinese express mail information data set, and word segmentation is carried out on the text by taking a single character as a unit;
step 1.3: marking single characters by using a BIEO system;
step 2: reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding;
the step 2 comprises the following steps: establishing a text vectorization model to perform word embedding, expressing the characteristics of words, and constructing the distribution of word sequences in the express mail information text to evaluate the probability of any word sequence;
the hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O";
wherein B represents the first word of the entity word, E represents the last word of the entity word, I represents the word in the middle of the entity word, and O represents the non-entity; the entity words are used as input for text vectorization;
step 3: establishing a time sequence probability prediction model for semantic decoding to obtain corresponding label score probability;
memorizing the needed information and forgetting useless information from two directions by using a time sequence probability prediction model;
the unit of the time sequence probability prediction model is formed by inputting words at the current moment
Figure QLYQS_1
Cell state->
Figure QLYQS_2
Temporary cell status->
Figure QLYQS_3
Hidden layer->
Figure QLYQS_4
Amnesia door->
Figure QLYQS_5
Memory door->
Figure QLYQS_6
And an output door->
Figure QLYQS_7
Composition;
step 3.1: according to the hidden layer state at the previous moment
Figure QLYQS_8
And the input word +.>
Figure QLYQS_9
Selecting information to be forgotten, calculating the forgetting gate by the following formula>
Figure QLYQS_10
Figure QLYQS_11
(1)
Step 3.2: according to the hidden layer state at the previous moment
Figure QLYQS_12
And the input word +.>
Figure QLYQS_13
The value +.about.memory gate selection is calculated by the following formula>
Figure QLYQS_14
And temporary cell status->
Figure QLYQS_15
Figure QLYQS_16
(2)
Step 3.3: according to the value of the memory gate
Figure QLYQS_17
Value of amnesia door->
Figure QLYQS_18
Temporary cell status->
Figure QLYQS_19
Cell state at last time
Figure QLYQS_20
The cell state +.about.at the current time is calculated based on the following formula>
Figure QLYQS_21
Figure QLYQS_22
(3)
Step 3.4: according to the hidden layer state at the previous moment
Figure QLYQS_23
Input word +.>
Figure QLYQS_24
And the state of the cell at the current time
Figure QLYQS_25
The output gate is calculated by the following formula>
Figure QLYQS_26
And hidden layer->
Figure QLYQS_27
:/>
Figure QLYQS_28
(4)
Figure QLYQS_29
(5)
The information in the state of the unit is updated and discarded through the memory gate and the forgetting gate at each unit moment, useful information is calculated and transmitted to the next unit, and finally, a state sequence with the same sentence length is obtained
Figure QLYQS_30
Step 4: establishing a label transition probability in a maximized probability prediction learning data set, and correcting the output of a time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence;
obtaining constraint rules from training data through a maximized probability prediction model, and guaranteeing validity of a prediction label through the constraint rules;
step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore(6)
step 4.2: order the
Figure QLYQS_31
Wherein->
Figure QLYQS_32
And
Figure QLYQS_33
set to 0;
step 4.2.1: the method for generating EmissionScare is as follows: any word vector in the sentence is subjected to a time sequence probability prediction model to obtain the sum of the corresponding label scores of each position in the sequence
Figure QLYQS_34
Output to CRF layer for calculation of EmissionScare;
step 4.3: order the
Figure QLYQS_35
Transmit score is the corresponding sum in the sequence state transition matrix;
the generation method of the Transmit score is as follows:
step 4.3.1: a probability value is corresponding between i and i+1 of the sequence, and the Transmit score is the sum of probability correspondence between any positions in the sequence state transition matrix;
step 4.4: calculating the loss of the maximized probability prediction model; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
Figure QLYQS_36
(7);
step 5: and visually displaying the recognition result of the nonstandard Chinese express mail information entity.
2. The utility model provides a non-standardized chinese express mail information identification system based on NER which characterized in that includes: the client is used for inputting Chinese express mail information of a user and visually outputting an analysis result of express address information in a text;
a server side executing a computer program to implement the non-standard Chinese express mail information identification method based on NER as set forth in claim 1;
and the database end is used for storing the Chinese express mail information of the client group.
CN202110951137.5A 2021-08-18 2021-08-18 Non-standard Chinese express mail information identification method and system based on NER Active CN113657103B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110951137.5A CN113657103B (en) 2021-08-18 2021-08-18 Non-standard Chinese express mail information identification method and system based on NER

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110951137.5A CN113657103B (en) 2021-08-18 2021-08-18 Non-standard Chinese express mail information identification method and system based on NER

Publications (2)

Publication Number Publication Date
CN113657103A CN113657103A (en) 2021-11-16
CN113657103B true CN113657103B (en) 2023-05-12

Family

ID=78481075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110951137.5A Active CN113657103B (en) 2021-08-18 2021-08-18 Non-standard Chinese express mail information identification method and system based on NER

Country Status (1)

Country Link
CN (1) CN113657103B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826331B (en) * 2019-10-28 2023-04-18 南京师范大学 Intelligent construction method of place name labeling corpus based on interactive and iterative learning
CN111310471B (en) * 2020-01-19 2023-03-10 陕西师范大学 Travel named entity identification method based on BBLC model
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN111783462B (en) * 2020-06-30 2023-07-04 大连民族大学 Chinese named entity recognition model and method based on double neural network fusion
CN112765314B (en) * 2020-12-31 2023-08-18 广东电网有限责任公司 Power information retrieval method based on power ontology knowledge base
CN112784051A (en) * 2021-02-05 2021-05-11 北京信息科技大学 Patent term extraction method

Also Published As

Publication number Publication date
CN113657103A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
WO2021135910A1 (en) Machine reading comprehension-based information extraction method and related device
CN110705301B (en) Entity relationship extraction method and device, storage medium and electronic equipment
CN113641820B (en) Visual angle level text emotion classification method and system based on graph convolution neural network
CN109582956B (en) Text representation method and device applied to sentence embedding
CN108062388A (en) Interactive reply generation method and device
CN112465017A (en) Classification model training method and device, terminal and storage medium
CN111444367B (en) Image title generation method based on global and local attention mechanism
CN111382271B (en) Training method and device of text classification model, text classification method and device
CN112256866B (en) Text fine-grained emotion analysis algorithm based on deep learning
WO2021208727A1 (en) Text error detection method and apparatus based on artificial intelligence, and computer device
CN111428557A (en) Method and device for automatically checking handwritten signature based on neural network model
CN110619119B (en) Intelligent text editing method and device and computer readable storage medium
CN111695338A (en) Interview content refining method, device, equipment and medium based on artificial intelligence
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN111429204A (en) Hotel recommendation method, system, electronic equipment and storage medium
CN113707299A (en) Auxiliary diagnosis method and device based on inquiry session and computer equipment
CN112632993A (en) Electric power measurement entity recognition model classification method based on convolution attention network
CN109086463A (en) A kind of Ask-Answer Community label recommendation method based on region convolutional neural networks
CN114491289A (en) Social content depression detection method of bidirectional gated convolutional network
CN112949637A (en) Bidding text entity identification method based on IDCNN and attention mechanism
CN113705207A (en) Grammar error recognition method and device
CN113761845A (en) Text generation method and device, storage medium and electronic equipment
CN113657103B (en) Non-standard Chinese express mail information identification method and system based on NER
WO2023116572A1 (en) Word or sentence generation method and related device
CN115147849A (en) Training method of character coding model, character matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant