CN113657103B - Non-standard Chinese express mail information identification method and system based on NER - Google Patents
Non-standard Chinese express mail information identification method and system based on NER Download PDFInfo
- Publication number
- CN113657103B CN113657103B CN202110951137.5A CN202110951137A CN113657103B CN 113657103 B CN113657103 B CN 113657103B CN 202110951137 A CN202110951137 A CN 202110951137A CN 113657103 B CN113657103 B CN 113657103B
- Authority
- CN
- China
- Prior art keywords
- word
- sequence
- mail information
- express mail
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Character Discrimination (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a non-standard Chinese express mail information identification method and system based on NER, which are used for uniformly acquiring express mail information from next data of an express company, and then preprocessing data to obtain a marked data set; reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding; establishing a time sequence probability prediction model for semantic decoding to obtain corresponding label score probability; establishing a label transition probability in a maximized probability prediction learning data set, and correcting the output of a time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence; and visually displaying the recognition result of the nonstandard Chinese express mail information entity. The invention digs the context information in the text from the front direction and the back direction and considers the relativity between characters so as to output a more accurate prediction sequence, thereby well improving the condition of lower recognition accuracy of the mail information elements when the user input is not standard.
Description
Technical Field
The invention relates to the technical field of intelligent express delivery, in particular to a non-standard Chinese express delivery mail information identification method and system based on NER.
Background
With the rise of the Internet and electronic commerce, the express industry is rapidly developing. This places tremendous strain on the end couriers' pickup and delivery. How to improve the user experience and the mail sending efficiency of the express industry has become the current research focus. Through alleviate loaded down with trivial details degree standardization user express mail information at express delivery in-process, can improve express delivery order efficiency and terminal express delivery person's delivery efficiency, be a feasible and effectual way of solving present express delivery and seizing and sending inefficiency.
In the prior art, only the condition that the user inputs the standard mail information is considered, namely, each client inputs the text format of name-telephone number-province/autonomous region/direct administration city-city/autonomous state/county/autonomous county-district-detailed address, but in the actual application scene, the analysis process becomes particularly complex due to the diversity and complexity of the expression mode of the Chinese express address information. Aiming at the problem, the traditional solution method is a rule-based Chinese address resolution method, a statistical model-based Chinese address resolution method and a deep learning-based Chinese address resolution method. However, the Chinese address resolution method based on the rules has certain recognition accuracy on the address information with strict rules, and relies on a relatively complete dictionary to a great extent and needs to be manually corrected. When a user inputs nonstandard express address information, the identification accuracy is greatly reduced; aiming at the problems of low adaptability, poor expansibility and the like of a rule-based method, the Chinese express address resolution method based on the statistical model is applied to Chinese express address resolution, so that the defects of a dictionary-based method and a rule-based method are overcome to a certain extent, and the problem of low segmentation efficiency of segmentation rules based on the rule-based method is avoided. The Chinese address segmentation method based on statistics has better effect than the traditional address segmentation method based on rules, and the probability model has good segmentation effect and good interpretation, but the word segmentation effect of the method is limited by feature setting, so that the problems of model training, fitting and the like are required to be prevented from occurring due to excessive features; the Chinese address information analysis method based on deep learning improves the efficiency and the computing performance of Chinese word segmentation to a great extent. Because the address resolution method based on deep learning is mostly applied to the English field, only partial non-normalized Chinese address new information processing and address element identification are completed, and when parameters are complex, the flexibility of a model is not high, and the actual needs of users cannot be well met. Meanwhile, the input form of the user is not fixed and various, so that the difficulty of Chinese mail information analysis is greatly enhanced. Therefore, the existing algorithm is generally difficult to directly solve the problem of identifying the non-standardized Chinese express mail information.
In summary, the existing researches have the following defects:
1) The existing research only has certain identification precision on address information with strict regularity, and the existing method relies on a relatively complete dictionary to a great extent and needs manual participation for correction, so that the adaptability is not strong and the expansibility is poor;
2) The word segmentation effect of the existing method is limited by the feature setting, so that the problems of model training, fitting and the like caused by excessive features are required to be prevented;
3) The existing method only completes the processing of partial non-normalized Chinese address new information and the identification of address elements, and when parameters are complex, the flexibility of the model is not high, and the actual needs of users cannot be well met.
Disclosure of Invention
The invention provides a non-standard Chinese express mail information identification method based on NER, which is used for obtaining an accurate and reasonable label prediction sequence and obtaining a required entity according to a prediction label.
The method comprises the following steps:
step 1: uniformly acquiring express mail information from next data of an express company, and preprocessing data to obtain a labeling data set;
step 2: reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding;
step 3: establishing a time sequence probability prediction model for semantic decoding to obtain corresponding label score probability;
step 4: establishing a label transition probability in a maximized probability prediction learning data set, and correcting the output of a time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence;
step 5: and visually displaying the recognition result of the nonstandard Chinese express mail information entity.
In the invention, the specific steps of the step 1 are as follows:
step 1.1: uniformly acquiring express mail information from the next data of an express company to form a Chinese express mail information data set;
step 1.2: preprocessing the obtained Chinese express mail information data set, and word segmentation is carried out on the text by taking a single character as a unit;
step 1.3: the marking of individual characters is done with the BIEO system.
In the present invention, the step 2 includes: and establishing a text vectorization model to perform word embedding, representing the characteristics of the words, and constructing the distribution of word sequences in the express mail information text to evaluate the probability of any word sequence.
In the present invention, the step 3 further includes: memorizing the needed information and forgetting useless information from two directions by using a time sequence probability prediction model;
the unit of the time sequence probability prediction model is formed by inputting a word x at the current moment t State of cell C t Temporary cell statusHidden state h t Forgetting door f t Memory gate i t And an output gate o t Composition;
step 3.1: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating a forgetting door, and selecting information to be forgotten to obtain f t . The formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f ) (1)
step 3.2: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating the information to be memorized selected by the memory gate to obtain i t And temporary cell statusThe formula is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i ) (2)
step 3.3: according to the value i of the memory gate t Forget the value of the door, f t Temporary cell statusCell state C at the last moment t-1 To calculate the cell state C at the current time t . The formula is as follows:
step 3.4: according to the hidden layer state h of the previous moment t-1 Input word x at the current time t And cell state C at the present moment t Calculating the value o of the output gate t Hidden layer state h t . The formula is as follows:
h t =o t *tanh(C t ) (4)
o t =σ(W o ·[h t-1 ,x t ]+b o ) (5)
the information in the state of the unit is updated and discarded through a memory gate and a forgetting gate at each unit moment, useful information is calculated and transmitted to the next unit, and finally a state sequence { h ] with the same sentence length is obtained 0 ,h 1 ,h 2 ,...,h n-1 ,}。
In the present invention, the step 4 further includes:
obtaining constraint rules from training data through a maximized probability prediction model, and guaranteeing validity of a prediction label through the constraint rules;
step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore (6)
step 4.2: let emissionscore=x 0,START +X 1,B-NAME +…+X n-1,O +X n,END Wherein X is 0,START And X n,END May be set to 0;
step 4.3: let Transmit score=t START->B-NAME +t B-NAME->I-NAME +...+t O->END Transmit score is the corresponding sum in the sequence state transition matrix;
step 4.4: calculating the loss of the maximized probability prediction model; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
in the present invention, the method for generating the EmissionScore in the step 4.2 is as follows:
step 4.2.1: any word vector in the sentence is subjected to a time sequence probability prediction model to obtain the sum X of the corresponding label scoring of each position in the sequence 1,B-NAME +X 2,I-NAME …+X n-1,O As an output, output to the CRF layer for calculation of EmissionScore.
In the invention, the generation method of the Transmit score in the step 4.3 is as follows:
step 4.3.1: a probability value is corresponding between i and i+1 of the sequence, and the TransitionScore is the sum of probability correspondence between any positions in the state transition matrix of the sequence;
in the invention, the hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O";
wherein B represents the first word of an entity word, E represents the last word of an entity word, I represents the word in the middle of an entity word, and O represents a non-entity. These independent words are used as input for text vectorization.
The invention also provides a non-standardized Chinese express mail information identification system based on NER, which comprises: the client is used for inputting Chinese express mail information of a user and visually outputting an analysis result of express address information in a text;
the server side executes the computer program to realize the non-standardized Chinese express mail information identification method based on NER;
and the database end is used for storing the Chinese express mail information of the client group.
From the above technical scheme, the invention has the following advantages:
according to the NER-based non-standard Chinese express mail information identification method, a Chinese express mail address information text input by a user at a client is segmented by taking a single character as a minimum segmentation unit, each segmented character is marked, and the generated text with the label is input into a text vectorization model.
And secondly, word embedding is carried out by using a text vectorization model, and the text vectorization model is based on the BERT model, so that the text vectorization model can be used for capturing the dependency relationship among sentences more thoroughly by referring to the advantage of the BERT, and a text sequence in the express mail information is constructed. Then, a time sequence probability prediction model is provided, the data subjected to text vectorization processing is converted into word vectors with context correlation, the word vectors are input into the time sequence probability prediction model for further semantic decoding, and the prediction value of each label is output. And then, providing an output sequence of a maximized probability prediction model decoding BILSTM layer, wherein the maximized probability prediction model can obtain constraint rules from training data so as to ensure that a prediction label is reasonable, and converting the prediction label into a corresponding entity according to the finally obtained prediction label. And finally, returning the obtained result to a user interface, so that a user can visually check the accuracy of text recognition of the express mail information.
Aiming at the defects of the non-standardized Chinese express mail information identification problem in the existing research, when the non-standardized Chinese express mail address information identification model is established, firstly, a Chinese express mail address information text input by a user at a client is segmented by taking a single character as a minimum segmentation unit, and then, each segmented character is marked. The generated text with the label is input into a text vectorization model, word embedding is carried out, distribution of word sequences in the express mail information text is constructed, and prediction probability of any sequence is obtained. And converting the text vectorized data into word vectors with context correlation by using a time sequence probability prediction model, and further performing semantic decoding to obtain a predictive value of each tag as an input of a downstream reasoning task. And the maximized probability prediction model decodes the output sequence of the time sequence probability prediction model, reduces the occurrence probability of illegal sequences and obtains the most probable legal prediction label. And finally, returning the obtained recognition result to a user interface according to the prediction label and the conversion to a corresponding entity, and verifying the accuracy of text recognition of the Chinese express mail information. Aiming at the problem of recognition of the non-standardized Chinese express mail information, the novel solving model and the novel solving algorithm provided by the invention improve the accuracy of recognition of the non-standardized Chinese express mail information. And compared experiments prove that the NER-based non-standardized Chinese express mail information identification model is superior to other traditional models in performance and the existing more mature address resolution method in the market, and has practical value.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a non-standard Chinese express mail information identification method based on NER;
FIG. 2 is a diagram of a text vectorization model;
FIG. 3 is a block diagram of a time series probability prediction model unit;
FIG. 4 is an overall structure diagram of a time sequence probability prediction model;
FIG. 5 is an architecture diagram of a NER-based non-standardized Chinese express mail information identification model;
fig. 6 is a schematic diagram of input and output of non-standardized chinese express mail information identification based on NER of the present invention.
Detailed Description
In the non-standard Chinese express mail information identification method based on NER provided by the invention, the units and algorithm steps of each example described in the disclosed embodiment can be realized by electronic hardware, computer software or a combination of the two, and in order to clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described according to functions in the above description. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The block diagram shown in the drawing of the non-standard Chinese express mail information identification method based on NER is only a functional entity and does not necessarily correspond to a physically independent entity. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In the non-standard Chinese express mail information identification method based on NER provided by the invention, it should be understood that the disclosed system, device and method can be realized in other modes. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
As shown in fig. 1 to 6, the non-standard chinese express mail information identification method based on NER provided by the present invention includes:
s1: the method comprises the steps that express mail information is uniformly obtained from express mail next data of an express company, and then data preprocessing is carried out to obtain a marked data set;
wherein, step 1.1: and uniformly acquiring the express mail information from the order of the express company by data to form a Chinese express mail information data set.
Step 1.2: and preprocessing the obtained Chinese express mail information data set, and segmenting the text by taking a single character as a unit.
Step 1.3: the marking of individual characters is done with the BIEO system. The hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O". Wherein B represents the first word of an entity word, E represents the last word of an entity word, I represents the word in the middle of an entity word, and O represents a non-entity. These independent words are used as input for text vectorization.
S2: reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding;
word embedding is carried out by using a text vectorization model, and distribution of word sequences in the express mail information text is constructed so as to evaluate probability of any word sequence.
S3: establishing a time sequence probability prediction model to perform semantic decoding to obtain label score probability corresponding to each word;
in step S3, the time sequence probability prediction model is utilized to memorize the needed information and forget useless information from the front and back directions. The unit structure of the time sequence probability prediction model is formed by inputting a word x at the current moment t State of cell C t Temporary cell statusHidden state h t Forgetting door f t Memory gate i t And an output gate o t Composition is prepared.
Step 3.1: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating a forgetting door, and selecting information to be forgotten to obtain f t . Formulas such asThe following steps:
f t =σ(W f ·[h t-1 ,x t ]+b f ) (1)
step 3.2: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating the information to be memorized selected by the memory gate to obtain i t And temporary cell statusThe formula is as follows: />
i t =σ(W i ·[h t-1 ,x t ]+b i ) (2)
Step 3.3: according to the value i of the memory gate t Forget the value of the door, f t Temporary cell statusCell state C at the last moment t-1 To calculate the cell state C at the current time t . The formula is as follows:
step 3.4: according to the hidden layer state h of the previous moment t-1 Input word x at the current time t And cell state C at the present moment t Calculating the value o of the output gate t Hidden layer state h t . The formula is as follows:
h t =o t *tanh(C t ) (4)
o t =σ(W o ·[h t-1 ,x t ]+b o ) (5)
the information in the state of the unit is updated and discarded through a memory gate and a forgetting gate at each unit moment, useful information is calculated and transmitted to the next unit, and finally a state sequence { h ] with the same sentence length is obtained 0 ,h 1 ,h 2 ,...,h n-1 ,}。
S4: establishing a maximum probability prediction model to learn the label transition probability in the data set, and correcting the output of the time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence;
that is, constraint rules are obtained from training data by maximizing the probabilistic predictive model, and the validity of the predictive label is guaranteed by these rules.
Step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore (6)
step 4.2: let emissionscore=x 0,START +X 1,B-NAME +...+X n-1,O +X n,END Wherein X is 0,START And X n,END May be set to 0.
In the invention, the method for generating the EmissionScore in the step 4.2 is as follows:
step 4.2.1: any word vector in the sentence passes through a time sequence probability prediction layer to obtain the sum X of the label scoring corresponding to each position in the sequence 1,B-NAME +X 2,I-NAME ...+X n-1,O As an output, the result is outputted to the maximized probability prediction model to perform calculation of EmissionScore.
Step 4.3: let Transmit score=t START->B-NAME +t B-NAME->I-NAME +...+t O->END Transmit score is the corresponding sum in the sequence state transition matrix.
In the invention, the generation method of the Transmit score in the step 4.3 is as follows:
step 4.3.1: a probability value is corresponding between i and i+1 of the sequence, and the TransitionScore is the sum of probability correspondence between any positions in the state transition matrix of the sequence;
step 4.4: calculating loss; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
s5: and visually displaying the recognition result of the nonstandard Chinese express mail information entity.
In the present invention, step 5 further includes: returning the result of the Chinese express mail information text analyzed by the NER-based non-standardized Chinese express mail information identification model to a user interface in a visual mode; the user can view the results of the entity identification as well as the accuracy.
In order to make the objects, features and advantages of the present invention more comprehensible, embodiments accompanied with specific embodiments and figures are described in detail below, wherein the embodiments are described only in part but not in all embodiments. All other embodiments, based on the embodiments in this patent, which would be within the purview of one of ordinary skill in the art without the particular effort to make the invention are intended to be within the scope of the patent protection. The method comprises the following specific steps:
step one, unified obtaining express mail information from express next data of an express company, and then preprocessing the data to obtain a marked data set
Step 1.1: and acquiring mail information from the express delivery next data of the express company to form a Chinese express mail information data set.
Step 1.2: and preprocessing the obtained Chinese express mail information data set, and segmenting the text by taking a single character as a unit.
Step 1.3: the marking of individual characters is done with the BIEO system. The hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O". Wherein B represents the first word of an entity word, E represents the last word of an entity word, I represents the word in the middle of an entity word, and O represents a non-entity. These independent words are used as input for text vectorization. The hierarchy annotation system is shown in table 1.
Table 1 level mail information labeling system
And secondly, word embedding is carried out by using a text vectorization model, and the distribution of word sequences in the express mail information text is constructed so as to evaluate the probability of any word sequence.
Fig. 2 gives a simple example. Because the data set obtained in the first step has a larger size, a simple example is given here to describe the part of the express mail text information that should be contained in the user input text. Fig. 2 illustrates the overall flow of word embedding using a text vectorization model to construct the distribution of word sequences in the text of the courier information. And evaluating the probability of any word sequence through a text vectorization model. To train bi-directional features, the pre-training task of the text vectorization model consists of masking the language and predicting the next sentence. Because the text vectorization model is based on the Bert text vectorization model, the whole framework of the text vectorization model refers to the encoders of the multi-layer transformers of the Bert, and the encoders are stacked so as to achieve a better effect. Each layer of encoder consists of a layer of muti-head-attribute and a layer of feed-word. The full-join calculation in the Encoder using self-intent can make word-to-word correlation more accurately expressed.
And thirdly, performing semantic decoding by using a time sequence probability prediction model to obtain label score probability corresponding to each word.
Fig. 3 and 4 give a simple example. When in useThe sequence probability prediction model is a model which consists of time sequences in the front direction and the rear direction and can memorize needed information and forget useless information. The structural unit of the time sequence probability prediction model is formed by inputting word x at the current moment t State of cell C t Temporary cell statusHidden state h t Forgetting door f t Memory gate i t And an output gate o t Composition is prepared.
Step 3.1: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating a forgetting door, and selecting information to be forgotten to obtain f t . The formula is as follows:
f t =σ(W f ·[h t-1 ,x t ]+b f ) (1)
step 3.2: according to the hidden layer state h of the previous moment t-1 And the input word x at the current time t Calculating the information to be memorized selected by the memory gate to obtain i t And temporary cell statusThe formula is as follows:
i t =σ(W i ·[h t-1 ,x t ]+b i ) (2)
step 3.3: according to the value i of the memory gate t Forget the value of the door, f t Temporary cell statusCell state C at the last moment t-1 To calculate the cell state C at the current time t . The formula is as follows:
step 3.4: according to the hidden layer state h of the previous moment t-1 Input word x at the current time t And cell state C at the present moment t Calculating the value o of the output gate t Hidden layer state h t . The formula is as follows:
h t =o t *tanh(C t ) (4)
o t =σ(W o ·[h t-1 ,x t ]+b o ) (5)
as can be seen from FIG. 3, x t Is added information at time t, the information in the state of the unit is updated and discarded through a memory gate and a forgetting gate at each unit time, useful information is calculated and transmitted to the next unit, and finally a state sequence { h) with the same sentence length is obtained 0 ,h 1 ,h 2 ,...,h n-1 These gate structures allow selective passage of information for removal or addition of information to the cell state. The output o of the current neuron is finally obtained through operation t . FIG. 4 is a diagram of a time series probability prediction model, wherein the input of the time series probability prediction model is an embedded word vector obtained by a text vectorization model, and the output is a prediction label corresponding to each word, such as 0.9 (B-NAME), 0.7 (I-NAME), 0.05 (E-AREA, 0.01 (O).
Fourth, through maximizing the label transition probability in the probability prediction model learning data set, correcting the output of the time sequence probability prediction model, and obtaining an accurate and reasonable label prediction sequence, the method comprises the following steps:
step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore (6)
step 4.2: let emissionscore=x 0,START +X 1,B-NAME +...+X n-1,O +X n,END Wherein X is 0,START And X n,END May be set to 0;
step 4.3: let Transmit score=t START->B-NAME +t B-NAME->I-NAME +...+t O->END ,TransitioThe nScore is the corresponding sum in the sequence state transition matrix;
step 4.4: calculating loss; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
FIG. 5 shows a model architecture diagram of NER-based non-standardized Chinese express mail information identification model.
Fig. 6 shows an input-output schematic diagram of non-standardized Chinese mail information identification, which is convenient for a user to visually display.
The method for visualizing the analysis result of the non-standardized Chinese express mail information is provided, and the recognition result of the irregular Chinese express mail information of the user is returned in a visualized mode.
The invention also provides a non-standardized Chinese express mail information identification system based on NER, which comprises: the client is used for inputting Chinese express mail information of a user and visually outputting an analysis result of express address information in a text; the server side executes the computer program to realize the non-standard Chinese express mail information identification method based on NER; and the database end is used for storing the Chinese express mail information of the client group.
The client provides an operation and display interface by using a JSP technology, wherein the operation and display interface comprises input of basic information of Chinese express mail, including name, telephone number, province/autonomous region/direct administration city, city/autonomous state/county/autonomous county, district, detailed address, invalid information and the like, and visual display of recognition results of non-standardized Chinese express mail information entities; the server side is realized by using a java technology, the intercepted request is processed, and then the result is returned to the client side; and the database end establishes a database for storing the express mail basic information of the client group by adopting a MySQL database.
The NER-based non-standardized chinese express mail information identification system is a unit and algorithm steps of examples described in connection with the embodiments disclosed herein, and can be implemented in electronic hardware, computer software, or a combination of both, and to clearly illustrate the interchangeability of hardware and software, the components and steps of examples have been generally described in terms of functionality in the foregoing description. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (2)
1. A non-standard Chinese express mail information identification method based on NER is characterized by comprising the following steps:
step 1: uniformly acquiring express mail information from next data of an express company, and preprocessing data to obtain a labeling data set;
step 1.1: uniformly acquiring express mail information from the next data of an express company to form a Chinese express mail information data set;
step 1.2: preprocessing the obtained Chinese express mail information data set, and word segmentation is carried out on the text by taking a single character as a unit;
step 1.3: marking single characters by using a BIEO system;
step 2: reading data, establishing a text vectorization model to perform word characteristic representation, and obtaining word embedding and position embedding;
the step 2 comprises the following steps: establishing a text vectorization model to perform word embedding, expressing the characteristics of words, and constructing the distribution of word sequences in the express mail information text to evaluate the probability of any word sequence;
the hierarchical mail information labeling system label comprises: "B-NAME", "E-NAME", "I-NAME", "B-TEL", "E-TEL", "I-TEL", "B-PROVINCE", "E-PROVINCE", "I-PROVINCE", "B-CITY", "E-CITY", "I-CITY", "B-AREA", "E-AREA", "I-AREA", "B-DETAILS", "E-DETAILS", "I-DETAILS", "O";
wherein B represents the first word of the entity word, E represents the last word of the entity word, I represents the word in the middle of the entity word, and O represents the non-entity; the entity words are used as input for text vectorization;
step 3: establishing a time sequence probability prediction model for semantic decoding to obtain corresponding label score probability;
memorizing the needed information and forgetting useless information from two directions by using a time sequence probability prediction model;
the unit of the time sequence probability prediction model is formed by inputting words at the current momentCell state->Temporary cell status->Hidden layer->Amnesia door->Memory door->And an output door->Composition;
step 3.1: according to the hidden layer state at the previous momentAnd the input word +.>Selecting information to be forgotten, calculating the forgetting gate by the following formula>:
Step 3.2: according to the hidden layer state at the previous momentAnd the input word +.>The value +.about.memory gate selection is calculated by the following formula>And temporary cell status->:
Step 3.3: according to the value of the memory gateValue of amnesia door->Temporary cell status->Cell state at last timeThe cell state +.about.at the current time is calculated based on the following formula>:
Step 3.4: according to the hidden layer state at the previous momentInput word +.>And the state of the cell at the current timeThe output gate is calculated by the following formula>And hidden layer->:/>
The information in the state of the unit is updated and discarded through the memory gate and the forgetting gate at each unit moment, useful information is calculated and transmitted to the next unit, and finally, a state sequence with the same sentence length is obtained;
Step 4: establishing a label transition probability in a maximized probability prediction learning data set, and correcting the output of a time sequence probability prediction model to obtain an accurate and reasonable label prediction sequence;
obtaining constraint rules from training data through a maximized probability prediction model, and guaranteeing validity of a prediction label through the constraint rules;
step 4.1: score of a tag sequence, for an input sequence, for a given tag of a tag sequence, is defined as:
Score=EmissionScore+TransitionScore(6)
step 4.2.1: the method for generating EmissionScare is as follows: any word vector in the sentence is subjected to a time sequence probability prediction model to obtain the sum of the corresponding label scores of each position in the sequenceOutput to CRF layer for calculation of EmissionScare;
step 4.3: order the
Transmit score is the corresponding sum in the sequence state transition matrix;
the generation method of the Transmit score is as follows:
step 4.3.1: a probability value is corresponding between i and i+1 of the sequence, and the Transmit score is the sum of probability correspondence between any positions in the sequence state transition matrix;
step 4.4: calculating the loss of the maximized probability prediction model; given an input sequence, there may be many signature sequences, and the objective of the maximization probability prediction model is to maximize the Score of the real sequence to the sum of scores of all possible sequences, the correlation calculation formula is as follows:
step 5: and visually displaying the recognition result of the nonstandard Chinese express mail information entity.
2. The utility model provides a non-standardized chinese express mail information identification system based on NER which characterized in that includes: the client is used for inputting Chinese express mail information of a user and visually outputting an analysis result of express address information in a text;
a server side executing a computer program to implement the non-standard Chinese express mail information identification method based on NER as set forth in claim 1;
and the database end is used for storing the Chinese express mail information of the client group.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110951137.5A CN113657103B (en) | 2021-08-18 | 2021-08-18 | Non-standard Chinese express mail information identification method and system based on NER |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110951137.5A CN113657103B (en) | 2021-08-18 | 2021-08-18 | Non-standard Chinese express mail information identification method and system based on NER |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113657103A CN113657103A (en) | 2021-11-16 |
CN113657103B true CN113657103B (en) | 2023-05-12 |
Family
ID=78481075
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110951137.5A Active CN113657103B (en) | 2021-08-18 | 2021-08-18 | Non-standard Chinese express mail information identification method and system based on NER |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113657103B (en) |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826331B (en) * | 2019-10-28 | 2023-04-18 | 南京师范大学 | Intelligent construction method of place name labeling corpus based on interactive and iterative learning |
CN111310471B (en) * | 2020-01-19 | 2023-03-10 | 陕西师范大学 | Travel named entity identification method based on BBLC model |
CN111382575A (en) * | 2020-03-19 | 2020-07-07 | 电子科技大学 | Event extraction method based on joint labeling and entity semantic information |
CN111783462B (en) * | 2020-06-30 | 2023-07-04 | 大连民族大学 | Chinese named entity recognition model and method based on double neural network fusion |
CN112765314B (en) * | 2020-12-31 | 2023-08-18 | 广东电网有限责任公司 | Power information retrieval method based on power ontology knowledge base |
CN112784051A (en) * | 2021-02-05 | 2021-05-11 | 北京信息科技大学 | Patent term extraction method |
-
2021
- 2021-08-18 CN CN202110951137.5A patent/CN113657103B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113657103A (en) | 2021-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021135910A1 (en) | Machine reading comprehension-based information extraction method and related device | |
CN110705301B (en) | Entity relationship extraction method and device, storage medium and electronic equipment | |
CN113641820B (en) | Visual angle level text emotion classification method and system based on graph convolution neural network | |
CN109582956B (en) | Text representation method and device applied to sentence embedding | |
CN108062388A (en) | Interactive reply generation method and device | |
CN112465017A (en) | Classification model training method and device, terminal and storage medium | |
CN111444367B (en) | Image title generation method based on global and local attention mechanism | |
CN111382271B (en) | Training method and device of text classification model, text classification method and device | |
CN112256866B (en) | Text fine-grained emotion analysis algorithm based on deep learning | |
WO2021208727A1 (en) | Text error detection method and apparatus based on artificial intelligence, and computer device | |
CN111428557A (en) | Method and device for automatically checking handwritten signature based on neural network model | |
CN110619119B (en) | Intelligent text editing method and device and computer readable storage medium | |
CN111695338A (en) | Interview content refining method, device, equipment and medium based on artificial intelligence | |
CN113158656B (en) | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium | |
CN111429204A (en) | Hotel recommendation method, system, electronic equipment and storage medium | |
CN113707299A (en) | Auxiliary diagnosis method and device based on inquiry session and computer equipment | |
CN112632993A (en) | Electric power measurement entity recognition model classification method based on convolution attention network | |
CN109086463A (en) | A kind of Ask-Answer Community label recommendation method based on region convolutional neural networks | |
CN114491289A (en) | Social content depression detection method of bidirectional gated convolutional network | |
CN112949637A (en) | Bidding text entity identification method based on IDCNN and attention mechanism | |
CN113705207A (en) | Grammar error recognition method and device | |
CN113761845A (en) | Text generation method and device, storage medium and electronic equipment | |
CN113657103B (en) | Non-standard Chinese express mail information identification method and system based on NER | |
WO2023116572A1 (en) | Word or sentence generation method and related device | |
CN115147849A (en) | Training method of character coding model, character matching method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |