CN115146635B - Address segmentation method based on domain knowledge enhancement - Google Patents

Address segmentation method based on domain knowledge enhancement Download PDF

Info

Publication number
CN115146635B
CN115146635B CN202211075587.3A CN202211075587A CN115146635B CN 115146635 B CN115146635 B CN 115146635B CN 202211075587 A CN202211075587 A CN 202211075587A CN 115146635 B CN115146635 B CN 115146635B
Authority
CN
China
Prior art keywords
address
text
sample
label
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211075587.3A
Other languages
Chinese (zh)
Other versions
CN115146635A (en
Inventor
杨伊态
付卓
王彦华
陈胜鹏
李颖
徐继伟
王敬佩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Geospace Information Technology Co ltd
Original Assignee
Geospace Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Geospace Information Technology Co ltd filed Critical Geospace Information Technology Co ltd
Priority to CN202211075587.3A priority Critical patent/CN115146635B/en
Publication of CN115146635A publication Critical patent/CN115146635A/en
Application granted granted Critical
Publication of CN115146635B publication Critical patent/CN115146635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of urban treatment systems, and provides an address segmentation method based on field knowledge enhancement, which comprises the steps of S1, training parameters of an address segmentation model by using a marked training sample set to obtain the address segmentation model; s2, inputting a text unified address to be segmented; and S3, segmenting the text unified address into a plurality of address elements through the address segmentation model. The invention uses the bert model to obtain the word vector of the address, and uses the BilSTM network and the CRF network to label the text address, thereby improving the address segmentation capability, having better segmentation accuracy and stronger generalization capability; in addition, the invention also forcibly modifies the transition probability on a CRF layer according to the characteristics of the address text, which is equivalent to fusing the domain knowledge of specific services in a model, thereby improving the segmentation accuracy and reducing the probability of segment dislocation.

Description

Address segmentation method based on domain knowledge enhancement
Technical Field
The invention belongs to the technical field of urban management systems, and particularly relates to an address segmentation method based on field knowledge enhancement.
Background
The construction of the unified address library is a very important ring in the field of urban management.
In the process of inputting the unified address into the database, different text segments in the unified address text need to be marked with corresponding address elements. Such as virtual uniform address: the Chinese east province is marked as province, the Jingzhou city is marked as city, the economic development area is marked as city, the Dafeng street is marked as street, the Shanshu district is marked as building, the 10 building is marked as building, and the 201 building is marked as room number. However, the unified addresses of different regions contain different address elements because of the administrative partitioning factor. Assume that a city's unified address contains 10 address elements in total: province, city, district, street, district, building, room number, village, group, house number. A uniform address for an urban area typically does not contain villages and groups, while an address for a suburban or rural area typically does not contain room numbers.
The existing address segmentation (also called segmentation) methods are mainly classified into 3 types:
the first type is a rule-based address segmentation method. Such methods address-section text addresses by constructing rules. The method has the characteristics of high matching efficiency and accuracy, complex rule, high labor cost and almost no generalization capability.
For example, for a virtual address "6 groups zhangsan si" in the county of relong zhenchang, dakang, kyotou, handong, the construction rule: XX province represents province, XX city represents city, etc. However, address elements such as names of people in villages and names of cells in towns are very complicated in characteristics and high in cost for constructing rules, so that the address segmentation method based on the rules can only segment addresses with relatively simple structures and very obvious characteristics.
The second category is address segmentation methods based on machine learning. The method uses a machine learning algorithm or a model, automatically learns the section characteristics through sample training, and then sections the address according to the section characteristics. Classical machine learning algorithm models such as Hidden Markov Model (HMM) based, conditional Random Field (CRF) based, etc. machine learning methods. Compared with a rule-based method, the method based on machine learning greatly reduces labor cost, but has poor generalization capability.
For example, for the virtual address "5 pieces of 17B of the wind plant # 1 of the Guangming street Guangming area Guangming street in Kyoto City, handong province", for the address element "Dafeng factory" in the address, when the address element appears out of date in training, the method based on machine learning can be segmented correctly, for example, there may be a virtual address "4 pieces of 2B of the wind plant # 1 of the Guangming street in Kyoto City, handong province" in training; however, if this address element is not present in the training, the machine learning based approach has a high probability of segmentation errors.
The third category is deep learning based methods. The method constructs a multilayer neural network, self-learns section characteristic parameters through training samples, and then uses the learned neural network to section the address. Compared with the first two types of algorithms, the algorithm based on deep learning has higher accuracy. However, the existing address segmentation method based on deep learning has less utilization of uniform address characteristic knowledge (from front to back, the region represented by the address elements is from wide to precise), text semantic extraction is still insufficient, and the accuracy rate needs to be improved.
For example, in the existing deep learning method, static word vectors are generated from text to text, and the accuracy rate is reduced. Like the virtual address "Zhanggsu 6 group Zhanggsu" of Tokyo, kyoto, city, handong province, two "tsugsu" appearing in the address are the same in the static word vector model as the text vector it generates, but the more appropriate vector representation should be that the corresponding vectors it generates should be different due to different locations.
Disclosure of Invention
In view of the foregoing problems, an object of the present invention is to provide an address segmentation method based on domain knowledge enhancement, which aims to solve the technical problems that the accuracy and generalization capability of the existing address segmentation method need to be improved.
The invention adopts the following technical scheme:
the address segmentation method based on the field knowledge enhancement comprises the following steps:
s1, training parameters of an address segmentation model by using a marked training sample set to obtain the address segmentation model;
s2, inputting a text unified address to be segmented;
and S3, segmenting the text unified address into a plurality of address elements through the address segmentation model.
Further, the specific process of step S1 is as follows:
s11, dividing a marked training sample set into training samples and verification samples in proportion, wherein the format of each sample is [ text address, segmented sample ], the segmented sample is a set segmented by corresponding text addresses, the segmented sample is composed of a plurality of segmented elements, and each segmented element comprises a text fragment and an address element name';
s12, converting the training sample into a standard sample, wherein the format of the standard sample is [ text address, text label sequence ], the text address is the text address in the sample, each text label in the text label sequence consists of 'address element name' and 'word code', and the text label sequence in the standard sample is a correct text label sequence corresponding to the text address;
s13, converting the text address in the standard sample into a word vector by using a Bert model;
s14, inputting the word vectors into a BilSTM network to obtain a text address emission probability matrix;
s15, obtaining all possible label sequences by using a CRF network according to the text address emission probability matrix, and calculating the total scores of the label sequences and the scores of the text label sequences;
s16, obtaining a loss score according to the score of the correct text label sequence and the total score of the label sequence;
and S17, modifying the parameters of the updated model by using a gradient descent method according to the loss value, verifying by using a verification sample, and selecting a parameter version with the highest verification accuracy as a finally trained address segmentation model.
Further, the specific process of step S12 is as follows:
counting the address element names appearing in all the training samples, and generating an address element name set;
converting the training sample into a standard sample according to a text address, a segmentation sample and a generated address element name set of the training sample, wherein the format of the standard sample is [ text address, text label sequence ], the text address is the text address in the sample, each text label in the text label sequence consists of an address element name and a word code, the number of the word codes is three, the word codes are B, I and S respectively, and the text label sequence in the standard sample is a correct text label sequence corresponding to the text address;
and counting all the appeared text labels and generating a text label set.
Further, the specific process of step S13 is as follows:
dividing the text address in the standard sample into words;
converting the text address into a morpheme code through a bert model, and obtaining a corresponding position code;
and respectively inputting the element code and the position code of the text address into the bert model to obtain corresponding word vectors.
Further, the specific process of step S14 is as follows:
and inputting the word vector into a BilSTM network to obtain a hidden layer state vector of the text address, and inputting the hidden layer state vector into a full connection layer to obtain a text address emission probability matrix.
Further, the specific process of step S15 is as follows:
inputting the text address emission probability matrix into a CRF network, wherein a CRF layer of the CRF network obtains the scores of text label sequences based on the emission probability matrix and the transition probability matrix, and obtains the total scores of the label sequences through calculation,
Figure 457358DEST_PATH_IMAGE001
the calculation formula is as follows:
Figure 750543DEST_PATH_IMAGE002
Figure 396288DEST_PATH_IMAGE003
representing a score of the input sample x marked as a sequence of text labels y.
Further, in step S16, the loss score calculation formula is:
Figure 442741DEST_PATH_IMAGE004
Figure 755911DEST_PATH_IMAGE005
wherein
Figure 363872DEST_PATH_IMAGE006
Representing the transmission probability value of the ith label in the text label sequence y, n is the length of the whole text label sequence y,
Figure 762492DEST_PATH_IMAGE007
representing the transfer probability value of the i-1 st label transferred to the i labels in the text label sequence y;
Figure 347057DEST_PATH_IMAGE008
a score representing that the input sample x is marked as a sequence of text labels y,
Figure 249154DEST_PATH_IMAGE009
representing the correct text label sequence of the input sample x
Figure 173158DEST_PATH_IMAGE010
The score of (a) is calculated,
Figure 527916DEST_PATH_IMAGE011
representing the sum of all possible tag sequence scores for the input sample x,
Figure 447330DEST_PATH_IMAGE012
representing the correct sequence of text labels for input sample x
Figure 705399DEST_PATH_IMAGE010
Fraction of loss of (a).
Further, the specific process of step S3 is as follows:
s31, converting the text address in the input text unified address into a word vector;
s32, calculating to obtain a text address emission probability matrix by using a BilSTM network
S33, obtaining a predicted label sequence of the text address by using a CRF network;
and S34, obtaining a final segmentation result according to the predicted tag sequence.
Further, the step S33 specifically includes the following steps:
modifying a transition probability matrix in the trained address segmentation model;
and inputting the text address emission probability matrix into the CRF network after the transition probability matrix is corrected, wherein the CRF layer of the CRF network obtains a marking result with the highest score by using a Viterbi algorithm on the basis of the emission probability matrix and the transition probability matrix and records the marking result as a predicted label sequence.
The invention has the beneficial effects that: the invention provides an address segmentation method based on field knowledge enhancement, which is used for segmenting an input text unified address into a plurality of address elements, specifically using a bert model to obtain a word vector of the address, and using a BilSTM network and a CRF network to label the text address; in addition, the invention also forcibly modifies the transition probability on a CRF layer according to the characteristics of the address text (from front to back, the area represented by the address elements is from wide to accurate), which is equivalent to fusing the domain knowledge of specific services in the model, thereby improving the segmentation accuracy and reducing the probability of segment dislocation.
Drawings
FIG. 1 is a flow chart of a method for address segmentation based on domain knowledge enhancement according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an address segmentation model training phase according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a text address translation to a word vector according to an embodiment of the present invention;
FIG. 4 is a state transition diagram provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of an inference phase of an address segmentation model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides an address segmentation method, which is used for segmenting an address and segmenting a text unified address into a plurality of address elements. Such as: the unified address of the virtual city of Beijing State has 15-level address elements: province, city, district/county, functional area, street, community/village, group, road, house number, district, building number.
The virtual uniform address 1 is:5 Guangming street gale community Guangming great wind park No. 1 great wind plant in Guangming district of Jing Ming district of Kyoho province 17B
The segmentation result is:
china of Handong province Jingzhou city Bright area Bright street Gale community Light road Number 1 Big wind factory 5 pieces 17B
Save money Market (a) District/county Street Community/village Road Number plate Cell Building number Room
The virtual uniform address 2 is:6 Zhang-san-Si in Beijing City, kyoho county, ruilong Zhenchang Mincun, handong province
The segmentation result is:
china of Handong province Jingzhou city Dakang county Ruiloncun Chanmin village 6 groups of Zhang three four
Save money City (R) District/county Street with a light source Community/village Group of Building number
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Fig. 1 shows a flow of an address segmentation method based on domain knowledge enhancement according to an embodiment of the present invention, and only the relevant parts according to an embodiment of the present invention are shown for convenience of description.
As shown in fig. 1, the address segmentation method based on the domain knowledge enhancement provided by this embodiment includes the following steps:
s1, training parameters of the address segmentation model by using the marked training sample set to obtain the address segmentation model.
The main purpose of this step is to train the address segmentation model, before training, the labeled training sample set is divided into two parts according to the proportion (9: training samples and validation samples. Inputting the training samples into the address segmentation model, learning all parameters of the model by the model through the training samples, testing the trained model by using the verification samples, and storing the parameter version with the highest testing accuracy.
The specific implementation process is as follows:
s11, dividing the marked training sample set into training samples and verification samples according to a proportion, wherein the format of each sample is [ text address, subsection sample ], the subsection sample is a set segmented by corresponding text addresses, the subsection sample is composed of a plurality of subsection elements, and each subsection element comprises a text fragment and an address element name.
The text address refers to a uniform address of the text to be segmented, and the segmentation sample is a set obtained after segmentation of the corresponding text address. The section sample is composed of a plurality of section elements, each of which is composed of < text fragment > + < address element name >.
A typical sample is:
[ text address: 5-span 17B of Guangming district Dafeng factory in Jingzhou city, handong province; segmenting samples: (province, handong), (city, kyo), (guangming district, district/county), (maufengi, district), (5, building), (17B, room number) ].
Here, the province and the city of kangdong are text fragments, and the province and the city are address element names, and the address element names can be expressed in english and are convenient to record. Namely, segmented samples: (Provincia, handong), (City, jingzhou), (Bright, district), (Dafeng, resregion), (5, building), (17B, rom) ].
And S12, converting the training sample into a standard sample, wherein the format of the standard sample is [ text address, text label sequence ], the text address is the text address in the sample, and each text label in the text label sequence consists of 'address element name' and 'word code'.
The main implementation of this step is to convert the training samples into standard samples for subsequent processing. The specific process is as follows:
121. and counting the address element names appearing in all the training samples and generating an address element name set.
For example, there are 3 training samples, where the address elements appearing in sample 1 are: { province, city, district/county, district, building, room number }, address elements appearing in sample 2 are: { province, city, district/county, town, group, building }, address elements appearing in sample 3 are: { province, city, functional area, street, district, building }. The set of address element names is: { province, city, district/county, district, building, room number, town, group, functional area, street }.
122. And converting the training samples into standard samples according to the text addresses, the segmentation samples and the generated address element name sets of the training samples.
The standard sample consists of two parts [ text address, text label sequence ], the text address is the text address input into the training sample, each text label in the text label sequence consists of two parts, < address element name > + < character code >, which are B, I and S respectively, and the text label sequence in the standard sample is the correct text label sequence corresponding to the text address. The address element names are corresponding address element names in the training samples, and the word codes are three, namely B, I and S. If the name of one address element contains more than 1 word, the word code of the first word is B, and the word codes of the other words are I; if an address element contains only one word, the word of this word is encoded as S.
Such as virtual address elements: in the Handong Province, the text tag sequences are Province-B, province-I and Province-I.
Such as virtual address elements: 5A B, the text tag sequence of which is Building-B, building-I, room-S.
123. And counting all the appeared text labels and generating a text label set.
A typical input training sample to standard sample is:
training samples:
[ text address: 17B of Guangming street Guangming great wind community Guangfu Dawang 1 great wind plant in Jingming district of Jingzhou city, handong province; segmenting samples: (Province, handong, city), (Bright District, district), (Bright Street, street), (gale Community, community), (optical Road, road), (number 1, door), (gale, resRegion), (5 Toan, building), (17B, room) ]
The standard samples after conversion were:
[ text address: 17B of Guangming street Guangming great wind community Guangfu Dawang 1 great wind plant in Jingming district of Jingzhou city, handong province; text label sequence: [ Provision-B, provision-I, city-B, city-I, city-I, district-B, district-I, district-I, street-B, street-I, community-B, community-I, community-I, community-I, road-B, road-I, door-B, door-I, resRegion-B, resRegion-I, resRegion-I, building-B, building-I, rom-B, rom-I ] ].
Fig. 2 is a schematic diagram illustrating an address segmentation model training phase, which is combined with fig. 2 to convert a text address into a standard address.
And S13, converting the text address in the standard sample into a word vector by using a Bert model.
With reference to fig. 2 and 3, the specific process is as follows:
131. the text addresses in the standard sample are segmented into words.
Such as virtual address: 5-span 17B of Guangming street Guangming community Guangmi great wind park No. 1 great wind plant in Guangming district of Jingzhou city, handong province is divided into: [ Han, east, province, jing, state, city, guang, ming, zong, guang, ming, street, dao, feng, shu, zong, guang, wei, luo, no. 1, dao, feng, factory, 5, town, 1,7, B ].
132. And converting the text address into a morpheme code through a bert model, and obtaining a corresponding position code.
Such as text address: [ Han, east, province, jing, state, city, guang, ming, zong, guang, ming, street, dao, dao, feng, shu, zong, guang, wei, lu, 1, dou, dao, feng, factory, 5, town, 1,7, B ]; each word in the text address is a lemma, and each lemma has a corresponding lemma code, so that the uniform text address is divided into lemmas according to the words by using a word splitter in a Bert model, and each word is converted into a corresponding code to obtain a position code. And each word in the text address is converted into a morpheme code.
The converted lemma is coded into [3727, 691, 4689, 776, 2336, 2356, 1045, 3209, 1277, 1045, 3209, 6125, 6887, 1920, 7599, 4852, 1277, 1045, 836, 6662, 122, 1384, 1920, 7599, 1322, 126, 3406, 122, 128 and 144]; the position code is: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29].
133. And respectively inputting the element code and the position code of the text address into the bert model to obtain corresponding word vectors.
And converting the element code and the position code of the text address by a bert model to obtain a corresponding word vector: addr _ e.
Addr _ e is a vector set of Addr _ len × 768 dimensions, where Addr _ len is the number of sample lemmas, i.e., each lemma in a sample is converted into a vector of 768 dimensions.
And S14, inputting the word vector into the BilSTM network to obtain a text address emission probability matrix.
The word vector Addr _ e is input into a BilSTM (bidirectional long short term memory neural network) network to obtain a hidden layer state vector Addr _ v of the text address, and then the hidden layer state vector Addr _ v is input into a full connection layer to obtain a text address emission probability matrix Emit _ m.
Emit _ m is a matrix of Tag _ num × Addr _ Len dimensions, where Tag _ num is the total number of address elements in the text label set, and Addr _ Len is the number of sample lemmas.
And S15, obtaining all possible label sequences by using a CRF network according to the text address emission probability matrix, and calculating the total scores of the label sequences and the scores of the text label sequences.
And inputting the text address emission probability matrix Emit _ m into a CRF (conditional random field) network, and calculating the score of each label sequence by the CRF layer based on the emission probability matrix Emit _ m and the transition probability matrix Trans _ m. The emission probability matrix Emit _ m is the result output by the Bi-LSTM layer, and the transition probability matrix Trans _ m is a parameter of the CRF network, and is a matrix of Tag _ num × Tag _ num. The value of Trans _ m [ i ] [ j ] represents the transition probability from the ith label to the jth label in the text label set. The value in Trans _ m is randomly assigned for the first training, and is adjusted after the t-1 training when the training is repeated later.
If the text label set is: { City-B, city-I, town-B, town-I, town-S }, the value of the transition probability matrix size 5 × 5, trans _m [2] [3] is 0.03, indicating that the probability of transition from City-I to Town-B is 0.03; the value of Trans _ m [1] [5] is 0.5, indicating a probability of 0.5 for a transition from City-B to Town-S.
Referring to the state transition diagram shown in fig. 4, if the input text address is "kyoto, the score of each label sequence is obtained according to the score calculation formula. The score formula is:
Figure 560222DEST_PATH_IMAGE005
wherein
Figure 667856DEST_PATH_IMAGE006
Representing the transmission probability value of the ith label in the text label sequence y, n is the length of the whole text label sequence y,
Figure 390961DEST_PATH_IMAGE007
and the transition probability value of the i-1 st label in the text label sequence y to the i labels is shown. Assuming that the text mark sequence is (Provice-I, city-b, city-I), the score is (0.7 +0.4+ 0.5) + (0.8 + 0.8) =3.2. The total fraction of tag sequences, i.e. the sum of the fractions of all possible tag sequences, 3 × 3=27 sequences in fig. 4, each sequence having its own fraction
Figure 2071DEST_PATH_IMAGE003
. The calculation formula of the total score of the label sequence is as follows:
Figure 650965DEST_PATH_IMAGE002
and S16, calculating loss scores according to the total scores of the text label sequences and the scores of the correct text label sequences.
The loss score is calculated as:
Figure 980315DEST_PATH_IMAGE004
wherein
Figure 507111DEST_PATH_IMAGE003
Indicating that the input exemplar x is marked with a score that marks the text label sequence y, such as the text label sequence (Provice-I, city-b, city-I),
Figure 972727DEST_PATH_IMAGE009
representing the correct text label sequence of the input sample x
Figure 61031DEST_PATH_IMAGE010
The fraction of (c). In the calculation formula of the predicted loss score,
Figure 877678DEST_PATH_IMAGE011
is the total score of all possible paths, i.e. the total score of all possible text label sequences,
Figure 677006DEST_PATH_IMAGE009
is a score of the correct text label sequence. Therefore, in the training stage, the total loss value can be continuously reduced according to a gradient descent method only by obtaining the total score of the label sequence and the score of the correct text label sequence. When the total loss value is reduced, the proportion of the correct path in the total path score is larger and larger, and when the total loss value is reduced to a certain degree, the score of the correct path in the total path is maximum, and at this time, the label sequence with the highest output score of the model is the correct text label sequence.
And S17, modifying the updated model parameters by using a gradient descent method according to the loss value, verifying by using a verification sample, and selecting a parameter version with the highest verification accuracy as a finally trained address segmentation model.
The address segmentation model traverses the training samples for multiple times, and the accuracy of the model is tested by using the verification samples after the training samples are traversed once. And in the address segmentation model training stage, one parameter version with the highest verification accuracy is selected as a finally trained model.
And S2, inputting a text unified address to be segmented.
And after the address segmentation model is established, segmenting the input text unified address to be segmented.
And S3, segmenting the text unified address into a plurality of address elements through the address segmentation model.
The input of the address segmentation model is [ text address, text label sequence ], the output of the model is the label sequence with the highest score, but the output is not concerned in the training stage, and the output is the predicted label sequence in the inference stage, which is called as the predicted label sequence in the embodiment.
With reference to fig. 5, the specific process is as follows:
and S31, converting the text address into a word vector for the received unified address of the text to be segmented, wherein the conversion process is consistent with the process in the step S13, and the conversion process uses a Bert model, which is not described herein any more.
And S32, calculating by using a BilSTM network to obtain a text address emission probability matrix, wherein the calculation process is consistent with the process in the step S14, and the description is omitted here.
S33, a CRF network is used to obtain a predicted label sequence of the text address.
Firstly, modifying a transition probability matrix in a trained address segmentation model: a characteristic of a textual unified address is that the geographic extent to which the address elements are represented from front to back is smaller and smaller. Therefore, the probability of the unlikely transition from state i to state j is reduced according to the address element constraint relation involved in the specific service and the set word encoding constraint relation. The embodiment is to divide the trained address into sectionsTrans _ m [ i ] in transition probability matrix in type][j]The value of (d) is corrected to-1000. Trans _ m [ i ]][j]Representing the transition probability from the ith label to the jth label in the text label set. If it is not possible to transition from state i to state j by a constraint relationship, trans _ m [ i ] is transferred][j]The value of (d) is corrected to-1000.
Figure 403654DEST_PATH_IMAGE013
Is-1000, the score of the whole
Figure 253405DEST_PATH_IMAGE008
It will be very low (most of the results are negative) so that all sequences containing states i to j have very low scores and a probability of being close to 0 after normalization.
For example, the address element constraint relationship is as follows: if "community" must be in front of "cell", the transition probability from "cell" to "community" should be close to 0.
The word encoding constraint relationship is as follows: the "City-I" will not appear in front of the "City-B", and the transition probability from "City-I" to "City-B" should be close to 0.
And then inputting the text address emission probability matrix into the CRF network after correcting the transition probability matrix, wherein the CRF layer of the CRF network obtains a marking result with the highest score by using a Viterbi algorithm based on the emission probability matrix and the transition probability matrix and records the marking result as a predicted label sequence.
And S34, obtaining a final segmentation result according to the predicted tag sequence.
Firstly, in the predicted tag sequence, if several consecutive tags all contain the same address element elem _ I, and the tag word codes thereof are B first and I the rest, these tags are merged tags, and the corresponding several texts in the text addresses are merged, and the address elements thereof are labeled elem _ I. And if the address is not a merged label, directly taking the character in the address as a segmented text segment, and labeling the address elements corresponding to the label.
Finally, the segmentation result can be directly output. In addition, the model can be further verified and optimized.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. An address segmentation method based on domain knowledge enhancement, characterized in that the method comprises the following steps:
s1, training parameters of an address segmentation model by using a marked training sample set to obtain the address segmentation model;
s2, inputting a text unified address to be segmented;
s3, segmenting the text unified address into a plurality of address elements through the address segmentation model;
the specific process of the step S1 is as follows:
s11, dividing a marked training sample set into training samples and verification samples in proportion, wherein the format of each sample is [ text address, segmented sample ], the segmented sample is a set segmented by corresponding text addresses, the segmented sample is composed of a plurality of segmented elements, and each segmented element comprises a text fragment and an address element name';
s12, converting the training sample into a standard sample, wherein the format of the standard sample is [ text address, text label sequence ], the text address is the text address in the sample, each text label in the text label sequence consists of 'address element name' and 'word code', and the text label sequence in the standard sample is a correct text label sequence corresponding to the text address;
s13, converting the text address in the standard sample into a word vector by using a Bert model;
s14, inputting the word vector into a BilSTM network to obtain a text address emission probability matrix;
s15, obtaining all possible label sequences by using a CRF network according to the text address emission probability matrix, and calculating the total scores of the label sequences and the scores of the text label sequences;
s16, obtaining a loss score according to the score of the correct text label sequence and the total score of the label sequence;
and S17, modifying the updated model parameters by using a gradient descent method according to the loss value, verifying by using a verification sample, and selecting a parameter version with the highest verification accuracy as a finally trained address segmentation model.
2. The address segmentation method based on domain knowledge enhancement according to claim 1, wherein the specific process of step S12 is as follows:
counting the names of the address elements appearing in all the training samples, and generating an address element name set;
converting the training sample into a standard sample according to a text address, a segmentation sample and a generated address element name set of the training sample, wherein the format of the standard sample is [ text address, text label sequence ], the text address is the text address in the sample, each text label in the text label sequence consists of 'address element name' and 'word code', the number of the word codes is three, namely B, I and S, and the text label sequence in the standard sample is a correct text label sequence corresponding to the text address;
and counting all the appeared text labels and generating a text label set.
3. The address segmentation method based on domain knowledge enhancement according to claim 2, wherein the specific process of step S13 is as follows:
dividing the text address in the standard sample into words;
converting the text address into a morpheme code through a bert model, and obtaining a corresponding position code;
and respectively inputting the lemma code and the position code of the text address into the bert model to obtain corresponding word vectors.
4. The address segmentation method based on the domain knowledge enhancement as claimed in claim 3, wherein the specific process of step S14 is as follows:
and inputting the word vector into a BilSTM network to obtain a hidden layer state vector of the text address, and inputting the hidden layer state vector into a full connection layer to obtain a text address emission probability matrix.
5. The address segmentation method based on domain knowledge enhancement as claimed in claim 4, wherein the step S15 comprises the following steps:
inputting the text address emission probability matrix into a CRF network, wherein a CRF layer of the CRF network obtains the scores of text label sequences based on the emission probability matrix and the transition probability matrix, and obtains the total scores of the label sequences through calculation
Figure 335426DEST_PATH_IMAGE002
The calculation formula is as follows:
Figure DEST_PATH_IMAGE003
Figure 928212DEST_PATH_IMAGE004
representing a score of the input sample x marked as a sequence of text labels y.
6. The address segmentation method based on the domain knowledge enhancement as claimed in claim 5, wherein in step S16, the loss score is calculated by:
Figure DEST_PATH_IMAGE005
Figure 385739DEST_PATH_IMAGE006
wherein
Figure DEST_PATH_IMAGE007
Representing the transmission of the ith label in the text label sequence yA probability value, n being the length of the entire text label sequence y,
Figure 13160DEST_PATH_IMAGE008
representing the transfer probability value of the i-1 st label transferred to the i labels in the text label sequence y;
Figure 366781DEST_PATH_IMAGE004
a score representing the input sample x marked as a sequence of text labels y,
Figure DEST_PATH_IMAGE009
representing the correct text label sequence of the input sample x
Figure 446864DEST_PATH_IMAGE010
The score of (a) is calculated,
Figure DEST_PATH_IMAGE011
representing the sum of all possible tag sequence scores for the input sample x,
Figure 708081DEST_PATH_IMAGE012
representing the correct text label sequence for an input sample x
Figure DEST_PATH_IMAGE013
The loss fraction of (c).
7. The address segmentation method based on domain knowledge enhancement according to claim 6, wherein the step S3 comprises the following steps:
s31, converting the text address in the input text unified address into a word vector;
s32, calculating by using a BilSTM network to obtain a text address emission probability matrix;
s33, obtaining a predicted label sequence of the text address by using a CRF network;
and S34, obtaining a final segmentation result according to the predicted tag sequence.
8. The address segmentation method based on domain knowledge enhancement as claimed in claim 7, wherein the step S33 comprises the following steps:
modifying a transition probability matrix in the trained address segmentation model;
and inputting the text address emission probability matrix into a CRF network after correcting the transition probability matrix, wherein a CRF layer of the CRF network obtains a marking result with the highest score by using a Viterbi algorithm based on the emission probability matrix and the transition probability matrix and records the marking result as a predicted label sequence.
CN202211075587.3A 2022-09-05 2022-09-05 Address segmentation method based on domain knowledge enhancement Active CN115146635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211075587.3A CN115146635B (en) 2022-09-05 2022-09-05 Address segmentation method based on domain knowledge enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211075587.3A CN115146635B (en) 2022-09-05 2022-09-05 Address segmentation method based on domain knowledge enhancement

Publications (2)

Publication Number Publication Date
CN115146635A CN115146635A (en) 2022-10-04
CN115146635B true CN115146635B (en) 2022-11-15

Family

ID=83416010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211075587.3A Active CN115146635B (en) 2022-09-05 2022-09-05 Address segmentation method based on domain knowledge enhancement

Country Status (1)

Country Link
CN (1) CN115146635B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579344B (en) * 2023-07-12 2023-10-20 吉奥时空信息技术股份有限公司 Case main body extraction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514234A (en) * 2012-06-30 2014-01-15 北京百度网讯科技有限公司 Method and device for extracting page information
CN114676353A (en) * 2022-05-25 2022-06-28 武大吉奥信息技术有限公司 Address matching method based on segmentation inference

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514234A (en) * 2012-06-30 2014-01-15 北京百度网讯科技有限公司 Method and device for extracting page information
CN114676353A (en) * 2022-05-25 2022-06-28 武大吉奥信息技术有限公司 Address matching method based on segmentation inference

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Transformer-BiLSTM-CRF的桥梁检测领域命名实体识别;李韧等;《中文信息学报》;20210415;第35卷(第4期);第83-91页 *
融合注意力机制的BERT-BiLSTM-CRF中文命名实体识别;廖涛 等;《阜阳师范大学学报(自然科学版)》;20210915;第38卷(第3期);第86-91页 *

Also Published As

Publication number Publication date
CN115146635A (en) 2022-10-04

Similar Documents

Publication Publication Date Title
CN109885824B (en) Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
CN108776762B (en) Data desensitization processing method and device
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN111723569A (en) Event extraction method and device and computer readable storage medium
CN112560478A (en) Chinese address RoBERTA-BilSTM-CRF coupling analysis method using semantic annotation
CN113505225B (en) Small sample medical relation classification method based on multi-layer attention mechanism
CN114676353B (en) Address matching method based on segmentation inference
CN115146635B (en) Address segmentation method based on domain knowledge enhancement
CN116011456B (en) Chinese building specification text entity identification method and system based on prompt learning
CN114091454A (en) Method for extracting place name information and positioning space in internet text
CN116414823A (en) Address positioning method and device based on word segmentation model
CN113312918B (en) Word segmentation and capsule network law named entity identification method fusing radical vectors
CN115688779A (en) Address recognition method based on self-supervision deep learning
CN114936627A (en) Improved segmentation inference address matching method
CN116483990B (en) Internet news content automatic generation method based on big data
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN112926323A (en) Chinese named entity identification method based on multi-stage residual convolution and attention mechanism
CN116386895B (en) Epidemic public opinion entity identification method and device based on heterogeneous graph neural network
CN115270774B (en) Big data keyword dictionary construction method for semi-supervised learning
CN111507103A (en) Self-training neural network word segmentation model using partial label set
CN114330350B (en) Named entity recognition method and device, electronic equipment and storage medium
CN115795060A (en) Entity alignment method based on knowledge enhancement
CN114398886A (en) Address extraction and standardization method based on pre-training
CN111523302A (en) Syntax analysis method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant