CN114936627A - Improved segmentation inference address matching method - Google Patents

Improved segmentation inference address matching method Download PDF

Info

Publication number
CN114936627A
CN114936627A CN202210573269.3A CN202210573269A CN114936627A CN 114936627 A CN114936627 A CN 114936627A CN 202210573269 A CN202210573269 A CN 202210573269A CN 114936627 A CN114936627 A CN 114936627A
Authority
CN
China
Prior art keywords
address
matching
elements
vector
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210573269.3A
Other languages
Chinese (zh)
Inventor
杨伊态
陈胜鹏
王敬佩
李颖
许继伟
付卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Geospace Information Technology Co Ltd
Original Assignee
Geospace Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Geospace Information Technology Co Ltd filed Critical Geospace Information Technology Co Ltd
Priority to CN202210573269.3A priority Critical patent/CN114936627A/en
Publication of CN114936627A publication Critical patent/CN114936627A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A30/00Adapting or protecting infrastructure or their operation
    • Y02A30/60Planning or developing urban green infrastructure

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention is suitable for the field of urban management systems, and provides an improved section-by-section inferred address matching method which comprises the steps of firstly training three simplified ESIM inferred models and three extraction matrixes, extracting a text address through the extraction matrixes, extracting three sub-elements from a key address and a standard address, wherein the sub-elements comprise a region element, a building element and a road code element, respectively matching the sub-elements by using the simplified ESIM inferred models, and finally comprehensively judging whether the addresses are matched or not according to the matching results of the three sub-elements. Compared with the existing address matching method based on deep learning, the method disclosed by the invention has the advantages that the sensitivity to the address length is reduced, the differences of different geographic elements are explicitly distinguished, the semantic understanding of the address is improved, and the matching accuracy is better.

Description

Improved segmentation inference address matching method
Technical Field
The invention belongs to the technical field of urban management, and particularly relates to an improved segmentation inference address matching method.
Background
The address is one of basic elements in the urban management system and is a connection hub of a plurality of elements such as people, things, objects and the like. However, in engineering practice, the description of the addresses collected by the line job and the addresses in the system address library often do not coincide, so how to align the collected addresses to be matched (hereinafter referred to as key addresses) with the unified addresses stored in the system (hereinafter referred to as standard addresses) is a very important ring in the urban management system. The goal of address matching is to determine whether the key address and the standard address point to the same address.
The existing address matching methods are mainly classified into the following three types:
the first is a rule-based address matching method, such as keyword search, edit distance calculation, and the like. The method sets a certain rule according to the character string characteristics of the text address, and judges whether the address pairs are matched according to the rule. The method has simple judgment basis and good judgment effect on the standard address, but has very poor effect on the irregular key address.
For example, in the address matching method based on keyword search, the input keyword is "jingzhou luzhou state way", because such method only identifies whether the target address contains these five words, it may be judged as the same address for "jingzhou luzhou state way" and "luzhou jing state way".
As another example, an address matching method based on edit distance. The virtual address pair 1[ "5 buildings 17B of the maw wind factory", "5 buildings 17B of the maw wind factory 5 of the maw wind community in the light bright area of the kyo city, handong province ] and the virtual address pair 2 [" 5 buildings 17B of the maw wind factory 1 of the maw wind community in the light bright area of the kyo city, handong province wind community 5 of the maw wind community, 1 of the maw wind community, 18B of the maw wind factory 5 of the maw wind community in the light bright area of the kyo city, handong province city ], because the same words between the addresses in the virtual address pair 1 are fewer than those in the address pair 2, and the continuous same text segments between the 2 addresses in the virtual address pair are shorter, the address matching method based on the edit distance considers that the similarity of the virtual address pair 2 is higher than that of the virtual address pair 1, but the similarity of the virtual address pair 1 should be higher.
The second is an address matching method based on machine learning, such as a support vector machine, bag of words statistics and the like. The method calculates the similarity of the address embedding vectors and the similarity value of the statistical characteristics through an algorithm or a model, and determines the matching degree of the address pairs. Compared with simple rule matching, the address matching method has a better effect on irregular key addresses, but the method can only extract shallow text semantics, so the matching effect is inferior to that of the address matching method based on deep learning, and the method based on machine learning needs to construct enough learning samples based on domain experts, so that the labor cost is high.
Such as: address matching based on bag-of-words models gives a high match value for two addresses that appear more like a common vocabulary, e.g.
Virtual address 1: 'Guangming street mountain and water mansion in Guangming district of Jingzhou city, Handong province 5-ridge 17B'
Virtual address 2: 'Guangming street Shannan community 15A 17B in Guangming district of Jingzhou city, Handong province'
Because the common vocabulary appears more and the method can only extract the shallow semantic, the virtual address 1 and the virtual address 2 are misjudged to be the same address.
And thirdly, an address matching method based on deep learning is adopted, a multi-layer neural network is usually constructed, address pairs are embedded into a multi-dimensional vector space, and then the vector similarity of the address pairs is compared to judge whether the addresses are matched. Compared with an address matching method based on machine learning, the neural network can automatically learn the characteristics of the text address and learn deeper text semantics, so that the cost of sample labeling is reduced, and the matching effect is better.
However, the existing deep learning method usually directly judges whether the address pairs are matched. In the text address, the difference of the matching characteristics of the numerical elements (most of the characteristics are building, house number and room number) and the matching characteristics of the non-numerical elements (most of the characteristics are regional elements) is ignored by the matching mode, so that the model is sensitive to the address length and is not sensitive to the numbers in the address, and the matching accuracy of the model is reduced.
For example, the following two virtual addresses:
virtual address 1: 'Jingzhou city Guangming district gale community 5 units 2B rooms'
Virtual address 2: ' 5-span 2-unit 71B-room in Guangming street gale community in Guangming district of Jing City, Shangdong province "
The regional element of the virtual address 1 is 'Jingzhou city Guangming district Guangdong wind community', and the building element is '5-building 2-unit 17B room';
the area element of the virtual address 2 is 'Guangming street gale community in Guangming district of Jingzhou city, Handong province', and the building element is '5-building 2-unit 71B room'.
For example, "the capital brightness district gale community" of the city of kyo "in virtual address 1 and" the capital brightness district gale community "of the city of kyo" in virtual address 2 are different in length but actually point to the same geographical location. The sensitivity of the model to length should therefore be reduced.
The building element "building element is" 5-span 2-unit 17B room "in the virtual address 1 and the building element" building element is "5-span 2-unit 71B room" in the virtual address 2, although only 1 pair of digital positions are replaced, the positions are completely different, so that the sensitivity of the model to the numbers is improved.
Disclosure of Invention
In view of the above problems, the present invention provides an improved segmentation inference address matching method, which aims to solve the technical problem of low matching accuracy of the existing method.
The invention adopts the following technical scheme:
the improved segmentation inference address matching method comprises the following steps:
step S1, training an address matching model, wherein the address matching model comprises an element-derived fault model and an element extraction layer matrix, the element inference model comprises a region ESIM inference model, a building ESIM inference model and a road code ESIM inference model, and the element extraction layer matrix comprises a region extraction matrix, a building extraction matrix and a road code extraction matrix;
step S2, inputting the address pair to be matched, and generating a prediction sample pair by the address pair to be matched through a prediction sample construction module;
and step S3, performing inference matching on the prediction sample pair by using the address matching model to obtain a corresponding matching result.
Further, the specific process of step S3 is as follows:
s31, converting the address pair to be matched into a text embedding vector by using a bert model, and inputting the text embedding vector into three extraction matrixes to obtain key area elements, key building elements and key road code elements of the key address, and standard area elements, standard building elements and standard road code elements of the standard address;
s32, inputting the six elements into the bert model again to obtain corresponding element word vectors, and then correspondingly using the three ESIM inference models to obtain matching results of the three elements, namely a region element matching result, a building element matching result and a road code element matching result;
and S33, finally, according to the matching results of the three elements, comprehensively calculating to obtain a final matching result.
Further, the training process of the address matching model is as follows:
s11, inputting samples, and dividing the training sample set into training samples and verification samples according to the proportion;
the format of each sample is [ area mark sample, building mark sample, road code mark sample, and section address mark sample ], wherein the formats of the area mark sample, the building mark sample and the road code mark are [ key element, standard element, mark ], the format of the section address mark sample is [ text address, mark index ], and the text address refers to key address or standard address;
s12, converting the key elements and standard elements in the area, the building and the road code mark sample and the text address in the segmented address mark sample into corresponding word vectors by using a bert model;
s13, training a region ESIM inference model, a building ESIM inference model and a road code ESIM inference model through key element word vectors and standard element word vectors;
and S14, training a region extraction matrix, a building extraction matrix and a road code extraction matrix through the text address word vectors.
Further, step S13 and step S14 are trained in parallel.
Further, in step S12, the key elements and standard elements in the area, building, and road code mark sample are called as address elements, and the specific process of step S12 is as follows:
dividing the text address and the address element into words;
converting the text address and the address element of the character into a word element code by using a bert model, and obtaining a corresponding position code;
and respectively inputting the word element codes and the position codes into the bert model to obtain corresponding word vectors.
Further, in step S13, the three ESIM inference models, i.e., the region ESIM inference model, the building ESIM inference model, and the road code ESIM inference model, are simplified models, and the training methods are consistent, and the procedure is as follows:
simultaneously inputting the key element word vectors and the standard element word vectors into a fully-connected neural network to obtain hidden layer state vectors of the key elements and the standard elements;
obtaining a similar weight matrix of the key elements and the standard elements through alignment operation;
weighting and summing the hidden state vectors of the standard elements by using the similar weight matrix to obtain similar vectors of the key elements, and weighting and summing the hidden state vectors of the key elements by using the similar weight matrix to obtain similar vectors of the standard elements;
respectively subtracting and multiplying the hidden state vector and the similar vector of the key element and the hidden state vector and the similar vector of the standard element, and performing soft alignment to obtain an enhanced vector of the information of the key element and an enhanced vector of the information of the standard address;
inputting the key element information enhancement vector and the standard address information enhancement vector into a bidirectional long-short term memory neural network to obtain a key element matching vector and a standard element matching vector;
performing pooling operation on the key element matching vector to obtain a maximum pooling vector of the key elements and an average pooling vector of the key elements; performing pooling operation on the standard element matching vector to obtain a maximum pooling vector of the standard element and an average pooling vector of the standard element; splicing the four obtained pooling vectors to obtain an element matching information vector;
inputting the element matching information vector into a full-connection layer, and obtaining a matching value of each category through a normalized exponential function, wherein the categories are three categories which are mismatching, matching and possible matching respectively;
calculating a loss value using a cross entropy loss function;
and modifying the parameters of the updated model by using a gradient descent method according to the loss value, and selecting a parameter version with the highest verification accuracy as a finally trained ESIM inference model.
Further, in step S14, the training methods of the three extraction matrices, i.e., the region extraction matrix, the building extraction matrix, and the road code extraction matrix, are consistent, and the process is as follows:
performing dot product on each element encoding vector in the text address word vector by using the extraction matrix to obtain a text address vector;
aiming at each lemma extraction value in the text address vector, obtaining a lemma extraction score by using a Sigmoid function;
generating element tag indexes according to the tag indexes, extracting scores and the element tag indexes according to the lemmas, and calculating loss values of each predicted lemma by using a cross entropy loss function;
and modifying the parameters of the updated matrix by using a gradient descent method according to the loss value, and selecting a parameter version with the highest verification accuracy as the finally trained extraction matrix.
The invention has the beneficial effects that: the invention provides an improved segmentation inference address matching method which is used for judging whether an address to be matched (a key address) input by a user and a uniform address (a standard address) in an address library point to the same destination or not; in the concrete implementation, firstly, three simplified ESIM inference models and three extraction matrixes are trained, a text address is extracted through the extraction matrixes, three sub-elements including region elements, building elements and road code elements are extracted from a key address and a standard address, the sub-elements are respectively matched through the simplified ESIM inference models, and finally whether the addresses are matched or not is comprehensively judged according to matching results of the three sub-elements. Compared with the existing address matching method based on deep learning, the method disclosed by the invention has the advantages that the sensitivity to the address length is reduced, the differences of different geographic elements are explicitly distinguished, the semantic understanding of the address is improved, and the matching accuracy is better.
Drawings
FIG. 1 is a flow diagram of an improved section-inferred address matching method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of address matching model training provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of simplified ESIM inference model training provided by an embodiment of the invention;
FIG. 4 is a flow chart of address matching provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of address matching model inference provided by an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 illustrates a flow of an address matching method based on section inference according to an embodiment of the present invention, and only the portions related to the embodiment of the present invention are illustrated for convenience of description.
As shown in fig. 1, the improved segmentation-inferred address matching method provided by the present embodiment includes the following steps:
and step S1, training an address matching model.
The address matching model comprises an element pushing fault model and an element extracting layer matrix, wherein the element inferring model comprises an area ESIM inferring model, a building ESIM inferring model and a road code ESIM inferring model, and the element extracting layer matrix comprises an area extracting matrix, a building extracting matrix and a road code extracting matrix.
The training process of the address matching model is shown in fig. 2, and includes the following steps:
and S11, inputting samples, and dividing the training sample set into training samples and verification samples according to the proportion.
Before training, the labeled training sample set is divided into two parts according to the proportion (9:1 or other proportions): training samples and validation samples. Inputting the training sample into the address matching model, learning all parameters of the model by the model through the training sample, testing the trained model of the parameters by using the verification sample, and storing the parameter version with the highest testing accuracy.
The format of each sample is [ area marked sample, building marked sample, road code marked sample, and section address marked sample ], wherein the formats of the area marked sample, the building marked sample and the road code mark are [ key element, standard element, mark ]. For example, in the area mark sample, the key element refers to an area element in an address to be matched, the standard element refers to an area element of a uniform address in an address base, and the mark has three types {0,1,2}, where 0 represents mismatch, 1 represents match, and 2 represents possible match. The building mark sample and the road code mark sample are similar to the area mark sample, but the elements are the building element and the road code element in the address respectively.
The sample format of the segmented address mark is [ text address, mark index ], and the text address refers to a key address or a standard address.
The mark index refers to the text address index of the marked different elements, for example, 1 indicates belonging to an area element, 2 indicates belonging to a road code element, and 3 indicates belonging to a building element.
The regional elements refer to text segments of provinces, cities, districts, streets, communities and cells in the text address.
The building element refers to a text segment of a building and a room number in a text address.
The road code elements refer to text fragments of roads and house numbers in the text addresses.
As shown in fig. 2, the virtual address "bei feng road No. 10 mountain and river group building 4 building" of the jingzhou city big wind factory "is a regional element," bei feng road No. 10 "road code element," and "mountain and river group building 4 building" is a building element.
For another example, the virtual address "south wind road wind power plant mountain and water group No. 4 building No. 12 of Guangming area", wherein "south wind road wind power plant" is the regional element, "south wind road" is the road code element, and "No. 4 building No. 12" is the building element.
Inputting a sample case: [ ("Jingzhou city Guangming Dafeng factory", "Guangming district Dafeng factory", 1), ("5-31", "5-13", 0), ("south wind road No. 3", 1), ("Jingzhou city Dafeng factory Beifeng road No. 10 mountain and river group mansion 4 building", "11111122222233333333") ].
And S12, converting the key elements, the standard elements and the text addresses in the section, building and road code mark samples into corresponding word vectors by using a bert model.
The specific process of the step is as follows:
and S121, dividing the text address and the address element into words.
The key elements and standard elements in the area, building and road code mark sample are referred to as address elements. Examples of segmenting text addresses and address elements into words are as follows:
such as virtual address: the 5-span 2-unit 17B room of the Guangming street gale community in the Guangming district of Jing district, Chinton province is divided into: [ Han, east, province, Jing, State, City, Guang, Ming, district, Guang, Ming, street, dao, Feng, Shu, district, 5, dong, 2, Unit, 1,7, B, Room ].
S122, converting the text address and the address element of the character into a word element code by using a bert model, and obtaining a corresponding position code.
Such as: [ Han, east, province, Jing, State, City, Guang, Ming, district, Guang, Ming, street, dao, Feng, Shu, district, 5, ridge, 2, Unit, Yuan, 1,7, B, Room ]
The lemma is coded as: [3727,691,4689,776,2336,2356,1045,3209,1277,1045,3209,6125,6887,1920,7599,4852,1277,126,3406,123,1296,1039,122,128,144,2147].
The position code is:
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]。
and S123, respectively inputting the element code and the position code into the bert model to obtain corresponding word vectors.
In this step, the labeled sample is converted by the bert model to obtain a corresponding word vector group: [ (DKe, DSe, mark), (BKe, BSe, mark), (CKe, CSe, mark), (ADDRe, area element, road code element, building element) ]. DKE and DSe respectively represent regional key element word vectors and regional standard element word vectors; BKe and BSe respectively represent a building key element word vector and a building standard element word vector; CKe and CSe respectively represent key element word vectors and standard element word vectors of the road code; ADDRe represents a text address word vector.
S13, training a region ESIM inference model, a building ESIM inference model and a road code ESIM inference model through the key element word vector and the standard element word vector.
And respectively deducing whether key element word vectors and standard element word vectors in the region, the building and the road code are matched by using the region ESIM inference model, the building ESIM inference model and the road code ESIM inference model. The training methods of the three models are consistent, but the samples of the training models are different, so the parameters after training are different.
The ESIM inference model in the embodiment is a simplified model, a Bi-LSTM network in the ESIM model is removed, and the resource expenditure of the model can be reduced and the training and reasoning speed of the model can be accelerated due to the fact that parameters are reduced. In a specific implementation, as shown in fig. 3, the process of this step is specifically as follows:
and S131, inputting the key element word vectors and the standard element word vectors into the fully-connected neural network at the same time to obtain hidden state vectors of the key elements and the standard elements.
Inputting the key element word vector DKe and the standard element word vector DSe into a layer of fully-connected neural network at the same time to obtain the hidden state vector of the key element
Figure BDA0003661117110000091
Hidden state vector with standard element
Figure BDA0003661117110000092
Wherein
Figure BDA0003661117110000093
W 1 D1 × d2 dimensional matrix, which is the learning parameter of the model, d1 is the dimension number of DKe and DSe, and d2 is
Figure BDA0003661117110000094
And
Figure BDA0003661117110000095
the number of dimensions of (c).
Figure BDA0003661117110000096
Is W 1 The transposed matrix of (2). In the present invention, d1 is 768 and d2 is 100 (or other dimensions are also possible).
And S132, obtaining a similar weight matrix of the key elements and the standard elements through alignment operation.
And obtaining a similar weight matrix E of the key elements and the standard elements through alignment operation. The alignment operation is as follows:
Figure BDA0003661117110000101
a vector representing the ith lemma in the hidden state vector of the key element,
Figure BDA0003661117110000102
and a vector representing the jth lemma in the hidden state vector of the standard element, wherein i is the number from 0 to the number of key elements, and j is the number from 0 to the number of element address lemmas.
S133, carrying out weighted summation on the hidden state vectors of the standard elements by using the similar weight matrix to obtain similar vectors of the key elements, and carrying out weighted summation on the hidden state vectors of the key elements by using the similar weight matrix to obtain similar vectors of the standard elements.
Using the obtained similar weight matrix E to conceal the state vector of the standard element
Figure BDA0003661117110000103
Weighted summation is carried out to obtain a key element similarity vector
Figure BDA0003661117110000104
Using the obtained similar weight matrix E to conceal the state vector of the key element
Figure BDA0003661117110000105
Weighted summation is carried out to obtain a key element similarity vector
Figure BDA0003661117110000106
Figure BDA0003661117110000107
Figure BDA0003661117110000108
Wherein l s Number of lemmas, l, representing standard elements K Number of lemmas representing key elements, e ij Represents the value of the ith row and the jth column in the similarity weight matrix E. e.g. of the type in ,e mj The same is true.
And S134, respectively carrying out subtraction and multiplication on the hidden state vector and the similar vector of the key element and the hidden state vector and the similar vector of the standard element, and carrying out soft alignment to obtain an enhanced vector of the key element information and an enhanced vector of the standard address information.
Subtracting and multiplying the related vector of the key element, and performing soft alignment to obtain an enhanced vector of the information of the key element
Figure BDA0003661117110000109
Similarly, standard address information enhancement vector can be obtained
Figure BDA00036611171100001010
Figure BDA00036611171100001011
And S135, inputting the key element information enhancement vector and the standard address information enhancement vector into the bidirectional long-short term memory neural network to obtain a key element matching vector and a standard element matching vector.
Enhancing the key element information by a vector M k Inputting the Bi-LSTM to obtain a key element matching vector V k The same-way obtained standard factor matching vector V s
S136, performing pooling operation on the key element matching vector to obtain a maximum key element pooling vector and an average key element pooling vector; performing pooling operation on the standard element matching vector to obtain a maximum pooling vector of the standard element and an average pooling vector of the standard element; and splicing the four obtained pooling vectors to obtain an element matching information vector.
Matching key elements to vector V k Obtaining a maximum pooling vector V of key elements by maximum pooling k,max Matching key elements to a vector V k Obtaining an average pooling vector V of key elements by average pooling k,avg . Maximum pooling vector V of standard elements obtained by the same method s,max Mean pooling vector V of standard elements s,avg
The formula for average pooling and maximum pooling is shown below:
Figure BDA0003661117110000111
wherein V k,i Representing a key element match vector V k The ith vector of (1).
Splicing the obtained four pooled vectors to obtain an element matching information vector V ═ V k,avg ,V k,max ,V s,avg ,V s,max ]。
And S137, inputting the element matching information vector into the full-connection layer, and obtaining the matching value of each category through a normalized exponential function, wherein the categories are three categories which are mismatching, matching and possible matching respectively.
And inputting the element matching information vector V into a full connection layer, and obtaining a matching value of each category (three categories, 0: mismatching; 1: matching; 2: possible matching) through a normalized exponential function (SOFTMAX) function. The fully-connected layer comprises two fully-connected neural networks, and the activation function between the two networks is the tanh activation function. The matching value of the output of the SOFTMAX function is between 0 and 1.
And S138, calculating a loss value by using a cross entropy loss function.
The loss function is formulated as
Figure BDA0003661117110000112
Wherein y is i For existing label classes, p i Is the output match value. If the label type is 1, then its one-hot label is [0,1,0 ]]If the output matching value is [0.4,0.2,0.4 ]]. The loss value is then: - (0 × log0.4+1 × log0.2+0 × log0.4) — log 0.2.
And S139, modifying the updated model parameters by using a gradient descent method according to the loss value, and selecting a parameter version with the highest verification accuracy as the finally trained ESIM inference model.
As shown in FIG. 2, the loss values of the region ESIM inference model are added to the loss values of the other two ESIM inference models and the three extraction matrices to obtain a total loss value, and then the gradient descent is used to update the model parameters. The other ESIM inference models total loss values were calculated the same.
The ESIM inference model may traverse the training sample multiple times. After each pass of the training sample, the accuracy of the model is tested using the validation sample. The verification process is basically consistent with the training process, and only after step S137 is completed, the category of the maximum matching value is selected as the prediction result, and compared with the labeling result. If the types are consistent, the prediction is correct, otherwise, the prediction is wrong. The model training phase will select the one parameter version that verifies the highest accuracy as the final trained ESIM inference model.
And S14, training a region extraction matrix, a building extraction matrix and a road code extraction matrix through the text address word vectors.
In the step, three extraction matrixes are used for extracting three corresponding element parts in the text address vector. The training methods of the three extraction matrixes, namely the region extraction matrix, the building extraction matrix and the road code extraction matrix, are consistent, and the process is as follows:
and S141, performing dot product on each element encoding vector in the text address word vector by using the extraction matrix to obtain the text address vector.
Using the extraction matrix to perform dot product on each element encoding vector in the text address word vector ADDRe to obtain a text address vector res, res ═ res 1 ,res 2 ,...,res n ),res i Is the ith token extraction value.
Wherein
Figure BDA0003661117110000121
W 2 Is a matrix of d1 x 1 dimensions, e i Is the encoding vector for the ith lemma in the text address word vector ADDRe.
Figure BDA0003661117110000122
Is W 2 The transposed matrix of (2).
And S142, aiming at each word element extraction value in the text address vector, obtaining a word element extraction score by using a Sigmoid function.
Extracting values of each word element, and obtaining a word element extraction Score by using a Sigmoid function i
Score i =Sigmoid(res i )。
And S143, generating element tag indexes according to the tag indexes, and calculating loss values of each predicted element by using a cross entropy loss function according to the element extraction scores and the element tag indexes.
In the training stage, element tag indexes are generated according to the tag indexes, for example, when the area tag indexes are generated, all indexes with 1 in the tag indexes are assigned with 1, and the rest are assigned with 0, when the road code tag indexes are generated, all indexes with 2 in the tag indexes are assigned with 1, and the rest are assigned with 0, and when the building tag indexes are generated, all indexes with 3 in the tag indexes are assigned with 1, and the rest are assigned with 0. And comparing the word element extraction fraction with the element mark index, and calculating each predicted word element loss value by using a cross entropy loss function, wherein the calculation process of each word element loss value is consistent with the S138 method.
Upon inferring the matching phase, Score is paired i And extracting the index of the corresponding lemma according to the lemma extraction fraction larger than 0.5. And extracting corresponding lemmas according to the indexes.
Such as: virtual text address:
"5 a 2 units 17B rooms in Guangming street gale community in Guangming district of Jingzhong city, Handong province" has the index: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25], converting each corresponding morpheme encoding vector by a region extraction matrix to obtain a corresponding region vector, obtaining a corresponding extraction score after each morpheme extraction value in the region vector is processed by a Sigmoid function, and obtaining a morpheme extraction score of [0.8,0.6,0.7,0.7,0.6,0.9,0.7,0.7,0.8,0.1,0.1,0.2,0.2,0.9,0.8,0.9,0.7,0.3,0.4,0.2,0.3,0.2,0.1,0.1,0.1 ].
In the training phase: the index according to the tag is: [1,1,1,1,1,1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,3,3,3,3,3,3]The region index is [1,1,1,1,1, 0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0]. Loss value of the first token is loss 1 Log0.8, loss of last lemma is loss 26 Log (1-0.1) log 0.9. The sum of the loss values of the text addresses is
Figure BDA0003661117110000131
Wherein loss i Is the loss value of the ith lemma.
When the index with the lemma score larger than 0.5 is extracted in the inference stage, the extracted index is [0,1,2,3,4,5,6,7,8,13,14,15,16], and the extracted lemma fragment is [ han, east, province, jing, state, city, light, bright, district, large, wind, society, district ].
And S144, modifying the parameters of the updated matrix by using a gradient descent method according to the loss value, and selecting a parameter version with the highest verification accuracy as a final trained extraction matrix.
As shown in FIG. 2, the loss values of the region extraction matrix are added to the loss values of the other three ESIM inference models and the two extraction matrices to obtain a total loss value, and then the gradient descent is used to update the model parameters. The total loss value calculation is the same for the other extraction matrices.
The extraction task of extracting the matrix may traverse the training samples multiple times. After each pass of the training sample, the accuracy of the extraction matrix is tested using the validation sample. The verification process is basically consistent with the training process, only the corresponding lemma index is selected in step S143, and compared with the labeling result, if the indexes are consistent, the prediction is correct, otherwise, the prediction is wrong. And in the model training stage, one parameter version with the highest verification accuracy is selected as a final trained extraction matrix.
In this embodiment, the three models of the element-derived layer and the three extraction matrices of the element extraction layer can be trained simultaneously in parallel, thereby improving the efficiency.
The step S1 is the address matching model training, which is to train the parameters of the address matching model with the labeled training sample set to obtain a trained address matching model. As shown in fig. 4, the following steps S2 to S3 are address matching model estimation, which is to determine whether an input address pair matches using a trained address matching model.
And step S2, inputting the address pair to be matched, and generating a prediction sample pair by the address pair to be matched through a prediction sample construction module.
The address pair to be matched has the format of [ key address, standard address 1, standard address 2.
And the address pair to be matched enters a prediction sample construction module to generate a prediction sample pair.
The prediction sample construction module is used for combining each standard address in the address pair to be matched with the key address respectively to generate a prediction sample pair. Such as:
address pairs to be matched: [ Key Address, Standard Address 1, Standard Address 2, Standard Address 3]
The prediction sample pair constructed by the prediction sample construction module is then:
prediction sample pair: [ key address, standard address 1], [ key address, standard address 2], [ key address, standard address 3 ].
And step S3, performing inference matching on the prediction sample by using an address matching model to obtain a corresponding matching result.
And obtaining embedded vectors KADDRe and SADDRe of the key address and the standard address by using a bert model for the address pair to be matched. The matching result structure of each predicted sample pair is [ key address, standard address, matching result, matching value ]. And then outputting the matching results according to the matching values in the descending order.
This step is a specific inference process, as shown in fig. 5, and is specifically as follows:
and S31, converting the address pairs to be matched into text embedding vectors by using a bert model, and inputting the text embedding vectors into the three extraction matrixes to obtain key area elements, key building elements and key road code elements of key addresses, and standard area elements, standard building elements and standard road code elements of standard addresses.
The key address embedding vector KADDRe is respectively input into the region extraction matrix, the building extraction matrix and the road code extraction matrix to respectively obtain a key region element index KDi, a key building element index KBi and a key road code element index KCi.
And then generating key area elements, key building elements and key road code elements according to KDi, KBi and KCi.
Such as virtual key address: the key regional element is the 'Jingzhou city wind factory', the key building element is the 'mountain and water group building 4', the key road code element is null value, and the result of road code ESIM model is changed into 'no information'.
Similar to the key address processing process, standard area elements, standard building elements and standard road code elements are obtained according to the standard address.
And S32, respectively re-inputting the six elements into the bert model to obtain corresponding element word vectors, and then correspondingly using the three ESIM inference models to obtain matching results of the three elements, namely a region element matching result, a building element matching result and a road code element matching result.
And inputting the six elements into the bert model again respectively to obtain respective element word vectors. As shown in fig. 5, inputting the key region element word vector and the standard region element word vector into the region ESIM inference model to obtain a region element matching result; inputting the key building element word vector and the standard building element word vector into a building ESIM inference model to obtain a building element matching result; and inputting the key road code element word vector and the standard road code element word vector into an ESIM inference model to obtain a road code element matching result. The ESIM inference model output has 4 results [ "match", "no match", "possible match", "no information" ]. Wherein a "no information" result is output only if at least one of a pair of elements of the input ESIM model is empty.
And S33, finally, according to the matching results of the three elements, comprehensively calculating to obtain a final matching result.
The comprehensive calculation method is flexibly designed according to the characteristics of addresses to be matched in different areas, and if the comprehensive calculation method can be set to be matched when the areas are matched, the buildings are not matched and the road codes are matched, the final matching result is 'matched'; and the final matching result can be set as 'mismatch' when the areas are matched, the buildings are not matched and the road codes are matched. ESIM inference model inference is performed as well as training, except that the corresponding class with the largest matching value of the model output is output as the result during the inference process.
In the embodiment of the invention, three simplified ESIM inference models and three extraction matrixes are trained, a text address pair is converted into a corresponding word vector pair by using a bert model, three sub-elements of the address, including a region element, a building element and a road code element, are extracted by using the three extraction matrixes, the sub-elements are respectively matched by using the simplified ESIM inference models, and finally whether the address is matched or not is comprehensively judged according to the matching results of the three sub-elements. Compared with the existing address matching method based on deep learning, the method of the invention reduces the sensitivity to the address length, increases the sensitivity to the number and improves the matching accuracy.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. An improved method of piecemeal inference address matching, the method comprising the steps of:
step S1, training an address matching model, wherein the address matching model comprises an element-derived fault model and an element extraction layer matrix, the element inference model comprises a region ESIM inference model, a building ESIM inference model and a road code ESIM inference model, and the element extraction layer matrix comprises a region extraction matrix, a building extraction matrix and a road code extraction matrix;
step S2, inputting the address pair to be matched, and generating a prediction sample pair by the address pair to be matched through a prediction sample construction module;
and step S3, performing inference matching on the prediction sample pair by using the address matching model to obtain a corresponding matching result.
2. The improved segmentation inference address matching method according to claim 1, wherein the specific process of step S3 is as follows:
s31, converting the address pairs to be matched into text embedded vectors by using a bert model, and inputting the text embedded vectors into the three extraction matrixes to obtain key area elements, key building elements and key road code elements of key addresses, and standard area elements, standard building elements and standard road code elements of standard addresses;
s32, inputting the six elements into the bert model again to obtain corresponding element word vectors, and then correspondingly using the three ESIM inference models to obtain matching results of the three elements, namely a region element matching result, a building element matching result and a road code element matching result;
and S33, finally, according to the matching results of the three elements, comprehensively calculating to obtain a final matching result.
3. The improved segmented inference address matching method of claim 2, wherein said address matching model is trained as follows:
s11, inputting samples, and dividing the training sample set into training samples and verification samples according to the proportion;
the format of each sample is [ area mark sample, building mark sample, road code mark sample, and section address mark sample ], wherein the formats of the area mark sample, the building mark sample and the road code mark are [ key element, standard element, mark ], the format of the section address mark sample is [ text address, mark index ], and the text address refers to key address or standard address;
s12, converting the key elements and standard elements in the area, the building and the road code mark sample and the text address in the segmented address mark sample into corresponding word vectors by using a bert model;
s13, training a region ESIM inference model, a building ESIM inference model and a road code ESIM inference model through key element word vectors and standard element word vectors;
and S14, training a region extraction matrix, a building extraction matrix and a road code extraction matrix through the text address word vectors.
4. The improved segmented inferred address matching method of claim 3, wherein step S13 and step S14 are trained in parallel.
5. The improved segmentation inference address matching method as claimed in claim 4, wherein in step S12, the key elements and standard elements in the area, building, road code mark sample are called address elements, and the specific process of step S12 is as follows:
dividing the text address and the address element into words;
converting the text address and the address element of the character into a word element code by using a bert model, and obtaining a corresponding position code;
and respectively inputting the word element codes and the position codes into the bert model to obtain corresponding word vectors.
6. The improved piecewise-inferred address matching method as set forth in claim 5, wherein in step S13, the three ESIM inference models of the region ESIM inference model, the building ESIM inference model, and the road-code ESIM inference model are simplified models, and the training method is consistent as follows:
simultaneously inputting the key element word vectors and the standard element word vectors into a fully-connected neural network to obtain hidden layer state vectors of the key elements and the standard elements;
obtaining a similar weight matrix of the key elements and the standard elements through alignment operation;
weighting and summing the hidden state vectors of the standard elements by using the similar weight matrix to obtain similar vectors of the key elements, and weighting and summing the hidden state vectors of the key elements by using the similar weight matrix to obtain similar vectors of the standard elements;
respectively subtracting and multiplying the hidden state vector and the similar vector of the key element and the hidden state vector and the similar vector of the standard element, and performing soft alignment to obtain an enhanced vector of the information of the key element and an enhanced vector of the information of the standard address;
inputting the key element information enhancement vector and the standard address information enhancement vector into a bidirectional long-short term memory neural network to obtain a key element matching vector and a standard element matching vector;
performing pooling operation on the key element matching vector to obtain a maximum pooling vector of the key elements and an average pooling vector of the key elements; performing pooling operation on the standard element matching vectors to obtain a standard element maximum pooling vector and a standard element average pooling vector; splicing the four obtained pooling vectors to obtain an element matching information vector;
inputting the element matching information vector into a full-connection layer, and obtaining a matching value of each category through a normalized index function, wherein the categories comprise three categories which are mismatching, matching and possible matching respectively;
calculating a loss value using a cross entropy loss function;
and modifying the parameters of the updated model by using a gradient descent method according to the loss value, and selecting a parameter version with the highest verification accuracy as a finally trained ESIM inference model.
7. The improved segmentation inference address matching method as claimed in claim 6, wherein in step S14, the training methods of the three extraction matrices, i.e. the region extraction matrix, the building extraction matrix and the road code extraction matrix, are consistent, and the procedure is as follows:
performing dot product on each element encoding vector in the text address word vector by using the extraction matrix to obtain a text address vector;
aiming at each lemma extraction value in the text address vector, obtaining a lemma extraction score by using a Sigmoid function;
generating element tag indexes according to the tag indexes, extracting scores and the element tag indexes according to the lemmas, and calculating loss values of each predicted lemma by using a cross entropy loss function;
and modifying the parameters of the updated matrix by using a gradient descent method according to the loss value, and selecting a parameter version with the highest verification accuracy as the finally trained extraction matrix.
CN202210573269.3A 2022-05-25 2022-05-25 Improved segmentation inference address matching method Pending CN114936627A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210573269.3A CN114936627A (en) 2022-05-25 2022-05-25 Improved segmentation inference address matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210573269.3A CN114936627A (en) 2022-05-25 2022-05-25 Improved segmentation inference address matching method

Publications (1)

Publication Number Publication Date
CN114936627A true CN114936627A (en) 2022-08-23

Family

ID=82864974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210573269.3A Pending CN114936627A (en) 2022-05-25 2022-05-25 Improved segmentation inference address matching method

Country Status (1)

Country Link
CN (1) CN114936627A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455315A (en) * 2022-11-10 2022-12-09 吉奥时空信息技术股份有限公司 Address matching model training method based on comparison learning
CN116501827A (en) * 2023-06-26 2023-07-28 北明成功软件(山东)有限公司 BIM-based market subject and building address matching and positioning method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455315A (en) * 2022-11-10 2022-12-09 吉奥时空信息技术股份有限公司 Address matching model training method based on comparison learning
CN116501827A (en) * 2023-06-26 2023-07-28 北明成功软件(山东)有限公司 BIM-based market subject and building address matching and positioning method
CN116501827B (en) * 2023-06-26 2023-09-12 北明成功软件(山东)有限公司 BIM-based market subject and building address matching and positioning method

Similar Documents

Publication Publication Date Title
CN114676353B (en) Address matching method based on segmentation inference
CN114936627A (en) Improved segmentation inference address matching method
CN112765358A (en) Taxpayer industry classification method based on noise label learning
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN112527938A (en) Chinese POI matching method based on natural language understanding
CN113592037B (en) Address matching method based on natural language inference
CN113326377A (en) Name disambiguation method and system based on enterprise incidence relation
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN115408525B (en) Letters and interviews text classification method, device, equipment and medium based on multi-level label
CN114863091A (en) Target detection training method based on pseudo label
CN117237559A (en) Digital twin city-oriented three-dimensional model data intelligent analysis method and system
CN115905959A (en) Method and device for analyzing relevance fault of power circuit breaker based on defect factor
CN115455315B (en) Address matching model training method based on comparison learning
CN113312498A (en) Text information extraction method for embedding knowledge graph by undirected graph
CN115146635B (en) Address segmentation method based on domain knowledge enhancement
CN117010373A (en) Recommendation method for category and group to which asset management data of power equipment belong
CN116467410A (en) Address matching method and device, electronic equipment and computer readable storage medium
CN117172235A (en) Class case discrimination method and system based on similarity measurement
CN117094835A (en) Multi-target group classification method for social media content
CN115270774B (en) Big data keyword dictionary construction method for semi-supervised learning
CN115168548B (en) Recall-sorting based address matching method
CN115719070A (en) Multi-step attack detection model pre-training method based on alarm semantics
CN115913992A (en) Anonymous network traffic classification method based on small sample machine learning
CN106816871B (en) State similarity analysis method for power system
CN113342982B (en) Enterprise industry classification method integrating Roberta and external knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination