CN111125365A

CN111125365A - Address data labeling method and device, electronic equipment and storage medium

Info

Publication number: CN111125365A
Application number: CN201911349674.1A
Authority: CN
Inventors: 黄绿君
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-08
Anticipated expiration: 2039-12-24
Also published as: CN111125365B

Abstract

The disclosure provides an address data labeling method, an address data labeling device, electronic equipment and a computer readable storage medium, and belongs to the technical field of data processing. The method comprises the following steps: acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be marked by adopting the address marking model to obtain a marked data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence. The method and the device can accurately and efficiently label the address data.

Description

Address data labeling method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to an address data labeling method, an address data labeling apparatus, an electronic device, and a computer-readable storage medium.

Background

With the advent of the information age, a large amount of data is presented, and in order to facilitate analysis and processing of the data to make good decisions, people usually label the data reasonably. Particularly, when the space analysis is carried out on the human behavior activity or the social economic activity, the marking of the address data is particularly important, the address data is effectively marked, and a quantitative decision basis can be provided for urban management and commercial operation.

The existing address data labeling method is usually performed in a manual labeling mode based on preset rules. However, due to different forms of address data, different data may have the same meaning, for example, the inner Mongolia, and the inner Mongolia autonomous region may all be labeled with the same meaning, and therefore, considering that the address data has a fast update speed and a high diversity degree, the address labeling rule and the address database need to be maintained and updated regularly, which consumes a lot of labor cost; in addition, the above method greatly depends on manual operation, the labeling cost is generally in direct proportion to the scale of the address data set, and when large-scale data labeling is performed, much manpower and material resource investment and a long labeling period are needed, so that the efficiency is low, and the accuracy cannot be guaranteed.

Therefore, how to label the address data accurately and efficiently is an urgent problem to be solved in the prior art.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides an address data labeling method, an address data labeling device, an electronic device, and a computer-readable storage medium, which at least to some extent overcome the problems of high labor cost and low accuracy of the existing address data labeling.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to one aspect of the present disclosure, there is provided an address data labeling method, including: acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be marked by adopting the address marking model to obtain a marked data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence.

In an exemplary embodiment of the present disclosure, the address labeling model is trained by: acquiring first sample data, second sample data and an address category label of the second sample data; pre-training a machine learning model by using the first sample data to generate an intermediate model; constructing an initial address labeling model based on at least a portion of the intermediate model; and training and obtaining the address labeling model by using the second sample data and the address category label of the second sample data.

In an exemplary embodiment of the present disclosure, the acquiring the first sample data and the second sample data includes: acquiring initial sample data, and carrying out standardization processing on the initial sample data; carrying out layered sampling on the initial sample data after the standardization processing, and updating the sequence of the initial sample data after the layered sampling; and dividing the initial sample data into the first sample data and the second sample data according to a preset proportion.

In an exemplary embodiment of the present disclosure, after dividing initial sample data into the first sample data and second sample data, an address category tag of the second sample data is acquired by: labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character; and updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.

In an exemplary embodiment of the disclosure, the pre-training a machine learning model with the first sample data, generating an intermediate model, includes: determining a plurality of groups of sample address pairs from the first sample data, and converting each group of sample address pairs into a sample address sequence; inputting the sample address sequence into a machine learning model to obtain a preset result value of the subtask; updating parameters of the machine learning model according to a first loss function to obtain the intermediate model, wherein the first loss function comprises an error between a result value of the subtask and a tag value of the subtask; wherein the subtasks include any one or a combination of more than one of: judging whether the province characters of the two addresses in the sample address pair are the same; judging whether the city characters of the two addresses in the sample address pair are the same; and judging whether the zone characters of the two addresses in the sample address pair are the same.

In an exemplary embodiment of the disclosure, after converting each set of sample address pairs into a sequence of sample addresses, the method further comprises: performing randomized substitution processing on one or more characters in the sample address sequence; the subtasks further include: and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.

In an exemplary embodiment of the present disclosure, the training and obtaining the address labeling model by using the second sample data and the address category label of the second sample data includes: constructing a second loss function based on the normalization of the labeling path; inputting the second sample data into the address labeling model, and updating parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.

In an exemplary embodiment of the present disclosure, after acquiring the address category tag of the second sample data, the method includes: acquiring the quantity of the second sample data under each address category, and calculating the sample proportion of each address category; if the sample proportion of at least one address category is lower than a preset threshold value, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the second sample data after adjustment to meet the preset condition.

In an exemplary embodiment of the present disclosure, the adjusting at least a part of the data in the second sample data includes: deleting a part of data from second sample data of the address category with the sample proportion higher than the preset threshold; and/or constructing new sample data based on second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.

In an exemplary embodiment of the present disclosure, the splitting the address to be labeled into a plurality of characters includes: acquiring an address to be marked, and performing text cleaning processing on the address to be marked; splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index; the processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence comprises the following steps: processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by the label indexes of each character; the determining the labeling result of the address to be labeled according to the labeling data sequence includes: and searching in a preset label dictionary according to the label index to determine the labeling result of the address to be labeled.

According to an aspect of the present disclosure, there is provided an address data labeling apparatus including: the model acquisition module is used for acquiring an address labeling model, and the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; the character splitting module is used for splitting the address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; the sequence processing module is used for processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence; and the result determining module is used for determining the labeling result of the address to be labeled according to the labeling data sequence.

In an exemplary embodiment of the present disclosure, the address labeling model is trained by: the data acquisition unit is used for acquiring first sample data, second sample data and an address category label of the second sample data; an intermediate model generation unit, configured to pre-train a machine learning model using the first sample data, and generate an intermediate model; an initial model building unit, configured to build an initial address tagging model based on at least a part of the intermediate model; and the model training unit is used for training and obtaining the address labeling model by utilizing the second sample data and the address category label of the second sample data.

In an exemplary embodiment of the present disclosure, the data acquisition unit includes: the standardization processing subunit is used for acquiring initial sample data and standardizing the initial sample data; the hierarchical processing subunit is used for carrying out hierarchical sampling on the initial sample data after the standardization processing and updating the sequence of the initial sample data after the hierarchical sampling; and the data dividing subunit is used for dividing the initial sample data into the first sample data and the second sample data according to a preset proportion.

In an exemplary embodiment of the present disclosure, after dividing initial sample data into the first sample data and second sample data, an address category tag of the second sample data is acquired by: the character labeling subunit is configured to label each character in the second sample data by using a preset labeling method to obtain an address category label of each character; and the label updating subunit is used for updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.

In an exemplary embodiment of the present disclosure, the intermediate model generating unit includes: a sequence conversion subunit, configured to determine multiple groups of sample address pairs from the first sample data, and convert each group of sample address pairs into a sample address sequence; the sequence input subunit is used for inputting the sample address sequence into a machine learning model so as to obtain a result value of a preset subtask; a parameter updating subunit, configured to update parameters of the machine learning model according to a first loss function to obtain the intermediate model, where the first loss function includes an error between a result value of the subtask and a tag value of the subtask; wherein the subtasks include any one or a combination of more than one of: judging whether the province characters of the two addresses in the sample address pair are the same; judging whether the city characters of the two addresses in the sample address pair are the same; and judging whether the zone characters of the two addresses in the sample address pair are the same.

In an exemplary embodiment of the present disclosure, after converting each group of sample address pairs into a sample address sequence, the address data labeling apparatus further includes: the replacement processing module is used for carrying out randomized replacement processing on one or more characters in the sample address sequence; the subtasks further include: and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.

In an exemplary embodiment of the present disclosure, the model training unit includes: the function construction subunit is used for constructing a second loss function based on the normalization of the labeling path; and the model training subunit is used for inputting the second sample data into the address labeling model, and updating the parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.

In an exemplary embodiment of the present disclosure, after acquiring the address category tag of the second sample data, the address data labeling apparatus includes: the sample proportion calculation module is used for acquiring the quantity of the second sample data under each address category and calculating the sample proportion of each address category; and the data adjusting module is used for adjusting at least part of data in the second sample data if the sample proportion of at least one address category is lower than a preset threshold value, so that the sample proportion of each address category in the adjusted second sample data meets the preset condition.

In an exemplary embodiment of the present disclosure, the data adjusting module includes: the data deleting unit is used for deleting a part of data from second sample data of the address category with the sample proportion higher than the preset threshold; and/or the sample construction unit is used for constructing new sample data based on second sample data of the address category of which the sample proportion is lower than the preset threshold value, and adding the new sample data into the second sample data.

In an exemplary embodiment of the present disclosure, the character splitting module includes: the address acquisition unit is used for acquiring an address to be marked and carrying out text cleaning processing on the address to be marked; the character splitting unit is used for splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index; the sequence processing module comprises: the index processing unit is used for processing the digital index by adopting an address labeling model to obtain a labeling data sequence formed by the label indexes of each character; the result determination module includes: and the index searching unit is used for searching in a preset label dictionary according to the label index so as to determine the labeling result of the address to be labeled.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

Exemplary embodiments of the present disclosure have the following advantageous effects:

acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be labeled by adopting an address labeling model to obtain a labeled data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence. On one hand, the exemplary embodiment processes the character sequence of the address to be labeled by establishing the address labeling model to obtain the labeling result, less depends on manual operation to label the address data, the intelligent degree of the labeling process is higher, the operation is simple and rapid, and the accuracy is higher; on the other hand, in the exemplary embodiment, only when the model is trained, the label labeling is performed on part of the sample data, and the address labeling model is determined so as to label large-scale address data, so that the labeling period required by labeling the address data is reduced, the human resource cost is reduced, and the method has wide applicability.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 schematically shows a flowchart of an address data labeling method in the present exemplary embodiment;

FIG. 2 schematically illustrates a sub-flow diagram of an address data annotation method in the present exemplary embodiment;

FIG. 3 schematically illustrates a sub-flow diagram of another address data labeling method in the present exemplary embodiment;

FIG. 4 is a sub-flowchart schematically illustrating yet another address data labeling method in the present exemplary embodiment;

FIG. 5 schematically illustrates a flow chart for training an intermediate model in the present exemplary embodiment;

FIG. 6 is a diagram schematically illustrating an architecture of an address labeling model in the present exemplary embodiment;

FIG. 7 schematically illustrates a flow chart of migrating a training intermediate model in the present exemplary embodiment;

fig. 8 schematically shows a flowchart of processing data in the present exemplary embodiment;

FIG. 9 is a flowchart schematically showing another address data labeling method in the present exemplary embodiment;

fig. 10 is a block diagram schematically showing the configuration of an address data labeling apparatus in the present exemplary embodiment;

fig. 11 schematically illustrates an electronic device for implementing the above-described method in the present exemplary embodiment;

fig. 12 schematically illustrates a computer-readable storage medium for implementing the above-described method in the present exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

An exemplary embodiment of the present disclosure first provides an address data labeling method, and an application scenario of the method of this embodiment may be: in the E-commerce platform, a large number of user addresses are obtained, the receiving addresses are labeled through the illustrative embodiment, and the relation between the area and the shopping habits of the users is analyzed according to the labeling result, so that the user behaviors are analyzed; when training data used for machine learning training is obtained, the training data is labeled, so that a training label is determined according to a labeling result, and a machine learning model is trained according to the training data and the label corresponding to the training data.

The exemplary embodiment is further described with reference to fig. 1, and as shown in fig. 1, the address data labeling method may include the following steps S110 to S140:

step S110, an address labeling model is obtained, and the address labeling model is obtained based on the first sample data without the label and the second sample data with the label through pre-training.

The address labeling model is a model for processing an address to be labeled to obtain a labeling result of the address to be labeled, and may be a pre-trained model. Sample data refers to training data for model training, and the present exemplary embodiment relates to two types of sample data, the unlabeled first sample data and the labeled second sample data. When model training is carried out, pre-training of a machine learning model can be carried out based on unlabeled first sample data to generate an intermediate model, then the intermediate model is trained through labeled second sample data and the labels corresponding to the labeled second sample data, and model parameters are adjusted to obtain a final address labeling model. The address labeling model can be determined by a large amount of unlabeled first sample data and a small amount of labeled second sample data, and compared with a machine learning model trained based on a large amount of training data needing labels, the resource cost of manual labeling can be greatly reduced, and the dependence of the model on the manual labeling data is reduced.

Step S120, splitting the address to be labeled into a plurality of characters, so as to convert the address to be labeled into a character sequence to be labeled, which is formed by arranging a plurality of characters.

The address to be labeled refers to data needing address labeling, such as a living address, an office address, a school address, a hospital address, an organization address, addresses of all links of an article circulation supply chain and the like, the address data records geographic space information corresponding to human behavior activities or social and economic activities, and the address data is a very valuable data resource in the big data era. The space analysis is carried out on the human behavior activity or the social and economic activity based on the address data, and quantitative decision basis can be provided for urban management and commercial operation. Given the complexity and specificity of address data, for example, address naming lacks a uniform specification; the address writing has the conditions of short-term, description information, multiple different characters, simplified/complex forms and the like; chinese addresses do not have the problem of dividing geographic cells by punctuation marks like English. Before processing the address to be annotated, the present exemplary embodiment may perform a splitting process on the characters of the address to be annotated, for example, the address data "rainbow bridge town of qing city, wen zhou, zhe jiang may be split into" zhe/jiang/province// wen/state/city// le/qing/city// rainbow/bridge/town/", and then convert these characters into a sequence of characters to be annotated, so as to facilitate the processing thereof by the address annotation model.

And step S130, processing the character sequence to be labeled by adopting an address labeling model to obtain a labeled data sequence.

Step S140, determining the labeling result of the address to be labeled according to the labeling data sequence.

And taking the character sequence to be marked converted by the address to be marked as input data of the address marking model, and processing the character sequence to be marked by the model to obtain a marking result of the address to be marked. The labeling result is data capable of reflecting data attributes in the data to be labeled, and can be a processing result of the machine learning model directly, such as a numerical value type result or an alphabet type result, and a final labeling result of the address to be labeled can be determined by converting the numerical value type result or the alphabet type result; the annotation result may also be a final annotation result of the data that can be annotated, for example, for the address data "nan shan region of guang-east shenzhen city", the annotation result may be a numerical result "1", "2", "3", by converting the numerical result, a final annotation result of "province", "city", "region" may be obtained, the annotation result may also be "guang-province", "shen-city", "nan shan-region", or "guang-province", "east-province", "deep-city", "zhen-city", "nan-region", "mountain-region", and the like, and the form of the annotation result is not specifically limited in this disclosure.

In an exemplary embodiment, in the step S120, the splitting the address to be labeled into a plurality of characters may include the following steps:

step S210, acquiring an address to be labeled, and performing text cleaning processing on the address to be labeled;

step S220, splitting a single character of the address to be marked after the text cleaning processing, and converting each split character into a numerical index;

step S130 may include:

step S230, processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by label indexes of each character;

step S140 may include:

step S240, according to the tag index, searching in a preset tag dictionary to determine the labeling result of the address to be labeled.

In this exemplary embodiment, the address to be labeled may be obtained from a database stored in the terminal device, or may be determined from address data input by the user in real time, so as to implement real-time labeling of the address to be labeled, and the like, which is not specifically limited by the present disclosure. In consideration of the great difference in the format or form of the acquired address to be labeled, after the address to be labeled is acquired, text cleaning processing can be performed on the address to be labeled, including but not limited to: eliminating too short or too long address entries; converting the full angle of the character into a half angle; and eliminating Chinese punctuation marks such as spaces, tab marks, quotation marks, various brackets and the like in the address text. Furthermore, word segmentation processing can be performed on the address to be labeled after text cleaning according to a single character, each character after word segmentation processing is converted into a numerical index, so that the address labeling model can process the character to be labeled conveniently to obtain a processing result of each numerical index, and finally a label corresponding to each numerical index is determined based on a preset label dictionary, namely, a labeling result of each character in the address to be labeled can be determined.

Based on the above description, in the present exemplary embodiment, an address labeling model is obtained, where the address labeling model is obtained based on the unlabeled first sample data and the labeled second sample data through pre-training; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be labeled by adopting an address labeling model to obtain a labeled data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence. On one hand, the exemplary embodiment processes the character sequence of the address to be labeled by establishing the address labeling model to obtain the labeling result, less depends on manual operation to label the address data, the intelligent degree of the labeling process is higher, the operation is simple and rapid, and the accuracy is higher; on the other hand, in the exemplary embodiment, only when the model is trained, the label labeling is performed on part of the sample data, and the address labeling model is determined so as to label large-scale address data, so that the labeling period required by labeling the address data is reduced, the human resource cost is reduced, and the method has wide applicability.

In an exemplary embodiment, as shown in fig. 3, in step S110, the address labeling model may be trained by:

step S310, acquiring first sample data, second sample data and an address category label of the second sample data;

step S320, pre-training a machine learning model by utilizing first sample data to generate an intermediate model;

step S330, constructing an initial address labeling model based on at least one part of the intermediate model;

step S340, training and obtaining an address labeling model by using the second sample data and the address category label of the second sample data.

The difference is that the first sample data is sample data without a label, and the second sample data is sample data with a label. The present exemplary embodiment may construct a robust deep learning model through the first sample data and the second sample data to reduce the dependency on the annotation data. Specifically, the machine learning model can be pre-trained in an unsupervised learning mode through the first sample data to generate an intermediate model, and then the intermediate model is migrated and learned through the second sample data with the label to obtain a final address labeling model. The address labeling model constructed in the mode does not need to be updated and maintained manually at regular intervals, and the investment of manpower and material resources consumed by manually labeling a large number of labels is reduced. In addition, in order to ensure the accuracy of model training, the exemplary embodiment may acquire various and comprehensive sample data from multiple platforms, for example, may acquire a full-volume receiving address in an order database in multiple e-commerce platforms as sample data from the latest quarter.

In an exemplary embodiment, as shown in fig. 4, the step S310 of acquiring the first sample data and the second sample data may include the following steps:

step S410, obtaining initial sample data and carrying out standardization processing on the initial sample data;

step S420, carrying out hierarchical sampling on the initial sample data after the standardization processing, and updating the sequence of the initial sample data after the hierarchical sampling;

step S430, according to a preset ratio, dividing the initial sample data into first sample data and second sample data.

The initial sample data refers to unprocessed sample data used for model training, and after the initial sample data is acquired, in order to improve accuracy and convenience of machine learning model training, the initial sample data may be subjected to standardization processing, which may specifically include but is not limited to: removing the data which are too short or too long in the initial sample data; character conversion, for example, converting full-angle characters in data into half-angle characters; the character specialization processing, such as eliminating the space, tab, quotation mark, various brackets and other punctuation marks in the data; and removing the weight and the like. In addition, considering that the randomness of the acquired data may cause data imbalance, for example, 100 pieces of acquired address data about beijing city may include 60 pieces of data about sunny region, 20 pieces of data about hailing region, and 20 pieces of data about new city region, in order to ensure the unbiased property of the sample data, this exemplary embodiment may further perform hierarchical sampling processing on the initial sample data after the normalization processing, specifically, the acquired address data may be hierarchically sampled according to province, city, and county, and for example, the number of address data about sunny region, hailing region, and new city region in the address data may be kept distributed evenly by hierarchical sampling. And then updating the order of the initial sample data after the hierarchical sampling, for example, randomly disordering the order of the sample data. And finally, dividing the initial sample data into first sample data without a label and second sample data with a label according to a preset proportion. The preset proportion can be set by self-definition according to needs, and the disclosure is not particularly limited thereto.

In an exemplary embodiment, after the above step S430, the address category tag of the second sample data may be acquired by:

labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character;

and updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.

In order to comprehensively analyze the administrative division hierarchy, the geographic entities and the meaningful text information in the address data, so that the labeling result of the address sequence has universality and meets the requirements of a multi-service scene, through the analysis of the attributes and characteristics of the address data, the exemplary embodiment can summarize and abstract various categories in the address data, such as 15 label categories shown in table 1, and define the content and category of each label category. Note that the label categories referred to in the present disclosure are not limited to the 15 categories in table 1.

TABLE 1 Category tag for address data and definition thereof

In the present exemplary embodiment, each character in the second sample data may be labeled by a preset labeling method, for example, the second sample data is imported into a text labeling tool (e.g., doccano), and each character in the second sample data is labeled by adopting a BIO (Begin intermediate external) sequence labeling manner. In order to ensure the accuracy of the address category label, the exemplary embodiment may further perform re-verification and correction on the labeled result through a manual verification method, and update the address category label of each character in the second sample data according to the verification result of the address category label, so as to ensure the accuracy and consistency of the address category label of each character.

And marking the address data by adopting a BIO sequence marking mode, namely marking each character in the address data as 'B-X', 'I-X' or 'O'. Wherein, the character is located in the segment of X type and at the beginning of the segment, the character is located in the segment of X type, the character is located in the middle of the segment, and the character is located in the segment of X type and at the middle of the segment, and the character is not located in any type. All possible address category labels may include: "B-Provision", "I-Provision", "B-City", "I-City", "B-County", "I-County", "B-Town", "I-Town", "B-Village", "I-Village", "B-Road", "I-Road", "B-roadNum", "I-roadNum", "B-roadAux", "I-roadAux", "B-Residence", "I-Residence", "B-Build", "I-Build", "B-School", "I-School", "B-organization", "I-organization", "B-construction", "I-number", "I-BuildNum", "B-Unit", "I-Unit", "B-Location", "I-Location", "O" ], respectively, corresponding to the address class labels in Table 1. The specific label format is shown in table 2:

TABLE 2 annotation result form based on BIO sequence annotation

In an exemplary embodiment, the step S320 may include the following steps:

determining a plurality of groups of sample address pairs from the first sample data, and converting each group of sample address pairs into a sample address sequence;

inputting the sample address sequence into a machine learning model to obtain a preset result value of the subtask;

updating parameters of the machine learning model according to a first loss function to obtain an intermediate model, wherein the first loss function comprises an error between a result value of the subtask and a tag value of the subtask;

wherein, the subtasks can include any one or combination of more of the following:

judging whether the province characters of two addresses in the sample address pair are the same;

judging whether the city characters of the two addresses in the sample address pair are the same;

and judging whether the zone characters of the two addresses in the sample address pair are the same.

When the machine learning model is pre-trained according to the unlabeled first sample data, the first sample data required by model construction can be pre-processed, and a training set, a verification set and a test set are divided in proportion. For each address in the training set, randomly selecting an address from the rest addresses with probability p, selecting the address with probability 1-p, and constructing a sample address pair, namely when selecting a certain address from the first sample data as one address in the sample address pair, selecting another address from other addresses with probability p, and forming the sample address pair with probability 1-p and the address itself. The probability p may be set by a user according to needs, for example, p may be set to be 50%, and the like, which is not specifically limited in this disclosure. And inputting the sample address sequence into a machine learning model to obtain a result value of a preset subtask. In this exemplary embodiment, one or more combinations of the three seed tasks may be set, characters of provinces, cities, districts/counties corresponding to two addresses in the sample address pair are compared, and whether provinces of the two sample addresses are the same, cities of the two sample addresses are the same, and counties of the two sample addresses are the same is determined.

Since the address data is different from the ordinary text data and can contain the spatial topological relation of the geographic regions, the semantic expression has obvious administrative levels and membership between the administrative levels. For example, in address 1, "hong bridge Zhen hong nan Lu of Qing city, le Jiang," Zhe Jiang, "Wen Zhou city," and "Qing City" have administrative levels and administrative level subordinates, and are important geographic information contained in the address data; the "Hangzhou" in the "Yangshan township" at address 2 and the "Wenzhou" at address 1 are in parallel relationship. Thus, the present exemplary embodiment proposes to encode such external knowledge and administrative level membership into semantic representations of text by adding the above-mentioned subtask constraints of "whether the province characters of two addresses in a sample address pair are the same", "whether the city characters of two addresses in a sample address pair are the same", and "whether the region characters of two addresses in a sample address pair are the same" in a pre-trained language model. Specifically, when model training data preprocessing is performed to construct a model training sample address pair, three subtasks corresponding to "whether province characters of two addresses in the sample address pair are the same", "whether city characters of two addresses in the sample address pair are the same", and "whether region characters of two addresses in the sample address pair are the same" are respectively reserved with identification bits "[ CLS1 ]", "[ CLS2 ]" "[ CLS3 ]", after semantic information is provided layer by a machine learning model (in the present exemplary embodiment, a BERT model can be used), multidimensional vectors corresponding to "[ CLS1 ]", "[ CLS2 ]" "[ CLS3 ]" are respectively extracted from a vector matrix output at the uppermost layer, and are respectively input into a two-class model, loss values loss1, loss2, and loss3 are calculated, and model parameters are further updated according to obtain an intermediate model.

It should be noted that, in order to facilitate the processing of the sample address pairs by the machine learning model, the present exemplary embodiment may first perform data processing on the first sample data, and convert each group of sample address pairs into a sample address sequence. Specifically, the sample address pair obtained above may be divided into words according to characters (i.e., divided into single characters), and marks of [ START ], [ END ] are added to the beginning and the END of the sample address sequence respectively to indicate the beginning and the END of the address sequence, so as to construct a corpus dictionary, and align the lengths of two sample addresses to the maximum length of the parameter setting. If the total length of the two sample addresses exceeds the maximum length of the parameter setting, selecting a longer sample address, and deleting non-START and END identification characters at the head or tail in sequence at random until the maximum length condition is met; and if the total length of the two address texts is less than the maximum length of the parameter setting, supplementing a special character "[ PAD ]" and the like at the tail, and constructing a sample address sequence to be processed by the machine learning model in such a way.

In an exemplary embodiment, after converting each group of sample address pairs into a sample address sequence, the address data labeling method may further include:

performing randomized substitution processing on one or more characters in the sample address sequence;

the above subtasks may further include:

and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.

In the present exemplary embodiment, in addition to the three sub-tasks described above, a "masked character prediction" sub-task may be provided to pre-train the machine learning model. Specifically, the randomized substitution process may be performed on one or more characters in the sample address sequence, including but not limited to: masking, replacing, deleting or the like is performed randomly on a certain proportion of one or more characters, that is, 80% of characters in the masked replacement words can be replaced by "[ MASK ]", 10% of characters are reserved as original characters, or 10% of characters are replaced by randomly taken characters in the corpus dictionary, and the like. By setting a 'mask single character prediction' subtask, an original character corresponding to a character subjected to randomized substitution in a sample address sequence is predicted, the subtask loss value is recorded as loss4, and parameters of a machine learning model are updated according to a first loss function to obtain an intermediate model.

It should be noted that, in this exemplary embodiment, the pre-training language model may be a model including multi-subtask constraints, a total first loss function of the pre-training language model may be obtained by weighted summation of the first loss functions of each subtask, and a weight of the first loss function may be set by a user according to needs, which is not specifically limited in this disclosure.

In the present exemplary embodiment, as shown in fig. 5, the training process of the intermediate model may include the following steps:

step S510, acquiring first sample data;

step S520, determining a plurality of groups of sample address pairs from the first sample data, and converting each group of sample address pairs into a sample address sequence;

step S530, constructing a pre-training intermediate model, designing a subtask, and defining a first loss function;

step S540, training a machine learning model based on the first sample data, and calculating a loss value of a first loss function in a forward propagation process;

step S550, judging whether the loss value is lower than a first preset threshold value;

step S560, if the value is lower than the first preset threshold value, ending the model training to obtain an intermediate model after the pre-training is finished;

if the value is higher than the first preset threshold value, executing the step S570, performing a back propagation process, calculating the parameter updating gradient of each layer, and updating the weight value;

and returning to execute the step S540, and performing multiple iterations until the model convergence and the model loss value are lower than a first preset threshold value, so as to obtain a pre-trained intermediate model and optimal parameters.

In an exemplary embodiment, the step S340 may include the following steps:

constructing a second loss function based on the normalization of the labeling path;

and inputting the second sample data into the address labeling model, and updating parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.

The present exemplary embodiment may transfer and learn information such as language features, semantic patterns, membership between administrative levels, and the like in the address data learned in the intermediate model pre-trained on the large-scale unlabeled second sample data in a Fine tuning manner, so as to reduce the dependence of the address labeling model on the large-scale labeled sample data.

When performing migration training on the address labeling model, second sample data of the address labeling model for address data and a tag form of the second sample data may be as shown in table 2. The present exemplary embodiment may construct training data of the following form for the second sample data: [ CLS1] [ CLS2] [ CLS3] [ START ] [ C1] … [ Cm ] [ END ] [ PAD ]. Where [ CLSi ] (i ═ 1, 2, 3) denotes an identification bit reserved for adding 3 subtasks proposed in the present exemplary embodiment, [ Cj ] denotes each character after the address data is divided into words, and m denotes the number of individual characters in the address data. And then converting each character in the address text into an integer index corresponding to the corresponding character in the corpus dictionary.

Labeling labels [ "START", "B-Provisions", "I-Provisions", "B-City", "I-City", "B-County", "I-County", "B-Town", "I-Town", "B-Village", "I-Village", "B-Road", "I-Road", "B-roadNum", "I-roadNum", "B-roadAux", "B-resource", "I-resource", "B-resource", "I-Build", "B-School", "I-School", "B-School", "I-organization", "B-building", "B-Build", "I-Location", "B-Location", "I-Location", "O-Location", "I-Unit", -, and "END" is used for mapping corresponding integer values, for example, mapping to integer values of 0 to 33, and storing the mapping values as a label dictionary for use in reversely converting the results after model inference into labels. And dividing the processed second sample data into a training set, a verification set and a test set according to a certain proportion, such as 70 percent to 15 percent. And training the model on the training set, and adjusting the hyper-parameters according to the performance of the model on the verification set.

Fig. 6 shows an architecture diagram of the address labeling model in the present exemplary embodiment. The method specifically comprises the following steps: the input layer 610 is used for extracting a vector matrix corresponding to 'CLS 1 CLS2 CLS 3' in a vector matrix output by the uppermost layer of the pre-trained intermediate model on the basis of the pre-trained intermediate model on large-scale non-labeled data (such as address-oriented BERT-base (768)); input into the TextCNN of the first middle layer 620 (including TextCNN (Text Convolutional Neural Networks) (64) + Dropout) to obtain the global token vector of the address sequence, input into the second middle layer 630 (including Long Short-Term Memory, Long Short Term Memory network/GUR (gated round unit)) in the two-way LSTM (128), i.e. two LSTMs propagating in opposite directions, of the vector matrix output from the top layer of the pre-trained middle model by extracting the vector matrix corresponding to "[ START ] [ C1] … [ Cm ] [ END ] [ PAD ]", and then splicing the global token vector output from the TextCNN with the vector corresponding to each character of the address sequence, and then in the third middle layer 640 (including Dropout (0.5)), the output vectors of the forward and reverse LSTMs are spliced, connecting two full-connection layers (a fourth middle Layer 650 and a fifth middle Layer 660) with dropout, Dense Layer (128) and Dense Layer (33) to obtain the fraction of each character in the address sequence belonging to each address class label, namely a state matrix; a CRF (conditional random field) layer is connected through a sixth intermediate layer 670 to learn the implicit constraints between characters in the address sequence, e.g. the beginning of a sentence should be "B-" or "O" instead of "I-"; "B-label 1I-label 2I-label 3 …", in this mode, classes 1, 2, 3 should be the same label class. "B-City I-City" is correct, while "B-City I-Organization" is incorrect; "O I-label" is erroneous, the beginning of the named entity should be "B-" instead of "I-", and so on. The probability score of the address class label transition of the front character and the rear character in the address sequence is represented by a state transition matrix defined in a CRF layer. The final goal of the address labeling model is to find an optimal sequence labeling path, i.e. the score of the optimal path should be the highest score among all paths.

The second penalty function of the address labeling model may be defined as:

wherein s is_pLabeling the score, s, of the path for each possible sequence_realPathIteratively optimizing the score of the obtained optimal path for the address labeling model; x is the number of_ijRepresenting the state score in the state matrix, i being the index of the position of the ith character in the address sequence, y_iIs the index of the address category label corresponding to the ith character in the address sequence.

Is label y in the transfer matrix_iTo the label y_i+1The likelihood score of metastasis.

As shown in fig. 7, a specific training process may include the following steps:

step S710, acquiring second sample data and an address category label corresponding to the second sample data;

step S720, constructing an address labeling model on the basis of the pre-trained intermediate model, and defining a second loss function;

step S730, calculating a loss value of a second loss function in the forward propagation process based on the second sample data migration training intermediate model;

step S740, determining whether the loss value is lower than a second preset threshold value;

step S750, if the value is lower than a second preset threshold value, ending model training to obtain a trained optimal address labeling model;

if the value is higher than the second preset threshold value, executing the step S760, performing a back propagation process, calculating the parameter updating gradient of each layer, and updating the weight value;

and continuing to execute the step S730, and performing multiple iterations until the model convergence and the model loss value are lower than a second preset threshold value, so as to obtain a final address labeling model and optimal parameters.

And finally, evaluating the performance of the finally obtained address labeling model through the test set. The test result of the exemplary embodiment shows that the address text sequence labeling method based on the transfer learning can achieve excellent performance on various labels. For example, the address labeling model has F1 values of 90% or more in province, city, district/county, town, road number, building number, unit 8 type tags, F1 values of 84.95% and 83.60% in school, house 2 type tags, respectively, and F1 values of other 5 type tags, respectively: road auxiliary points: 79.67, mechanism: 77.62, village/community: 74.49, building: 74.28, orientation 74.47. The detailed performance of the model is shown in table 3:

TABLE 3 accuracy, recall and F1 values of the Address Mark model on the test set for each Address Category tag

In an exemplary embodiment, after the address category tag of the second sample data is acquired in step S310, the address data tagging method may include the following steps:

acquiring the quantity of second sample data under each address category, and calculating the sample proportion of each address category;

and if the sample proportion of at least one address category is lower than the preset threshold, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the adjusted second sample data to meet the preset condition.

Considering the particularity of the address data, for example, each address generally includes administrative level labels such as province, city, district/county, village/community, etc., the number of the several types of labels in the second sample data is extremely unbalanced, and the number of the province, city, district/county labels is much greater than that of other address category labels, which results in that the model pays more attention to the correctness of label labeling of the province, city, district/county labels, and the three types of labels are themselves relatively easy to label. The number of samples in buildings, organizations, schools, residences, villages/villages and the like is small due to the diversity of entity names, and therefore, the address labeling model is not sufficiently trained on the labels. Therefore, after the address category tag of the second sample data is obtained, the present exemplary embodiment may obtain the number of the second sample data in each address category, calculate the sample proportion of each address category, and if there is at least one address category whose sample proportion is lower than the preset threshold, adjust at least a portion of data in the second sample data, that is, perform data imbalance processing or data enhancement processing.

It should be noted that, in order to avoid data distortion caused by data adjustment transition, the present exemplary embodiment may set a preset condition, and selectively perform data adjustment in the above manner after counting the number of each type of label in the training set after processing, so that the number of each type of label is approximately equal, and when the preset condition is satisfied, the data adjustment process may be ended. For example, when the amount of data subjected to data adjustment exceeds 2 times of the number of original training sets, the data adjustment is stopped.

Specifically, in an exemplary embodiment, adjusting at least a part of data in the second sample data may include the following steps:

(1) deleting a part of data from second sample data of the address category with the sample proportion higher than a preset threshold; and/or

(2) And constructing new sample data based on the second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.

For example, since the number of province, city, district/county tags is much greater than that of other address category tags in the second sample data, it may be considered that the second sample data under the province, city, district/county address category tags is deleted from the second sample data in the manner of (1). The second sample data quantity under the address category labels of province, city, district/county, village/community and the like is kept balanced.

In addition, the data adjustment processing may be performed on the second sample data by using the method (2), specifically, the entity names of the tags with small data volumes (such as buildings, institutions, schools, residences, villages/villages, and the like) in the labeled data are respectively extracted, after duplication is removed, entity name sets are respectively constructed, labeled data with a certain proportion are randomly extracted from the data set of the labeled second sample data, data adjustment processing is performed on each address in the labeled data set in a targeted manner and added to the training set, so that the sample data volumes of the tags in the training set after combination are balanced. For example, building a 5-class entity name set of a building, an organization, a school, a house, a country/village, and randomly selecting an n (0< ═ n < ═ 5) class tag for each address to perform data adjustment processing. For the selected label, one of 3 ways of 'no processing', 'deleting the label', 'randomly selecting one to replace the current entity name from the corresponding entity name set' is randomly adopted to adjust the data, and the adjusted data is added into the training set.

Fig. 8 shows a flowchart of data processing in the present exemplary embodiment, which may specifically include the following steps:

step S810, acquiring second sample data;

step S820, labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character;

step S830, updating the address category label of each character in the second sample data according to the verification result aiming at the address category label;

step 840, obtaining the quantity of second sample data under each address category, and calculating the sample proportion of each address category;

step S850, judging whether the sample proportion of each address type in the second sample data is balanced;

step S860, if the sample proportion of each address category in the second sample data is not balanced, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the adjusted second sample data to meet a preset condition;

in step S870, if the sample proportion of each address category in the second sample data is balanced, the data processing flow is completed.

Fig. 9 is a flowchart illustrating another address data labeling method in the present exemplary embodiment, which may specifically include the following steps:

step S910, obtaining initial sample data and carrying out standardization processing on the initial sample data;

step S920, dividing the initial sample data into first sample data and second sample data according to a preset proportion;

step S930, pre-training a machine learning model by utilizing the first sample data to generate an intermediate model;

step S940, marking each character in the second sample data by adopting a preset marking method, and verifying to obtain an address category label of each character;

step S950, performing data adjustment on the sample data with unbalanced sample proportion, and constructing second sample data required by transfer learning;

step S960, migrating the training intermediate model by using the second sample data and the address category label of the second sample data to obtain an address labeling model;

step S970, acquiring the address to be labeled, and processing the address to be labeled by adopting the address labeling model to obtain a labeling result of the address to be labeled.

An exemplary embodiment of the present disclosure also provides an address data labeling apparatus. Referring to fig. 10, the apparatus 1000 may include a model obtaining module 1010, configured to obtain an address labeling model, where the address labeling model is obtained by pre-training on the basis of unlabeled first sample data and labeled second sample data; a character splitting module 1020, configured to split the address to be labeled into multiple characters, so as to convert the address to be labeled into a character sequence to be labeled, where the character sequence is formed by arranging multiple characters; a sequence processing module 1030, configured to process the character sequence to be labeled by using an address labeling model, so as to obtain a labeled data sequence; and the result determining module 1040 is configured to determine, according to the labeled data sequence, a labeling result of the address to be labeled.

In an exemplary embodiment, the address labeling model may be trained by: the data acquisition unit is used for acquiring first sample data, second sample data and an address category label of the second sample data; the intermediate model generation unit is used for pre-training the machine learning model by utilizing the first sample data to generate an intermediate model; an initial model building unit, configured to build an initial address tagging model based on at least a part of the intermediate model; and the model training unit is used for training and obtaining the address labeling model by utilizing the second sample data and the address category label of the second sample data.

In an exemplary embodiment, the data acquisition unit may include: the standardization processing subunit is used for acquiring initial sample data and carrying out standardization processing on the initial sample data; the hierarchical processing subunit is used for carrying out hierarchical sampling on the initial sample data after the standardization processing and updating the sequence of the initial sample data after the hierarchical sampling; and the data dividing subunit is used for dividing the initial sample data into first sample data and second sample data according to a preset proportion.

In an exemplary embodiment, after the initial sample data is divided into the first sample data and the second sample data, the address category tag of the second sample data may be acquired by: the character labeling subunit is used for labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character; and the label updating subunit is used for updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.

In an exemplary embodiment, the intermediate model generating unit may include: the sequence conversion subunit is used for determining a plurality of groups of sample address pairs from the first sample data and converting each group of sample address pairs into a sample address sequence; the sequence input subunit is used for inputting the sample address sequence into the machine learning model to obtain a result value of a preset subtask; a parameter updating subunit, configured to update parameters of the machine learning model according to a first loss function to obtain an intermediate model, where the first loss function includes an error between a result value of the subtask and a tag value of the subtask; wherein the subtasks include any one or combination of more than one of the following: judging whether the province characters of two addresses in the sample address pair are the same; judging whether the city characters of the two addresses in the sample address pair are the same; and judging whether the zone characters of the two addresses in the sample address pair are the same.

In an exemplary embodiment, after converting each group of sample address pairs into a sample address sequence, the address data labeling apparatus may further include: the replacement processing module is used for carrying out randomized replacement processing on one or more characters in the sample address sequence; the subtasks also include: and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.

In an exemplary embodiment, the model training unit may include: the function construction subunit is used for constructing a second loss function based on the normalization of the labeling path; and the model training subunit is used for inputting the second sample data into the address labeling model, and updating the parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.

In an exemplary embodiment, after acquiring the address category tag of the second sample data, the address data annotating device may include: the sample proportion calculation module is used for acquiring the quantity of second sample data under each address category and calculating the sample proportion of each address category; and the data adjusting module is used for adjusting at least part of data in the second sample data if the sample proportion of at least one address category is lower than a preset threshold value, so that the sample proportion of each address category in the adjusted second sample data meets a preset condition.

In an exemplary embodiment, the data adjustment module may include: the data deleting unit is used for deleting a part of data from second sample data of the address category with the sample proportion higher than a preset threshold value; and/or the sample construction unit is used for constructing new sample data based on second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.

In an exemplary embodiment, the character splitting module may include: the address acquisition unit is used for acquiring the address to be marked and carrying out text cleaning processing on the address to be marked; the character splitting unit is used for splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index; the sequence processing module comprises: the index processing unit is used for processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by label indexes of each character; the result determination module includes: and the index searching unit is used for searching in a preset label dictionary according to the label index so as to determine the labeling result of the address to be labeled.

The specific details of each module/unit in the above-mentioned apparatus have been described in detail in the embodiment of the method section, and the details that are not disclosed may refer to the contents of the embodiment of the method section, and therefore are not described herein again.

Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 1100 according to such an exemplary embodiment of the present disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, electronic device 1100 is embodied in the form of a general purpose computing device. The components of the electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, a bus 1130 connecting different system components (including the memory unit 1120 and the processing unit 1110), and a display unit 1140.

Where the memory unit stores program code, the program code may be executed by the processing unit 1110 to cause the processing unit 1110 to perform the steps according to various exemplary embodiments of the present disclosure as described in the above-mentioned "exemplary methods" section of this specification. For example, the processing unit 1110 may execute steps S110 to S140 shown in fig. 1, may execute steps S210 to S240 shown in fig. 2, and the like.

The storage unit 1120 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)1121 and/or a cache memory unit 1122, and may further include a read-only memory unit (ROM) 1123.

The storage unit 1120 may also include a program/utility 1124 having a set (at least one) of program modules 1125, such program modules 1125 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1130 may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1100 may also communicate with one or more external devices 1300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1100, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 1150. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. As shown, the network adapter 1160 communicates with the other modules of the electronic device 1100 over the bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiments of the present disclosure.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.

Referring to fig. 12, a program product 1200 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit according to an exemplary embodiment of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. An address data labeling method is characterized by comprising the following steps:

acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data;

splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters;

processing the character sequence to be marked by adopting the address marking model to obtain a marked data sequence;

and determining the labeling result of the address to be labeled according to the labeling data sequence.

2. The method of claim 1, wherein the address labeling model is trained by:

acquiring first sample data, second sample data and an address category label of the second sample data;

pre-training a machine learning model by using the first sample data to generate an intermediate model;

constructing an initial address labeling model based on at least a portion of the intermediate model;

and training and obtaining the address labeling model by using the second sample data and the address category label of the second sample data.

3. The method of claim 2, wherein said obtaining first and second sample data comprises:

acquiring initial sample data, and carrying out standardization processing on the initial sample data;

carrying out layered sampling on the initial sample data after the standardization processing, and updating the sequence of the initial sample data after the layered sampling;

and dividing the initial sample data into the first sample data and the second sample data according to a preset proportion.

4. The method according to claim 3, wherein after dividing initial sample data into the first sample data and second sample data, obtaining an address class label of the second sample data by:

5. The method of claim 2, wherein pre-training a machine learning model with the first sample data, generating an intermediate model, comprises:

updating parameters of the machine learning model according to a first loss function to obtain the intermediate model, wherein the first loss function comprises an error between a result value of the subtask and a tag value of the subtask;

wherein the subtasks include any one or a combination of more than one of:

judging whether the province characters of the two addresses in the sample address pair are the same;

6. The method of claim 5, wherein after converting each set of sample address pairs into a sequence of sample addresses, the method further comprises:

the subtasks further include:

7. The method of claim 2, wherein said training and deriving the address labeling model using the second sample data and the address class label of the second sample data comprises:

inputting the second sample data into the address labeling model, and updating parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.

8. The method of claim 2, wherein after obtaining the address class tag for the second sample data, the method comprises:

acquiring the quantity of the second sample data under each address category, and calculating the sample proportion of each address category;

if the sample proportion of at least one address category is lower than a preset threshold value, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the second sample data after adjustment to meet the preset condition.

9. The method of claim 8, wherein said adjusting at least a portion of said second sample data comprises:

deleting a part of data from second sample data of the address category with the sample proportion higher than the preset threshold; and/or

And constructing new sample data based on second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.

10. The method of claim 1, wherein the splitting the address to be labeled into a plurality of characters comprises:

acquiring an address to be marked, and performing text cleaning processing on the address to be marked;

splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index;

the processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence comprises the following steps:

processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by the label indexes of each character;

the determining the labeling result of the address to be labeled according to the labeling data sequence includes:

and searching in a preset label dictionary according to the label index to determine the labeling result of the address to be labeled.

11. An address data labeling apparatus, comprising:

the model acquisition module is used for acquiring an address labeling model, and the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data;

the character splitting module is used for splitting the address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters;

the sequence processing module is used for processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence;

and the result determining module is used for determining the labeling result of the address to be labeled according to the labeling data sequence.

12. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-10 via execution of the executable instructions.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.