CN111125365A - Address data labeling method and device, electronic equipment and storage medium - Google Patents

Address data labeling method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111125365A
CN111125365A CN201911349674.1A CN201911349674A CN111125365A CN 111125365 A CN111125365 A CN 111125365A CN 201911349674 A CN201911349674 A CN 201911349674A CN 111125365 A CN111125365 A CN 111125365A
Authority
CN
China
Prior art keywords
address
sample data
data
sample
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911349674.1A
Other languages
Chinese (zh)
Other versions
CN111125365B (en
Inventor
黄绿君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JD Digital Technology Holdings Co Ltd
Original Assignee
JD Digital Technology Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JD Digital Technology Holdings Co Ltd filed Critical JD Digital Technology Holdings Co Ltd
Priority to CN201911349674.1A priority Critical patent/CN111125365B/en
Publication of CN111125365A publication Critical patent/CN111125365A/en
Application granted granted Critical
Publication of CN111125365B publication Critical patent/CN111125365B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The disclosure provides an address data labeling method, an address data labeling device, electronic equipment and a computer readable storage medium, and belongs to the technical field of data processing. The method comprises the following steps: acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be marked by adopting the address marking model to obtain a marked data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence. The method and the device can accurately and efficiently label the address data.

Description

Address data labeling method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to an address data labeling method, an address data labeling apparatus, an electronic device, and a computer-readable storage medium.
Background
With the advent of the information age, a large amount of data is presented, and in order to facilitate analysis and processing of the data to make good decisions, people usually label the data reasonably. Particularly, when the space analysis is carried out on the human behavior activity or the social economic activity, the marking of the address data is particularly important, the address data is effectively marked, and a quantitative decision basis can be provided for urban management and commercial operation.
The existing address data labeling method is usually performed in a manual labeling mode based on preset rules. However, due to different forms of address data, different data may have the same meaning, for example, the inner Mongolia, and the inner Mongolia autonomous region may all be labeled with the same meaning, and therefore, considering that the address data has a fast update speed and a high diversity degree, the address labeling rule and the address database need to be maintained and updated regularly, which consumes a lot of labor cost; in addition, the above method greatly depends on manual operation, the labeling cost is generally in direct proportion to the scale of the address data set, and when large-scale data labeling is performed, much manpower and material resource investment and a long labeling period are needed, so that the efficiency is low, and the accuracy cannot be guaranteed.
Therefore, how to label the address data accurately and efficiently is an urgent problem to be solved in the prior art.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure provides an address data labeling method, an address data labeling device, an electronic device, and a computer-readable storage medium, which at least to some extent overcome the problems of high labor cost and low accuracy of the existing address data labeling.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to one aspect of the present disclosure, there is provided an address data labeling method, including: acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be marked by adopting the address marking model to obtain a marked data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence.
In an exemplary embodiment of the present disclosure, the address labeling model is trained by: acquiring first sample data, second sample data and an address category label of the second sample data; pre-training a machine learning model by using the first sample data to generate an intermediate model; constructing an initial address labeling model based on at least a portion of the intermediate model; and training and obtaining the address labeling model by using the second sample data and the address category label of the second sample data.
In an exemplary embodiment of the present disclosure, the acquiring the first sample data and the second sample data includes: acquiring initial sample data, and carrying out standardization processing on the initial sample data; carrying out layered sampling on the initial sample data after the standardization processing, and updating the sequence of the initial sample data after the layered sampling; and dividing the initial sample data into the first sample data and the second sample data according to a preset proportion.
In an exemplary embodiment of the present disclosure, after dividing initial sample data into the first sample data and second sample data, an address category tag of the second sample data is acquired by: labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character; and updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.
In an exemplary embodiment of the disclosure, the pre-training a machine learning model with the first sample data, generating an intermediate model, includes: determining a plurality of groups of sample address pairs from the first sample data, and converting each group of sample address pairs into a sample address sequence; inputting the sample address sequence into a machine learning model to obtain a preset result value of the subtask; updating parameters of the machine learning model according to a first loss function to obtain the intermediate model, wherein the first loss function comprises an error between a result value of the subtask and a tag value of the subtask; wherein the subtasks include any one or a combination of more than one of: judging whether the province characters of the two addresses in the sample address pair are the same; judging whether the city characters of the two addresses in the sample address pair are the same; and judging whether the zone characters of the two addresses in the sample address pair are the same.
In an exemplary embodiment of the disclosure, after converting each set of sample address pairs into a sequence of sample addresses, the method further comprises: performing randomized substitution processing on one or more characters in the sample address sequence; the subtasks further include: and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.
In an exemplary embodiment of the present disclosure, the training and obtaining the address labeling model by using the second sample data and the address category label of the second sample data includes: constructing a second loss function based on the normalization of the labeling path; inputting the second sample data into the address labeling model, and updating parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.
In an exemplary embodiment of the present disclosure, after acquiring the address category tag of the second sample data, the method includes: acquiring the quantity of the second sample data under each address category, and calculating the sample proportion of each address category; if the sample proportion of at least one address category is lower than a preset threshold value, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the second sample data after adjustment to meet the preset condition.
In an exemplary embodiment of the present disclosure, the adjusting at least a part of the data in the second sample data includes: deleting a part of data from second sample data of the address category with the sample proportion higher than the preset threshold; and/or constructing new sample data based on second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.
In an exemplary embodiment of the present disclosure, the splitting the address to be labeled into a plurality of characters includes: acquiring an address to be marked, and performing text cleaning processing on the address to be marked; splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index; the processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence comprises the following steps: processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by the label indexes of each character; the determining the labeling result of the address to be labeled according to the labeling data sequence includes: and searching in a preset label dictionary according to the label index to determine the labeling result of the address to be labeled.
According to an aspect of the present disclosure, there is provided an address data labeling apparatus including: the model acquisition module is used for acquiring an address labeling model, and the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; the character splitting module is used for splitting the address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; the sequence processing module is used for processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence; and the result determining module is used for determining the labeling result of the address to be labeled according to the labeling data sequence.
In an exemplary embodiment of the present disclosure, the address labeling model is trained by: the data acquisition unit is used for acquiring first sample data, second sample data and an address category label of the second sample data; an intermediate model generation unit, configured to pre-train a machine learning model using the first sample data, and generate an intermediate model; an initial model building unit, configured to build an initial address tagging model based on at least a part of the intermediate model; and the model training unit is used for training and obtaining the address labeling model by utilizing the second sample data and the address category label of the second sample data.
In an exemplary embodiment of the present disclosure, the data acquisition unit includes: the standardization processing subunit is used for acquiring initial sample data and standardizing the initial sample data; the hierarchical processing subunit is used for carrying out hierarchical sampling on the initial sample data after the standardization processing and updating the sequence of the initial sample data after the hierarchical sampling; and the data dividing subunit is used for dividing the initial sample data into the first sample data and the second sample data according to a preset proportion.
In an exemplary embodiment of the present disclosure, after dividing initial sample data into the first sample data and second sample data, an address category tag of the second sample data is acquired by: the character labeling subunit is configured to label each character in the second sample data by using a preset labeling method to obtain an address category label of each character; and the label updating subunit is used for updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.
In an exemplary embodiment of the present disclosure, the intermediate model generating unit includes: a sequence conversion subunit, configured to determine multiple groups of sample address pairs from the first sample data, and convert each group of sample address pairs into a sample address sequence; the sequence input subunit is used for inputting the sample address sequence into a machine learning model so as to obtain a result value of a preset subtask; a parameter updating subunit, configured to update parameters of the machine learning model according to a first loss function to obtain the intermediate model, where the first loss function includes an error between a result value of the subtask and a tag value of the subtask; wherein the subtasks include any one or a combination of more than one of: judging whether the province characters of the two addresses in the sample address pair are the same; judging whether the city characters of the two addresses in the sample address pair are the same; and judging whether the zone characters of the two addresses in the sample address pair are the same.
In an exemplary embodiment of the present disclosure, after converting each group of sample address pairs into a sample address sequence, the address data labeling apparatus further includes: the replacement processing module is used for carrying out randomized replacement processing on one or more characters in the sample address sequence; the subtasks further include: and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.
In an exemplary embodiment of the present disclosure, the model training unit includes: the function construction subunit is used for constructing a second loss function based on the normalization of the labeling path; and the model training subunit is used for inputting the second sample data into the address labeling model, and updating the parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.
In an exemplary embodiment of the present disclosure, after acquiring the address category tag of the second sample data, the address data labeling apparatus includes: the sample proportion calculation module is used for acquiring the quantity of the second sample data under each address category and calculating the sample proportion of each address category; and the data adjusting module is used for adjusting at least part of data in the second sample data if the sample proportion of at least one address category is lower than a preset threshold value, so that the sample proportion of each address category in the adjusted second sample data meets the preset condition.
In an exemplary embodiment of the present disclosure, the data adjusting module includes: the data deleting unit is used for deleting a part of data from second sample data of the address category with the sample proportion higher than the preset threshold; and/or the sample construction unit is used for constructing new sample data based on second sample data of the address category of which the sample proportion is lower than the preset threshold value, and adding the new sample data into the second sample data.
In an exemplary embodiment of the present disclosure, the character splitting module includes: the address acquisition unit is used for acquiring an address to be marked and carrying out text cleaning processing on the address to be marked; the character splitting unit is used for splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index; the sequence processing module comprises: the index processing unit is used for processing the digital index by adopting an address labeling model to obtain a labeling data sequence formed by the label indexes of each character; the result determination module includes: and the index searching unit is used for searching in a preset label dictionary according to the label index so as to determine the labeling result of the address to be labeled.
According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
Exemplary embodiments of the present disclosure have the following advantageous effects:
acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be labeled by adopting an address labeling model to obtain a labeled data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence. On one hand, the exemplary embodiment processes the character sequence of the address to be labeled by establishing the address labeling model to obtain the labeling result, less depends on manual operation to label the address data, the intelligent degree of the labeling process is higher, the operation is simple and rapid, and the accuracy is higher; on the other hand, in the exemplary embodiment, only when the model is trained, the label labeling is performed on part of the sample data, and the address labeling model is determined so as to label large-scale address data, so that the labeling period required by labeling the address data is reduced, the human resource cost is reduced, and the method has wide applicability.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
Fig. 1 schematically shows a flowchart of an address data labeling method in the present exemplary embodiment;
FIG. 2 schematically illustrates a sub-flow diagram of an address data annotation method in the present exemplary embodiment;
FIG. 3 schematically illustrates a sub-flow diagram of another address data labeling method in the present exemplary embodiment;
FIG. 4 is a sub-flowchart schematically illustrating yet another address data labeling method in the present exemplary embodiment;
FIG. 5 schematically illustrates a flow chart for training an intermediate model in the present exemplary embodiment;
FIG. 6 is a diagram schematically illustrating an architecture of an address labeling model in the present exemplary embodiment;
FIG. 7 schematically illustrates a flow chart of migrating a training intermediate model in the present exemplary embodiment;
fig. 8 schematically shows a flowchart of processing data in the present exemplary embodiment;
FIG. 9 is a flowchart schematically showing another address data labeling method in the present exemplary embodiment;
fig. 10 is a block diagram schematically showing the configuration of an address data labeling apparatus in the present exemplary embodiment;
fig. 11 schematically illustrates an electronic device for implementing the above-described method in the present exemplary embodiment;
fig. 12 schematically illustrates a computer-readable storage medium for implementing the above-described method in the present exemplary embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
An exemplary embodiment of the present disclosure first provides an address data labeling method, and an application scenario of the method of this embodiment may be: in the E-commerce platform, a large number of user addresses are obtained, the receiving addresses are labeled through the illustrative embodiment, and the relation between the area and the shopping habits of the users is analyzed according to the labeling result, so that the user behaviors are analyzed; when training data used for machine learning training is obtained, the training data is labeled, so that a training label is determined according to a labeling result, and a machine learning model is trained according to the training data and the label corresponding to the training data.
The exemplary embodiment is further described with reference to fig. 1, and as shown in fig. 1, the address data labeling method may include the following steps S110 to S140:
step S110, an address labeling model is obtained, and the address labeling model is obtained based on the first sample data without the label and the second sample data with the label through pre-training.
The address labeling model is a model for processing an address to be labeled to obtain a labeling result of the address to be labeled, and may be a pre-trained model. Sample data refers to training data for model training, and the present exemplary embodiment relates to two types of sample data, the unlabeled first sample data and the labeled second sample data. When model training is carried out, pre-training of a machine learning model can be carried out based on unlabeled first sample data to generate an intermediate model, then the intermediate model is trained through labeled second sample data and the labels corresponding to the labeled second sample data, and model parameters are adjusted to obtain a final address labeling model. The address labeling model can be determined by a large amount of unlabeled first sample data and a small amount of labeled second sample data, and compared with a machine learning model trained based on a large amount of training data needing labels, the resource cost of manual labeling can be greatly reduced, and the dependence of the model on the manual labeling data is reduced.
Step S120, splitting the address to be labeled into a plurality of characters, so as to convert the address to be labeled into a character sequence to be labeled, which is formed by arranging a plurality of characters.
The address to be labeled refers to data needing address labeling, such as a living address, an office address, a school address, a hospital address, an organization address, addresses of all links of an article circulation supply chain and the like, the address data records geographic space information corresponding to human behavior activities or social and economic activities, and the address data is a very valuable data resource in the big data era. The space analysis is carried out on the human behavior activity or the social and economic activity based on the address data, and quantitative decision basis can be provided for urban management and commercial operation. Given the complexity and specificity of address data, for example, address naming lacks a uniform specification; the address writing has the conditions of short-term, description information, multiple different characters, simplified/complex forms and the like; chinese addresses do not have the problem of dividing geographic cells by punctuation marks like English. Before processing the address to be annotated, the present exemplary embodiment may perform a splitting process on the characters of the address to be annotated, for example, the address data "rainbow bridge town of qing city, wen zhou, zhe jiang may be split into" zhe/jiang/province// wen/state/city// le/qing/city// rainbow/bridge/town/", and then convert these characters into a sequence of characters to be annotated, so as to facilitate the processing thereof by the address annotation model.
And step S130, processing the character sequence to be labeled by adopting an address labeling model to obtain a labeled data sequence.
Step S140, determining the labeling result of the address to be labeled according to the labeling data sequence.
And taking the character sequence to be marked converted by the address to be marked as input data of the address marking model, and processing the character sequence to be marked by the model to obtain a marking result of the address to be marked. The labeling result is data capable of reflecting data attributes in the data to be labeled, and can be a processing result of the machine learning model directly, such as a numerical value type result or an alphabet type result, and a final labeling result of the address to be labeled can be determined by converting the numerical value type result or the alphabet type result; the annotation result may also be a final annotation result of the data that can be annotated, for example, for the address data "nan shan region of guang-east shenzhen city", the annotation result may be a numerical result "1", "2", "3", by converting the numerical result, a final annotation result of "province", "city", "region" may be obtained, the annotation result may also be "guang-province", "shen-city", "nan shan-region", or "guang-province", "east-province", "deep-city", "zhen-city", "nan-region", "mountain-region", and the like, and the form of the annotation result is not specifically limited in this disclosure.
In an exemplary embodiment, in the step S120, the splitting the address to be labeled into a plurality of characters may include the following steps:
step S210, acquiring an address to be labeled, and performing text cleaning processing on the address to be labeled;
step S220, splitting a single character of the address to be marked after the text cleaning processing, and converting each split character into a numerical index;
step S130 may include:
step S230, processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by label indexes of each character;
step S140 may include:
step S240, according to the tag index, searching in a preset tag dictionary to determine the labeling result of the address to be labeled.
In this exemplary embodiment, the address to be labeled may be obtained from a database stored in the terminal device, or may be determined from address data input by the user in real time, so as to implement real-time labeling of the address to be labeled, and the like, which is not specifically limited by the present disclosure. In consideration of the great difference in the format or form of the acquired address to be labeled, after the address to be labeled is acquired, text cleaning processing can be performed on the address to be labeled, including but not limited to: eliminating too short or too long address entries; converting the full angle of the character into a half angle; and eliminating Chinese punctuation marks such as spaces, tab marks, quotation marks, various brackets and the like in the address text. Furthermore, word segmentation processing can be performed on the address to be labeled after text cleaning according to a single character, each character after word segmentation processing is converted into a numerical index, so that the address labeling model can process the character to be labeled conveniently to obtain a processing result of each numerical index, and finally a label corresponding to each numerical index is determined based on a preset label dictionary, namely, a labeling result of each character in the address to be labeled can be determined.
Based on the above description, in the present exemplary embodiment, an address labeling model is obtained, where the address labeling model is obtained based on the unlabeled first sample data and the labeled second sample data through pre-training; splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters; processing the character sequence to be labeled by adopting an address labeling model to obtain a labeled data sequence; and determining the labeling result of the address to be labeled according to the labeling data sequence. On one hand, the exemplary embodiment processes the character sequence of the address to be labeled by establishing the address labeling model to obtain the labeling result, less depends on manual operation to label the address data, the intelligent degree of the labeling process is higher, the operation is simple and rapid, and the accuracy is higher; on the other hand, in the exemplary embodiment, only when the model is trained, the label labeling is performed on part of the sample data, and the address labeling model is determined so as to label large-scale address data, so that the labeling period required by labeling the address data is reduced, the human resource cost is reduced, and the method has wide applicability.
In an exemplary embodiment, as shown in fig. 3, in step S110, the address labeling model may be trained by:
step S310, acquiring first sample data, second sample data and an address category label of the second sample data;
step S320, pre-training a machine learning model by utilizing first sample data to generate an intermediate model;
step S330, constructing an initial address labeling model based on at least one part of the intermediate model;
step S340, training and obtaining an address labeling model by using the second sample data and the address category label of the second sample data.
The difference is that the first sample data is sample data without a label, and the second sample data is sample data with a label. The present exemplary embodiment may construct a robust deep learning model through the first sample data and the second sample data to reduce the dependency on the annotation data. Specifically, the machine learning model can be pre-trained in an unsupervised learning mode through the first sample data to generate an intermediate model, and then the intermediate model is migrated and learned through the second sample data with the label to obtain a final address labeling model. The address labeling model constructed in the mode does not need to be updated and maintained manually at regular intervals, and the investment of manpower and material resources consumed by manually labeling a large number of labels is reduced. In addition, in order to ensure the accuracy of model training, the exemplary embodiment may acquire various and comprehensive sample data from multiple platforms, for example, may acquire a full-volume receiving address in an order database in multiple e-commerce platforms as sample data from the latest quarter.
In an exemplary embodiment, as shown in fig. 4, the step S310 of acquiring the first sample data and the second sample data may include the following steps:
step S410, obtaining initial sample data and carrying out standardization processing on the initial sample data;
step S420, carrying out hierarchical sampling on the initial sample data after the standardization processing, and updating the sequence of the initial sample data after the hierarchical sampling;
step S430, according to a preset ratio, dividing the initial sample data into first sample data and second sample data.
The initial sample data refers to unprocessed sample data used for model training, and after the initial sample data is acquired, in order to improve accuracy and convenience of machine learning model training, the initial sample data may be subjected to standardization processing, which may specifically include but is not limited to: removing the data which are too short or too long in the initial sample data; character conversion, for example, converting full-angle characters in data into half-angle characters; the character specialization processing, such as eliminating the space, tab, quotation mark, various brackets and other punctuation marks in the data; and removing the weight and the like. In addition, considering that the randomness of the acquired data may cause data imbalance, for example, 100 pieces of acquired address data about beijing city may include 60 pieces of data about sunny region, 20 pieces of data about hailing region, and 20 pieces of data about new city region, in order to ensure the unbiased property of the sample data, this exemplary embodiment may further perform hierarchical sampling processing on the initial sample data after the normalization processing, specifically, the acquired address data may be hierarchically sampled according to province, city, and county, and for example, the number of address data about sunny region, hailing region, and new city region in the address data may be kept distributed evenly by hierarchical sampling. And then updating the order of the initial sample data after the hierarchical sampling, for example, randomly disordering the order of the sample data. And finally, dividing the initial sample data into first sample data without a label and second sample data with a label according to a preset proportion. The preset proportion can be set by self-definition according to needs, and the disclosure is not particularly limited thereto.
In an exemplary embodiment, after the above step S430, the address category tag of the second sample data may be acquired by:
labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character;
and updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.
In order to comprehensively analyze the administrative division hierarchy, the geographic entities and the meaningful text information in the address data, so that the labeling result of the address sequence has universality and meets the requirements of a multi-service scene, through the analysis of the attributes and characteristics of the address data, the exemplary embodiment can summarize and abstract various categories in the address data, such as 15 label categories shown in table 1, and define the content and category of each label category. Note that the label categories referred to in the present disclosure are not limited to the 15 categories in table 1.
TABLE 1 Category tag for address data and definition thereof
Figure BDA0002334329530000121
Figure BDA0002334329530000131
In the present exemplary embodiment, each character in the second sample data may be labeled by a preset labeling method, for example, the second sample data is imported into a text labeling tool (e.g., doccano), and each character in the second sample data is labeled by adopting a BIO (Begin intermediate external) sequence labeling manner. In order to ensure the accuracy of the address category label, the exemplary embodiment may further perform re-verification and correction on the labeled result through a manual verification method, and update the address category label of each character in the second sample data according to the verification result of the address category label, so as to ensure the accuracy and consistency of the address category label of each character.
And marking the address data by adopting a BIO sequence marking mode, namely marking each character in the address data as 'B-X', 'I-X' or 'O'. Wherein, the character is located in the segment of X type and at the beginning of the segment, the character is located in the segment of X type, the character is located in the middle of the segment, and the character is located in the segment of X type and at the middle of the segment, and the character is not located in any type. All possible address category labels may include: "B-Provision", "I-Provision", "B-City", "I-City", "B-County", "I-County", "B-Town", "I-Town", "B-Village", "I-Village", "B-Road", "I-Road", "B-roadNum", "I-roadNum", "B-roadAux", "I-roadAux", "B-Residence", "I-Residence", "B-Build", "I-Build", "B-School", "I-School", "B-organization", "I-organization", "B-construction", "I-number", "I-BuildNum", "B-Unit", "I-Unit", "B-Location", "I-Location", "O" ], respectively, corresponding to the address class labels in Table 1. The specific label format is shown in table 2:
TABLE 2 annotation result form based on BIO sequence annotation
Figure BDA0002334329530000132
Figure BDA0002334329530000141
In an exemplary embodiment, the step S320 may include the following steps:
determining a plurality of groups of sample address pairs from the first sample data, and converting each group of sample address pairs into a sample address sequence;
inputting the sample address sequence into a machine learning model to obtain a preset result value of the subtask;
updating parameters of the machine learning model according to a first loss function to obtain an intermediate model, wherein the first loss function comprises an error between a result value of the subtask and a tag value of the subtask;
wherein, the subtasks can include any one or combination of more of the following:
judging whether the province characters of two addresses in the sample address pair are the same;
judging whether the city characters of the two addresses in the sample address pair are the same;
and judging whether the zone characters of the two addresses in the sample address pair are the same.
When the machine learning model is pre-trained according to the unlabeled first sample data, the first sample data required by model construction can be pre-processed, and a training set, a verification set and a test set are divided in proportion. For each address in the training set, randomly selecting an address from the rest addresses with probability p, selecting the address with probability 1-p, and constructing a sample address pair, namely when selecting a certain address from the first sample data as one address in the sample address pair, selecting another address from other addresses with probability p, and forming the sample address pair with probability 1-p and the address itself. The probability p may be set by a user according to needs, for example, p may be set to be 50%, and the like, which is not specifically limited in this disclosure. And inputting the sample address sequence into a machine learning model to obtain a result value of a preset subtask. In this exemplary embodiment, one or more combinations of the three seed tasks may be set, characters of provinces, cities, districts/counties corresponding to two addresses in the sample address pair are compared, and whether provinces of the two sample addresses are the same, cities of the two sample addresses are the same, and counties of the two sample addresses are the same is determined.
Since the address data is different from the ordinary text data and can contain the spatial topological relation of the geographic regions, the semantic expression has obvious administrative levels and membership between the administrative levels. For example, in address 1, "hong bridge Zhen hong nan Lu of Qing city, le Jiang," Zhe Jiang, "Wen Zhou city," and "Qing City" have administrative levels and administrative level subordinates, and are important geographic information contained in the address data; the "Hangzhou" in the "Yangshan township" at address 2 and the "Wenzhou" at address 1 are in parallel relationship. Thus, the present exemplary embodiment proposes to encode such external knowledge and administrative level membership into semantic representations of text by adding the above-mentioned subtask constraints of "whether the province characters of two addresses in a sample address pair are the same", "whether the city characters of two addresses in a sample address pair are the same", and "whether the region characters of two addresses in a sample address pair are the same" in a pre-trained language model. Specifically, when model training data preprocessing is performed to construct a model training sample address pair, three subtasks corresponding to "whether province characters of two addresses in the sample address pair are the same", "whether city characters of two addresses in the sample address pair are the same", and "whether region characters of two addresses in the sample address pair are the same" are respectively reserved with identification bits "[ CLS1 ]", "[ CLS2 ]" "[ CLS3 ]", after semantic information is provided layer by a machine learning model (in the present exemplary embodiment, a BERT model can be used), multidimensional vectors corresponding to "[ CLS1 ]", "[ CLS2 ]" "[ CLS3 ]" are respectively extracted from a vector matrix output at the uppermost layer, and are respectively input into a two-class model, loss values loss1, loss2, and loss3 are calculated, and model parameters are further updated according to obtain an intermediate model.
It should be noted that, in order to facilitate the processing of the sample address pairs by the machine learning model, the present exemplary embodiment may first perform data processing on the first sample data, and convert each group of sample address pairs into a sample address sequence. Specifically, the sample address pair obtained above may be divided into words according to characters (i.e., divided into single characters), and marks of [ START ], [ END ] are added to the beginning and the END of the sample address sequence respectively to indicate the beginning and the END of the address sequence, so as to construct a corpus dictionary, and align the lengths of two sample addresses to the maximum length of the parameter setting. If the total length of the two sample addresses exceeds the maximum length of the parameter setting, selecting a longer sample address, and deleting non-START and END identification characters at the head or tail in sequence at random until the maximum length condition is met; and if the total length of the two address texts is less than the maximum length of the parameter setting, supplementing a special character "[ PAD ]" and the like at the tail, and constructing a sample address sequence to be processed by the machine learning model in such a way.
In an exemplary embodiment, after converting each group of sample address pairs into a sample address sequence, the address data labeling method may further include:
performing randomized substitution processing on one or more characters in the sample address sequence;
the above subtasks may further include:
and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.
In the present exemplary embodiment, in addition to the three sub-tasks described above, a "masked character prediction" sub-task may be provided to pre-train the machine learning model. Specifically, the randomized substitution process may be performed on one or more characters in the sample address sequence, including but not limited to: masking, replacing, deleting or the like is performed randomly on a certain proportion of one or more characters, that is, 80% of characters in the masked replacement words can be replaced by "[ MASK ]", 10% of characters are reserved as original characters, or 10% of characters are replaced by randomly taken characters in the corpus dictionary, and the like. By setting a 'mask single character prediction' subtask, an original character corresponding to a character subjected to randomized substitution in a sample address sequence is predicted, the subtask loss value is recorded as loss4, and parameters of a machine learning model are updated according to a first loss function to obtain an intermediate model.
It should be noted that, in this exemplary embodiment, the pre-training language model may be a model including multi-subtask constraints, a total first loss function of the pre-training language model may be obtained by weighted summation of the first loss functions of each subtask, and a weight of the first loss function may be set by a user according to needs, which is not specifically limited in this disclosure.
In the present exemplary embodiment, as shown in fig. 5, the training process of the intermediate model may include the following steps:
step S510, acquiring first sample data;
step S520, determining a plurality of groups of sample address pairs from the first sample data, and converting each group of sample address pairs into a sample address sequence;
step S530, constructing a pre-training intermediate model, designing a subtask, and defining a first loss function;
step S540, training a machine learning model based on the first sample data, and calculating a loss value of a first loss function in a forward propagation process;
step S550, judging whether the loss value is lower than a first preset threshold value;
step S560, if the value is lower than the first preset threshold value, ending the model training to obtain an intermediate model after the pre-training is finished;
if the value is higher than the first preset threshold value, executing the step S570, performing a back propagation process, calculating the parameter updating gradient of each layer, and updating the weight value;
and returning to execute the step S540, and performing multiple iterations until the model convergence and the model loss value are lower than a first preset threshold value, so as to obtain a pre-trained intermediate model and optimal parameters.
In an exemplary embodiment, the step S340 may include the following steps:
constructing a second loss function based on the normalization of the labeling path;
and inputting the second sample data into the address labeling model, and updating parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.
The present exemplary embodiment may transfer and learn information such as language features, semantic patterns, membership between administrative levels, and the like in the address data learned in the intermediate model pre-trained on the large-scale unlabeled second sample data in a Fine tuning manner, so as to reduce the dependence of the address labeling model on the large-scale labeled sample data.
When performing migration training on the address labeling model, second sample data of the address labeling model for address data and a tag form of the second sample data may be as shown in table 2. The present exemplary embodiment may construct training data of the following form for the second sample data: [ CLS1] [ CLS2] [ CLS3] [ START ] [ C1] … [ Cm ] [ END ] [ PAD ]. Where [ CLSi ] (i ═ 1, 2, 3) denotes an identification bit reserved for adding 3 subtasks proposed in the present exemplary embodiment, [ Cj ] denotes each character after the address data is divided into words, and m denotes the number of individual characters in the address data. And then converting each character in the address text into an integer index corresponding to the corresponding character in the corpus dictionary.
Labeling labels [ "START", "B-Provisions", "I-Provisions", "B-City", "I-City", "B-County", "I-County", "B-Town", "I-Town", "B-Village", "I-Village", "B-Road", "I-Road", "B-roadNum", "I-roadNum", "B-roadAux", "B-resource", "I-resource", "B-resource", "I-Build", "B-School", "I-School", "B-School", "I-organization", "B-building", "B-Build", "I-Location", "B-Location", "I-Location", "O-Location", "I-Unit", -, and "END" is used for mapping corresponding integer values, for example, mapping to integer values of 0 to 33, and storing the mapping values as a label dictionary for use in reversely converting the results after model inference into labels. And dividing the processed second sample data into a training set, a verification set and a test set according to a certain proportion, such as 70 percent to 15 percent. And training the model on the training set, and adjusting the hyper-parameters according to the performance of the model on the verification set.
Fig. 6 shows an architecture diagram of the address labeling model in the present exemplary embodiment. The method specifically comprises the following steps: the input layer 610 is used for extracting a vector matrix corresponding to 'CLS 1 CLS2 CLS 3' in a vector matrix output by the uppermost layer of the pre-trained intermediate model on the basis of the pre-trained intermediate model on large-scale non-labeled data (such as address-oriented BERT-base (768)); input into the TextCNN of the first middle layer 620 (including TextCNN (Text Convolutional Neural Networks) (64) + Dropout) to obtain the global token vector of the address sequence, input into the second middle layer 630 (including Long Short-Term Memory, Long Short Term Memory network/GUR (gated round unit)) in the two-way LSTM (128), i.e. two LSTMs propagating in opposite directions, of the vector matrix output from the top layer of the pre-trained middle model by extracting the vector matrix corresponding to "[ START ] [ C1] … [ Cm ] [ END ] [ PAD ]", and then splicing the global token vector output from the TextCNN with the vector corresponding to each character of the address sequence, and then in the third middle layer 640 (including Dropout (0.5)), the output vectors of the forward and reverse LSTMs are spliced, connecting two full-connection layers (a fourth middle Layer 650 and a fifth middle Layer 660) with dropout, Dense Layer (128) and Dense Layer (33) to obtain the fraction of each character in the address sequence belonging to each address class label, namely a state matrix; a CRF (conditional random field) layer is connected through a sixth intermediate layer 670 to learn the implicit constraints between characters in the address sequence, e.g. the beginning of a sentence should be "B-" or "O" instead of "I-"; "B-label 1I-label 2I-label 3 …", in this mode, classes 1, 2, 3 should be the same label class. "B-City I-City" is correct, while "B-City I-Organization" is incorrect; "O I-label" is erroneous, the beginning of the named entity should be "B-" instead of "I-", and so on. The probability score of the address class label transition of the front character and the rear character in the address sequence is represented by a state transition matrix defined in a CRF layer. The final goal of the address labeling model is to find an optimal sequence labeling path, i.e. the score of the optimal path should be the highest score among all paths.
The second penalty function of the address labeling model may be defined as:
Figure BDA0002334329530000181
Figure BDA0002334329530000182
wherein s ispLabeling the score, s, of the path for each possible sequencerealPathIteratively optimizing the score of the obtained optimal path for the address labeling model; x is the number ofijRepresenting the state score in the state matrix, i being the index of the position of the ith character in the address sequence, yiIs the index of the address category label corresponding to the ith character in the address sequence.
Figure BDA0002334329530000183
Is label y in the transfer matrixiTo the label yi+1The likelihood score of metastasis.
As shown in fig. 7, a specific training process may include the following steps:
step S710, acquiring second sample data and an address category label corresponding to the second sample data;
step S720, constructing an address labeling model on the basis of the pre-trained intermediate model, and defining a second loss function;
step S730, calculating a loss value of a second loss function in the forward propagation process based on the second sample data migration training intermediate model;
step S740, determining whether the loss value is lower than a second preset threshold value;
step S750, if the value is lower than a second preset threshold value, ending model training to obtain a trained optimal address labeling model;
if the value is higher than the second preset threshold value, executing the step S760, performing a back propagation process, calculating the parameter updating gradient of each layer, and updating the weight value;
and continuing to execute the step S730, and performing multiple iterations until the model convergence and the model loss value are lower than a second preset threshold value, so as to obtain a final address labeling model and optimal parameters.
And finally, evaluating the performance of the finally obtained address labeling model through the test set. The test result of the exemplary embodiment shows that the address text sequence labeling method based on the transfer learning can achieve excellent performance on various labels. For example, the address labeling model has F1 values of 90% or more in province, city, district/county, town, road number, building number, unit 8 type tags, F1 values of 84.95% and 83.60% in school, house 2 type tags, respectively, and F1 values of other 5 type tags, respectively: road auxiliary points: 79.67, mechanism: 77.62, village/community: 74.49, building: 74.28, orientation 74.47. The detailed performance of the model is shown in table 3:
TABLE 3 accuracy, recall and F1 values of the Address Mark model on the test set for each Address Category tag
Figure BDA0002334329530000191
Figure BDA0002334329530000201
In an exemplary embodiment, after the address category tag of the second sample data is acquired in step S310, the address data tagging method may include the following steps:
acquiring the quantity of second sample data under each address category, and calculating the sample proportion of each address category;
and if the sample proportion of at least one address category is lower than the preset threshold, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the adjusted second sample data to meet the preset condition.
Considering the particularity of the address data, for example, each address generally includes administrative level labels such as province, city, district/county, village/community, etc., the number of the several types of labels in the second sample data is extremely unbalanced, and the number of the province, city, district/county labels is much greater than that of other address category labels, which results in that the model pays more attention to the correctness of label labeling of the province, city, district/county labels, and the three types of labels are themselves relatively easy to label. The number of samples in buildings, organizations, schools, residences, villages/villages and the like is small due to the diversity of entity names, and therefore, the address labeling model is not sufficiently trained on the labels. Therefore, after the address category tag of the second sample data is obtained, the present exemplary embodiment may obtain the number of the second sample data in each address category, calculate the sample proportion of each address category, and if there is at least one address category whose sample proportion is lower than the preset threshold, adjust at least a portion of data in the second sample data, that is, perform data imbalance processing or data enhancement processing.
It should be noted that, in order to avoid data distortion caused by data adjustment transition, the present exemplary embodiment may set a preset condition, and selectively perform data adjustment in the above manner after counting the number of each type of label in the training set after processing, so that the number of each type of label is approximately equal, and when the preset condition is satisfied, the data adjustment process may be ended. For example, when the amount of data subjected to data adjustment exceeds 2 times of the number of original training sets, the data adjustment is stopped.
Specifically, in an exemplary embodiment, adjusting at least a part of data in the second sample data may include the following steps:
(1) deleting a part of data from second sample data of the address category with the sample proportion higher than a preset threshold; and/or
(2) And constructing new sample data based on the second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.
For example, since the number of province, city, district/county tags is much greater than that of other address category tags in the second sample data, it may be considered that the second sample data under the province, city, district/county address category tags is deleted from the second sample data in the manner of (1). The second sample data quantity under the address category labels of province, city, district/county, village/community and the like is kept balanced.
In addition, the data adjustment processing may be performed on the second sample data by using the method (2), specifically, the entity names of the tags with small data volumes (such as buildings, institutions, schools, residences, villages/villages, and the like) in the labeled data are respectively extracted, after duplication is removed, entity name sets are respectively constructed, labeled data with a certain proportion are randomly extracted from the data set of the labeled second sample data, data adjustment processing is performed on each address in the labeled data set in a targeted manner and added to the training set, so that the sample data volumes of the tags in the training set after combination are balanced. For example, building a 5-class entity name set of a building, an organization, a school, a house, a country/village, and randomly selecting an n (0< ═ n < ═ 5) class tag for each address to perform data adjustment processing. For the selected label, one of 3 ways of 'no processing', 'deleting the label', 'randomly selecting one to replace the current entity name from the corresponding entity name set' is randomly adopted to adjust the data, and the adjusted data is added into the training set.
Fig. 8 shows a flowchart of data processing in the present exemplary embodiment, which may specifically include the following steps:
step S810, acquiring second sample data;
step S820, labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character;
step S830, updating the address category label of each character in the second sample data according to the verification result aiming at the address category label;
step 840, obtaining the quantity of second sample data under each address category, and calculating the sample proportion of each address category;
step S850, judging whether the sample proportion of each address type in the second sample data is balanced;
step S860, if the sample proportion of each address category in the second sample data is not balanced, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the adjusted second sample data to meet a preset condition;
in step S870, if the sample proportion of each address category in the second sample data is balanced, the data processing flow is completed.
Fig. 9 is a flowchart illustrating another address data labeling method in the present exemplary embodiment, which may specifically include the following steps:
step S910, obtaining initial sample data and carrying out standardization processing on the initial sample data;
step S920, dividing the initial sample data into first sample data and second sample data according to a preset proportion;
step S930, pre-training a machine learning model by utilizing the first sample data to generate an intermediate model;
step S940, marking each character in the second sample data by adopting a preset marking method, and verifying to obtain an address category label of each character;
step S950, performing data adjustment on the sample data with unbalanced sample proportion, and constructing second sample data required by transfer learning;
step S960, migrating the training intermediate model by using the second sample data and the address category label of the second sample data to obtain an address labeling model;
step S970, acquiring the address to be labeled, and processing the address to be labeled by adopting the address labeling model to obtain a labeling result of the address to be labeled.
An exemplary embodiment of the present disclosure also provides an address data labeling apparatus. Referring to fig. 10, the apparatus 1000 may include a model obtaining module 1010, configured to obtain an address labeling model, where the address labeling model is obtained by pre-training on the basis of unlabeled first sample data and labeled second sample data; a character splitting module 1020, configured to split the address to be labeled into multiple characters, so as to convert the address to be labeled into a character sequence to be labeled, where the character sequence is formed by arranging multiple characters; a sequence processing module 1030, configured to process the character sequence to be labeled by using an address labeling model, so as to obtain a labeled data sequence; and the result determining module 1040 is configured to determine, according to the labeled data sequence, a labeling result of the address to be labeled.
In an exemplary embodiment, the address labeling model may be trained by: the data acquisition unit is used for acquiring first sample data, second sample data and an address category label of the second sample data; the intermediate model generation unit is used for pre-training the machine learning model by utilizing the first sample data to generate an intermediate model; an initial model building unit, configured to build an initial address tagging model based on at least a part of the intermediate model; and the model training unit is used for training and obtaining the address labeling model by utilizing the second sample data and the address category label of the second sample data.
In an exemplary embodiment, the data acquisition unit may include: the standardization processing subunit is used for acquiring initial sample data and carrying out standardization processing on the initial sample data; the hierarchical processing subunit is used for carrying out hierarchical sampling on the initial sample data after the standardization processing and updating the sequence of the initial sample data after the hierarchical sampling; and the data dividing subunit is used for dividing the initial sample data into first sample data and second sample data according to a preset proportion.
In an exemplary embodiment, after the initial sample data is divided into the first sample data and the second sample data, the address category tag of the second sample data may be acquired by: the character labeling subunit is used for labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character; and the label updating subunit is used for updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.
In an exemplary embodiment, the intermediate model generating unit may include: the sequence conversion subunit is used for determining a plurality of groups of sample address pairs from the first sample data and converting each group of sample address pairs into a sample address sequence; the sequence input subunit is used for inputting the sample address sequence into the machine learning model to obtain a result value of a preset subtask; a parameter updating subunit, configured to update parameters of the machine learning model according to a first loss function to obtain an intermediate model, where the first loss function includes an error between a result value of the subtask and a tag value of the subtask; wherein the subtasks include any one or combination of more than one of the following: judging whether the province characters of two addresses in the sample address pair are the same; judging whether the city characters of the two addresses in the sample address pair are the same; and judging whether the zone characters of the two addresses in the sample address pair are the same.
In an exemplary embodiment, after converting each group of sample address pairs into a sample address sequence, the address data labeling apparatus may further include: the replacement processing module is used for carrying out randomized replacement processing on one or more characters in the sample address sequence; the subtasks also include: and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.
In an exemplary embodiment, the model training unit may include: the function construction subunit is used for constructing a second loss function based on the normalization of the labeling path; and the model training subunit is used for inputting the second sample data into the address labeling model, and updating the parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.
In an exemplary embodiment, after acquiring the address category tag of the second sample data, the address data annotating device may include: the sample proportion calculation module is used for acquiring the quantity of second sample data under each address category and calculating the sample proportion of each address category; and the data adjusting module is used for adjusting at least part of data in the second sample data if the sample proportion of at least one address category is lower than a preset threshold value, so that the sample proportion of each address category in the adjusted second sample data meets a preset condition.
In an exemplary embodiment, the data adjustment module may include: the data deleting unit is used for deleting a part of data from second sample data of the address category with the sample proportion higher than a preset threshold value; and/or the sample construction unit is used for constructing new sample data based on second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.
In an exemplary embodiment, the character splitting module may include: the address acquisition unit is used for acquiring the address to be marked and carrying out text cleaning processing on the address to be marked; the character splitting unit is used for splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index; the sequence processing module comprises: the index processing unit is used for processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by label indexes of each character; the result determination module includes: and the index searching unit is used for searching in a preset label dictionary according to the label index so as to determine the labeling result of the address to be labeled.
The specific details of each module/unit in the above-mentioned apparatus have been described in detail in the embodiment of the method section, and the details that are not disclosed may refer to the contents of the embodiment of the method section, and therefore are not described herein again.
Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 1100 according to such an exemplary embodiment of the present disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 11, electronic device 1100 is embodied in the form of a general purpose computing device. The components of the electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, a bus 1130 connecting different system components (including the memory unit 1120 and the processing unit 1110), and a display unit 1140.
Where the memory unit stores program code, the program code may be executed by the processing unit 1110 to cause the processing unit 1110 to perform the steps according to various exemplary embodiments of the present disclosure as described in the above-mentioned "exemplary methods" section of this specification. For example, the processing unit 1110 may execute steps S110 to S140 shown in fig. 1, may execute steps S210 to S240 shown in fig. 2, and the like.
The storage unit 1120 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)1121 and/or a cache memory unit 1122, and may further include a read-only memory unit (ROM) 1123.
The storage unit 1120 may also include a program/utility 1124 having a set (at least one) of program modules 1125, such program modules 1125 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 1130 may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 1100 may also communicate with one or more external devices 1300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1100, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 1150. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. As shown, the network adapter 1160 communicates with the other modules of the electronic device 1100 over the bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiments of the present disclosure.
Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.
Referring to fig. 12, a program product 1200 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit according to an exemplary embodiment of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (13)

1. An address data labeling method is characterized by comprising the following steps:
acquiring an address labeling model, wherein the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data;
splitting an address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters;
processing the character sequence to be marked by adopting the address marking model to obtain a marked data sequence;
and determining the labeling result of the address to be labeled according to the labeling data sequence.
2. The method of claim 1, wherein the address labeling model is trained by:
acquiring first sample data, second sample data and an address category label of the second sample data;
pre-training a machine learning model by using the first sample data to generate an intermediate model;
constructing an initial address labeling model based on at least a portion of the intermediate model;
and training and obtaining the address labeling model by using the second sample data and the address category label of the second sample data.
3. The method of claim 2, wherein said obtaining first and second sample data comprises:
acquiring initial sample data, and carrying out standardization processing on the initial sample data;
carrying out layered sampling on the initial sample data after the standardization processing, and updating the sequence of the initial sample data after the layered sampling;
and dividing the initial sample data into the first sample data and the second sample data according to a preset proportion.
4. The method according to claim 3, wherein after dividing initial sample data into the first sample data and second sample data, obtaining an address class label of the second sample data by:
labeling each character in the second sample data by adopting a preset labeling method to obtain an address category label of each character;
and updating the address category label of each character in the second sample data according to the verification result aiming at the address category label.
5. The method of claim 2, wherein pre-training a machine learning model with the first sample data, generating an intermediate model, comprises:
determining a plurality of groups of sample address pairs from the first sample data, and converting each group of sample address pairs into a sample address sequence;
inputting the sample address sequence into a machine learning model to obtain a preset result value of the subtask;
updating parameters of the machine learning model according to a first loss function to obtain the intermediate model, wherein the first loss function comprises an error between a result value of the subtask and a tag value of the subtask;
wherein the subtasks include any one or a combination of more than one of:
judging whether the province characters of the two addresses in the sample address pair are the same;
judging whether the city characters of the two addresses in the sample address pair are the same;
and judging whether the zone characters of the two addresses in the sample address pair are the same.
6. The method of claim 5, wherein after converting each set of sample address pairs into a sequence of sample addresses, the method further comprises:
performing randomized substitution processing on one or more characters in the sample address sequence;
the subtasks further include:
and predicting the original character corresponding to the character subjected to the randomized substitution in the sample address sequence.
7. The method of claim 2, wherein said training and deriving the address labeling model using the second sample data and the address class label of the second sample data comprises:
constructing a second loss function based on the normalization of the labeling path;
inputting the second sample data into the address labeling model, and updating parameters of the address labeling model according to the error between the output labeling path and the address category label of the second sample data and the second loss function so as to train and obtain the address labeling model.
8. The method of claim 2, wherein after obtaining the address class tag for the second sample data, the method comprises:
acquiring the quantity of the second sample data under each address category, and calculating the sample proportion of each address category;
if the sample proportion of at least one address category is lower than a preset threshold value, adjusting at least a part of data in the second sample data to enable the sample proportion of each address category in the second sample data after adjustment to meet the preset condition.
9. The method of claim 8, wherein said adjusting at least a portion of said second sample data comprises:
deleting a part of data from second sample data of the address category with the sample proportion higher than the preset threshold; and/or
And constructing new sample data based on second sample data of the address category with the sample proportion lower than the preset threshold value, and adding the new sample data into the second sample data.
10. The method of claim 1, wherein the splitting the address to be labeled into a plurality of characters comprises:
acquiring an address to be marked, and performing text cleaning processing on the address to be marked;
splitting a single character of the address to be marked after the text cleaning processing is carried out, and converting each split character into a numerical index;
the processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence comprises the following steps:
processing the digital index by adopting an address labeling model to obtain a labeled data sequence formed by the label indexes of each character;
the determining the labeling result of the address to be labeled according to the labeling data sequence includes:
and searching in a preset label dictionary according to the label index to determine the labeling result of the address to be labeled.
11. An address data labeling apparatus, comprising:
the model acquisition module is used for acquiring an address labeling model, and the address labeling model is obtained by pre-training based on unlabeled first sample data and labeled second sample data;
the character splitting module is used for splitting the address to be marked into a plurality of characters so as to convert the address to be marked into a character sequence to be marked, wherein the character sequence to be marked is formed by arranging the characters;
the sequence processing module is used for processing the character sequence to be labeled by adopting the address labeling model to obtain a labeled data sequence;
and the result determining module is used for determining the labeling result of the address to be labeled according to the labeling data sequence.
12. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1-10 via execution of the executable instructions.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.
CN201911349674.1A 2019-12-24 2019-12-24 Address data labeling method and device, electronic equipment and storage medium Active CN111125365B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911349674.1A CN111125365B (en) 2019-12-24 2019-12-24 Address data labeling method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911349674.1A CN111125365B (en) 2019-12-24 2019-12-24 Address data labeling method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111125365A true CN111125365A (en) 2020-05-08
CN111125365B CN111125365B (en) 2022-01-07

Family

ID=70500079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911349674.1A Active CN111125365B (en) 2019-12-24 2019-12-24 Address data labeling method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111125365B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069293A (en) * 2020-09-14 2020-12-11 上海明略人工智能(集团)有限公司 Data annotation method and device, electronic equipment and computer readable medium
CN112131415A (en) * 2020-09-18 2020-12-25 北京影谱科技股份有限公司 Method and device for improving data acquisition quality based on deep learning
CN112131384A (en) * 2020-08-27 2020-12-25 科航(苏州)信息科技有限公司 News classification method and computer-readable storage medium
CN112257413A (en) * 2020-10-30 2021-01-22 深圳壹账通智能科技有限公司 Address parameter processing method and related equipment
CN112488200A (en) * 2020-11-30 2021-03-12 上海寻梦信息技术有限公司 Logistics address feature extraction method, system, equipment and storage medium
CN112579919A (en) * 2020-12-09 2021-03-30 小红书科技有限公司 Data processing method and device and electronic equipment
CN112989166A (en) * 2021-03-26 2021-06-18 杭州有数金融信息服务有限公司 Method for calculating actual business territory of enterprise
CN113011157A (en) * 2021-03-19 2021-06-22 中国联合网络通信集团有限公司 Method, device and equipment for hierarchical processing of address information
KR20210154069A (en) * 2020-06-11 2021-12-20 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Method, apparatus, device and storage medium for training model
WO2022134592A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Address information resolution method, apparatus and device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684440A (en) * 2018-12-13 2019-04-26 北京惠盈金科技术有限公司 Address method for measuring similarity based on level mark
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised file classification method and device based on Active Learning
CN109977395A (en) * 2019-02-14 2019-07-05 北京三快在线科技有限公司 Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text
CN110222337A (en) * 2019-05-28 2019-09-10 浙江邦盛科技有限公司 A kind of Chinese address segmenting method based on transformer and CRF
CN110297909A (en) * 2019-07-05 2019-10-01 中国工商银行股份有限公司 A kind of classification method and device of no label corpus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684440A (en) * 2018-12-13 2019-04-26 北京惠盈金科技术有限公司 Address method for measuring similarity based on level mark
CN109977395A (en) * 2019-02-14 2019-07-05 北京三快在线科技有限公司 Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised file classification method and device based on Active Learning
CN110222337A (en) * 2019-05-28 2019-09-10 浙江邦盛科技有限公司 A kind of Chinese address segmenting method based on transformer and CRF
CN110297909A (en) * 2019-07-05 2019-10-01 中国工商银行股份有限公司 A kind of classification method and device of no label corpus

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102534721B1 (en) 2020-06-11 2023-05-22 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Method, apparatus, device and storage medium for training model
JP7166322B2 (en) 2020-06-11 2022-11-07 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Methods, apparatus, electronics, storage media and computer programs for training models
KR20210154069A (en) * 2020-06-11 2021-12-20 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Method, apparatus, device and storage medium for training model
JP2021197137A (en) * 2020-06-11 2021-12-27 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method, device, electronic apparatus, storage medium, and computer program for training model
CN112131384A (en) * 2020-08-27 2020-12-25 科航(苏州)信息科技有限公司 News classification method and computer-readable storage medium
CN112069293A (en) * 2020-09-14 2020-12-11 上海明略人工智能(集团)有限公司 Data annotation method and device, electronic equipment and computer readable medium
CN112069293B (en) * 2020-09-14 2024-04-19 上海明略人工智能(集团)有限公司 Data labeling method, device, electronic equipment and computer readable medium
CN112131415A (en) * 2020-09-18 2020-12-25 北京影谱科技股份有限公司 Method and device for improving data acquisition quality based on deep learning
CN112257413B (en) * 2020-10-30 2022-05-17 深圳壹账通智能科技有限公司 Address parameter processing method and related equipment
CN112257413A (en) * 2020-10-30 2021-01-22 深圳壹账通智能科技有限公司 Address parameter processing method and related equipment
WO2022089227A1 (en) * 2020-10-30 2022-05-05 深圳壹账通智能科技有限公司 Address parameter processing method, and related device
CN112488200A (en) * 2020-11-30 2021-03-12 上海寻梦信息技术有限公司 Logistics address feature extraction method, system, equipment and storage medium
CN112579919B (en) * 2020-12-09 2023-04-21 小红书科技有限公司 Data processing method and device and electronic equipment
CN112579919A (en) * 2020-12-09 2021-03-30 小红书科技有限公司 Data processing method and device and electronic equipment
WO2022134592A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Address information resolution method, apparatus and device, and storage medium
CN113011157A (en) * 2021-03-19 2021-06-22 中国联合网络通信集团有限公司 Method, device and equipment for hierarchical processing of address information
CN112989166A (en) * 2021-03-26 2021-06-18 杭州有数金融信息服务有限公司 Method for calculating actual business territory of enterprise

Also Published As

Publication number Publication date
CN111125365B (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN111125365B (en) Address data labeling method and device, electronic equipment and storage medium
Qi et al. Finding all you need: web APIs recommendation in web of things through keywords search
CN110825882B (en) Knowledge graph-based information system management method
US9268766B2 (en) Phrase-based data classification system
AU2020321751A1 (en) Neural network system for text classification
WO2022218186A1 (en) Method and apparatus for generating personalized knowledge graph, and computer device
CN109992673A (en) A kind of knowledge mapping generation method, device, equipment and readable storage medium storing program for executing
JP2022003512A (en) Method and apparatus for constructing quality evaluation model, electronic device, storage medium, and computer program
US11194963B1 (en) Auditing citations in a textual document
US20220100772A1 (en) Context-sensitive linking of entities to private databases
US11681876B2 (en) Cascaded fact-based summarization
CN116383399A (en) Event public opinion risk prediction method and system
Yang et al. DOMFN: A divergence-orientated multi-modal fusion network for resume assessment
Wei et al. GP-GCN: Global features of orthogonal projection and local dependency fused graph convolutional networks for aspect-level sentiment classification
Liu et al. Supporting features updating of apps by analyzing similar products in App stores
US20220100967A1 (en) Lifecycle management for customized natural language processing
CN116432611A (en) Manuscript writing auxiliary method, system, terminal and storage medium
US11475211B1 (en) Elucidated natural language artifact recombination with contextual awareness
CN114936564A (en) Multi-language semantic matching method and system based on alignment variational self-coding
Wang et al. Construction of bilingual knowledge graph based on meteorological simulation
CN116028620B (en) Method and system for generating patent abstract based on multi-task feature cooperation
US20240054282A1 (en) Elucidated natural language artifact recombination with contextual awareness
CN117251650B (en) Geographic hotspot center identification method, device, computer equipment and storage medium
Carroll et al. Modeling the annotation process for ancient corpus creation
Oliveira et al. Sentiment analysis of stock market behavior from Twitter using the R Tool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant after: Jingdong Technology Holding Co.,Ltd.

Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant before: Jingdong Digital Technology Holding Co.,Ltd.

Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant after: Jingdong Digital Technology Holding Co.,Ltd.

Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant before: JINGDONG DIGITAL TECHNOLOGY HOLDINGS Co.,Ltd.

GR01 Patent grant
GR01 Patent grant