CN110968654A

CN110968654A - Method, equipment and system for determining address category of text data

Info

Publication number: CN110968654A
Application number: CN201811149284.5A
Authority: CN
Inventors: 郑华飞; 谢朋峻; 李林琳; 司罗
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2020-04-07
Anticipated expiration: 2038-09-29
Also published as: CN110968654B

Abstract

The application provides a text data address category determining method, category identifying equipment, a text data address category determining system, computing equipment and a computer readable storage medium, and relates to the technical field of data processing. The method for determining the address category of the text data comprises the following steps: carrying out named entity identification on the text data to obtain an address field; performing administrative division completion on the address field; extracting the geographical interest point information in the supplemented address field; and inputting the geographical interest point information into an address category machine learning model component to obtain an address category corresponding to the geographical interest point information. According to the technical scheme, the natural text data is used as input, and the accurate address category is output.

Description

Method, equipment and system for determining address category of text data

Technical Field

The present application belongs to the field of data processing technology, and in particular, to a method for determining an address category of text data, a category identification device, a system for determining an address category of text data, a computing device, and a computer-readable storage medium.

Background

With the development of mobile internet and the popularization of smart phones, it becomes easier to acquire personal location data of users. The personal position data of the user recorded in the form of the track has wide application value in a plurality of fields such as e-commerce, criminal investigation and the like.

However, directly processing the original single-point position data (i.e. a text address fragment, such as "bamboo water rhyme in spring wind" in yun hang region "," wulin one "in hang state", and "ten thousand hanging city" in hang state) in the user behavior trace data has a great difficulty in real-time commodity recommendation in the e-commerce field, crime prevention in the criminal investigation field, and the like. Therefore, POI (Point of Interest) identification and classification of the text address fragment in the behavior track data of the user are significant. The prior art POI identification and classification technical scheme mainly includes:

1. POI classification service of electronic map

The online map service of electronic maps (such as high-grade maps, Baidu maps and the like) maintains a massive standard address library, and can search according to input address fragments and return the most relevant standard addresses. The standard address carries POI category information. The electronic map generally collects and perfects POI classification information in a standard address base in a crowdsourcing mode, and the defect of the electronic map is obvious, namely, the electronic map cannot be obtained for newly-appeared addresses (or POIs). Furthermore, the crowdsourcing approach itself risks mislabeling, such as labeling an office building as a shopping mall.

2. Machine learning-based classification of calibrated POI data

The method is based on calibrated POI data, firstly, Chinese word segmentation preprocessing is carried out on the names of the shops, then a short text vector space model is established, then a main classification feature dictionary is screened out by adopting an information gain method, and then the probability of selecting samples under each classification is estimated based on a naive Bayes model. The method based on machine learning can achieve certain effect of classification prediction on calibrated POI data, but when an address fragment of a natural text is input (such as 'university of science and technology opposite to the Zhongguancun scientific park'), the method does not perform POI identification and extraction, but uses the whole address fragment as the POI, further performs Chinese word segmentation preprocessing, establishes a vector space model, and finally performs category prediction by using a naive Bayes model. Because of the presence of the distracter "Zhongguan technology park," the address fragment may be identified as "technology park" rather than "college".

Because the prior art POI identification and classification solutions have the above drawbacks, a new solution is urgently needed to solve the above drawbacks.

Disclosure of Invention

In view of the above, the present application provides an address category determining method for text data, a category identifying device, an address category determining system for text data, a computing device, and a computer readable storage medium, where the method identifies text data in a behavior trajectory and extracts geographical interest point information, and then performs classification prediction on the geographical interest point information through an address category machine learning model component to obtain an accurate address category, so that natural text data corresponding to the behavior trajectory data of a user is used as input to output an accurate address category, and the method has a wide application value in multiple fields such as e-commerce and criminal investigation.

In order to achieve the above purpose, the present application provides the following technical solutions:

according to a first aspect of the present application, a method for determining an address category of text data is provided, including:

carrying out named entity identification on the text data to obtain an address field;

performing administrative division completion on the address field;

extracting the geographical interest point information in the supplemented address field;

and inputting the geographical interest point information into an address category machine learning model component to obtain an address category corresponding to the geographical interest point information.

According to a second aspect of the present application, there is provided an category identifying device comprising:

the named entity recognition module is used for carrying out named entity recognition on the text data to obtain an address field;

the administrative division completion module is used for performing administrative division completion on the address field;

the interest point extracting module is used for extracting the geographical interest point information in the supplemented address field;

and the address category determining module is used for inputting the geographical interest point information into an address category machine learning model component to obtain an address category corresponding to the geographical interest point information.

According to a third aspect of the present application, there is provided an address category determination system for text data, including a category identification device;

the category identifying device is configured to: carrying out named entity identification on the text data to obtain an address field; performing administrative division completion on the address field; extracting the geographical interest point information in the supplemented address field; and inputting the geographical interest point information into an address category machine learning model component to obtain an address category corresponding to the geographical interest point information.

According to a fourth aspect of the application, a computing device is presented, comprising: the system comprises a processor and a storage device, wherein the processor is suitable for realizing instructions, the storage device stores a plurality of instructions, and the instructions are suitable for being loaded by the processor and executing the address category determination method of the text data.

According to a fifth aspect of the present application, a computer-readable storage medium is proposed, which stores a computer program for executing the above-described address category determination method for text data.

According to the technical scheme, the method and the device have the advantages that the text data in the behavior track are identified, the geographical interest point information is extracted, the geographical interest point information is classified and predicted through the address category machine learning model component, the accurate address category is obtained, the natural text data of the behavior track data of the user is used as input, the accurate address category is output, and the method and the device have wide application value in multiple fields such as e-commerce and criminal investigation.

In order to make the aforementioned and other objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic diagram illustrating an address category determination system for text data according to the present application;

FIG. 2 is a schematic diagram illustrating interaction between a category identification device and a model component training device in an address category determination system for text data according to the present application;

FIG. 3 is a schematic diagram illustrating the structure of a category identifying device according to the present application;

FIG. 4 is a schematic diagram of a model component training apparatus according to the present application;

FIG. 5 is a flow chart illustrating a method for determining address categories of text data according to the present application;

fig. 6 is a schematic flowchart illustrating a process of extracting geographic interest point information in the method for determining address categories of text data according to the present application;

FIG. 7 is a flow diagram illustrating a method for training an address category machine learning model component of the present application;

fig. 8 is a diagram illustrating a statistical word lattice classification method in an embodiment provided by the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The principles and spirit of the present application are explained in detail below with reference to several representative embodiments of the present application.

Although the present application provides method operational steps or apparatus configurations as illustrated in the following examples or figures, more or fewer operational steps or modular units may be included in the methods or apparatus based on conventional or non-inventive efforts. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution sequence of the steps or the module structure of the apparatus is not limited to the execution sequence or the module structure shown in the embodiment or the drawings of the present application. The described methods or modular structures, when applied in an actual device or end product, may be executed sequentially or in parallel according to embodiments or the methods or modular structures shown in the figures.

The technical terms referred to in the present application will be first described below.

And (3) interest points: point of Interest, POI for short, refers to a Point of Interest, a government agency, a company, a market, a restaurant, etc. on an electronic map, which are all POIs.

POI category prediction: and outputting the interest point category to which the POI belongs by taking the POI fragment as input.

Named entity recognition: the term "Named Entity Recognition", abbreviated as NER, is also called "proper name Recognition", and refers to Recognition of entities with specific meaning in text, and mainly includes names of people, places, organizations, proper nouns, and the like.

Aiming at the technical defects of the POI classification service of the electronic map and the classification scheme of calibrated POI data based on machine learning in the prior art, the applicant of the application provides an address category determination system and category identification equipment of text data, which identify the text data in a behavior track and extract geographic interest point information, classify and predict the geographic interest point information through an address category machine learning model component, finally obtain an accurate address category and have wide application value in a plurality of fields such as electronic commerce, criminal investigation and the like.

The following describes a specific embodiment of the present application. The present application provides a system for determining address categories of text data, fig. 1 shows a schematic structural diagram of the system, please refer to fig. 1, the system for determining address categories of text data includes a category identification device 100, and may further include a model component training device 200. Fig. 2 shows a schematic diagram of interaction between the category identifying device 100 and the model component training device 200, please refer to fig. 1 and fig. 2, in the present application:

s1: the category identifying device acquires text data.

In one implementation manner of the application, text data corresponding to a behavior track of a user on the internet or an intelligent terminal can be acquired.

In one embodiment of the present application, the category identifying device may be implemented by a server.

S2: and extracting the geographic interest point information corresponding to the text data by the category identification equipment.

In an embodiment of the present application, extracting geographic point of interest information corresponding to text data may include:

s21: carrying out named entity identification on the text data to obtain an address field; for example, it may include: performing word segmentation on the text data to obtain word segmentation data; carrying out named entity recognition on the word segmentation data; the address field after the named entity identification is extracted.

In an embodiment of the present application, considering that a place name, which is an element of text address data, is a word in a very special field, and has the characteristics of fixed administrative division place name and low change frequency, a word segmentation algorithm based on word granularity is used to segment text data, fig. 8 shows a schematic diagram of a statistical word lattice classification method, please refer to fig. 8, and segmenting text data may include:

1. candidate word lattice: using dictionary matching to enumerate all possible segmented words of the input sentence, and storing the words in a word grid form;

2. calculating the weight of each path in the word grid, wherein the weight is obtained by calculating the relevant information of the unitary statistical probability of each node and the binary statistical probability among the nodes in the figure 8;

3. and finding a path with the maximum weight value in the graph according to a graph search algorithm to serve as a final word segmentation result.

Taking text data of a Yunzhou West Silibaba Xixi park as an example, word segmentation data after word segmentation is the Yunzhou West Silibaba Xixi park.

In other embodiments of the present application, other word segmentation algorithms may also be used for word segmentation.

And after the word segmentation data are obtained, carrying out named entity recognition on the word segmentation data.

In one embodiment of the present application, place name named entity recognition is performed on the participle data, and a conventional Conditional Random Field (CRF) model is trained by using a labeled text address training corpus to complete NER of the participle data. In other embodiments of the present application, other models may also be employed for named entity recognition.

After named entity recognition is performed on the word data, the address field after command entity recognition is extracted.

Taking word segmentation data of "Yunzhang Wenyuan Alibaxi park" in Yunzhang Wenyuan West road as an example, an address field extracted after the named entity identification is "Yunzhang/breakdown _ abr Wenyuan/road Alibaxi park/building".

Extracting the geographical point of interest information of the text data may further include:

s22: performing administrative division completion on the address field;

in one embodiment of the present application, the completion of the administrative division of the address field may include: and complementing and/or correcting the address field by comparison with a standard address library.

In one embodiment of the present application, the standard address library may be constructed according to an address structured level, the address structured level may include administrative divisions and address elements corresponding to a plurality of levels, and the standard address library may include a plurality of standard addresses, which may include fields such as an administrative division field, a road name field, and a geographic interest point field.

In one embodiment provided herein, the address structuring level may be divided with reference to the latest administrative areas across the country, and the address structuring level may include 21 levels in total, as shown in table 1.

TABLE 1

In one particular embodiment provided herein, a standard address base, such as 9 levels, is constructed based on the structured hierarchy of addresses, as shown in the partial screenshot shown in table 2.

TABLE 2

In one embodiment provided herein, other numbers of address structuring levels may be constructed and a standard address library constructed based on the address structuring levels.

Taking an address field of "hangzhou/city _ abbr/hangzhou/district _ abbr wen xi way/road 969/num/hao arioba xi garden/building" extracted after the named entity recognition as an example, the address field after completion of the administrative district is "prov ═ zhejiang city, district ═ hangzhou area road ═ 969, poii ═ ariba xi garden".

s23: and extracting the geographical interest point information in the complemented address field.

In one embodiment provided herein, the geographic point of interest information may be determined based on a value of a geographic point of interest field in the complemented address field.

Taking an address field of "hangzhou/city _ abbr/hangzhou/district _ abbr wen xi lu/rod 969/num/hao arioba xi garden/building" extracted after the named entity is identified as an example, the address field after completion of the administrative district is "prov ═ zhejiang city, district ═ hangzhou area rod ═ wen cao adno ═ 969 poii ═ ariba xi garden", and the extracted geographic interest point information is "arioba xi garden area".

S3: the model component training device trains the address category machine learning model component.

In one embodiment of the present application, the specific training process of the model component training device is as follows:

s31: a plurality of pieces of address data are acquired.

In one embodiment of the present application, a plurality of shipping addresses may be obtained from an e-commerce platform.

In one embodiment of the present application, the shipping address may be centered at a predetermined distance as a radius within which the address associated with the shipping address is determined.

In one embodiment of the present application, the latitude and longitude of the shipping address may be adjusted to determine the address associated with the shipping address.

In one embodiment of the present application, after determining the relevant address of the shipping address, the relevant address may be further used as a seed address to determine more relevant addresses.

It should be noted that the receiving address is an example, and other types of address data can be obtained according to needs in specific implementation, and all the related modifications should fall into the scope of the present invention.

S32: category information corresponding to the address data and the point of interest data are determined.

In one embodiment of the application, the receiving address and the related address can be subjected to data crawling by means of an electronic map, and the electronic map automatically returns category information of the receiving address and the related address and interest point data.

In one particular embodiment provided herein, by means of a high-resolution map, a seed address and its nearby relevant address POI and category information are crawled, totaling a crawled 10 million high-resolution addresses, the category information including a 3-level hierarchy of categories, such as the three-level tags of the Alibaxi stream park: business housing-industrial park; song Qing age kindergarten: science and education culture service-school-kindergarten. After the 10 hundred million acquired high-resolution addresses are passed through the dead, the total number of 500 million POI data cover 4000 categories such as villages, residential areas, shopping centers and higher institutions.

S33: extracting the characteristics of the category information and the interest point data;

s34: and training the address category machine learning model component according to the category information after the feature extraction and the interest point data.

While in one embodiment of the present application, the address category machine learning model component is, for example, a fasttext model, in other embodiments of the present application, an address category machine learning model component such as a CNN model may also be employed. The fasttext model is a neural network structure of end-to-end, which means that an original text character string can be directly used as the neural network structure, excessive manual feature engineering is not needed, and the category to which the original text character string belongs can be output at an output layer of the network. To improve the training effect of the model, the embedding + bigram feature can be used as the input of the neural network.

In an embodiment of the application, the category information and the interest point data after feature extraction are used as training data of the address category machine learning model component to train the address category machine learning model component.

In one embodiment of the application, the category information and the interest point data after feature extraction are divided into first-class data and second-class data;

taking the first kind of data as training data of the address category machine learning model component, and training the address category machine learning model component;

and taking the second class data as test data of the address category machine learning model component, and testing the trained address category machine learning model component.

In a specific embodiment provided by the application, in a training stage of a fasttext model, 500 ten thousand POI data sets are allocated as a training data set and a testing data set in a ratio of 3:1, and word embedding is specified as 100 dimensions. Since the number of target classes is large (4000), the training efficiency is higher by selecting the historical softmax as the loss function in this embodiment. A classification accuracy of 77.3% was achieved on the test set.

S4: and inputting the geographical interest point information into an address category machine learning model component to obtain an address category corresponding to the geographical interest point information.

And after the training of the address category machine learning model component is finished, taking the geographical interest point information corresponding to the text data as the input of the address category machine learning model component to obtain the address category corresponding to the geographical interest point information.

In one embodiment of the present application, the address category includes at least one level of category information.

In one embodiment of the present application, the address category includes at least one level of category information and a probability value corresponding to the category information.

In a specific embodiment provided by the present application, taking text data "the liho wenyi west way aribab xi park" as an example, the extracted POI is "the aribab west park" as an input of the trained address category machine learning model component, and the output POI category is "the business residence # industrial park: 0.1035". Wherein 0.1035 is the probability value of the category, which means that the most probable category of POI is "business housing # industrial park" in "Alibaxi stream park", and the probability of belonging to other categories is less than 0.1035.

Therefore, according to the system for determining the address category of the text data, the text data in the behavior track is identified through the category identification device, the geographic interest point information is extracted, then the geographic interest point information is classified and predicted through the address category machine learning model component obtained through training of the model component training device, the accurate address category is obtained, the natural text data corresponding to the behavior track data of the user is used as input, the accurate address category is output, and the system has wide application value in multiple fields such as e-commerce and criminal investigation.

In an embodiment of the present application, the category identification device and the model component training device may be coupled and deployed in the same independent server or server cluster, or may be deployed on different servers respectively.

Fig. 3 is a schematic structural diagram of a category identifying device according to the present application, and referring to fig. 3, the category identifying device 100 includes:

and the named entity identification module 101 is configured to perform named entity identification on the text data to obtain an address field.

In one embodiment of the present application, the named entity identifying module may include:

and the word segmentation unit is used for segmenting words of the text data to obtain word segmentation data.

In one embodiment of the present application, a word segmentation algorithm based on word granularity is used for word segmentation, because the element of text data, namely, the place name is a word in a very special field, and has the characteristics of fixed administrative division place name and low change frequency.

Taking the text data of the Yunzhou West Silibaba Xixi park as an example, the output participle data after participle is the Yunzhou West Silibaba Xixi park.

And the recognition unit is used for carrying out named entity recognition on the word segmentation data.

And the extraction unit is used for extracting the address field identified by the named entity.

And an administrative region completion module 102, configured to perform administrative region completion on the address field.

In an embodiment of the present application, the administrative region completion module may be specifically configured to: the address fields are complemented and/or corrected against a standard address base.

In one embodiment of the application, a standard address library may be constructed based on an address structured level, where the address structured level includes administrative divisions and address elements corresponding to multiple levels, and the standard address library includes multiple standard addresses, which may include fields such as administrative division fields, road name fields, and geographic interest point fields.

In one particular embodiment provided herein, a standard address library, such as 9 levels, may be constructed based on the structured hierarchy of addresses, as shown in the partial screenshot shown in Table 2.

In one embodiment provided herein, other numbers of address structuring levels may be partitioned and the address base may be normalized based on the address structuring level.

And the interest point extracting module 103 is configured to extract geographical interest point information in the completed address field.

In a specific embodiment provided by the present application, the interest point extracting module may be specifically configured to: and determining the geographic interest point information based on the value of the geographic interest point field in the supplemented address field.

Taking an address field of "hangzhou/city _ abbr hangzhou/district _ abbr wen xi lu/road 969/num/hao arioba xi yuan/building" as an example, the supplemented address field is "prof ═ zhejiang city ═ hangzhou regional road ═ wen xi rood ═ 969 poii ═ arioba xi yuan", and the extracted geographic interest point information is "arioba xi yuan".

And the address category determining module 104 is configured to input the geographic interest point information into the address category machine learning model component to obtain an address category corresponding to the geographic interest point information.

Fig. 4 is a schematic structural diagram of a model component training device according to the present application, and referring to fig. 4, the model component training device 200 includes:

an address obtaining module 201, configured to obtain multiple pieces of address data.

In one embodiment of the present application, the receiving address may be used as a seed address, and a related address corresponding to the receiving address may be determined. For example, the relevant address of the shipping address is determined within the range by taking the shipping address as the center and a preset distance as the radius.

And the data crawling module 202 is used for determining the category information corresponding to the address data and the point of interest data.

The feature extraction module 203 is used for extracting features of the category information and the interest point data;

and the model training module 204 is used for training the address category machine learning model component according to the category information after feature extraction and the interest point data.

In an embodiment of the application, the model training module includes a first training module, configured to use the category information and the point of interest data after feature extraction as training data of the address category machine learning model component, and train the address category machine learning model component.

In one embodiment of the present application, the model training module comprises:

the data dividing module is used for dividing the category information and the interest point data after the features are extracted into first-class data and second-class data;

the second training module is used for taking the first class of data as training data of the address category machine learning model component and training the address category machine learning model component;

and the model test module is used for taking the second class data as test data of the address category machine learning model component and testing the trained address category machine learning model component.

Having described the address category determination system, category identification device, model component training device of the text data of the present application, the method of the present application will be described next with reference to the drawings. The implementation of the method can be referred to the implementation of the system, and repeated details are not repeated.

Fig. 5 is a schematic flowchart illustrating a method for determining address categories of text data according to the present application, and referring to fig. 5, the method includes:

s101: and carrying out named entity identification on the text data to obtain an address field.

S102: and performing administrative division completion on the address field.

S103: and extracting the geographical interest point information in the completed address field.

S104: and inputting the geographical interest point information into an address category machine learning model component to obtain an address category corresponding to the geographical interest point information.

Fig. 6 is a schematic flowchart illustrating a process of extracting geographic interest point information in the method for determining address categories of text data according to the present application, please refer to fig. 6, where step S101 includes:

s201: and performing address word segmentation on the text data to obtain word segmentation data.

S202: and carrying out named entity recognition on the word segmentation data.

In one embodiment of the present application, named entity recognition is performed on the participle data, and a conventional Conditional Random Field (CRF) model is trained by using labeled text address training corpuses to complete the NER of the participle data. In other embodiments of the present application, other models may also be employed for named entity recognition.

S203: the address field after the named entity identification is extracted.

S204: and performing administrative division completion on the address field.

In one embodiment of the present application, the address field may be complemented and/or corrected against a standard address base.

The standard address base can be constructed based on an address structured level, the address structured level can comprise administrative divisions corresponding to a plurality of levels and address elements, and the standard address base can comprise a plurality of standard addresses, wherein the standard addresses comprise fields such as administrative division fields, road name fields and geographic interest point fields.

In one embodiment provided herein, other numbers of levels of address structuring may be partitioned and a standard address library constructed.

S205: and extracting the geographical interest point information in the completed address field.

Fig. 7 is a flowchart illustrating a method for training an address category machine learning model component according to the present application, please refer to fig. 7, where the method includes:

s301: a plurality of pieces of address data are acquired.

In one embodiment of the present application, the receiving address may be used as a seed address, and a related address corresponding to the receiving address may be determined. For example, the shipping address may be centered at a predetermined distance as a radius within which the address associated with the shipping address is determined.

S302: category information corresponding to the address data and the point of interest data are determined.

In an embodiment of the application, data crawling may be performed on the receiving address and the related address to obtain category information and interest point data corresponding to the receiving address and the related address. For example, the receiving address and the related address can be subjected to data crawling by means of an electronic map, and the electronic map automatically returns the category information of the receiving address and the related address and the point-of-interest data.

S303: extracting the characteristics of the category information and the interest point data;

s304: and training the address category machine learning model component according to the category information after the feature extraction and the interest point data.

In an embodiment of the application, the step of training the address category machine learning model component according to the category information and the interest point data after feature extraction includes that the category information and the interest point data after feature extraction are used as training data of the address category machine learning model component, and the address category machine learning model component is trained.

In one embodiment of the present application, training the address category machine learning model component according to the category information after feature extraction and the point of interest data includes: classifying the category information and the interest point data after the feature extraction into first-class data and second-class data;

The present application further provides a computing device comprising: the system comprises a processor and a storage device, wherein the processor is suitable for realizing instructions, the storage device stores a plurality of instructions, and the instructions are suitable for being loaded by the processor and executing the address category determination method of the text data.

The present application also proposes a computer-readable storage medium storing a computer program for executing the above-described method for determining an address category of text data.

In summary, the present application provides a method for determining an address category of text data, a category identification device, a system for determining an address category of text data, a computing device, and a computer-readable storage medium, in which geographic interest point information is extracted by identifying text data in a behavior trajectory, and then the geographic interest point information is classified and predicted by an address category machine learning model component obtained through training, so as to obtain an accurate address category, thereby realizing that natural text data corresponding to behavior trajectory data of a user is used as input, outputting an accurate address category, and having wide application values in a plurality of fields such as e-commerce and criminal investigation.

It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Although the present application provides method steps as described in an embodiment or flowchart, more or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded.

The units, devices, modules, etc. set forth in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the present application, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of a plurality of sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

While the present application has been described with examples, those of ordinary skill in the art will appreciate that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A method for determining an address category of text data, comprising:

performing administrative division completion on the address field;

2. The method of claim 1, wherein the performing named entity recognition on the text data and obtaining the address field comprises:

performing word segmentation on the text data to obtain word segmentation data;

carrying out named entity recognition on the word segmentation data;

the address field after the named entity identification is extracted.

3. The method of claim 1, wherein the completion of the administrative division of the address field comprises: and complementing and/or correcting the address field by comparison with a standard address library.

4. The method of claim 3, wherein the standard address library comprises:

administrative division fields;

a road name field;

a geographic point of interest field.

5. The method of claim 4, wherein extracting the geographical point of interest information in the complemented address field comprises:

and determining the geographic interest point information based on the value of the geographic interest point field in the supplemented address field.

6. The method of claim 1, wherein the address category machine learning model component is trained based on:

acquiring a plurality of pieces of address data;

determining category information and interest point data corresponding to the address data;

extracting the characteristics of the category information and the interest point data;

and training the address category machine learning model component according to the category information after feature extraction and the interest point data.

7. An category identifying device, comprising:

8. The category identification device of claim 7, wherein the named entity identification module comprises:

the word segmentation unit is used for segmenting words of the text data to obtain word segmentation data;

the recognition unit is used for carrying out named entity recognition on the word segmentation data;

9. The category identification device of claim 7, wherein the administrative region completion module is specifically configured to:

and complementing and/or correcting the address field by comparison with a standard address library.

10. The category identifying device of claim 9, wherein the standard address library comprises:

administrative division fields;

a road name field;

a geographic point of interest field.

11. The category identification device of claim 10, wherein the interest point extraction module is specifically configured to:

12. The category identification device of claim 7, wherein the address category machine learning model component is trained based on the steps of:

acquiring a plurality of pieces of address data;

13. An address category determination system for text data, comprising a category identification device;

14. The system of claim 13, further comprising a model component training device;

the model component training device is used for training the address category machine learning model component as follows:

acquiring a plurality of pieces of address data;

15. A computing device, wherein the computing device comprises: a processor adapted to implement instructions and a storage device storing instructions adapted to be loaded by the processor and to perform the method of any of claims 1 to 6.

16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for performing the method of any of claims 1 to 6.