CN107463711A - A kind of tag match method and device of data - Google Patents

A kind of tag match method and device of data Download PDF

Info

Publication number
CN107463711A
CN107463711A CN201710723820.7A CN201710723820A CN107463711A CN 107463711 A CN107463711 A CN 107463711A CN 201710723820 A CN201710723820 A CN 201710723820A CN 107463711 A CN107463711 A CN 107463711A
Authority
CN
China
Prior art keywords
label
sample label
sample
aiming field
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710723820.7A
Other languages
Chinese (zh)
Other versions
CN107463711B (en
Inventor
王颜
崔乐乐
王传超
徐宏伟
姚民伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Cloud Service Information Technology Co Ltd
Original Assignee
Shandong Inspur Cloud Service Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Cloud Service Information Technology Co Ltd filed Critical Shandong Inspur Cloud Service Information Technology Co Ltd
Priority to CN201710723820.7A priority Critical patent/CN107463711B/en
Publication of CN107463711A publication Critical patent/CN107463711A/en
Application granted granted Critical
Publication of CN107463711B publication Critical patent/CN107463711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of tag match method and device of data, this method includes:Sample label table is built, the sample label table includes at least one sample label, and the hierarchical relationship of each sample label, and each sample label corresponds to same tag types;According to the tag types of at least one sample label, the aiming field corresponding with the tag types is gone out from the extracting data obtained in advance, the aiming field includes at least one keyword;For sample label each described, it is performed both by:Determine to whether there is the target keyword corresponding with the sample label in the aiming field, if it is, the sample label is defined as into reference label;According to the reference label and the hierarchical relationship of each sample label determined, from least one sample label, it is determined that with the aiming field corresponding to the corresponding at least one matching label of data.This programme can improve the accuracy of tag match.

Description

A kind of tag match method and device of data
Technical field
The present invention relates to field of computer technology, the tag match method and device of more particularly to a kind of data.
Background technology
Data analysis can help people to make accurate judgement to data, to take appropriate action, have in practical process Play an important role, and the premise of data analysis is cleaning, processing and the tag match to data.
When carrying out tag match to data, mainly by retrieving the conjunctive word corresponding with the type of label, it will retrieve Conjunctive word corresponding to data be defined as the data to match with label.For example, when label is Beijing, entering row label Timing, in the data of internet collection retrieval whether there is conjunctive word " city ", if in the presence of giving tacit consent to the number before the conjunctive word According to for keyword corresponding with the label, that is, the data given tacit consent to before " city " are that keyword is " Beijing ", then that this label is true Think the matching label of the data.
In the process, matching label is only defined by retrieval associated word, and does not determine that conjunctive word is corresponding accurately Keyword it is whether accurately corresponding with label substance, for example, when the character before conjunctive word " city " is mess code, in this method Will its match with label Beijing, this causes the accuracy of tag match relatively low.
The content of the invention
The embodiments of the invention provide the tag match method and device of data, the accuracy of tag match can be improved.
In a first aspect, the embodiments of the invention provide a kind of tag match method of data, including:
Sample label table is built, the sample label table includes at least one sample label, and each sample The hierarchical relationship of label;Wherein, each sample label corresponds to same tag types;
According to the tag types of at least one sample label, from the extracting data obtained in advance go out with it is described The corresponding aiming field of tag types;
For sample label each described, it is performed both by:
Determine to whether there is the target keyword corresponding with the sample label in the aiming field, if it is, will The sample label is defined as reference label;
According to the reference label and the hierarchical relationship of each sample label determined, from least one sample In label, it is determined that with the aiming field corresponding to the corresponding at least one matching label of data.
Preferably,
In the tag types according at least one sample label, go out from the extracting data obtained in advance and institute After stating the corresponding aiming field of tag types, further comprise:
According to the data format of at least one sample label, the morphological analysis corresponding with the data format is set Device;
Full-text index is established for the aiming field, and specifies the lexical analyzer set;
Using the lexical analyzer specified, the aiming field is split into at least one keyword;
Then,
It is described to determine to whether there is the target keyword corresponding with the sample label in the aiming field, including:
The full-text index established using the aiming field, retrieve whether there is at least one keyword with The corresponding target keyword of the sample label.
Preferably,
Before whether there is the target keyword corresponding with the sample label in the determination aiming field, Further comprise:
According to the hierarchical relationship of each sample label, vernier corresponding to each level is set respectively;
Then,
It is described to determine to whether there is the target keyword corresponding with the sample label in the aiming field, including:
According to level corresponding to the sample label, vernier corresponding to the sample label is determined;
Using the vernier determined, search and whether there is the target keyword in the aiming field.
Preferably,
Each described sample label is directed to described, is performed both by:Determine in the aiming field whether there is with it is described Before the corresponding target keyword of sample label, further comprise:
For at least one sample label corresponding to each level, it is performed both by:Determine at least one sample label Character length corresponding to respectively, and according to each character length, at least one sample label is ranked up;
Then,
It is described to be directed to each described sample label, it is performed both by:Determine to whether there is and the sample in the aiming field The corresponding target keyword of this label, including:
According to the ranking results of at least one sample label, at least one affiliated level pair of sample label is utilized The vernier answered, determine to whether there is target keyword corresponding with each sample label in the aiming field successively.
Preferably,
The reference label and the hierarchical relationship of each sample label that the basis is determined, from described at least one In sample label, it is determined that with the aiming field corresponding to the corresponding at least one matching label of data, including:
According to the hierarchical relationship, determine to whether there is the higher level of the reference label at least one sample label Label, if it is, using higher level's label and the reference label as the matching label;Otherwise the reference label is made For the matching label.
Second aspect, the embodiments of the invention provide a kind of tag match device of data, including:Construction unit, field Extraction unit and tag match unit;Wherein,
The construction unit, for building sample label table, the sample label table includes at least one sample label, And the hierarchical relationship of each sample label;Wherein, each sample label corresponds to same tag types;
The field extraction unit, at least one in the sample label table that is built according to the construction unit The tag types of sample label, the target word corresponding with the tag types is gone out from the extracting data obtained in advance Section;
The tag match unit, for for sample label each described, being performed both by:Determine that the field extraction is single It whether there is the target keyword corresponding with the sample label in the aiming field extracted in member, if it is, by the sample This label is defined as reference label;And each sample in the reference label and the sample label table determined The hierarchical relationship of label, from least one sample label, it is determined that with the aiming field corresponding to data it is corresponding At least one matching label.
Preferably,
The field extraction unit, is further used for the data format according at least one sample label, set with The corresponding lexical analyzer of the data format;Full-text index is established for the aiming field, and specifies the institute's predicate set Method analyzer, using the lexical analyzer specified, the aiming field is split into at least one keyword;
The tag match unit, for using the aiming field establish the full-text index, retrieval described at least It whether there is the target keyword corresponding with the sample label in one keyword.
Preferably,
The device further comprises:Setting unit;Wherein,
The setting unit, for the hierarchical relationship according to each sample label, each level pair is set respectively The vernier answered;
The tag match unit, for the level according to belonging to the sample label, set from the setting unit In vernier corresponding to each described level, vernier corresponding to the sample label is determined;And the vernier determined is utilized, Search and whether there is the target keyword in the aiming field.
Preferably,
The setting unit, it is further used for being directed at least one sample label corresponding to each level, is performed both by:Really Character length corresponding to fixed at least one sample label difference, and according to each character length, to described at least one Individual sample label is ranked up;
The tag match unit, for the ranking results according at least one sample label, using it is described at least Vernier corresponding to one affiliated level of sample label, determine to whether there is and each sample mark in the aiming field successively Target keyword corresponding to label.
Preferably,
The tag match unit, for according to the hierarchical relationship, determine at least one sample label whether Higher level's label of the reference label be present, if it is, higher level's label and the reference label are marked as the matching Label;Otherwise using the reference label as the matching label.
The embodiments of the invention provide a kind of tag match method and device of data, by from the data gathered in advance The aiming field corresponding with the tag types of sample label is extracted, then accurately determines to whether there is and sample in aiming field Keyword corresponding to this label, if so, further according to the hierarchical relationship of each sample label, determine corresponding with aiming field The corresponding matching label of data.Thus, when carrying out tag match, the keyword corresponding with label is directly matched, without It is that data corresponding to the conjunctive word of interception match with label, so as to improve the accuracy of sample label.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are the present invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis These accompanying drawings obtain other accompanying drawings.
Fig. 1 is a kind of flow chart of the tag match method for data that one embodiment of the invention provides;
Fig. 2 is a kind of flow chart of the tag match method for data that another embodiment of the present invention provides;
Fig. 3 is a kind of structural representation of the tag match device for data that one embodiment of the invention provides;
Fig. 4 is a kind of structural representation of the tag match device for data that another embodiment of the present invention provides.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained on the premise of creative work is not made, belongs to the scope of protection of the invention.
As shown in figure 1, the embodiments of the invention provide a kind of tag match method of data, this method can include following Step:
Step 101:Sample label table is built, the sample label table includes at least one sample label, and each The hierarchical relationship of the sample label, each sample label correspond to same tag types;
Step 102:According to the tag types of at least one sample label, from the extracting data obtained in advance Go out the aiming field corresponding with the tag types;
Step 103:For sample label each described, it is performed both by:Determine to whether there is and institute in the aiming field The corresponding target keyword of sample label is stated, if it is, the sample label is defined as into reference label;
Step 104:According to the reference label and the hierarchical relationship of each sample label determined, from it is described at least In one sample label, it is determined that with the aiming field corresponding to the corresponding at least one matching label of data.
It is corresponding with the tag types of sample label by going out from the extracting data gathered in advance in above-described embodiment Aiming field, then accurately determine to whether there is keyword corresponding with sample label in aiming field, if so, further according to each The hierarchical relationship of individual sample label, determine with aiming field corresponding to the corresponding matching label of data.Thus, rower is being entered During label matching, the direct matching keyword corresponding with label, rather than by data corresponding to the conjunctive word of interception and label phase Matching, so as to improve the accuracy of sample label.
For example, at least one sample label that sample label table includes is 23 provincial administrative areas and 5 autonomous differentiations Under not corresponding province code, province title and province abbreviation, and 4 municipalities directly under the Central Government and each provincial administrative area and autonomous region Belong to city codes, city name and city corresponding to the difference of city referred to as, and corresponding to the subordinate district difference in each city Referred to as, then the hierarchical relationship of each sample label is row corresponding to each city difference for district code, district title and district Political affairs rank, the tag types of each label are address class.When carrying out tag match, the magnanimity internet data from acquisition first In find address field, for example, address field " Jinan City Lixia District " can be found from a text data, then can determine that Go out and the keyword corresponding with sample label " Jinan City " and " Lixia District " in this field be present, then according in sample label table Hierarchical relationship, i.e., the hierarchical relationship in administrative divisions at different levels, determine that Jinan City is under the jurisdiction of Shandong Province, therefore by sample label The matching label of " Shandong Province ", " Jinan City " and " Lixia District " as text data corresponding to the address field.Thus, in standard Really determine the keyword corresponding with sample label and then traveling one is entered according to the hierarchical relationship between each sample label Step matching, the thus direct matching keyword corresponding with label, rather than by data corresponding to the conjunctive word of interception and label Match, improve the accuracy of sample label.
In one embodiment of the invention, after step 102, it may further include:
According to the data format of at least one sample label, the morphological analysis corresponding with the data format is set Device;
Full-text index is established for the aiming field, and specifies the lexical analyzer set;
Using the lexical analyzer specified, the aiming field is split into at least one keyword;
The then embodiment of step 103, it can include:The full-text index established using the aiming field, Retrieve and whether there is the target keyword corresponding with the sample label at least one keyword.
For example, the data obtained in advance are text formatting, then lexical analyzer corresponding to setting is Chinese morphology point Parser, full-text index then is established for aiming field, and specify the Chinese lexical analysis device set, then can pass through specified Chinese Lexical analyzer, the aiming field determined is segmented, so as to which aiming field is split into multiple keywords.For example, really The aiming field made is " Jinan City Lixia District ", then the aiming field is split as into keyword using Chinese lexical analysis device " Jinan City " and " Lixia District ", when carrying out tag match, by taking sample label " Jinan City " as an example, it is determined that each pass split out It whether there is the target keyword corresponding with " Jinan City " in keyword, herein, corresponding keyword be present, then by sample mark Label " Jinan City " are used as reference label corresponding to the aiming field.Thus, the lexical analyzer corresponding with data format is selected, More accurately word-breaking can be carried out to the data got, reduce the wrong incidence for splitting and splitting the mistake such as imperfect, favorably In sample label and the accurate match of keyword, so as to be advantageous to further improve the accuracy of tag match.
In one embodiment of the invention, before step 103, it may further include:
According to the hierarchical relationship of each sample label, vernier corresponding to each level is set respectively;
The then embodiment of step 103, it can include:
According to level corresponding to the sample label, vernier corresponding to the sample label is determined;
Using the vernier determined, search and whether there is the target keyword in the aiming field.
Herein, according to the administrative grade in each city, provincial vernier, city-level vernier and district level vernier are set respectively, Provincial vernier includes the abbreviation in all provinces in sample label table, and city-level vernier includes all cities in sample label table Referred to as, level vernier in district includes the full name in all districts in sample label table.Then each vernier is utilized, searches aiming field In whether there is target keyword.By taking the level vernier of district as an example, successively using the full name in each district as keyword, to target word Section is indexed, can be by sample corresponding to the district if the full name identical keyword with some district in aiming field be present Matching label of this label as data corresponding to the aiming field.Similarly, provincial vernier and city-level vernier can also be used, complete Save the tag match with city.Using vernier corresponding to each sample label, target keyword is determined from aiming field, can Travel through each sample label automatically in aiming field, and the skip to aiming field can be avoided, thus can both improve label The efficiency matched somebody with somebody, it can also improve the accuracy of tag match.
In one embodiment of the invention, before step 103, it may further include:
For at least one sample label corresponding to each level, it is performed both by:Determine at least one sample label Character length corresponding to respectively, and according to each character length, at least one sample label is ranked up;
The then embodiment of step 103, it can include:
According to the ranking results of at least one sample label, at least one affiliated level pair of sample label is utilized The vernier answered, determine to whether there is target keyword corresponding with each sample label in the aiming field successively.
For example, it is ranked up for the full name in each district, for example, the character length according to the full name in each city Descending sort is carried out, it is true with this according to the ranking results of each district full name in the level vernier of district then using district level vernier It whether there is keyword corresponding with each district full name in the field that sets the goal.This can effectively be avoided endless due to sample label Label error hiding caused by corresponding to entirely, for example, the full name " Dongcheng District " in district is matched as " city " this kind of situation.
In one embodiment of the invention, the embodiment of step 104, it can include:
According to the hierarchical relationship, determine to whether there is the higher level of the reference label at least one sample label Label, if it is, using higher level's label and the reference label as the matching label;Otherwise the reference label is made For the matching label.
For example, when it is reference label to determine " Jinan City ", according to the administrative grade in each city, determine that Jinan City is subordinate to Belong to Shandong Province, i.e., " Shandong Province " is higher level's label of " Jinan City ", then by " Shandong Province " and " Jinan City " as data Match label.For another example, when the reference label determined is " Shandong Province ", marked in the absence of higher level corresponding to the reference label Label, then the matching label only using this label as data, thus can realize the multistage tag match of data, further improve label The accuracy of matching.
Below exemplified by the matching of provincial, city-level and district level sample label is carried out using oracle database, to this hair The tag match method for the data that bright embodiment provides is described in detail, as shown in Fig. 2 this method can include following step Suddenly:
Step 201:Sample label table is built, the sample label table includes at least one provincial sample label, at least One city-level sample label and at least one district level sample label.
The newest administrative division (including administrative division title and code) of State Statistics Bureau's issue is gathered by internet, will Administrative division data at different levels import database and handled, and generate administrative division code table:DM_REGION.Table includes 23 Province code PROVINCE_CODE, province title PROVINCE_NAME corresponding to individual provincial administrative area and 5 autonomous regions' difference Corresponded to respectively with province abbreviation PROVINCE_JC, and 4 municipalities directly under the Central Government and each provincial administrative area and the subordinate city of autonomous region City codes CITY_CODE, city name CITY_NAME and city abbreviation CITY_JC, and the subordinate district in each city District code COUNTY_CODE, district title COUNTY_NAME and district abbreviation COUNTY_JC corresponding to respectively.
Step 202:Address field is extracted from the network text data obtained in advance, and Chinese lexical analysis is set Device.
For example, the database user of operation is DAT_CL, the network text data obtained in advance is the number of internet collection According to it is stored in database table DAT_CL.ADDRESS_INFO, and the address field therefrom extracted is ADDRESS.Due to sample Sample label in this label is address class label, therefore extracts address field from the network data of acquisition.Due to ground Location field is text formatting, therefore sets Chinese lexical analysis device.For example, the Chinese lexical analysis device set is chinese_ Lexer, setting up procedure can at least be realized by following procedure language:exec ctx_ddl.create_preference(' lexer_1','chinese_lexer')。
Step 203:Full-text index is added for the address field, and specifies Chinese lexical analysis device, in specifying Literary lexical analyzer, at least one keyword will be split into the address field.
Herein, the address field to extract adds full-text index, is named as IDX_ADDR, in this instance, the process It can be realized by following procedure language:
CREATE INDEX IDX_ADDR ON ADDRESS_INFO(ADDRESS)
INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('lexer lexer_1')。
Step 204:Provincial vernier, city-level vernier and district level vernier, provincial vernier is set to correspond at least one respectively Provincial sample label, city-level sample label correspond at least one city-level sample label, and district level vernier corresponds at least one district Level sample label.
For example, the provincial vernier established is cur_province, city-level vernier is cur_city, and district level vernier is cur_ county。
Establishing the process of provincial vernier can at least be realized by following program language:
cur_province:
CURSOR CUR_PROVINCE IS
SELECT DISTINCT W.PROVINCE_JC
FROM DM_REGION W
WHERE W.PROVINCE_JC IS NOT NULL;
ROW_PROVINCE CUR_PROVINCE%ROWTYPE
ORFER BY DESC
Step 205:For every one-level vernier, it is performed both by:Character length corresponding to determining each sample label respectively, root According to each character length, descending sort is carried out to each sample label.
For example, according to the title length in each city, descending sort, the i.e. longer city of title are carried out to each sample label Sample label corresponding to city is preceding, after sample label corresponding to the shorter city of title is arranged in.
Step 206:Each vernier is traveled through, according to ranking results, determines to whether there is and each sample in address field successively The corresponding keyword of this label, if it is, corresponding sample label is defined as into reference label.
By taking provincial vernier as an example, according to the descending order in each province, the name in each province is referred to as keyword successively, Address field is indexed, can be by the province if the title identical keyword with some province in address field be present Matching label of the corresponding sample label as data corresponding to the address field.
By taking Shandong as an example, when determining to whether there is with " Shandong " corresponding keyword in address field, following journey can be passed through Sequence language is realized:
INSERT INTO ADDRESS_INFO_LABEL
SELECT W.*, ' Shandong ', ", "
FROM ADDRESS_INFO W
WHERE CONTAINS (ADDRESS, ' Shandong ', 1)>0.
Similarly, city-level vernier and district level vernier can be also traveled through, completes the tag match of city-level and district level.For example, can District level vernier cur_county is traveled through, the name by each district is referred to as keyword the address field of gathered data is entered successively Row full-text index, complete district tag match.
Step 207:It whether there is according to the administrative grade belonging to each sample label, in judgement sample label list and determine Reference label higher level's label, if it is, perform step 208, otherwise perform step 209.
Step 208:Matching label using higher level's label and reference label as text data corresponding to address field, and tie Beam current process.
Step 209:Using reference label as matching label.
For example, address field " Jinan City Lixia District " can be extracted from a text data, then can determine that this The keyword corresponding with sample label " Jinan City " and " Lixia District " in field be present, then the layer in sample label table Level relation, i.e., the hierarchical relationship in administrative divisions at different levels, determine that Jinan City is under the jurisdiction of Shandong Province, therefore by sample label " Shandong Province ", the matching label of " Jinan City " and " Lixia District " as text data corresponding to the address field.
In addition, when being matched using oracle database, it need to first check whether database has what full-text index needed User and role, and corresponding authority is assigned to the user of tag match, it is that corresponding authority is assigned to user in the above-described embodiments DAT_CL.In tag match, tag match Table A DDRESS_INFO_LABEL can be built in advance, to store matching result. Tri- fields of PROVINCE, CITY and COUNTY are added in tag match table.During matching, this label list DM_REGION is sampled In a province, such as Shandong Province, abbreviation Shandong, using Shandong as keyword full-text search address AD DRESS, Shandong on successful match , PROVINCE label values are Shandong, and data are inserted into ADDRESS_INFO_LABEL.
Further, since the address information of collection is lack of standardization, thus include in sample label table each address full name, Abbreviation and code, when carrying out tag match, general each level is matched using various ways, is on the one hand full name, separately On the one hand it is that referred to as, can also be matched in addition using code.For example, if certain address field is " Guangxi Zhuang Autonomous Region xx streets Road ", then its can match for referred to as " Guangxi ", can also match as full name " Guangxi Zhuang Autonomous Region ".If address field is " wide Western Guilin City ", then it can be by province abbreviation matching to sample label " Guangxi ", i.e., it is " Guangxi " that it, which matches label, if now Only matched by province full name " Guangxi Zhuang Autonomous Region ", then can not be the address field successful match to matching label " Guangxi ".Thus, by using a variety of matching ways of full name, abbreviation and code in each level, it can further improve label The accuracy of matching.
After the text data that collects completes tag match, there is successful match to province's cities and counties' three-level, provinces and cities' two-stage, provincial One-level and the class data of cities and counties of province four are not matched.Wherein, the data of any one-level label are matched, can be carried for data analysis Supported for dimension.In addition, for complete one-level, two-stage, three-level matching address date, it is necessary to carry out matching result checking.Can The error situation that can occur has:Street, cell information and province's cities and counties' administrative division are of the same name etc., and this kind of data bulk is less, can adopt Manually mode is adjusted.
In an embodiment of the invention, 168608 are carried by the tag match method of data provided by the invention The data set of Text Address carries out the three-level tag match test of cities and counties of province, and 83.33% completes the tag match of province's cities and counties' three-level, 11% completes the tag match of provinces and cities' two-stage, and 0.6% completes the tag match in province, and 5.07% data can not match administrative area Draw.Used time is less than 1 minute.Therefore, the present invention realizes the magnanimity text that automatic batch in a short time is completed to gather internet The tag match of this information, label is added for gathered data, more analysis dimensions are provided for data analysis.
As shown in figure 3, the embodiments of the invention provide a kind of tag match device of data, including:Construction unit 301, Field extraction unit 302 and tag match unit 303;Wherein,
The construction unit 301, for building sample label table, the sample label table includes at least one sample mark Label, and the hierarchical relationship of each sample label, each sample label correspond to same tag types;
The field extraction unit 302, in the sample label table that is built according to the construction unit 301 extremely The tag types of a few sample label, the mesh corresponding with the tag types is gone out from the extracting data obtained in advance Marking-up section;
The tag match unit 303, for for sample label each described, being performed both by:Determine that the field carries Take and whether there is the target keyword corresponding with the sample label in the aiming field extracted in unit 302, if it is, will The sample label is defined as reference label;And each institute in the reference label and the sample label table determined The hierarchical relationship of sample label is stated, from least one sample label, it is determined that data phase corresponding with the aiming field Corresponding at least one matching label.
In one embodiment of the invention, the field extraction unit 302, it can be further used for according to described at least one The data format of sample label, the lexical analyzer corresponding with the data format is set;Established for the aiming field complete Text index, and the lexical analyzer set is specified, using the lexical analyzer specified, the aiming field is split Into at least one keyword;
The tag match unit, for using the aiming field establish the full-text index, retrieval described at least It whether there is the target keyword corresponding with the sample label in one keyword.
As shown in figure 4, in one embodiment of the invention, the device may further include:Setting unit 401;Wherein,
The setting unit 401, for the hierarchical relationship according to each sample label, each level is set respectively Corresponding vernier;
The tag match unit 303, for the level according to belonging to the sample label, set from the setting unit Each described level corresponding in vernier, determine vernier corresponding to the sample label;And utilize the trip determined Mark, searches and whether there is the target keyword in the aiming field.
In one embodiment of the invention, the setting unit 401, it is further used for corresponding to each level at least One sample label, is performed both by:Character length corresponding to determining at least one sample label respectively, and according to each described Character length, at least one sample label is ranked up;
The tag match unit 303, for the ranking results according at least one sample label, using it is described extremely Vernier corresponding to a few affiliated level of sample label, determine to whether there is and each sample in the aiming field successively Target keyword corresponding to label.
In one embodiment of the invention, the tag match unit 303, for according to the hierarchical relationship, it is determined that described It whether there is higher level's label of the reference label at least one sample label, if it is, by higher level's label and described Reference label is as the matching label;Otherwise using the reference label as the matching label.
The contents such as the information exchange between each unit, implementation procedure in said apparatus, due to implementing with the inventive method Example is based on same design, and particular content can be found in the narration in the inventive method embodiment, and here is omitted.
Present invention also offers a kind of computer-readable recording medium, including execute instruction, when described in the computing device of storage control During execute instruction, the storage control performs the method that any of the above-described embodiment of the present invention provides.
In addition, present invention also offers a kind of storage control, including:Processor, memory and bus;The memory For storing execute instruction, the processor is connected with the memory by the bus, when the storage control is run When, the execute instruction of memory storage described in the computing device, so that the storage control is performed in the present invention The method that any embodiment offer is provided.
In summary, each embodiment of the invention at least has the advantages that:
1st, in embodiments of the present invention, by going out the tag types phase with sample label from the extracting data gathered in advance Corresponding aiming field, then accurately determine to whether there is keyword corresponding with sample label in aiming field, if so, again According to the hierarchical relationship of each sample label, determine with aiming field corresponding to the corresponding matching label of data.Thus, exist When carrying out tag match, directly match the keyword corresponding with label, rather than by data corresponding to the conjunctive word of interception and Label matches, so as to improve the accuracy of sample label.
2nd, in embodiments of the present invention, it is determined that the lexical analyzer corresponding with the data format of sample label, and utilize The lexical analyzer determined, at least one keyword is split out from aiming field.The selection word corresponding with data format Method analyzer, more accurately word-breaking can be carried out to the data got, reduce the wrong hair for splitting and splitting the mistake such as imperfect Raw rate, be advantageous to the accurate match of sample label and keyword, so as to be advantageous to further improve the accuracy of tag match.
3rd, in embodiments of the present invention, according to the hierarchical relationship of each sample label, set each level corresponding respectively Vernier, and using set vernier search aiming field in whether there is target keyword corresponding with sample label.This can Travel through each sample label automatically in aiming field, and the skip to aiming field can be avoided, thus can both improve label The efficiency matched somebody with somebody, it can also improve the accuracy of tag match.
4th, in embodiments of the present invention, it is right according to character length corresponding to its difference to the sample label of each level Each sample label is ranked up.According to ranking results, using vernier corresponding to each level difference, aiming field is determined successively In whether there is target keyword corresponding with each sample label.This can effectively be avoided the incomplete correspondence due to sample label Caused label error hiding, further improve the accuracy of tag match.
5th, in embodiments of the present invention, after determining reference label, according to the hierarchical relationship between each sample label, Higher level's label of reference label is determined whether there is, if in the presence of, the reference label and its higher level's label are defined as With label, the multistage tag match of data thus can be realized, further improves the accuracy of tag match.
6th, in embodiments of the present invention, when the data of collection are entered with the matching of row address category label, sample label table includes Full name, abbreviation and area code corresponding to each address difference.When carrying out tag match, in each level using full name, letter Claim a variety of matching ways with code, so as to improve the accuracy of tag match.
It should be noted that herein, such as first and second etc relational terms are used merely to an entity Or operation makes a distinction with another entity or operation, and not necessarily require or imply and exist between these entities or operation Any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant be intended to it is non- It is exclusive to include, so that process, method, article or equipment including a series of elements not only include those key elements, But also the other element including being not expressly set out, or also include solid by this process, method, article or equipment Some key elements.In the absence of more restrictions, the key element limited by sentence " including one ", is not arranged Except other identical factor in the process including the key element, method, article or equipment being also present.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through Programmed instruction related hardware is completed, and foregoing program can be stored in computer-readable storage medium, the program Upon execution, the step of execution includes above method embodiment;And foregoing storage medium includes:ROM, RAM, magnetic disc or light Disk etc. is various can be with the medium of store program codes.
It is last it should be noted that:Presently preferred embodiments of the present invention is the foregoing is only, is merely to illustrate the skill of the present invention Art scheme, is not intended to limit the scope of the present invention.Any modification for being made within the spirit and principles of the invention, Equivalent substitution, improvement etc., are all contained in protection scope of the present invention.

Claims (10)

1. a kind of tag match method of data, it is characterised in that including:
Sample label table is built, the sample label table includes at least one sample label, and each sample label Hierarchical relationship;Wherein, each sample label corresponds to same tag types;
According to the tag types of at least one sample label, go out and the label from the extracting data obtained in advance The corresponding aiming field of type;
For sample label each described, it is performed both by:
Determine to whether there is the target keyword corresponding with the sample label in the aiming field, if it is, by described in Sample label is defined as reference label;
According to the reference label and the hierarchical relationship of each sample label determined, from least one sample label In, it is determined that with the aiming field corresponding to the corresponding at least one matching label of data.
2. according to the method for claim 1, it is characterised in that
In the tag types according at least one sample label, go out and the mark from the extracting data obtained in advance After signing the corresponding aiming field of type, further comprise:
According to the data format of at least one sample label, the lexical analyzer corresponding with the data format is set;
Full-text index is established for the aiming field, and specifies the lexical analyzer set;
Using the lexical analyzer specified, the aiming field is split into at least one keyword;
Then,
It is described to determine to whether there is the target keyword corresponding with the sample label in the aiming field, including:
The full-text index established using the aiming field, retrieve whether there is at least one keyword with it is described The corresponding target keyword of sample label.
3. according to the method for claim 1, it is characterised in that
Before whether there is the target keyword corresponding with the sample label in the determination aiming field, enter one Step includes:
According to the hierarchical relationship of each sample label, vernier corresponding to each level is set respectively;
Then,
It is described to determine to whether there is the target keyword corresponding with the sample label in the aiming field, including:
According to level corresponding to the sample label, vernier corresponding to the sample label is determined;
Using the vernier determined, search and whether there is the target keyword in the aiming field.
4. according to the method for claim 3, it is characterised in that
Each described sample label is directed to described, is performed both by:Determine to whether there is and the sample in the aiming field Before the corresponding target keyword of label, further comprise:
For at least one sample label corresponding to each level, it is performed both by:Determine at least one sample label difference Corresponding character length, and according to each character length, at least one sample label is ranked up;
Then,
It is described to be directed to each described sample label, it is performed both by:Determine to whether there is and the sample mark in the aiming field Corresponding target keyword is signed, including:
According to the ranking results of at least one sample label, using corresponding at least one affiliated level of sample label Vernier, determine to whether there is target keyword corresponding with each sample label in the aiming field successively.
5. according to the method for claim 1, it is characterised in that
The reference label and the hierarchical relationship of each sample label that the basis is determined, from least one sample In label, it is determined that with the aiming field corresponding to the corresponding at least one matching label of data, including:
According to the hierarchical relationship, determine that the higher level at least one sample label with the presence or absence of the reference label marks Label, if it is, using higher level's label and the reference label as the matching label;Otherwise using the reference label as The matching label.
A kind of 6. tag match device of data, it is characterised in that including:Construction unit, field extraction unit and tag match Unit;Wherein,
The construction unit, for building sample label table, the sample label table includes at least one sample label, and The hierarchical relationship of each sample label;Wherein, each sample label corresponds to same tag types;
The field extraction unit, at least one sample in the sample label table that is built according to the construction unit The tag types of label, the aiming field corresponding with the tag types is gone out from the extracting data obtained in advance;
The tag match unit, for for sample label each described, being performed both by:Determine in the field extraction unit It whether there is the target keyword corresponding with the sample label in the aiming field of extraction, if it is, by the sample mark Label are defined as reference label;And each sample label in the reference label and the sample label table determined Hierarchical relationship, from least one sample label, it is determined that with the aiming field corresponding to data it is corresponding at least One matching label.
7. device according to claim 6, it is characterised in that
The field extraction unit, is further used for the data format according at least one sample label, set with it is described The corresponding lexical analyzer of data format;Full-text index is established for the aiming field, and specifies the morphology point set Parser, using the lexical analyzer specified, the aiming field is split into at least one keyword;
The tag match unit, it is described at least one for the full-text index established using the aiming field, retrieval It whether there is the target keyword corresponding with the sample label in keyword.
8. device according to claim 6, it is characterised in that further comprise:Setting unit;Wherein,
The setting unit, for the hierarchical relationship according to each sample label, set respectively corresponding to each level Vernier;
The tag match unit, for the level according to belonging to the sample label, from described in setting unit setting In vernier corresponding to each level, vernier corresponding to the sample label is determined;And using the vernier determined, search It whether there is the target keyword in the aiming field.
9. device according to claim 8, it is characterised in that
The setting unit, it is further used for being directed at least one sample label corresponding to each level, is performed both by:Determine institute Character length corresponding at least one sample label difference is stated, and according to each character length, at least one sample This label is ranked up;
The tag match unit, for the ranking results according at least one sample label, using described at least one Vernier corresponding to the affiliated level of sample label, determine to whether there is and each sample label pair in the aiming field successively The target keyword answered.
10. device according to claim 6, it is characterised in that
The tag match unit, for according to the hierarchical relationship, determining to whether there is at least one sample label Higher level's label of the reference label, if it is, using higher level's label and the reference label as the matching label;It is no Then using the reference label as the matching label.
CN201710723820.7A 2017-08-22 2017-08-22 Data tag matching method and device Active CN107463711B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710723820.7A CN107463711B (en) 2017-08-22 2017-08-22 Data tag matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710723820.7A CN107463711B (en) 2017-08-22 2017-08-22 Data tag matching method and device

Publications (2)

Publication Number Publication Date
CN107463711A true CN107463711A (en) 2017-12-12
CN107463711B CN107463711B (en) 2020-07-28

Family

ID=60549314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710723820.7A Active CN107463711B (en) 2017-08-22 2017-08-22 Data tag matching method and device

Country Status (1)

Country Link
CN (1) CN107463711B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959244A (en) * 2018-06-07 2018-12-07 北京京东尚科信息技术有限公司 The method and apparatus of address participle
CN110097407A (en) * 2019-05-10 2019-08-06 宁波奥克斯电气股份有限公司 A kind of generation method and system of user tag
CN111198887A (en) * 2019-12-31 2020-05-26 北京左医健康技术有限公司 Medicine indexing method, medicine retrieval method and system
CN111626808A (en) * 2020-02-26 2020-09-04 京东数字科技控股有限公司 Data processing method and apparatus, storage medium, and electronic apparatus
WO2020177073A1 (en) * 2019-03-05 2020-09-10 深圳市天软科技开发有限公司 Data set acquisition method, terminal device, and computer readable storage medium
CN112528100A (en) * 2020-12-18 2021-03-19 厦门市美亚柏科信息股份有限公司 Label strategy recommending and marking method, terminal equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101283353A (en) * 2005-08-03 2008-10-08 温克科技公司 Systems for and methods of finding relevant documents by analyzing tags
CN101350012A (en) * 2007-07-18 2009-01-21 北京灵图软件技术有限公司 Method and system for matching address
CN101686146A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and equipment for fuzzy query, query result processing and filtering condition processing
US20130232153A1 (en) * 2012-03-02 2013-09-05 Cleversafe, Inc. Modifying an index node of a hierarchical dispersed storage index
CN104123366A (en) * 2014-07-23 2014-10-29 谢建平 Search method and server
CN104375992A (en) * 2013-08-12 2015-02-25 中国移动通信集团浙江有限公司 Address matching method and device
CN104834736A (en) * 2015-05-19 2015-08-12 深圳证券信息有限公司 Method and device for establishing index database and retrieval method, device and system
CN104967565A (en) * 2015-05-28 2015-10-07 烽火通信科技股份有限公司 Method and system for hybrid processing of upstream label and downstream label
US20160246799A1 (en) * 2015-02-20 2016-08-25 International Business Machines Corporation Policy-based, multi-scheme data reduction for computer memory
CN106503276A (en) * 2017-01-06 2017-03-15 山东浪潮云服务信息科技有限公司 A kind of method and apparatus of the time series databases for real-time monitoring system
CN106682147A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Mass data based query method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101283353A (en) * 2005-08-03 2008-10-08 温克科技公司 Systems for and methods of finding relevant documents by analyzing tags
CN101350012A (en) * 2007-07-18 2009-01-21 北京灵图软件技术有限公司 Method and system for matching address
CN101686146A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and equipment for fuzzy query, query result processing and filtering condition processing
US20130232153A1 (en) * 2012-03-02 2013-09-05 Cleversafe, Inc. Modifying an index node of a hierarchical dispersed storage index
CN104375992A (en) * 2013-08-12 2015-02-25 中国移动通信集团浙江有限公司 Address matching method and device
CN104123366A (en) * 2014-07-23 2014-10-29 谢建平 Search method and server
US20160246799A1 (en) * 2015-02-20 2016-08-25 International Business Machines Corporation Policy-based, multi-scheme data reduction for computer memory
CN104834736A (en) * 2015-05-19 2015-08-12 深圳证券信息有限公司 Method and device for establishing index database and retrieval method, device and system
CN104967565A (en) * 2015-05-28 2015-10-07 烽火通信科技股份有限公司 Method and system for hybrid processing of upstream label and downstream label
CN106682147A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Mass data based query method and device
CN106503276A (en) * 2017-01-06 2017-03-15 山东浪潮云服务信息科技有限公司 A kind of method and apparatus of the time series databases for real-time monitoring system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959244A (en) * 2018-06-07 2018-12-07 北京京东尚科信息技术有限公司 The method and apparatus of address participle
CN108959244B (en) * 2018-06-07 2022-08-09 北京京东尚科信息技术有限公司 Address word segmentation method and device
WO2020177073A1 (en) * 2019-03-05 2020-09-10 深圳市天软科技开发有限公司 Data set acquisition method, terminal device, and computer readable storage medium
CN110097407A (en) * 2019-05-10 2019-08-06 宁波奥克斯电气股份有限公司 A kind of generation method and system of user tag
CN111198887A (en) * 2019-12-31 2020-05-26 北京左医健康技术有限公司 Medicine indexing method, medicine retrieval method and system
CN111626808A (en) * 2020-02-26 2020-09-04 京东数字科技控股有限公司 Data processing method and apparatus, storage medium, and electronic apparatus
CN112528100A (en) * 2020-12-18 2021-03-19 厦门市美亚柏科信息股份有限公司 Label strategy recommending and marking method, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN107463711B (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN107463711A (en) A kind of tag match method and device of data
CN101093478B (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN103123618B (en) Text similarity acquisition methods and device
CN105653706A (en) Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN101882163A (en) Fuzzy Chinese address geographic evaluation method based on matching rule
CN107145584A (en) A kind of resume analytic method based on n gram models
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
CN103425687A (en) Retrieval method and system based on queries
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN109145073A (en) A kind of address resolution method and device based on segmentation methods
CN104199965A (en) Semantic information retrieval method
CN109145260A (en) A kind of text information extraction method
CN104484380A (en) Personalized search method and personalized search device
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN106055621A (en) Log retrieval method and device
CN109933797A (en) Geocoding and system based on Jieba participle and address dictionary
CN109165273A (en) General Chinese address matching method facing big data environment
CN105095091B (en) A kind of software defect code file localization method based on Inverted Index Technique
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN107203526A (en) A kind of query string semantic requirement analysis method and device
CN102646124A (en) Method for automatically identifying address information
CN107577744A (en) Nonstandard Address automatic matching model, matching process and method for establishing model
CN114780680A (en) Retrieval and completion method and system based on place name and address database
CN106980639B (en) Short text data aggregation system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant