WO2016112782A1 - 一种用户的生活圈提取方法及系统 - Google Patents

一种用户的生活圈提取方法及系统 Download PDF

Info

Publication number
WO2016112782A1
WO2016112782A1 PCT/CN2015/099766 CN2015099766W WO2016112782A1 WO 2016112782 A1 WO2016112782 A1 WO 2016112782A1 CN 2015099766 W CN2015099766 W CN 2015099766W WO 2016112782 A1 WO2016112782 A1 WO 2016112782A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
segmentation
actual
address
minimum
Prior art date
Application number
PCT/CN2015/099766
Other languages
English (en)
French (fr)
Inventor
邵佳帅
牟川
邢志峰
Original Assignee
北京京东尚科信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京京东尚科信息技术有限公司
Publication of WO2016112782A1 publication Critical patent/WO2016112782A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90324Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data

Definitions

  • the invention relates to the technical field related to electronic commerce, in particular to a method and system for extracting a life circle of a user.
  • the receiving address filled in by the user who makes the shopping on the e-commerce website contains a wealth of information, and identifying the cell name, office building name or office location name in the user address is very important for the e-commerce company.
  • the existing "life circle” keywords for extracting addresses are generally used for word segmentation using self-organizing lexicon.
  • a method for extracting a user's life circle comprising:
  • the address segmentation training step includes: acquiring a plurality of segmentation training addresses for performing training, segmenting the segmentation training addresses to obtain a segmentation minimum training result, and acquiring a segmentation training address set, the segmentation training The address set includes the segmentation minimum training result and a training word type annotation for describing the wording type of the segmentation minimum training result, and the segmentation feature template is obtained, and the segmentation feature template includes at least one pair
  • the segmentation training address set performs a feature segmentation feature
  • the feature template is trained by the conditional random field model, and the address segmentation training model is obtained.
  • the address identification training step includes: acquiring a plurality of identification training addresses for performing training, segmenting the identification training addresses to obtain a minimum training result, and selecting training related to the living circle type from the minimum training results of the identification.
  • the identifier training address set includes a training sensitive word and a training life circle type identifier for describing a life circle type of the training sensitive word, and acquiring an identification feature template, where the identification feature template includes And at least one identifier feature for characterizing the identifier training address set, and training the identifier training address set and the identifier feature template by using a conditional random field model to obtain an address identification training model;
  • the actual address obtaining step includes: obtaining at least one actual address of the user, and segmenting the actual address to obtain an actual minimum segmentation result;
  • the actual address segmentation step includes: inputting the actual minimum segmentation result into the address segmentation training model, and obtaining an actual idiom type tag for describing the actual minimum segmentation result, according to the The actual vocabulary type annotation of the actual minimum segmentation result, and recombining the actual minimum segmentation result into a life circle name;
  • the actual address identification step includes: selecting an actual sensitive word related to the life circle type from the actual minimum segmentation result, and inputting the actual sensitive word into the address identification training model to obtain the actual sensitive word
  • the actual life circle type identification of the life circle type includes: selecting an actual sensitive word related to the life circle type from the actual minimum segmentation result, and inputting the actual sensitive word into the address identification training model to obtain the actual sensitive word
  • the actual life circle type identification of the life circle type includes: selecting an actual sensitive word related to the life circle type from the actual minimum segmentation result, and inputting the actual sensitive word into the address identification training model to obtain the actual sensitive word
  • the actual life circle type identification of the life circle type includes: selecting an actual sensitive word related to the life circle type from the actual minimum segmentation result, and inputting the actual sensitive word into the address identification training model to obtain the actual sensitive word The actual life circle type identification of the life circle type;
  • the life circle extraction step includes: generating, for each of the actual addresses, a life circle including the life circle name and an actual life circle type identifier for the corresponding actual sensitive word.
  • a user's life circle extraction system includes:
  • the address segmentation training module is configured to: obtain a plurality of segmentation training addresses for performing training, and perform segmentation of the segmentation training addresses to obtain a segmentation minimum training result, and obtain a segmentation training address set, where the segmentation The training address set includes the segmentation minimum training result and a training word type annotation for describing the wording type of the segmentation minimum training result, and the segmentation feature template is acquired, and the segmentation feature template includes at least one piece for And performing a feature segmentation feature on the segmentation training address set, and training the segmentation training address set and the segmentation feature template by using a conditional random field model to obtain an address segmentation training model;
  • the address identification training module is configured to: obtain a plurality of identification training addresses for performing training, and perform segmentation to obtain the identification minimum training result, and select a life circle type related to the minimum training result from the identification And training the sensitive word to obtain the identifier training address set, where the identifier training address set includes a training sensitive word and a training life circle type identifier for describing the life circle type of the training sensitive word, and acquiring the identification feature template, the identifier feature template And including at least one identifier feature for characterizing the identifier training address set, and training the identifier training address set and the identifier feature template by using a conditional random field model to obtain an address identification training model;
  • the actual address obtaining module is configured to: obtain an actual address of at least one of the users, and perform segmentation of the actual address to obtain an actual minimum segmentation result;
  • the actual address segmentation module is configured to: input the actual minimum segmentation result into the address segmentation training model, and obtain an actual idiom type tag for describing the actual minimum segmentation result, according to the The actual idiom type annotation of the actual minimum segmentation result, and recombining the actual minimum segmentation result into a life circle name;
  • the actual address identification module is configured to: select an actual sensitive word related to the life circle type from the actual minimum segmentation result, and input the actual sensitive word into the address identification training model to obtain the actual sensitivity
  • the actual life circle type identifier of the word life circle type is configured to: select an actual sensitive word related to the life circle type from the actual minimum segmentation result, and input the actual sensitive word into the address identification training model to obtain the actual sensitivity
  • the life circle type module is configured to: generate, for each of the actual addresses, a life circle including the life circle name and an actual life circle type identifier of the corresponding actual sensitive word.
  • the invention trains the address segmentation training model and the address identification training model by training the address, and extracts the corresponding living circle name and the actual living circle type identifier by the address segmentation training model and the address identification training model respectively, thereby The name and type of the life circle that accurately identifies the user's address.
  • FIG. 1 is a flowchart of a method for extracting a life circle of a user according to the present invention
  • FIG. 2 is a schematic diagram showing an example of a split training address set
  • FIG. 3 is a schematic diagram of an example of a segmentation feature template
  • FIG. 4 is a schematic diagram showing an example of identifying a training address set
  • FIG. 5 is a schematic diagram showing an example of identifying a feature template
  • Figure 6 is a schematic diagram showing an example of a segmentation mark
  • FIG. 7 is a structural block diagram of a life circle extraction system of a user according to the present invention.
  • FIG. 1 is a flowchart of a method for extracting a life circle of a user according to the present invention, including:
  • Step S101 comprising: acquiring a plurality of split training addresses for performing training, and performing segmentation of the split training addresses to obtain a split training result, and acquiring a split training address set, where the split training address set includes The segmentation minimum training result and the training word type annotation for describing the word type of the segmentation minimum training result, and acquiring a segmentation feature template, the segmentation feature template including at least one piece for cutting Performing a segmentation feature of the feature description by the training address set, and training the segmentation training address set and the segmentation feature template by using a conditional random field model to obtain an address segmentation training model;
  • Step S102 includes: acquiring a plurality of identification training addresses for performing training, performing segmentation on the identification training addresses to obtain a minimum training result, and selecting training sensitive words related to the living circle type from the minimum training results of the identification.
  • Obtaining an identifier training address set where the identifier training address set includes a training sensitive word and a training life circle type identifier for describing a life circle type of the training sensitive word, and acquiring an identification feature template, where the identification feature template includes at least one Identifying the identifier of the identifier training address set, and training the identifier training address set and the identifier feature template by using a conditional random field model to obtain an address identification training model;
  • Step S103 comprising: acquiring at least one actual address of the user, and dividing the actual address to obtain an actual minimum segmentation result;
  • Step S104 comprising: inputting the actual minimum segmentation result into the address segmentation training model, and obtaining an actual idiom type tag for describing a morphological type of the actual minimum segmentation result, according to the actual minimum slice
  • the actual idiom type annotation of the result which will be described
  • the actual minimum segmentation results are recombined into the life circle name;
  • Step S105 comprising: selecting an actual sensitive word related to the life circle type from the actual minimum segmentation result, inputting the actual sensitive word into the address identification training model, and obtaining a life for describing the actual sensitive word
  • the actual life circle type identifier of the circle type
  • Step S106 comprising: generating, for each of the actual addresses, a life circle including the life circle name and an actual life circle type identifier for the corresponding actual sensitive word.
  • the invention trains the address segmentation training model and the address identification training model by training the address, and extracts the corresponding living circle name and the actual living circle type identifier by the address segmentation training model and the address identification training model respectively, thereby The name and type of the life circle that accurately identifies the user's address.
  • Step S101 acquires a plurality of split training addresses for performing training, and performs splitting of the split training addresses to obtain a minimum training result, and uses a conditional random field model for training.
  • Segmenting the segmentation training address to obtain a segmentation minimum training result can be implemented by using an existing automatic segmentation method, for example, using a word segmentation tool with the existing name snailseg, which is an open source minimal segmentation software, on the github. Can be downloaded to the source code.
  • the minimum training result of the segmentation refers to the minimum segmentation of the training address. For example, the minimum segmentation result of "Beichen Century Center" is: “North”, “Chen", “Century", “Center”.
  • the segmentation training address set is obtained by adding a training word type annotation for describing the categorization type of the minimum training result.
  • Training into a word type annotation can manually mark all the minimum training results of the segmentation, and the word type annotation refers to the annotation of the type of the minimum training result at the time of idiom.
  • the idiom type includes the beginning of the word, the middle or the end of the word, and the idiom.
  • the minimum segmentation result of "Lize Middle Road” is "Lize”, “Secondary”, “Road”, then "Lize” is the beginning of the word, "Secondary” and “Road” are the middle or end of the word.
  • an address unit such as “XX Center” may be an “office building” or a “company” or an “institution”.
  • “Beichen Century Center” is an office building
  • “Shoushan Fuhai Pension Center” is an institution. If you specify the rules manually, it will be cumbersome and may not be solved.
  • Conditional Random Field (CRF) theory can be used in natural language processing tasks such as sequence tagging, data segmentation, and block analysis. It has been applied in Chinese natural language processing tasks such as Chinese word segmentation, Chinese name recognition, and ambiguity resolution, and it performs well.
  • the input sequence x is the trained data
  • CRF-based main tools are implemented as CRF, FlexCRF, CRF++, and CRFsuite, and the present invention preferably uses CRFsuite.
  • the segmentation feature template describes the segmentation training address set.
  • training data and feature templates are needed for training, so that the training model trains each feature according to the previously written feature template.
  • the feature function is a unified form representation of the state feature function and the transfer feature function.
  • the eigenfunction is usually a binary function, and the value is either 1 or 0.
  • the conditional random field model uses the following eigenfunctions:
  • the above formula is a feature function set by the conditional random field model to train whether the feature template description has real meaning.
  • the relationship between some words and words is described, and then trained according to the training data. If the characteristics of the training data conform to one of the features of the written feature template, then for this feature of the feature template Say, the result of Equation 1 is 1, if not, the result is 0. That is to say, the result of Equation 1 is the result of training together with the training template plus the feature template.
  • the training data splits the training address set in step S101, and the feature template splits the feature template in step S101.
  • the training address set is identified as the training data of the conditional random field model, and the identification feature template is the feature template of the conditional random field model.
  • the conditional random field model obtains the weight of each feature of the feature template by calculating the feature function, and in step S104, after inputting the actual minimum segmentation result into the segmentation training model, each feature of the feature template is passed.
  • the weight is calculated to obtain the probability of multiple possible idiom types of the actual minimum segmentation result, and the annotation of the idiom type in which the expected value is the largest is selected as the actual idiom type annotation.
  • the identification of the life circle type in which the expected value is the largest is selected as the actual living circle type identification.
  • step S104 the actual idiom type is marked according to each actual minimum segmentation result. Note that one or more actual minimum cut results are recombined and the result is the life circle name.
  • the above-mentioned label or logo refers to a word type or a life circle type represented by letters, symbols, characters or numerical values.
  • the step S101 specifically includes:
  • the plurality of split training addresses for performing training are segmented by using an automaton rule, and each of the split training addresses is divided into at least one split minimum training result to generate a plurality of split training address groups.
  • a training address set each of the segmentation training address groups includes at least one segmentation training unit, each of the segmentation training units includes one of the segmentation minimum training results, and the same segmentation training address group includes a slice
  • the minimum training result of the split training unit is obtained by segmenting the same split training address;
  • segmentation training address set adding, to each of the segmentation training units, a training idiom for describing a word type of the segmentation minimum training result in the same segmentation training address group Type labeling
  • segmentation feature template includes at least one segmentation feature for characterizing the segmentation training address set
  • the step S103 includes: acquiring at least one actual address of the user, and dividing the actual address by using an automaton rule to obtain an actual minimum segmentation result, and each of the actual addresses is segmented to obtain at least one actual number. Minimum segmentation result;
  • the step S104 specifically includes:
  • each of the actual split address groups including at least one actual split unit, each of the actual split units including one of the actual minimum cut results And the actual minimum segmentation result of the actual segmentation unit included in the same actual segmentation address group is obtained by segmenting the same actual address;
  • the segmentation training unit further includes: identifying whether the minimum training result is a sensitive word identifier of the sensitive word, and dividing the length of the minimum training result;
  • the actual segmentation unit further includes: whether the actual minimum segmentation result is a sensitive word identifier of the sensitive word, and the length of the actual minimum segmentation result;
  • the segmentation features include:
  • the first segmentation training unit including the relative displacement as the first preset value includes a segmentation minimum training result and the second segmentation training unit including the at least one relative displacement being the second preset value includes a minimum segmentation training result, and is sensitive a first joint feature defined by a word identifier or length; or
  • the step S102 specifically includes:
  • the plurality of identification training addresses for performing training are segmented by using an automaton rule, and each of the identification training addresses is segmented to obtain at least one identification minimum training result, and the living circle type is selected from the minimum training result of the identification.
  • Corresponding training sensitive words generating a training address set including a plurality of identification training address groups, each of the identification training address groups including at least one identification training unit, each of the identification training units including one training sensitive word, and The training sensitive words of the identification training unit included in the same identification training address group are obtained by segmenting the same identification training address;
  • Obtaining an identifier training address set where the identifier training address set adds, to each of the identifier training units, a training life circle type identifier for describing a life circle type of the training sensitive word;
  • the step S103 includes: acquiring at least one actual address of the user, and dividing the actual address by using an automaton rule to obtain an actual minimum segmentation result, and each of the actual addresses is segmented to obtain at least one actual number. Minimum segmentation result;
  • Step S105 specifically:
  • an actual sensitive word related to the life circle type from the actual minimum segmentation result, and generating an actual identification address set including a plurality of actual identification address groups, each of the actual identification address groups including at least one actual identification unit, each The actual identification unit includes one of the actual sensitive words, and the actual sensitive words of the actual identification unit included in the same actual identification address group are segmented by the same actual address;
  • the identification features include:
  • a conditional random field model is implemented using CRFSuite, as shown in FIG. 2 as an example of a split training address set, and FIG. 3 is an example of a split feature template, and FIG. 4 shows an identifier.
  • FIG. 2 is an example of a split training address set
  • FIG. 3 is an example of a split feature template
  • FIG. 4 shows an identifier.
  • Figure 5 shows an example of identifying a feature template.
  • the end of the address will have some information specific to the house number, which is interference information for the life circle extraction and needs to be removed.
  • interference information for the life circle extraction and needs to be removed.
  • the marketing data expert group of Room 1609, Jingdong Mall, 16th Floor, Block A, Beichen Century Center, No. 8 Beichen West Road, Chaoyang District, Beijing will be removed from the interference information and will be “16th Floor, Block A, Beichen Century Center, No. 8 Beichen West Road, Chaoyang District” .
  • each of the split training address sets is a split training unit, and one or more split training units form a split training address group, and two adjacent split training address groups are used. Empty lines are separated.
  • Each split training unit has a total of four columns: the first column is the minimum split word, that is, the minimum training result; the second column is the sensitive word identifier, there are two symbols "+”, "-”, if cut The minimum training result is terminated by a sensitive address term, such as road, university, then "+”, otherwise "-”; the third column describes the length of the minimum training result, such as "Lize” The length of the fourth column is training word type annotation, there are three symbols "B", "I”, "O”, if the minimum training result is the beginning of a word, it is represented by "B”, if The minimum training result is the middle or end of a word, expressed by "I”. If the minimum training result is a separate idiom, it is represented by "O”.
  • the actual split address set is similar to the split training address set. The only difference is that each actual split unit has three columns, the first column is the actual minimum segmentation result; the second column is the sensitive word identifier; the third column is The length of the actual minimum segmentation result.
  • the segmentation training model adds the actual idiom type annotation to the actual minimum segmentation result by calculating and filling in the fourth column of each actual segmentation unit. This way, the words can be restored by recognizing the combination of "B", "I", and "O".
  • each row in the segmentation feature template represents a segmentation feature.
  • the representation pattern of the segmentation feature template will be different. However, the effect achieved is the same.
  • the form of the segmentation feature template is shown in Figure 3, where:
  • w, pos, m respectively represent the minimum training result of the segmentation, the identification of the sensitive word of the minimum training result, and the length of the minimum training result.
  • the second number of the segmentation feature describes the relative displacement, and the relative displacement refers to the other segmentation training units that differ from the current segmentation training unit by a preset value. For example: ('w', 0) represents the minimum training result of the split with a relative displacement of 0, that is, the minimum training result of the current split training unit, and ('w', 1) represents the current split.
  • the next line of the training unit divides the training result of the training unit to the minimum training result.
  • the conditional random field model calculates the probability of each segmentation feature under different training word type annotations during training. For example, for the feature of (w', 0), the minimum segmentation of each segmentation training unit is counted. The probability that the training result word type is marked as B, the probability that the training word type is marked as I, and the probability that the training word type is marked as O.
  • Each segmentation feature indicates that there is a correlation between one or more segmentation training units. For example, if a split training result of a split training unit is found to have a certain composition relationship with the minimum training result of the next minimum training split unit, the feature template can be described as such ((w), ), ('w', 1)). For example, when observing that "Wangjing" and "Science Park” have a certain composition relationship (here, can form a word), the feature model can be written as (('w', 0), ('w', 1)) .
  • the segmentation feature does not mean that every two words must have a relationship, which only means that there is a certain possibility between the two. Then, in the process of training, by segmenting the feature template to describe the relationship, CRF will automatically generate a feature function. It is possible to train whether there is any relationship between the minimum training results of these two segments. For another example, m stands for the length of each word, if Observed that there are often two words followed by three words to form a word, then the feature model can be written as (('m', 0), ('m', 1)), so that CRF will automatically generate a feature The function does not have such a feature between training the length of the two segmented minimum training results.
  • Feature templates can be written by hand. According to a large number of observations by observers, the relationship between some words and words is summarized and expressed by feature templates. Then CRF will automatically generate some feature functions to train words and words according to feature templates. Is there such a relationship?
  • words means to form words with actual address meaning, that is, as living circle names.
  • the train.txt split training address set can automatically generate the data files needed for CRF training according to the pre-written split feature template chunking.py.
  • word.model is the model data of the training segmentation training model obtained by training.
  • the 5000 addresses of the test set are generated according to the above steps to generate an actual cut address set, and then the actual split address set is tested by using the address split training model.
  • the test command is:
  • test.crfsuite.txt is the actual split address set and check.txt is the result file. It stores the results of the address segmentation training model calculation. That is, each actual segmentation unit in the actual segmentation address set of test.cffsuite.txt is calculated according to each feature in the feature template, and the actual idiom type tag with the highest probability is selected and added to the corresponding actual segmentation unit. in. Comparing it with the results previously marked by hand, the accuracy of the address segmentation training model can be obtained.
  • Correct rate the number of correct pieces of information extracted / the number of pieces of information extracted
  • Recall rate number of correct pieces of information extracted / number of pieces of information in the sample.
  • real.crfsuite.txt is the actual split address set
  • real.txt is the actual result.
  • the identification training address set includes training sensitive words, and the training sensitive words are extracted from the identification minimum training result.
  • the training sensitive words are part of identifying the minimum training result, for example, the identification training address set as shown in FIG. 4, each of which is an identification training unit, and one or more identification training units form an identification training address group, two adjacent The identification training address groups are separated by blank lines.
  • Each identification training unit has two columns: the first column is training sensitive words, such as "road”, "village", "number yard”. Training sensitive words In the process of performing automaton rule segmentation, the selection is made by a preset rule, wherein the house numbers such as 501, 12-01 are uniformly processed, and are set to num.
  • the second column of the identification training unit is the training life circle type identifier, which is replaced by a number, and each number represents only one type.
  • the identification feature template is similar to the segmentation feature template. As shown in FIG. 5, the identification feature template prepared by the observer according to the observed relationship between the address units, when the different software is used to implement the conditional random field, the performance of the segmentation feature template is segmented. The form will be different, but the effect is the same. For the implementation of the conditional random field model with CRFSuite, the form of the segmentation feature template is shown in Figure 5. Since the identification training address set has only two columns, it is only necessary to use w to identify the feature template.
  • the transition probability between each two features can be calculated based on the relationship described by the feature template. Thereby obtaining the trained model.
  • the training model is identified by training.
  • the command is as follows:
  • the model word.model is obtained, and then the actual address is also generated in the manner described above.
  • the difference between the actual identification address set and the identification training address set is that each actual identification unit of the actual identification address set includes only the actual sensitive words, and does not include the living circle type identification.
  • the actual identification address set is marked by the obtained identification training model, and the command is as follows:
  • the identification training model will add the actual life circle type identifier for each actual sensitive word.
  • the actual life circle type identifier is translated into the corresponding life circle type, and each actual address can be associated with the corresponding life circle type.
  • step S104 the living circle type of the "Beijing City” living circle type labeled "City” and “Chaoyang District” is marked as “Zone” and "Beichen West Road” is marked as "Road” and "No.
  • FIG. 7 is a structural block diagram of a user's life circle extraction system according to the present invention, including:
  • the address segmentation training module 701 is configured to: obtain a plurality of segmentation training addresses for performing training, and perform segmentation of the segmentation training addresses to obtain a segmentation minimum training result, and obtain a segmentation training address set, where the segmentation The split training address set includes the split minimum training result and the training idiom type annotation for describing the categorization type of the split minimum training result, and obtain a segmentation feature template, where the segmentation feature template includes at least one And performing the feature segmentation feature set on the segmentation training address set, and training the segmentation training address set and the segmentation feature template by using a conditional random field model to obtain an address segmentation training model;
  • the address identifier training module 702 is configured to: obtain a plurality of identifier training addresses for performing training, and perform segmentation to obtain the identifier minimum training result, and select a life circle type from the identifier minimum training result. And acquiring a training training address set, where the training training address set includes a training sensitive word and a training life circle type identifier for describing a life circle type of the training sensitive word, and acquiring an identification feature template, the identification feature The template includes at least one identifier feature for characterizing the identifier training address set, and the identifier training address set and the identifier feature template are trained by using a conditional random field model to obtain an address identifier training model;
  • the actual address obtaining module 703 is configured to: obtain an actual address of at least one of the users, and perform segmentation of the actual address to obtain an actual minimum segmentation result;
  • the actual address segmentation module 704 is configured to: input the actual minimum segmentation result into the Describe the address segmentation training model, obtain an actual idiom type annotation for describing the utterance type of the actual minimum segmentation result, and perform the actual minimum dicing according to the actual idiom type annotation of the actual minimum segmentation result The results are regrouped into the name of the living circle;
  • the actual address identification module 705 is configured to: select an actual sensitive word related to the life circle type from the actual minimum segmentation result, and input the actual sensitive word into the address identification training model to obtain the actual The actual life circle type identifier of the life circle type of sensitive words;
  • the life circle type module 706 is configured to: generate, for each of the actual addresses, a life circle including the life circle name and an actual life circle type identifier of the corresponding actual sensitive word.
  • the address segmentation training module is specifically configured to:
  • the plurality of split training addresses for performing training are segmented by using an automaton rule, and each of the split training addresses is divided into at least one split minimum training result to generate a plurality of split training address groups.
  • a training address set each of the segmentation training address groups includes at least one segmentation training unit, each of the segmentation training units includes one of the segmentation minimum training results, and the same segmentation training address group includes a slice
  • the minimum training result of the split training unit is obtained by segmenting the same split training address;
  • segmentation training address set adding, to each of the segmentation training units, a training idiom for describing a word type of the segmentation minimum training result in the same segmentation training address group Type labeling
  • segmentation feature template includes at least one segmentation feature for characterizing the segmentation training address set
  • the actual address obtaining module is specifically configured to: obtain at least one actual address of the user, and perform the actual minimum address segmentation by using an automatic machine rule to obtain an actual minimum segmentation result, and each of the actual addresses is segmented and obtained. At least one actual minimum segmentation result;
  • the actual address segmentation module is specifically configured to:
  • each of said actual The segmentation address group includes at least one actual segmentation unit, each of the actual segmentation cells including one of the actual minimum segmentation results, and the actual minimum segmentation result of the actual segmentation unit included in the same actual segmentation address group Obtained by the same actual address;
  • the segmentation training unit further includes: identifying whether the minimum training result is a sensitive word identifier of the sensitive word, and dividing the length of the minimum training result;
  • the actual segmentation unit further includes: whether the actual minimum segmentation result is a sensitive word identifier of the sensitive word, and the length of the actual minimum segmentation result;
  • the segmentation features include:
  • the first segmentation training unit including the relative displacement as the first preset value includes a segmentation minimum training result and the second segmentation training unit including the at least one relative displacement being the second preset value includes a minimum segmentation training result, and is sensitive a first joint feature defined by a word identifier or length; or
  • the length of the first segmentation training unit included by the relative displacement to the first preset value and the second segmentation training unit of the at least one relative displacement being the second preset value include a minimum segmentation
  • the address identifier training module is specifically configured to:
  • the plurality of identification training addresses for performing training are segmented by using an automaton rule, and each of the identification training addresses is segmented to obtain at least one identification minimum training result, and the living circle type is selected from the minimum training result of the identification.
  • Corresponding training sensitive words generating a training address set including a plurality of identification training address groups, each of the identification training address groups including at least one identification training unit, each of the identification training units including one training sensitive word, and The training sensitive words of the identification training unit included in the same identification training address group are obtained by segmenting the same identification training address;
  • Obtaining an identifier training address set where the identifier training address set adds, to each of the identifier training units, a training life circle type identifier for describing a life circle type of the training sensitive word;
  • the actual address obtaining module is specifically configured to: obtain at least one actual address of the user, and perform the actual minimum address segmentation by using an automatic machine rule to obtain an actual minimum segmentation result, and each of the actual addresses is segmented and obtained. At least one actual minimum segmentation result;
  • the actual address identification module is specifically used to:
  • an actual sensitive word related to the life circle type from the actual minimum segmentation result, and generating an actual identification address set including a plurality of actual identification address groups, each of the actual identification address groups including at least one actual identification unit, each The actual identification unit includes one of the actual sensitive words, and the actual sensitive words of the actual identification unit included in the same actual identification address group are segmented by the same actual address;
  • the identification features include:

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

一种用户的生活圈提取方法及系统,方法包括:获取多个用于进行训练的切分训练地址,采用条件随机场模型进行训练,得到地址切分训练模型;获取多个用于进行训练的标识训练地址,采用条件随机场模型进行训练,得到地址标识训练模型;获取至少一个所述用户的实际地址,将所述实际地址进行切分得到实际最小切分结果,输入所述地址切分训练模型,得到实际成词类型标注,将所述实际最小切分结果重新组合为生活圈名称;从所述实际最小切分结果中选取实际敏感词,输入所述地址标识训练模型,得到实际生活圈类型标识;每个所述实际地址生成包括所述生活圈名称及实际生活圈类型标识的生活圈。该方法准确地识别出用户的地址的生活圈的名称和类型。

Description

一种用户的生活圈提取方法及系统 技术领域
本发明涉及电子商务相关技术领域,特别是一种用户的生活圈提取方法及系统。
背景技术
在电子商务网站进行购物的用户所填写的收货地址中蕴含着丰富的信息,识别出用户地址中的小区名字、写字楼名字或办公地点名字等对电子商务公司来说是很重要的工作。
现有提取地址的“生活圈”关键词,一般都是利用自组织词库进行分词查找。
然而,通过自组织词库,无法精确的从收货地址中将“生活圈”的名字和类型提取出来。
发明内容
基于此,有必要针对现有技术无法精确的从收货地址中将“生活圈”的名字和类型提取出来的技术问题,提供一种用户的生活圈提取方法及系统。
一种用户的生活圈提取方法,包括:
地址切分训练步骤,包括:获取多个用于进行训练的切分训练地址,对所述切分训练地址进行切分得到切分最小训练结果,获取切分训练地址集,所述切分训练地址集包括所述切分最小训练结果以及用于描述所述切分最小训练结果的成词类型的训练成词类型标注,获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征,将所述切分训练地址集和所述切分 特征模板采用条件随机场模型进行训练,得到地址切分训练模型;
地址标识训练步骤,包括:获取多个用于进行训练的标识训练地址,对所述标识训练地址进行切分得到标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,获取标识训练地址集,所述标识训练地址集包括训练敏感词以及用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识,获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征,将所述标识训练地址集和所述标识特征模板采用条件随机场模型进行训练,得到地址标识训练模型;
实际地址获取步骤,包括:获取至少一个所述用户的实际地址,将所述实际地址进行切分得到实际最小切分结果;
实际地址切分步骤,包括:将所述实际最小切分结果输入所述地址切分训练模型,得到用于描述所述实际最小切分结果的成词类型的实际成词类型标注,根据所述实际最小切分结果的实际成词类型标注,将所述实际最小切分结果重新组合为生活圈名称;
实际地址标识步骤,包括:从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,将所述实际敏感词输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识;
生活圈提取步骤,包括:对每个所述实际地址生成包括所述生活圈名称及对相应的实际敏感词的实际生活圈类型标识的生活圈。
一种用户的生活圈提取系统,包括:
地址切分训练模块,用于:获取多个用于进行训练的切分训练地址,对所述切分训练地址进行切分得到切分最小训练结果,获取切分训练地址集,所述切分训练地址集包括所述切分最小训练结果以及用于描述所述切分最小训练结果的成词类型的训练成词类型标注,获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征,将所述切分训练地址集和所述切分特征模板采用条件随机场模型进行训练,得到地址切分训练模型;
地址标识训练模块,用于:获取多个用于进行训练的标识训练地址,对所述标识训练地址进行切分得到标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,获取标识训练地址集,所述标识训练地址集包括训练敏感词以及用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识,获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征,将所述标识训练地址集和所述标识特征模板采用条件随机场模型进行训练,得到地址标识训练模型;
实际地址获取模块,用于:获取至少一个所述用户的实际地址,将所述实际地址进行切分得到实际最小切分结果;
实际地址切分模块,用于:将所述实际最小切分结果输入所述地址切分训练模型,得到用于描述所述实际最小切分结果的成词类型的实际成词类型标注,根据所述实际最小切分结果的实际成词类型标注,将所述实际最小切分结果重新组合为生活圈名称;
实际地址标识模块,用于:从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,将所述实际敏感词输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识;
生活圈类型模块,用于:对每个所述实际地址生成包括所述生活圈名称及对相应的实际敏感词的实际生活圈类型标识的生活圈。
本发明通过训练地址,训练出地址切分训练模型和地址标识训练模型,并将实际地址通过地址切分训练模型和地址标识训练模型分别提取出相应的生活圈名称及实际生活圈类型标识,从而准确地识别出用户的地址的生活圈的名称和类型。
附图说明
图1为本发明一种用户的生活圈提取方法的工作流程图;
图2为切分训练地址集的例子示意图;
图3为切分特征模板的例子示意图;
图4为标识训练地址集的例子示意图;
图5为标识特征模板的例子示意图;
图6为切分标识的例子示意图;
图7为本发明一种用户的生活圈提取系统的结构模块图。
具体实施方式
下面结合附图和具体实施例对本发明做进一步详细的说明。
如图1所示为本发明一种用户的生活圈提取方法的工作流程图,包括:
步骤S101,包括:获取多个用于进行训练的切分训练地址,对所述切分训练地址进行切分得到切分最小训练结果,获取切分训练地址集,所述切分训练地址集包括所述切分最小训练结果以及用于描述所述切分最小训练结果的成词类型的训练成词类型标注,获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征,将所述切分训练地址集和所述切分特征模板采用条件随机场模型进行训练,得到地址切分训练模型;
步骤S102,包括:获取多个用于进行训练的标识训练地址,对所述标识训练地址进行切分得到标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,获取标识训练地址集,所述标识训练地址集包括训练敏感词以及用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识,获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征,将所述标识训练地址集和所述标识特征模板采用条件随机场模型进行训练,得到地址标识训练模型;
步骤S103,包括:获取至少一个所述用户的实际地址,将所述实际地址进行切分得到实际最小切分结果;
步骤S104,包括:将所述实际最小切分结果输入所述地址切分训练模型,得到用于描述所述实际最小切分结果的成词类型的实际成词类型标注,根据所述实际最小切分结果的实际成词类型标注,将所述 实际最小切分结果重新组合为生活圈名称;
步骤S105,包括:从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,将所述实际敏感词输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识;
步骤S106,包括:对每个所述实际地址生成包括所述生活圈名称及对相应的实际敏感词的实际生活圈类型标识的生活圈。
本发明通过训练地址,训练出地址切分训练模型和地址标识训练模型,并将实际地址通过地址切分训练模型和地址标识训练模型分别提取出相应的生活圈名称及实际生活圈类型标识,从而准确地识别出用户的地址的生活圈的名称和类型。
其中,步骤S101获取多个用于进行训练的切分训练地址,对所述切分训练地址进行切分得到切分最小训练结果,并采用条件随机场模型进行训练。对所述切分训练地址进行切分得到切分最小训练结果可以采用现有的自动切分方法实现,例如采用现有名称为snailseg的分词工具,其为开源的最小切分软件,在github上面可以下载到源码。切分最小训练结果指的是将训练地址进行最小切分,例如:“北辰世纪中心”的最小切分结果为:“北”、“辰”、“世纪”、“中心”。
将切分最小训练结果添加用于描述所述切分最小训练结果的成词类型的训练成词类型标注,得到切分训练地址集。训练成词类型标注可以采用人工对所有的切分最小训练结果进行标注,成词类型标注指的是该最小训练结果在成词时的类型的标注。优选地,成词类型包括词开头、词中间或结尾、以及单独成词。例如“利泽中二路”的最小切分结果为“利泽”、“中二”、“路”,则“利泽”为词开头,“中二”和“路”为词中间或结尾,而“北三环东路北京化工大学”的最小切分结果为“北三环”、“东路”、“北京化工大学”,其中“北三环”为词开头,“东路”为词中间或结尾,而“北京化工大学”为单独成词。
在处理地址时,会遇到很多规则解决不了的难题。
(1)地址切分中会遇到切分歧义和新地址的难题,切分歧义如:中关村北大街、中关村/北大街,用规则很难解决到底该在哪里切分;而每当新地址出现的时候,也要不断的添加新的规则,这也是一项无止境的工作。
(2)地址标注会遇到标注歧义的问题。下面就标注歧义来说明。
例如“XX中心”这么一个地址单元,可能是一个“写字楼”,也可能是一家“公司”或是一家“机构”。例如:“北辰世纪中心”就是一个写字楼,而“寿山福海养老中心”就是一个机构。如果通过人工指定规则的话,就会比较繁琐,而且也不一定能解决好。
条件随机场(CRF)理论可以用于序列标记、数据分割、组块分析等自然语言处理任务中。在中文分词、中文人名识别、歧义消解等汉语自然语言处理任务中都有应用,表现很好。
对于一个给定的条件随机场,输入序列x为训练的数据,输出序列y为标记的结果,通过计算条件概率P(Yi=yi|x),P(Yi-1=yi-1,Yi=yi|x)相应的数学期望,选择其中期望值最大的yi作为xi的结果。
目前基于CRF的主要工具实现有CRF,FlexCRF,CRF++,和CRFsuite,本发明优选使用CRFsuite。
对于上述地址切分和标注问题基于条件随机场(CRF)的模型是怎么解决呢?其实,地址单元之间是有关联的,例如,“北辰世纪中心”后面有“12层”这么一个“楼层”的类型,因此“北辰世纪中心”是一个“写字楼”的概率要大于一个“机构”。条件随机场模型在训练的时候,得到很多这样的信息。在后续标注时,就给出准确的答案。CRF 就是这样利用词的前后关系进行处理的。
切分特征模板对所述切分训练地址集进行特征描述,对于条件随机场模型来说,训练时需要训练数据及特征模板,这样训练模型就会根据事先写好的特征模板训练出每个特征的权重。特征函数是状态特征函数和转移特征函数的统一形式表示。特征函数通常是二值函数,取值要么为1要么为0。条件随机场模型采用如下特征函数:
Figure PCTCN2015099766-appb-000001
上面的公式为条件随机场模型设定的一个特征函数,来训练特征模板描述的是否有真实意义。在预先编写的特征模板中,描述了一些词与词之间的关系,然后根据训练数据去训练,如果训练数据的特征符合编写的特征模板的其中一个特征,那么对于特征模板的这一个特征来说,公式1的结果就是1,如果不符合,结果就是0。也就是说,公式1的结果,是由训练数据加上特征模板一起训练出的结果。
训练数据在步骤S101中即切分训练地址集,而特征模板在步骤S101中即切分特征模板。同样地,在步骤S102中标识训练地址集为条件随机场模型的训练数据,而标识特征模板为条件随机场模型的特征模板。
条件随机场模型通过对特征函数的计算,从而获得特征模板的每个特征的权重,在步骤S104中,当向切分训练模型中输入实际最小切分结果后,通过特征模板的每个特征的权重进行计算,得到实际最小切分结果的多个可能的成词类型的概率,选择其中期望值最大的成词类型的标注作为实际成词类型标注。同样地,在步骤S105中,则选择其中期望值最大的生活圈类型的标识作为实际生活圈类型标识。
对于步骤S104,则根据每个实际最小切分结果的实际成词类型标 注,将一个或多个实际最小切分结果重新组合,其结果则为生活圈名称。
上述的标注或标识指的是通过字母、符号、文字或者数值表示成词类型或者生活圈类型。
在其中一个实施例中:
所述步骤S101,具体包括:
将多个用于进行训练的切分训练地址采用自动机规则进行切分,每个所述切分训练地址切分后得到至少一个切分最小训练结果,生成包括多个切分训练地址组的训练地址集,每个所述切分训练地址组包括至少一个切分训练单元,每个所述切分训练单元包括一个所述切分最小训练结果,且同一切分训练地址组所包括的切分训练单元的切分最小训练结果由同一切分训练地址切分后得到;
获取切分训练地址集,所述切分训练地址集对每个所述切分训练单元添加用于描述所述切分最小训练结果在同一切分训练地址组内的成词类型的训练成词类型标注;
获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征;
对所述切分训练地址集和所述切分特征模板,采用条件随机场模型进行训练,得到地址切分训练模型;
所述步骤S103,具体包括:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
所述步骤S104,具体包括:
生成包括多个实际切分地址组的实际切分地址集,每个所述实际切分地址组包括至少一个实际切分单元,每个所述实际切分单元包括一个所述实际最小切分结果,且同一实际切分地址组所包括的实际切分单元的实际最小切分结果由同一实际地址切分后得到;
将所述实际切分地址集输入所述地址切分训练模型,得到用于描 述所述实际最小切分结果在同一实际切分地址组内的成词类型的实际成词类型标注,根据每个所述实际最小切分结果对应的实际成词类型标注,将同一实际切分地址组内的实际最小切分结果重新组合为生活圈名称。
优选地:
所述切分训练单元还包括:切分最小训练结果是否为敏感词的敏感词标识、切分最小训练结果的长度;
所述实际切分单元还包括:实际最小切分结果是否为敏感词的敏感词标识、实际最小切分结果的长度;
所述切分特征包括:
由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果定义的最小结果单个特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识定义的敏感词单个特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的长度定义的长度单个特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第一联合特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第二联合特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的长度与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第三联合特征。
在其中一个实施例中:
所述步骤S102,具体包括:
将多个用于进行训练的标识训练地址采用自动机规则进行切分,每个所述标识训练地址切分后得到至少一个标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,生成包括多个标识训练地址组的训练地址集,每个所述标识训练地址组包括至少一个标识训练单元,每个所述标识训练单元包括一个所述训练敏感词,且同一标识训练地址组所包括的标识训练单元的训练敏感词由同一标识训练地址切分后得到;
获取标识训练地址集,所述标识训练地址集对每个所述标识训练单元添加用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识;
获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征;
对所述标识训练地址集和所述标识特征模板,采用条件随机场模型进行训练,得到地址标识训练模型;
所述步骤S103,具体包括:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
步骤S105,具体包括:
从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,生成包括多个实际标识地址组的实际标识地址集,每个所述实际标识地址组包括至少一个实际标识单元,每个所述实际标识单元包括一个所述实际敏感词,且同一实际标识地址组所包括的实际标识单元的实际敏感词由同一实际地址切分后得到;
将所述实际标识地址集输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识。
优选地:
所述标识特征包括:
由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词定义的敏感词单个特征;或者
由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词与至少一个相对位移为第二预设数值的第二标识训练单元包括的训练敏感词共同定义的敏感词联合特征。
作为本发明的一个最佳实施例,采用CRFSuite实现条件随机场模型,如图2所示为切分训练地址集的例子,图3所示为切分特征模板的例子,图4所示为标识训练地址集的例子,图5所示为标识特征模板的例子。
用户填写的地址很杂乱,如一些大小写混乱、全角半角问题,需要对这些地址进行预处理,包括:
1)字符归一化
小写转大写
中文的xx层等转数字
阿拉伯的xx环等转中文
全角转半角
繁体转成简体
2)去除无用字符
去除首尾无用字符
空格或者\t字符的解决
移除标点符号以及特殊符号
等等
3)去除无用信息
移除电话号码
移除email
等等
同时,地址的结尾会有具体到门牌号的一些信息,对生活圈提取来说是干扰信息,需要去掉。例如:将“北京市朝阳区北辰西路8号北辰世纪中心A座16层京东商城1609室营销数据专家组”去掉干扰信息后为“朝阳区北辰西路8号北辰世纪中心A座16层”。
地址经过字符串预处理和去除干扰信息后会有大量重复的情况,为了避免后续的随机抽样产生数据过于单一,需要对地址进行去掉重复的操作。
从全部经过上面三步处理后的地址中随机抽取出5000条作为训练集,随机抽取5000条作为测试集。将上述训练集的地址应用于地址切分训练模型时,则为切分训练地址,将上述训练集的地址应用于地址标识训练模型时,则为标识训练地址。
如图2所示,切分训练地址集的每一行为一个切分训练单元,一个或多个切分训练单元组成一个切分训练地址组,两个相邻的切分训练地址组之间采用空行隔开。每个切分训练单元一共有四列:第一列是最小切分后的词即切分最小训练结果;第二列为敏感词标识,有两种符号“+”、“-”,如果切分最小训练结果是以一个敏感的地址词语结尾的,如路,大学,那么为“+”,否则为“-”;第三列描述的是切分最小训练结果的长度,如“利泽”的长度为2;第四列是训练成词类型标注,有三种符号“B”、“I”、“O”,如果切分最小训练结果是一个词的开头,则用“B”表示,如果切分最小训练结果是一个词的中间或者结尾,用“I”表示,如果切分最小训练结果是单独成词,用“O”表示。
实际切分地址集与切分训练地址集类似,其唯一的区别为每个实际切分单元为三列,第一列为实际最小切分结果;第二列为敏感词标识;第三列为实际最小切分结果的长度。切分训练模型在接收到实际切分地址集后,通过计算并填写每个实际切分单元的第四列,即为实际最小切分结果添加实际成词类型标注。这样通过识别“B”、“I”、“O”的组合就可以还原出词了。
如图3所示,切分特征模板中的每一行代表一条切分特征,当采用不同软件实现条件随机场时,切分特征模板的表现形式会有所不同, 然而其所实现的效果是相同的。对于采用CRFSuite实现条件随机场模型时,切分特征模板的形式如图3所示,其中:
w、pos、m分别代表切分最小训练结果、切分最小训练结果的敏感词标识、切分最小训练结果的长度。切分特征的第二个数字描述的是相对位移,相对位移指的是相对于当前切分训练单元相差预设数值的其他切分训练单元。比如:(‘w’,0)代表的就是相对位移为0的切分最小训练结果,即当前切分训练单元的切分最小训练结果,而(‘w’,1)代表的是当前切分训练单元的下一行切分训练单元的切分最小训练结果。
条件随机场模型在训练时会对每一个切分特征计算在不同的训练成词类型标注下的概率,例如对于(w’,0)的特征,则统计每个切分训练单元的切分最小训练结果所对应的训练成词类型标注为B的概率、训练成词类型标注为I的概率和训练成词类型标注为O的概率。
每一个切分特征表示的是一个或多个切分训练单元之间具有相关的可能。比如,如果发现了一个切分训练单元的切分最小训练结果与下一个最小训练切分单元的切分最小训练结果有一定的组成关系,则可以用特征模板这样描述((′w′,0),(′w′,1))。例如,当观测出“望京”和“科技园”有一定的组成关系(这里就是能组成一个词),则特征模型就可以写成((′w′,0),(′w′,1))。
对于((′w′,0),(′w′,1)),条件随机场模型中训练时则分别计算特征(′w′,0)在不同的训练成词类型标注下的概率以及特征(′w′,1)在不同的训练成词类型标注下的概率,然后再计算两者的联合概率。
切分特征并不代表每两个词都一定会有关系,其仅表示两者之间具有一定的可能性,然后训练的过程中,通过切分特征模板描述关系,CRF会自动生成一个特征函数,可以训练出这两个切分最小训练结果之间到底有没有关系。再举个例子,m代表的是每个词的长度,如果 观测出,经常有2个字后面接着3个字会成词,则特征模型就可以写成((′m′,0),(′m′,1)),这样,CRF也会自动生成一个特征函数去训练两个切分最小训练结果的长度之间到底有没有这样的特征。再比如:如果想描述图2中第14行的“天”与13行的“新月”和15行的“宾馆”有成词的可能,那么就在切分特征模板中编写((‘w’,-1),(‘w’,0),(‘w’,1))。这样CRF会自动构建出特征函数去训练这三个词的联合概率。
特征模板可以为手工编写的,根据观察者大量的观察,总结出的一些词与词之间的关系,用特征模板表达出来,然后CRF会根据特征模板自动生成一些特征函数去训练词与词之间是否有这样的关系。
上述的“成词”表示组成具有实际地址意义的词,即作为生活圈名称。
具体训练的方式如下:
执行命令
cat train.txt|python chunking.py>train.crfsuite.txt,
就可以把train.txt切分训练地址集根据预先编写的切分特征模板chunking.py自动生成CRF训练所需要的数据文件。
使用训练命令进行模型训练:
crfsuite learn-m word.model train.crfsuite.txt
其中word.model是训练得到的地址切分训练模型的模型数据。
对测试集的5000条地址按照上述步骤生成同样的格式生成一份实际切分地址集,然后使用地址切分训练模型对实际切分地址集进行测试,测试命令为:
crfsuite tag-r-m word.model test.crfsuite.txt>check.txt
其中,test.crfsuite.txt是实际切分地址集,check.txt是结果文件, 里面储存着地址切分训练模型计算后的结果。即对test.cffsuite.txt实际切分地址集中的每条实际切分单元根据特征模板中的每个特征进行计算,选择其中概率最大的实际成词类型标注,并添加到对应的实际切分单元中。将其与预先通过人工标注的结果进行比较,可以得出该地址切分训练模型的准确率。
在第一次测试中,测试集中的5000条地址,通过模型共识别出4566个“生活圈”,其中正确识别的个数是4060个。经过多次训练和测试后的结果计算,模型的正确率在82%~89%之间;模型的召回率在90%~95%之间。其中:
正确率=提取出的正确信息条数/提取出的信息条数;
召回率=提取出的正确信息条数/样本中的信息条数。
因此,可以通过获取实际的用户的地址,并运行:
crfsuite tag-r-m word.model real.crfsuite.txt>real.txt。
其中,real.crfsuite.txt为实际切分地址集,而real.txt为实际结果。
最后根据实际结果得到的实际最小切分结果的实际成词类型标注,,将所述实际最小切分结果重新组合,组合的结果即为生活圈名称。
对于地址标识也是采用上述类似的方式。
标识训练地址集所包括的是训练敏感词,训练敏感词是从标识最小训练结果中抽取出来的,当采用同样的训练集时,标识最小训练结果与切分最小训练结果可以相同。训练敏感词是标识最小训练结果的一部分,例如如图4所示的标识训练地址集,其每一行为一个标识训练单元,一个或多个标识训练单元组成一个标识训练地址组,两个相邻的标识训练地址组之间采用空行隔开。每个标识训练单元一共有两列:第一列是训练敏感词,例如“路”、“村”、“号院”。训练敏感词可 以在进行自动机规则切分时,通过预设规则进行选定,其中对于501,12-01这样的门牌号做了统一处理,都置为num。标识训练单元的第二列为训练生活圈类型标识,采用数字代替其类型,每一个数字只代表一种类型。
标识特征模板与切分特征模板类似,如图5所示是观察者根据观察得到的地址单元之间的关系编写的标识特征模板,当采用不同软件实现条件随机场时,切分特征模板的表现形式会有所不同,然而其所实现的效果是相同的。对于采用CRFSuite实现条件随机场模型时,切分特征模板的形式如图5所示。由于标识训练地址集只有两列,因此标识特征模板只需采用w即可。
随后,利用条件随机场就可以根据特征模板描述的关系来计算每两个特征之间的转移概率。从而得到训练后的模型。
通过训练得到标识训练模型,命令如下:
crfsuite learn-m new_word.model train.crfsuite.txt
得到模型word.model,然后将实际地址也采用上述方式生成实际标识地址集。实际标识地址集与标识训练地址集的区别在于,实际标识地址集的每个实际标识单元仅包括实际敏感词,并不包括生活圈类型标识。
将实际标识地址集通过得到的标识训练模型进行标注,命令如下:
crfsuite tag-r-m word.model test.crfsuite.txt>check.txt
则标识训练模型会为每个实际敏感词添加实际生活圈类型标识。
最后把实际生活圈类型标识翻译为对应的生活圈类型,则能将每个实际地址与对应的生活圈类型关联。
如图6所示,“北京市朝阳区北辰西路8号北辰世纪中心16层京 东商城”通过执行步骤S104会切分为“北京市”、“朝阳区”、“北辰西路”、“8号”、“北辰世纪中心”、“16层”、“京东商城”。而执行步骤S105,则“北京市”的生活圈类型标注为“市”、“朝阳区”的生活圈类型标注为“区”、“北辰西路”的生活圈类型标注为“路”、“8号”的生活圈类型标注为“号”、“北辰世纪中心”的生活圈类型标注为“写字楼”、“16层”的生活圈类型标注为“楼层”、“京东商城”的生活圈类型标注为“单位”。从而得到以下生活圈:“北京市生活圈”、“朝阳区生活圈”、“北辰西路生活圈”、“8号生活圈”、“北辰世纪中心写字楼生活圈”、“16层楼层生活圈”和“京东商城单位生活圈”。
如图7所示为本发明一种用户的生活圈提取系统的结构模块图,包括:
地址切分训练模块701,用于:获取多个用于进行训练的切分训练地址,对所述切分训练地址进行切分得到切分最小训练结果,获取切分训练地址集,所述切分训练地址集包括所述切分最小训练结果以及用于描述所述切分最小训练结果的成词类型的训练成词类型标注,获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征,将所述切分训练地址集和所述切分特征模板采用条件随机场模型进行训练,得到地址切分训练模型;
地址标识训练模块702,用于:获取多个用于进行训练的标识训练地址,对所述标识训练地址进行切分得到标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,获取标识训练地址集,所述标识训练地址集包括训练敏感词以及用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识,获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征,将所述标识训练地址集和所述标识特征模板采用条件随机场模型进行训练,得到地址标识训练模型;
实际地址获取模块703,用于:获取至少一个所述用户的实际地址,将所述实际地址进行切分得到实际最小切分结果;
实际地址切分模块704,用于:将所述实际最小切分结果输入所 述地址切分训练模型,得到用于描述所述实际最小切分结果的成词类型的实际成词类型标注,根据所述实际最小切分结果的实际成词类型标注,将所述实际最小切分结果重新组合为生活圈名称;
实际地址标识模块705,用于:从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,将所述实际敏感词输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识;
生活圈类型模块706,用于:对每个所述实际地址生成包括所述生活圈名称及对相应的实际敏感词的实际生活圈类型标识的生活圈。
在其中一个实施例中:
所述地址切分训练模块,具体用于:
将多个用于进行训练的切分训练地址采用自动机规则进行切分,每个所述切分训练地址切分后得到至少一个切分最小训练结果,生成包括多个切分训练地址组的训练地址集,每个所述切分训练地址组包括至少一个切分训练单元,每个所述切分训练单元包括一个所述切分最小训练结果,且同一切分训练地址组所包括的切分训练单元的切分最小训练结果由同一切分训练地址切分后得到;
获取切分训练地址集,所述切分训练地址集对每个所述切分训练单元添加用于描述所述切分最小训练结果在同一切分训练地址组内的成词类型的训练成词类型标注;
获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征;
对所述切分训练地址集和所述切分特征模板,采用条件随机场模型进行训练,得到地址切分训练模型;
所述实际地址获取模块,具体用于:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
所述实际地址切分模块,具体用于:
生成包括多个实际切分地址组的实际切分地址集,每个所述实际 切分地址组包括至少一个实际切分单元,每个所述实际切分单元包括一个所述实际最小切分结果,且同一实际切分地址组所包括的实际切分单元的实际最小切分结果由同一实际地址切分后得到;
将所述实际切分地址集输入所述地址切分训练模型,得到用于描述所述实际最小切分结果在同一实际切分地址组内的成词类型的实际成词类型标注,根据每个所述实际最小切分结果对应的实际成词类型标注,将同一实际切分地址组内的实际最小切分结果重新组合为生活圈名称。
在其中一个实施例中:
所述切分训练单元还包括:切分最小训练结果是否为敏感词的敏感词标识、切分最小训练结果的长度;
所述实际切分单元还包括:实际最小切分结果是否为敏感词的敏感词标识、实际最小切分结果的长度;
所述切分特征包括:
由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果定义的最小结果单个特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识定义的敏感词单个特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的长度定义的长度单个特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第一联合特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第二联合特征;或者
由相对位移为第一预设数值的第一切分训练单元包括的长度与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小 训练结果、敏感词标识或长度共同定义的第三联合特征。
在其中一个实施例中;
所述地址标识训练模块,具体用于:
将多个用于进行训练的标识训练地址采用自动机规则进行切分,每个所述标识训练地址切分后得到至少一个标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,生成包括多个标识训练地址组的训练地址集,每个所述标识训练地址组包括至少一个标识训练单元,每个所述标识训练单元包括一个所述训练敏感词,且同一标识训练地址组所包括的标识训练单元的训练敏感词由同一标识训练地址切分后得到;
获取标识训练地址集,所述标识训练地址集对每个所述标识训练单元添加用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识;
获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征;
对所述标识训练地址集和所述标识特征模板,采用条件随机场模型进行训练,得到地址标识训练模型;
所述实际地址获取模块,具体用于:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
实际地址标识模块,具体用于:
从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,生成包括多个实际标识地址组的实际标识地址集,每个所述实际标识地址组包括至少一个实际标识单元,每个所述实际标识单元包括一个所述实际敏感词,且同一实际标识地址组所包括的实际标识单元的实际敏感词由同一实际地址切分后得到;
将所述实际标识地址集输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识。
在其中一个实施例中:
所述标识特征包括:
由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词定义的敏感词单个特征;或者
由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词与至少一个相对位移为第二预设数值的第二标识训练单元包括的训练敏感词共同定义的敏感词联合特征。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种用户的生活圈提取方法,其特征在于,包括:
    地址切分训练步骤,包括:获取多个用于进行训练的切分训练地址,对所述切分训练地址进行切分得到切分最小训练结果,获取切分训练地址集,所述切分训练地址集包括所述切分最小训练结果以及用于描述所述切分最小训练结果的成词类型的训练成词类型标注,获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征,将所述切分训练地址集和所述切分特征模板采用条件随机场模型进行训练,得到地址切分训练模型;
    地址标识训练步骤,包括:获取多个用于进行训练的标识训练地址,对所述标识训练地址进行切分得到标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,获取标识训练地址集,所述标识训练地址集包括训练敏感词以及用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识,获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征,将所述标识训练地址集和所述标识特征模板采用条件随机场模型进行训练,得到地址标识训练模型;
    实际地址获取步骤,包括:获取至少一个所述用户的实际地址,将所述实际地址进行切分得到实际最小切分结果;
    实际地址切分步骤,包括:将所述实际最小切分结果输入所述地址切分训练模型,得到用于描述所述实际最小切分结果的成词类型的实际成词类型标注,根据所述实际最小切分结果的实际成词类型标注,将所述实际最小切分结果重新组合为生活圈名称;
    实际地址标识步骤,包括:从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,将所述实际敏感词输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识;
    生活圈提取步骤,包括:对每个所述实际地址生成包括所述生活 圈名称及对相应的实际敏感词的实际生活圈类型标识的生活圈。
  2. 根据权利要求1所述的用户的生活圈提取方法,其特征在于:
    所述地址切分训练步骤,具体包括:
    将多个用于进行训练的切分训练地址采用自动机规则进行切分,每个所述切分训练地址切分后得到至少一个切分最小训练结果,生成包括多个切分训练地址组的训练地址集,每个所述切分训练地址组包括至少一个切分训练单元,每个所述切分训练单元包括一个所述切分最小训练结果,且同一切分训练地址组所包括的切分训练单元的切分最小训练结果由同一切分训练地址切分后得到;
    获取切分训练地址集,所述切分训练地址集对每个所述切分训练单元添加用于描述所述切分最小训练结果在同一切分训练地址组内的成词类型的训练成词类型标注;
    获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征;
    对所述切分训练地址集和所述切分特征模板,采用条件随机场模型进行训练,得到地址切分训练模型;
    所述实际地址获取步骤,具体包括:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
    所述实际地址切分步骤,具体包括:
    生成包括多个实际切分地址组的实际切分地址集,每个所述实际切分地址组包括至少一个实际切分单元,每个所述实际切分单元包括一个所述实际最小切分结果,且同一实际切分地址组所包括的实际切分单元的实际最小切分结果由同一实际地址切分后得到;
    将所述实际切分地址集输入所述地址切分训练模型,得到用于描述所述实际最小切分结果在同一实际切分地址组内的成词类型的实际成词类型标注,根据每个所述实际最小切分结果对应的实际成词类型标注,将同一实际切分地址组内的实际最小切分结果重新组合为生活圈名称。
  3. 根据权利要求2所述的用户的生活圈提取方法,其特征在于:
    所述切分训练单元还包括:切分最小训练结果是否为敏感词的敏感词标识、切分最小训练结果的长度;
    所述实际切分单元还包括:实际最小切分结果是否为敏感词的敏感词标识、实际最小切分结果的长度;
    所述切分特征包括:
    由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果定义的最小结果单个特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识定义的敏感词单个特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的长度定义的长度单个特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第一联合特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第二联合特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的长度与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第三联合特征。
  4. 根据权利要求1所述的用户的生活圈提取方法,其特征在于:
    所述地址标识训练步骤,具体包括:
    将多个用于进行训练的标识训练地址采用自动机规则进行切分,每个所述标识训练地址切分后得到至少一个标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,生成包括多个标识训练地址组的训练地址集,每个所述标识训练地址组包括 至少一个标识训练单元,每个所述标识训练单元包括一个所述训练敏感词,且同一标识训练地址组所包括的标识训练单元的训练敏感词由同一标识训练地址切分后得到;
    获取标识训练地址集,所述标识训练地址集对每个所述标识训练单元添加用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识;
    获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征;
    对所述标识训练地址集和所述标识特征模板,采用条件随机场模型进行训练,得到地址标识训练模型;
    所述实际地址获取步骤,具体包括:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
    实际地址标识步骤,具体包括:
    从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,生成包括多个实际标识地址组的实际标识地址集,每个所述实际标识地址组包括至少一个实际标识单元,每个所述实际标识单元包括一个所述实际敏感词,且同一实际标识地址组所包括的实际标识单元的实际敏感词由同一实际地址切分后得到;
    将所述实际标识地址集输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识。
  5. 根据权利要求4所述的用户的生活圈提取方法,其特征在于:
    所述标识特征包括:
    由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词定义的敏感词单个特征;或者
    由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词与至少一个相对位移为第二预设数值的第二标识训练单元包括的训练敏感词共同定义的敏感词联合特征。
  6. 一种用户的生活圈提取系统,其特征在于,包括:
    地址切分训练模块,用于:获取多个用于进行训练的切分训练地址,对所述切分训练地址进行切分得到切分最小训练结果,获取切分训练地址集,所述切分训练地址集包括所述切分最小训练结果以及用于描述所述切分最小训练结果的成词类型的训练成词类型标注,获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征,将所述切分训练地址集和所述切分特征模板采用条件随机场模型进行训练,得到地址切分训练模型;
    地址标识训练模块,用于:获取多个用于进行训练的标识训练地址,对所述标识训练地址进行切分得到标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,获取标识训练地址集,所述标识训练地址集包括训练敏感词以及用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识,获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征,将所述标识训练地址集和所述标识特征模板采用条件随机场模型进行训练,得到地址标识训练模型;
    实际地址获取模块,用于:获取至少一个所述用户的实际地址,将所述实际地址进行切分得到实际最小切分结果;
    实际地址切分模块,用于:将所述实际最小切分结果输入所述地址切分训练模型,得到用于描述所述实际最小切分结果的成词类型的实际成词类型标注,根据所述实际最小切分结果的实际成词类型标注,将所述实际最小切分结果重新组合为生活圈名称;
    实际地址标识模块,用于:从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,将所述实际敏感词输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识;
    生活圈类型模块,用于:对每个所述实际地址生成包括所述生活圈名称及对相应的实际敏感词的实际生活圈类型标识的生活圈。
  7. 根据权利要求6所述的用户的生活圈提取系统,其特征在于:
    所述地址切分训练模块,具体用于:
    将多个用于进行训练的切分训练地址采用自动机规则进行切分,每个所述切分训练地址切分后得到至少一个切分最小训练结果,生成包括多个切分训练地址组的训练地址集,每个所述切分训练地址组包括至少一个切分训练单元,每个所述切分训练单元包括一个所述切分最小训练结果,且同一切分训练地址组所包括的切分训练单元的切分最小训练结果由同一切分训练地址切分后得到;
    获取切分训练地址集,所述切分训练地址集对每个所述切分训练单元添加用于描述所述切分最小训练结果在同一切分训练地址组内的成词类型的训练成词类型标注;
    获取切分特征模板,所述切分特征模板包括至少一条用于对所述切分训练地址集进行特征描述的切分特征;
    对所述切分训练地址集和所述切分特征模板,采用条件随机场模型进行训练,得到地址切分训练模型;
    所述实际地址获取模块,具体用于:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
    所述实际地址切分模块,具体用于:
    生成包括多个实际切分地址组的实际切分地址集,每个所述实际切分地址组包括至少一个实际切分单元,每个所述实际切分单元包括一个所述实际最小切分结果,且同一实际切分地址组所包括的实际切分单元的实际最小切分结果由同一实际地址切分后得到;
    将所述实际切分地址集输入所述地址切分训练模型,得到用于描述所述实际最小切分结果在同一实际切分地址组内的成词类型的实际成词类型标注,根据每个所述实际最小切分结果对应的实际成词类型标注,将同一实际切分地址组内的实际最小切分结果重新组合为生活圈名称。
  8. 根据权利要求7所述的用户的生活圈提取系统,其特征在于:
    所述切分训练单元还包括:切分最小训练结果是否为敏感词的敏 感词标识、切分最小训练结果的长度;
    所述实际切分单元还包括:实际最小切分结果是否为敏感词的敏感词标识、实际最小切分结果的长度;
    所述切分特征包括:
    由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果定义的最小结果单个特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识定义的敏感词单个特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的长度定义的长度单个特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的切分最小训练结果与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第一联合特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的敏感词标识与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第二联合特征;或者
    由相对位移为第一预设数值的第一切分训练单元包括的长度与至少一个相对位移为第二预设数值的第二切分训练单元包括的切分最小训练结果、敏感词标识或长度共同定义的第三联合特征。
  9. 根据权利要求6所述的用户的生活圈提取系统,其特征在于:
    所述地址标识训练模块,具体用于:
    将多个用于进行训练的标识训练地址采用自动机规则进行切分,每个所述标识训练地址切分后得到至少一个标识最小训练结果,从所述标识最小训练结果中选取与生活圈类型相关的训练敏感词,生成包括多个标识训练地址组的训练地址集,每个所述标识训练地址组包括至少一个标识训练单元,每个所述标识训练单元包括一个所述训练敏感词,且同一标识训练地址组所包括的标识训练单元的训练敏感词由同一标识训练地址切分后得到;
    获取标识训练地址集,所述标识训练地址集对每个所述标识训练单元添加用于描述所述训练敏感词的生活圈类型的训练生活圈类型标识;
    获取标识特征模板,所述标识特征模板包括至少一条对所述标识训练地址集进行特征描述的标识特征;
    对所述标识训练地址集和所述标识特征模板,采用条件随机场模型进行训练,得到地址标识训练模型;
    所述实际地址获取模块,具体用于:获取至少一个所述用户的实际地址,将所述实际地址采用自动机规则进行切分得到实际最小切分结果,每个所述实际地址切分后得到至少一个实际最小切分结果;
    实际地址标识模块,具体用于:
    从所述实际最小切分结果中选取与生活圈类型相关的实际敏感词,生成包括多个实际标识地址组的实际标识地址集,每个所述实际标识地址组包括至少一个实际标识单元,每个所述实际标识单元包括一个所述实际敏感词,且同一实际标识地址组所包括的实际标识单元的实际敏感词由同一实际地址切分后得到;
    将所述实际标识地址集输入所述地址标识训练模型,得到用于描述所述实际敏感词的生活圈类型的实际生活圈类型标识。
  10. 根据权利要求9所述的用户的生活圈提取系统,其特征在于:
    所述标识特征包括:
    由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词定义的敏感词单个特征;或者
    由相对位移为第一预设数值的第一标识训练单元包括的训练敏感词与至少一个相对位移为第二预设数值的第二标识训练单元包括的训练敏感词共同定义的敏感词联合特征。
PCT/CN2015/099766 2015-01-13 2015-12-30 一种用户的生活圈提取方法及系统 WO2016112782A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510016140.2A CN104598573B (zh) 2015-01-13 2015-01-13 一种用户的生活圈提取方法及系统
CN201510016140.2 2015-01-13

Publications (1)

Publication Number Publication Date
WO2016112782A1 true WO2016112782A1 (zh) 2016-07-21

Family

ID=53124358

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/099766 WO2016112782A1 (zh) 2015-01-13 2015-12-30 一种用户的生活圈提取方法及系统

Country Status (2)

Country Link
CN (1) CN104598573B (zh)
WO (1) WO2016112782A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598573B (zh) * 2015-01-13 2017-06-16 北京京东尚科信息技术有限公司 一种用户的生活圈提取方法及系统
CN104850538A (zh) * 2015-05-08 2015-08-19 裴克铭管理咨询(上海)有限公司 基于规则和统计模型的中文地址复合分词技术
CN104933023B (zh) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 中文地址分词标注方法
CN104933024B (zh) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 中文地址分词标注方法
CN111858937B (zh) * 2016-12-14 2024-04-30 创新先进技术有限公司 一种虚假地址信息识别的方法及装置
CN111274802B (zh) * 2018-11-19 2023-04-18 阿里巴巴集团控股有限公司 一种地址数据的有效性判断方法及其装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008140117A (ja) * 2006-12-01 2008-06-19 National Institute Of Information & Communication Technology 中国語の文字シーケンスを中国語の単語シーケンスにセグメント化するための装置
CN102360383A (zh) * 2011-10-15 2012-02-22 西安交通大学 一种面向文本的领域术语与术语关系抽取方法
CN102479191A (zh) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 提供多粒度分词结果的方法及其装置
US20140012569A1 (en) * 2012-07-03 2014-01-09 National Taiwan Normal University System and Method Using Data Reduction Approach and Nonlinear Algorithm to Construct Chinese Readability Model
CN103678684A (zh) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 一种基于导航信息检索的中文分词方法
CN104598573A (zh) * 2015-01-13 2015-05-06 北京京东尚科信息技术有限公司 一种用户的生活圈提取方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008140117A (ja) * 2006-12-01 2008-06-19 National Institute Of Information & Communication Technology 中国語の文字シーケンスを中国語の単語シーケンスにセグメント化するための装置
CN102479191A (zh) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 提供多粒度分词结果的方法及其装置
CN102360383A (zh) * 2011-10-15 2012-02-22 西安交通大学 一种面向文本的领域术语与术语关系抽取方法
US20140012569A1 (en) * 2012-07-03 2014-01-09 National Taiwan Normal University System and Method Using Data Reduction Approach and Nonlinear Algorithm to Construct Chinese Readability Model
CN103678684A (zh) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 一种基于导航信息检索的中文分词方法
CN104598573A (zh) * 2015-01-13 2015-05-06 北京京东尚科信息技术有限公司 一种用户的生活圈提取方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUN, HONG ET AL.: "Research on Chinese Toponym Recognition Method with Two-Layer CRF and Rules Combination", COMPUTER APPLICATIONS AND SOFTWARE, vol. 31, no. 11, 30 November 2014 (2014-11-30), pages 175, ISSN: 1000-386X *

Also Published As

Publication number Publication date
CN104598573A (zh) 2015-05-06
CN104598573B (zh) 2017-06-16

Similar Documents

Publication Publication Date Title
WO2016112782A1 (zh) 一种用户的生活圈提取方法及系统
CN110110054B (zh) 一种基于深度学习的从非结构化文本中获取问答对的方法
CN107168955B (zh) 利用基于词上下文的字嵌入与神经网络的中文分词方法
CN108959242B (zh) 一种基于中文字符词性特征的目标实体识别方法及装置
EP3153978B1 (en) Address search method and device
CN111198948A (zh) 文本分类校正方法、装置、设备及计算机可读存储介质
CN106776538A (zh) 企业非标准格式文档的信息提取方法
CN110781663B (zh) 文本分析模型的训练方法及装置、文本分析方法及装置
JP2023529939A (ja) マルチモーダルpoi特徴の抽出方法及び装置
CN107193796B (zh) 一种舆情事件检测方法及装置
CN109147767A (zh) 语音中的数字识别方法、装置、计算机设备及存储介质
JP2015062117A (ja) 実体のリンク付け方法及び実体のリンク付け装置
CN111488468B (zh) 地理信息知识点抽取方法、装置、存储介质及计算机设备
CN112163424A (zh) 数据的标注方法、装置、设备和介质
EP3495968A1 (en) Method and system for extraction of relevant sections from plurality of documents
CN107169321B (zh) 基于属性计数和结构度量技术相结合的程序剽窃检测方法及系统
CN111159332A (zh) 一种基于bert的文本多意图识别方法
JP2019032704A (ja) 表データ構造化システムおよび表データ構造化方法
CN115659226A (zh) 一种获取app标签的数据处理系统
CN115795056A (zh) 非结构化信息构建知识图谱的方法、服务器及存储介质
CN113656547A (zh) 文本匹配方法、装置、设备及存储介质
JP2022151838A (ja) 低リソース言語からのオープン情報の抽出
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN111539383B (zh) 公式知识点识别方法及装置
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15877693

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15877693

Country of ref document: EP

Kind code of ref document: A1