CN105630768B - A kind of product name recognition method and device based on stacking condition random field - Google Patents

A kind of product name recognition method and device based on stacking condition random field Download PDF

Info

Publication number
CN105630768B
CN105630768B CN201510974820.5A CN201510974820A CN105630768B CN 105630768 B CN105630768 B CN 105630768B CN 201510974820 A CN201510974820 A CN 201510974820A CN 105630768 B CN105630768 B CN 105630768B
Authority
CN
China
Prior art keywords
word
product
name
random field
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510974820.5A
Other languages
Chinese (zh)
Other versions
CN105630768A (en
Inventor
黄河燕
杨献祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201510974820.5A priority Critical patent/CN105630768B/en
Publication of CN105630768A publication Critical patent/CN105630768A/en
Application granted granted Critical
Publication of CN105630768B publication Critical patent/CN105630768B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of context-sensitive product name recognition methods and device based on stacking condition random field, belong to internet data processing and analysis technical field, the method of the present invention carries out the expression of word using term vector method, and the semantic similarity of the measuring similarity word using vector, pass through the method amalgamation of global contextual information of term vector combination term clustering;It is complicated there is a problem of nesting for ProductName simultaneously, the identification of ProductName is carried out using stacking conditional random field models.Compare the prior art, contextual information is insufficient in effective solution of the present invention ProductName identification, the problems such as ProductName is complicated there are nested structure improves the performance of the ProductName identification of labyrinth, and the accuracy rate of ProductName of the present invention identification and F1 values are higher than conventional method.

Description

A kind of product name recognition method and device based on stacking condition random field
Technical field
The invention belongs to internet data processing and analysis fields, are related to a kind of context based on stacking condition random field Related product name recognition method and device.
Background technology
In the Web2.0 epoch, with the rise of the social network-i i-platforms such as microblogging, each Internet user is no longer only information Viewer, while also become information publisher, internet is changed into interaction from an information promulgating platform Platform.In past ten years, the e-commerce industry in China, which continues to develop, grows, and more and more companies open on the internet Exhibition industry is engaged in, and product is sold by the network promotion.By in December, 2013, enterprise's ratio that online sales are carried out in the whole nation reaches 23.5%, the enterprise that promotion is carried out by internet has also accounted for 20.9%.More and more people are accustomed to shopping at network, The product for interconnecting discussion online oneself purchase comments on oneself product that is used and buying in the place such as forum, microblogging, shopping website The advantages of and deficiency, people be accustomed to checking the user's evaluations of the commodity oneself to be bought by search engine before buying commodity, And online friend can then influence the purchase decision of oneself to the evaluation of a certain commodity quality.All kinds of enterprises have also opened the official of oneself one after another Fang Weibo starts to promote the product of oneself in this new media in microblogging.Not only government begins to focus on internet at present Topic is propagated, and various commercial enterprise also begins to pay close attention to and analyze the network informations such as all kinds of forums, microblogging, blog, it is desirable to Cong Zhongzhang The market public praise for holding Products understands opinions and suggestions of the numerous netizens to Products, and the moment monitors our company's product Negative reviews protect the reputation of company timely to carry out Crisis.Internet has become the companies of all trades and professions from public affairs The important way that approach obtains industry competitive intelligence is opened, all kinds of companies are all in the market table for the product for paying close attention to rival Existing, new product publication to making suitable decision in time.For all kinds of enterprises, concern internet information most it is basic just It is to pay close attention to the product of its affiliated industry and oneself production, therefore ProductName is accurately identified from the data of magnanimity on internet Be carry out industry public sentiment monitoring, Praise and business intelligence basis and premise.
ProductName knowledge maybe identify name of product entity in the text, and ProductName identification is proprietary in information extraction One subdivision field of noun identification, it is intended to will indicate that the Entity recognition of name of product comes out to be business intelligence etc. in text Upper layer application provides support.At present for the research of proper noun recognition mainly for tradition such as name, place name, institution terms Name entity be identified, as the development of internet and e-commerce is also increasingly heavier for the identification work of name of product It wants, it is also relatively fewer about the identification of name of product at present.Different from tradition name entity, the usual structure of ProductName is more multiple It is miscellaneous, number, letter, spcial character, Chinese character etc. are generally comprised, and the relatively long nesting phenomenon of length is than more serious;In addition, It is flooded with a large amount of user on the epoch internets Web2.0 and generates text, since the Literal Skills of user are different, communicative habits are each For its different intractability far above the traditional media such as news, application value is also higher than traditional news media more.In order to more ProductName is accurately identified in good slave internet mass information, needs to consider part and global context information, it is right ProductName is known method for distinguishing and is improved.
Invention content
It is an object of the invention to consider the nested problem of ProductName emphatically, while comprehensively utilizing contextual information and being produced The improvement of name of an article recognition methods proposes a kind of context-sensitive product name recognition method based on stacking condition random field, effectively Solve the problems, such as present in ProductName nested, while making full use of part and global context information to carry out changing for feature Into the performance of promotion ProductName identification.
Idea of the invention is that using term vector model and term clustering amalgamation of global contextual information, local context is supplemented The deficiency of information, while carrying out the identification with the ProductName of nested structure using stacking conditional random field models.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of context-sensitive product name recognition method based on stacking condition random field, includes the following steps:
Step 1:Participle and part-of-speech tagging pretreatment are carried out to language material text;
Step 2:Character representation is carried out as unit of word to language material text;
Step 3:The feature templates required with the low layer conditional random field models trained current word utilize after indicating Trained low layer conditional random field models are identified to obtain preliminary recognition result, are denoted as label 1;
Step 4:The word for using a character representation is indicated plus label 1 as quadratic character;
Step 5:The feature templates required with the low layer conditional random field models trained current word utilize after indicating Trained high-rise conditional random field models, which are identified, obtains final recognition result, is denoted as label 2;
Step 6:It is exported after the word for being identified as product entity in language material text is increased its corresponding label.
Preferably, a feature includes foundation characteristic, domain features, category feature, the foundation characteristic is used for Indicate word possessed by feature, including word itself, part of speech, whether comprising letter, whether comprising number, whether include special word Symbol;Domain features are used to indicate the features of word fields, including current word whether brand name, whether serial name, whether model Name, whether product attribute;Category feature is used to indicate the category feature belonging to word.
Preferably, the domain features are determined based on field product knowledge database by string matching mode, the neck Domain product knowledge database is built by following procedure:
Product-related data is captured from field related web site;
The data grabbed are parsed to obtain preliminary product entity list;
Artificial correction is carried out to preliminary product entity list, specifies the affiliated brand of product entity, series and model, structure The product entity list built including product entity and its affiliated brand, series and model simultaneously stores;
With reference to the common properties list for capturing obtained one field product of data manual sorting and store.
Preferably, the category feature belonging to the current word is determined by following procedure:
Word-based vector model clusters similarity of the root therein between, and two words A and B are corresponded to Term vectorAnd vectorBetween similarity be calculated by the following formula:
One unique class number is set for each classification after the completion of cluster;
The class number of classification belonging to current word is exported.
Preferably, the term vector model is obtained by following procedure:
The relevant webpage of downloading field is simultaneously parsed into plain text;
Word segmentation processing is carried out to the text that download obtains;
Use the text training term vector model of point good word.
Preferably, the label 1 and label 2 are labeled using BIO modes, B presentation-entity starts, I presentation-entity In part in addition to beginning, O indicates that non-physical part, the label 1 that thus mode obtains are one of the following:
B-BRA:Indicate the start element of brand name;
I-BRA:Indicate other elements in addition to start element of brand name;
B-SER:Indicate the start element of serial name;
I-SER:Indicate other elements in addition to start element of serial name;
B-TYP:Indicate the start element of model name;
I-TYP:Indicate other elements in addition to start element of model name;
B-COM:Indicate the start element of company name;
I-COM:Indicate other elements in addition to start element of company name;
B-PRO:Indicate the start element of ProductName;
I-PRO:Indicate other elements in addition to start element of ProductName;
O:Indicate non-physical element.
Preferably, the low layer conditional random field models trained and high-rise conditional random field models pass through following mistake Journey obtains:
The relevant text of product is collected as training corpus;
Participle and part-of-speech tagging are carried out to training corpus;
The entities such as brand, series, model, company, the ProductName occurred in the text after label participle obtain including product The sentence of entity;
A feature, label 1 and label 2 are carried out to product entity to indicate;
The product entity indicated with a feature, label 1 has been trained for the training of conditional random field models Low layer conditional random field models, the feature that feature templates should be including a upper word, current word and next word;
Training by the product entity indicated with a feature, label 1, label 2 for conditional random field models obtains Trained high-rise conditional random field models, the feature that feature templates should be including a upper word, current word and next word.
A kind of context-sensitive product name recognition device based on stacking condition random field, including field product knowledge database, Term vector model, the low layer conditional random field models trained, the high-rise conditional random field models trained, Text Pretreatment mould Block, a character representation module, quadratic character representation module, preliminary product name identification module, final products name identification module and Recognition result output module;Text Pretreatment module, a character representation module, preliminary product name identification module, quadratic character Representation module, final products name identification module and recognition result output module are sequentially connected, field product knowledge database, term vector mould Type is connected with a character representation module respectively, the low layer conditional random field models trained and preliminary product name identification module phase Even, the high-rise conditional random field models trained are connected with final products name identification module;
The field product knowledge database is to be built according to the process for building field product knowledge database described in claim 3, packet Include product entity list and common properties list;
The term vector model is to be obtained according to the process of training term vector model described in claim 5;
The low layer conditional random field models trained and the high-rise conditional random field models trained are wanted according to right 7 processes are asked to obtain;
The Text Pretreatment module is used to receive the text of ProductName to be identified and carries out participle and part of speech mark to it Note;
All words and its part of speech that character representation module is used to obtain Text Pretreatment module are produced based on field Product knowledge base and term vector model respectively obtain its characteristic value, i.e., are indicated word with a feature;
All words and its a feature that preliminary product name identification module is used to export a character representation module, melt It is identified, is obtained just by the low layer conditional random field models trained after closing a feature of its previous word and latter word Walk recognition result label 1;
A feature and mark for all words that quadratic character representation module is used to export preliminary product name identification module The quadratic character that 1 combination of note obtains equivalent indicates;
All words and its quadratic character that final products name identification module is used to export quadratic character representation module, melt It is identified, is obtained most by the high-rise conditional random field models trained after closing the quadratic character of its previous word and latter word Whole recognition result label 2;
All words and its label 2 that recognition result output module is used to export final products name identification module, filter out Obtain recognition result list after non-product name entity elements, in recognition result list word and its label replace input text in Equivalent after export.
Preferably, last word content is supplemented in the field product knowledge database regular replenishment field, institute's predicate The newest related text of vector model regular replenishment trains the process of term vector model to be instructed according to claim 5 again Practice.
Preferably, a Sub-eigenvaluc uses power according to character representation described in claim 2, the label 1 and label 2 Profit requires 6 modes to be labeled.
Advantageous effect
The problems such as present invention is complicated for name of product, while contextual information is underutilized, using word The method amalgamation of global contextual information of vector, and asked using the identification of stacking condition random field solution complex structure product name Topic compares the prior art, and contextual information is insufficient in the ProductName identification of this method effective solution, and ProductName has nested tie The problems such as structure is complicated improves the performance of the ProductName identification of labyrinth.The accuracy rate and F1 values of the method for the present invention are higher than biography The method of system,.The present invention is widely used in the ProductName identification of news, microblogging, forum and other social medias.
Description of the drawings
Fig. 1 is a kind of place of the context-sensitive product name recognition method based on stacking condition random field of the embodiment of the present invention Manage flow diagram.
Fig. 2 is a kind of group of the context-sensitive product name recognition device based on stacking condition random field of the embodiment of the present invention At structural schematic diagram.
Specific implementation mode
In order to keep the object, technical solutions and advantages of the present invention etc. of greater clarity, below in conjunction with specific embodiment pair The present invention and its principle are described further, and specific embodiments described below is only used for carrying out necessary explanation to the present invention Illustrate, is not intended to limit the present invention.
It is hereinafter a kind of to the present invention based on the upper and lower of stacking condition random field by taking the identification of the ProductName of field of mobile phones as an example Literary Related product name recognition method illustrates, and is as shown in Figure 1 processing flow schematic diagram, specifically includes following steps:
Step 1:It artificially collects the relevant text of ProductName and identifies language material as ProductName;
Step 2:Collection field relevant ProductName information architecture field product knowledge database;
Step 3:Collect the relevant text training term vector model of product;
Step 4:Feature selecting is carried out, using selected character representation language material;
Step 5:The low layer conditional random field models and identification complex structure product of simple entity for identification are respectively trained The high-rise conditional random field models of name;
Step 6:Using conditional random field models automatic identification name of product.
Each step is described in detail respectively below:
Step 1:It artificially collects the relevant text of ProductName and identifies language material as ProductName;
This step is substantially carried out the preparation of language material, for the model training and measure of merit in subsequent step.
Since the present embodiment is by taking the identification of the ProductName of field of mobile phones as an example, this example is completed by following steps:
Step 1-1:From the related web page of field of mobile phones product related web site Zhong Guan-cun download online field of mobile phones, and carry out Parsing only retains the content of text in Web page text;
Step 1-2:Participle and part-of-speech tagging are carried out to obtained text, can be carried out using ICTCLAS 2015;
Step 1-3:The entities such as brand, series, model, company, the ProductName occurred in the text after handmarking's participle, Obtain 4000 sentences for including product entity;
Step 2:The relevant ProductName information in field is collected from internet, builds field product knowledge database;
Product knowledge database is mainly that follow-up step provides field relevant knowledge, needs to use this when carrying out feature selecting The field product knowledge database of step structure.
Since the present embodiment is by taking the identification of the ProductName of field of mobile phones as an example, field product knowledge database includes mainly hand The product in machine field, is completed especially by following steps:
Step 2-1:From Zhong Guan-cun, online mobile phone channel captures mobile phone products related data;
Step 2-2:The data grabbed are parsed to obtain preliminary product entity list, following table arranges for product entity The example of table;
Step 2-3:Artificial correction is carried out to preliminary product entity list, specify the affiliated brand of product entity, series with And model, it builds the product entity list including product entity and its affiliated brand, series and model and stores;Specifically Form is as shown in the table;
Product entity Brand name Serial name Model name
Samsung Galaxy Note2 Samsung Galaxy Note2
Nokia Lumia 920 Nokia Lumia 920
Associate S890 Association S890
Step 2-4:It with reference to the common properties list for capturing obtained one field product of data manual sorting and stores, produces Product attribute list example is as follows:
Step 3:Collect the relevant text training term vector model of product;
Term vector model is mainly used for amalgamation of global contextual information, further supplements contextual information, and it is real to improve product The effect of body identification.
Step 3-1:The a large amount of relevant webpage of mobile phone is captured from the online mobile phone channel in Zhong Guan-cun and mobile phone China website, and It is parsed into plain text, while the relevant microblogging of gripping portion mobile phone from Sina weibo, it is relevant that 1,000,000 mobile phones have been obtained Sentence;
Step 3-2:Word segmentation processing is carried out to the text that download obtains, can be carried out herein using ICTCLAS 2015;
Step 3-3:Using the text training term vector model of point good word, it is used herein as the term vector tool Word2vec that increases income Tool carries out, and setting window size is 10, and vector dimension 100 uses skip-gram models.Obtained after training a word to Model is measured, each word is expressed as the vector of one 100 dimension, can indicate corresponding with the vector of 100 dimensions in follow-up work Word;
Step 4:Feature selecting is carried out, using selected character representation language material;
This step main purpose is to select feature, and by training data and the unified character representation of test data, selects The feature quality selected directly affects final recognition effect.
Step 4-1:Using current word itself, part of speech, whether comprising letter, whether comprising number, whether include special word Symbol is as basic feature;Wherein current word refers to that handled when handled successively as unit of word the sentence of point good word Word, such as:" I has bought a Samsung mobile phone, the Note2 " of new listing, can be by word processing, when processing arrives " three in processing procedure When star ", current word refers to just " Samsung ", and "one" is a upper word, and " mobile phone " is next word.
Step 4-2:Using the knowledge base obtained in step 2, by current word whether brand name, whether serial name, whether type Number name, whether product attribute etc. is respectively as domain features;
Step 4-3:Using the term vector model obtained in step 3, all words for including in term vector model are used Kmeans algorithms are clustered, and wherein the similarity between word is right using the measuring similarity between the corresponding vector of the word In given vectorAnd vectorDefinitionWithSimilarityIt calculates Formula is as follows:
One unique class number is set for each classification after the completion of cluster, using the classification belonging to current word as class Other feature;
Step 4-4:Feature described in step 4-1 to step 4-3 is used to carry out simple entity in low layer condition random field Identification identifies the entities such as brand name, serial name, model name, Business Name, on the basis of these features, by low layer condition The flag sequence of the random field feature new as one is used for high-rise conditional random field models, carries out the knowledge of complex structure product name Not, ProductName is identified;
Step 4-5:Obtained in step 1 4000 sentences comprising product entity are divided into two parts, 3000 are used as instruction Practice data, 1000 are used as test data, and the feature described in step 4-1 to step 4-4 is indicated respectively, training data and survey It tries the word in data and uses sequence mark shown in sequence and table 2 shown in table 1 respectively, be labeled using BIO modes in flag sequence, B presentation-entity starts, the part in I presentation-entity in addition to beginning, and O indicates non-physical part, B-BRA and I- to be used in this example BRA, B-SER and I-SER, B-TYP and I-TYP, B-COM and I-COM, B-PRO and I-PRO indicate brand name, series respectively Name, model name, company name, the beginning of ProductName and the other elements in addition to beginning indicate non-physical element with O:
Table 1:
Word 1 value of feature 2 value of feature 3 value of feature …… Feature n values Flag sequence
Table 2:
Word 1 value of feature 2 value of feature 3 value of feature …… Feature n values
For the data to be finally identified using sequence mark shown in table 2, last row blank will be by the side in the present invention Method is marked, to reach final identifying purpose.
Step 4-6:Gained sentence in step 1 is carried out characterization expression by the rule defined in step 4-5;
Step 4-7:The identification of ProductName entity and the word before and after product entity have close relationship, therefore defined herein spy The local context information of template fusion is levied, the present embodiment carries out the training and test of conditional random field models using CRF++0.53, this Place only needs the feature templates syntactic definition feature templates according to CRF++, and template item merges a upper word, current word and next The feature of word.
Step 5:The low layer conditional random field models and identification complex structure product of simple entity for identification are respectively trained The high-rise conditional random field models of name;Wherein low layer conditional random field models brand name, serial name, model name, public affairs for identification Take charge of the simple results entity such as name, high-rise conditional random field models ProductName entity for identification.Characterize the training language after indicating Expect that sample is as shown in the table:
Word Feature 1 Feature 2 Feature n Label 1 Label 2
I N N Y O O
Like N N N O O
Samsung Y N N B-BRA B-PRO
Galaxy N Y N B-SER I-PRO
S3 N N Y B-TYP I-PRO
N N N O O
Step 5-1:Step 4-7 institutes are carried out using the training corpus in addition to label 2 for having characterized expression in upper table The conditional random field models of training low layer, the identification of the entity for simple structure after the feature templates stated indicate;
Step 5-2:The feature templates table described in step 4-7 is carried out using the training corpus for having characterized expression in upper table Training high level conditional random field models, are used for the identification of complex structure product name after showing;
Step 6:Using conditional random field models automatic identification name of product;
Step 6-1:Feature defined in step 4 inputs low layer condition random by the data to be identified indicated are characterized Field model carries out the identification of simple entity;The data mode wherein inputted is the data that most next two columns are removed in step 5 sample data, Model can increase " label 1 " column data on the basis of input data and be used as output, at this time can be according to the result of " label 1 " Judge simple entity;
Step 6-2:Output in the recognition result of the low layer condition random field obtained in step 6-1 i.e. step 6-1 is made The identification of complex structure product name is carried out for the input of high-rise conditional random field models;Model can increase on the basis of input data Add " label 2 " column data as output.
Step 6-3:According to the expression meaning for the flag sequence arranged in step 4-5 to the recognition result in step 6-2 into Row parsing, filters out and obtains final ProductName recognition result labeled as O non-physical elements.
A kind of context-sensitive product name recognition device based on stacking condition random field, knot are realized according to the above method Structure is as shown in Fig. 2, the device field product knowledge database, term vector model, the low layer conditional random field models trained, trained High-rise conditional random field models, including Text Pretreatment module, a character representation module, quadratic character representation module, just Walk ProductName identification module, final products name identification module and recognition result output module;Text Pretreatment module, a feature Representation module, preliminary product name identification module, quadratic character representation module, final products name identification module and recognition result output Module is sequentially connected, and field product knowledge database, term vector model are connected with a character representation module respectively, the low layer trained Conditional random field models are connected with preliminary product name identification module, the high-rise conditional random field models trained and final products name Identification module is connected;
The field product knowledge database is to be built according to the process for building field product knowledge database described in claim 3, packet Include product entity list and common properties list;In order to which the last word that can always include field changes, periodically to described The content of last word is supplemented in the product knowledge database of field;
The term vector model is to be obtained according to the process of training term vector model described in claim 5;In order to make word to Amount model can track the newest variation of Field Words always, and regular replenishment field related text re-starts training to it;
The low layer conditional random field models trained and the high-rise conditional random field models trained are wanted according to right 7 processes are asked to obtain;
The Text Pretreatment module is used to receive the text of ProductName to be identified and carries out participle and part of speech mark to it Note;
All words and its part of speech that character representation module is used to obtain Text Pretreatment module are produced based on field Product knowledge base and term vector model respectively obtain its characteristic value, i.e., are indicated word with a feature, preferably, use is above-mentioned Foundation characteristic, domain features and category feature indicate;
All words and its a feature that preliminary product name identification module is used to export a character representation module, melt It is identified, is obtained just by the low layer conditional random field models trained after closing a feature of its previous word and latter word Walk recognition result label 1;
A feature and mark for all words that quadratic character representation module is used to export preliminary product name identification module The quadratic character that 1 combination of note obtains equivalent indicates;
All words and its quadratic character that final products name identification module is used to export quadratic character representation module, melt It is identified, is obtained most by the high-rise conditional random field models trained after closing the quadratic character of its previous word and latter word Whole recognition result label 2;
All words and its label 2 that recognition result output module is used to export final products name identification module, filter out Obtain recognition result list after non-product name entity elements, in recognition result list word and its label replace input text in Equivalent after export.
Preferably, the label 1 and label 2 are marked using above-mentioned BIO modes.
Test result
In order to verify effectiveness of the invention, Sina weibo has been captured in the present embodiment from 2 months in April, 2013 in 2012 Totally 7,000 ten thousand microblog datas, 4000 relevant microbloggings of field of mobile phones product of random screening have carried out artificial mark, and adopt Training is done with 3000,1000 are used as test.Contrast experiment uses conditional random field models, using the basis in step 4-1 The feature that feature is tested as a comparison carries out the identification of ProductName entity.The evaluation index of related field includes accuracy rate, recalls Rate, F1 values use evaluation index of the F1 values as this experiment since F1 values are a comprehensive evaluation indexs in this experiment, F1 values are higher, and expression effect is better.The experimental results are shown inthe following table:
As can be seen from the table, the recognition effect of brand name, serial name, model name and product entity, which has, obviously carries It rises, wherein the F1 values of ProductName entity rise the most apparent.Experiment shows that the present invention can effectively improve ProductName Entity recognition Effect.

Claims (2)

1. a kind of context-sensitive product name recognition method based on stacking condition random field, this approach includes the following steps:
Step 1: carrying out participle and part-of-speech tagging pretreatment to language material text;
Step 2: carrying out a character representation as unit of word to language material text;
It has been trained Step 3: the feature templates required with the low layer conditional random field models trained current word utilize after indicating Low layer conditional random field models be identified to obtain preliminary recognition result, be denoted as label 1;
Step 4: the word for using a character representation is indicated plus label 1 as quadratic character;
It has been trained Step 5: the feature templates required with the low layer conditional random field models trained current word utilize after indicating High-rise conditional random field models be identified and obtain final recognition result, be denoted as label 2;
Step 6: being exported after the word for being identified as product entity in language material text is increased its corresponding label 2;
Feature includes foundation characteristic, domain features, category feature, and the foundation characteristic is for indicating possessed by word Feature, including word itself, part of speech, whether comprising letter, whether comprising number, whether include spcial character;Domain features are used for Indicate the feature of word fields, including current word whether brand name, whether serial name, whether model name, whether product attribute; Category feature is used to indicate the category feature belonging to word;
The domain features determine that the field product knowledge database is logical based on field product knowledge database by string matching mode Cross following procedure structure:
Product-related data is captured from field related web site;
The data grabbed are parsed to obtain preliminary product entity list;
Artificial correction is carried out to preliminary product entity list, specifies the affiliated brand of product entity, series and model, structure packet It includes the product entity list including product entity and its affiliated brand, series and model and stores;
With reference to the common properties list for capturing obtained one field product of data manual sorting and store;
Category feature belonging to the current word is determined by following procedure:
Word-based vector model clusters similarity of the root therein between, the corresponding word of two words A and B VectorAnd vectorBetween similarity be calculated by the following formula:
One unique class number is set for each classification after the completion of cluster;
The class number of classification belonging to current word is exported;
The term vector model is obtained by following procedure:
The relevant webpage of downloading field is simultaneously parsed into plain text;
Word segmentation processing is carried out to the text that download obtains;
Use the text training term vector model of point good word;
The label 1 and label 2 are labeled using BIO modes, and B presentation-entity starts, in I presentation-entity in addition to beginning Part, O indicate that non-physical part, the label 1 that thus mode obtains are one of the following:
B-BRA:Indicate the start element of brand name;
I-BRA:Indicate other elements in addition to start element of brand name;
B-SER:Indicate the start element of serial name;
I-SER:Indicate other elements in addition to start element of serial name;
B-TYP:Indicate the start element of model name;
I-TYP:Indicate other elements in addition to start element of model name;
B-COM:Indicate the start element of company name;
I-COM:Indicate other elements in addition to start element of company name;
B-PRO:Indicate the start element of ProductName;
I-PRO:Indicate other elements in addition to start element of ProductName;
O:Indicate non-physical element;
The low layer conditional random field models trained and high-rise conditional random field models are obtained by following process:
The relevant text of product is collected as training corpus;
Participle and part-of-speech tagging are carried out to training corpus;
The brand occurred in the text after participle, series, model, company, ProductName entity are marked, is obtained comprising product entity Sentence;
A feature, label 1 and label 2 are carried out to product entity to indicate;
The low layer that the product entity indicated with a feature, label 1 has been trained for the training of conditional random field models Conditional random field models, the feature that feature templates should be including a upper word, current word and next word;
Training by the product entity indicated with a feature, label 1, label 2 for conditional random field models has been trained High-rise conditional random field models, feature templates should include a upper word, current word and next word feature.
2. a kind of context phase based on stacking condition random field of product name recognition method structure according to claim 1 Close product name recognition device, it is characterised in that:Including field product knowledge database, term vector model, the low layer condition trained with Airport model, the high-rise conditional random field models trained, Text Pretreatment module, character representation module, a quadratic character Representation module, preliminary product name identification module, final products name identification module and recognition result output module;Text Pretreatment mould Block, a character representation module, preliminary product name identification module, quadratic character representation module, final products name identification module and Recognition result output module is sequentially connected, and field product knowledge database, term vector model are connected with a character representation module respectively, The low layer conditional random field models trained are connected with preliminary product name identification module, the high-rise conditional random field models trained It is connected with final products name identification module.
CN201510974820.5A 2015-12-23 2015-12-23 A kind of product name recognition method and device based on stacking condition random field Active CN105630768B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510974820.5A CN105630768B (en) 2015-12-23 2015-12-23 A kind of product name recognition method and device based on stacking condition random field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510974820.5A CN105630768B (en) 2015-12-23 2015-12-23 A kind of product name recognition method and device based on stacking condition random field

Publications (2)

Publication Number Publication Date
CN105630768A CN105630768A (en) 2016-06-01
CN105630768B true CN105630768B (en) 2018-10-12

Family

ID=56045725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510974820.5A Active CN105630768B (en) 2015-12-23 2015-12-23 A kind of product name recognition method and device based on stacking condition random field

Country Status (1)

Country Link
CN (1) CN105630768B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301411B (en) * 2016-04-14 2020-07-10 科大讯飞股份有限公司 Mathematical formula identification method and device
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106503192B (en) * 2016-10-31 2019-10-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN106528863B (en) * 2016-11-29 2019-07-02 中国国防科技信息中心 A kind of training of CRF identifier and technology and its attribute-name relationship are to abstracting method
CN108415896B (en) * 2017-02-09 2022-03-04 北京京东尚科信息技术有限公司 Deep learning model training method, word segmentation method, training system and word segmentation system
CN106980609A (en) * 2017-03-21 2017-07-25 大连理工大学 A kind of name entity recognition method of the condition random field of word-based vector representation
CN107193959B (en) * 2017-05-24 2020-11-27 南京大学 Pure text-oriented enterprise entity classification method
CN107844474A (en) * 2017-09-29 2018-03-27 华南师范大学 Disease data name entity recognition method and system based on stacking condition random field
CN110413769A (en) * 2018-04-25 2019-11-05 北京京东尚科信息技术有限公司 Scene classification method, device, storage medium and its electronic equipment
CN108763205B (en) * 2018-05-21 2022-05-03 创新先进技术有限公司 Brand alias identification method and device and electronic equipment
CN109189820B (en) * 2018-07-30 2021-08-31 北京信息科技大学 Coal mine safety accident ontology concept extraction method
CN109766541B (en) * 2018-12-12 2023-08-18 咪咕文化科技有限公司 Marketing strategy identification method, server and computer storage medium
CN113051918B (en) * 2019-12-26 2024-05-14 北京中科闻歌科技股份有限公司 Named entity recognition method, device, equipment and medium based on ensemble learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477518A (en) * 2009-01-09 2009-07-08 昆明理工大学 Tour field named entity recognition method based on condition random field
CN103164426A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Method and device of recognizing named entity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477518A (en) * 2009-01-09 2009-07-08 昆明理工大学 Tour field named entity recognition method based on condition random field
CN103164426A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Method and device of recognizing named entity

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Domain-Specific Product Named Entity Recognition from Chinese Microblog;Xianxiang Yang et al.;《2014 10th International Conference on Computational Intelligence and Security》;20141231;第218-222页 *
Product Named Entity Recogintion Using Conditional Random Fields;Fang Luo et al.;《2011 Fourth International Conference on Business Intelligence and Financial Engineering》;20111231;第86-89页 *
基于层叠CRFs的中文句子评价对象抽取;郑敏洁 等;《中文信息学报》;20130531;第27卷(第3期);第69-76页 *
基于层叠条件随机场模型的中文机构名自动识别;周俊生 等;《电子学报》;20060531;第34卷(第5期);第804-809页 *
微博客中的知识条目发现方法研究;石汇淼;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150215;第2015年卷(第2期);第9、11、20页 *
针对产品命名实体识别的半监督学习方法;黄诗琳 等;《北京邮电大学学报》;20130430;第36卷(第2期);第20-23、54页 *

Also Published As

Publication number Publication date
CN105630768A (en) 2016-06-01

Similar Documents

Publication Publication Date Title
CN105630768B (en) A kind of product name recognition method and device based on stacking condition random field
Anastasia et al. Twitter sentiment analysis of online transportation service providers
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN103927309B (en) A kind of method and device to business object markup information label
CN105843796A (en) Microblog emotional tendency analysis method and device
CN110377696A (en) A kind of commodity future news the analysis of public opinion method and system
CN109933779A (en) User's intension recognizing method and system
CN113837531A (en) Product quality problem finding and risk assessment method based on network comments
CN107315738A (en) A kind of innovation degree appraisal procedure of text message
CN105824898A (en) Label extracting method and device for network comments
AU2015252513A1 (en) Method and system for filtering goods evaluation information
CN109960756A (en) Media event information inductive method
CN105740382A (en) Aspect classification method for short comment texts
CN102609427A (en) Public opinion vertical search analysis system and method
CN110442728A (en) Sentiment dictionary construction method based on word2vec automobile product field
CN104361010A (en) Automatic classification method for correcting news classification
CN109213998A (en) Chinese wrongly written character detection method and system
CN109101551A (en) A kind of construction method and device of question and answer knowledge base
CN105117434A (en) Webpage classification method and webpage classification system
Lo et al. A review of opinion mining and sentiment classification framework in social networks
Leopairote et al. Software quality in use characteristic mining from customer reviews
Hasanati et al. Implementation of support vector machine with lexicon based for sentimenT ANALYSIS ON TWITter
CN105468780A (en) Normalization method and device of product name entity in microblog text
Song et al. Extracting product features from online reviews for sentimental analysis
CN105760502A (en) Commercial quality emotional dictionary construction system based on big data text mining

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant