CN103870446B - A kind of descriptor screening technique and device - Google Patents

A kind of descriptor screening technique and device Download PDF

Info

Publication number
CN103870446B
CN103870446B CN201210551720.8A CN201210551720A CN103870446B CN 103870446 B CN103870446 B CN 103870446B CN 201210551720 A CN201210551720 A CN 201210551720A CN 103870446 B CN103870446 B CN 103870446B
Authority
CN
China
Prior art keywords
descriptor
statistical value
dictionary
business object
description information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210551720.8A
Other languages
Chinese (zh)
Other versions
CN103870446A (en
Inventor
侯磊
李军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201210551720.8A priority Critical patent/CN103870446B/en
Publication of CN103870446A publication Critical patent/CN103870446A/en
Application granted granted Critical
Publication of CN103870446B publication Critical patent/CN103870446B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of descriptor screening technique and device, including: for each business object in multiple business objects, the each descriptor included based on descriptor dictionary, from the title content of this business object, extract descriptor present in the descriptor dictionary, and determine whether the descriptor of this existence exists in the appointment description information of this business object, if there is, according to the first statistical value that the descriptor setting this existence of incremental update is corresponding, if it does not, according to the second statistical value corresponding to descriptor setting this existence of incremental update;After all carrying out above-mentioned statistics for each business object in the plurality of business object, the first corresponding statistical value and the second statistical value is distinguished according to each descriptor that descriptor dictionary includes, the each descriptor including descriptor dictionary screens, the descriptor dictionary after being updated.The scheme using the embodiment of the present application to provide, improves the accuracy that the descriptor to business object determines.

Description

A kind of descriptor screening technique and device
Technical field
The application relates to Internet technical field and field of computer technology, particularly relate to a kind of descriptor screening technique and Device.
Background technology
In existing Internet technology, website typically can be issued some business objects, for logging in the user of this website Browse, and further for the post-treatment operations of specific transactions object.Such as, as a example by e-commerce website, business Object can be specifically the product that seller user issues, and the information of business object can be specifically retouching of the various features to product State information etc., such as type information, pricing information, performance information and the brand message etc. of product, log in the use of e-commerce website Family can by browsing the various information of release product, understand the details of this product, it is possible to further perform receipts Hide, buy or recommend the process operations such as other users;As a example by community website, business object can be specifically that community users is sent out The model of cloth, the information of business object can be specifically the description information of model, the content information etc. of model, website, login community Browse user and can understand the details of this model by browsing the various information of the model of issue, it is possible to further Execution collection, money order receipt to be signed and returned to the sender or recommend other users etc. and process operation.
In actual applications, the description information of business object can be to be issued this business pair by the supplier of business object As time input, and due to various actual cause, such as operational error, to reasons such as the inadequate understandings of business object, in fact it could happen that The inaccurate situation of description information that the supplier of business object is inputted for its business object provided.Such as, for product The input of board information, is likely to be due to supplier and is unfamiliar with the exact brand name of business object, or recognizes the reasons such as mistake, causes The brand word of input is not the brand of a necessary being.And if now will extract based on the brand message data of mistake The brand word list come, in the brand recognition of business object processes, it will further cause recognition result inaccurate, from And also need to the most inaccurate recognition result and correct, thus waste process resource, and reduce brand knowledge Other treatment effeciency.
Summary of the invention
In view of this, the embodiment of the present application provides a kind of descriptor screening technique and device, is used for solving in prior art The descriptor to business object existed determines inaccurate problem.
The embodiment of the present application is achieved through the following technical solutions:
The embodiment of the present application provides a kind of descriptor screening technique, including:
For each business object in multiple business objects, perform following steps A and step B:
Step A: each descriptor included based on descriptor dictionary, from the title content of this business object, extraction is in institute State descriptor present in descriptor dictionary;
Step B: determine whether the descriptor of described existence exists, if deposited in the appointment description information of this business object , first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, increase according to described setting Measure the second statistical value that the descriptor of the described existence of renewal is corresponding;
For each business object in the plurality of business object, after performing step A and step B, according to described First statistical value of each descriptor that descriptor dictionary includes correspondence respectively and the second statistical value, include described descriptor dictionary Each descriptor screen, the descriptor dictionary after being updated.
The embodiment of the present application additionally provides a kind of descriptor screening plant, including:
First extracting unit, for for each business object in multiple business objects, includes based on descriptor dictionary Each descriptor, from the title content of this business object, extract descriptor present in the described descriptor dictionary;
Statistic unit, for determining whether the descriptor of described existence is deposited in the appointment description information of this business object , if it does, first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, according to The second statistical value that the descriptor of existence described in described setting incremental update is corresponding;
Screening unit, for for each business object in the plurality of business object, extracts by described first After unit and described statistic unit process, according to the first of each descriptor correspondence respectively that described descriptor dictionary includes Statistical value and the second statistical value, each descriptor including described descriptor dictionary screens, the descriptor after being updated Dictionary.
In at least one technical scheme above-mentioned that the embodiment of the present application provides, in each description included based on descriptor dictionary When word screens, first against each business object in multiple business objects, each description included based on descriptor dictionary Word, from the title content of this business object, extracts descriptor present in the descriptor dictionary, it is then determined that the retouching of this existence Whether predicate exists in the appointment description information of this business object, if it does, according to setting retouching of this existence of incremental update The first statistical value that predicate is corresponding, if it does not, according to set this existence of incremental update descriptor corresponding second statistics Value;Wherein, in the presence of a descriptor is in the title content of business object and appointment description information all, this descriptor is represented It is accurately to a certain extent, otherwise, when a descriptor only exists in the title content of business object, and in this business pair In the appointment description information of elephant not in the presence of, represent that this descriptor is inaccurate to a certain extent, so, to multiple business After object all completes above-mentioned statistics, each descriptor that descriptor dictionary includes is to having the first statistical value and the second statistics Value, and, corresponding the first statistical value this descriptor of the biggest expression is the most accurate, corresponding the second statistical value this description of the biggest expression Word is the most inaccurate, thus the first corresponding statistical value and the second statistical value distinguished in each descriptor included according to descriptor dictionary, The each descriptor including descriptor dictionary screens, and removes inaccurate descriptor, it is possible to obtain wherein descriptor more accurate Descriptor dictionary after true renewal, i.e. improve determined by the accuracy of descriptor.
Other features and advantage will illustrate in the following description, and, partly become from description Obtain it is clear that or understand by implementing the application.The purpose of the application and other advantages can be by the explanations write Structure specifically noted in book, claims and accompanying drawing realizes and obtains.
Accompanying drawing explanation
Accompanying drawing is for providing further understanding of the present application, and constitutes a part for description, implements with the application Example is used for explaining the application together, is not intended that the restriction to the application.In the accompanying drawings:
The flow chart of the descriptor screening technique that Fig. 1 provides for the embodiment of the present application;
The flow chart of the Fig. 2 descriptor screening technique for providing in the embodiment of the present application 1;
The flow chart of the Fig. 3 descriptor identifying processing for providing in the embodiment of the present application 1;
The structural representation of the Fig. 4 descriptor screening plant for providing in the embodiment of the present application 2.
Detailed description of the invention
In order to provide the implementation of the accuracy improving the descriptor determining business object, the embodiment of the present application provides A kind of descriptor screening technique and device, this technical scheme can apply to determine the process of the descriptor dictionary of business object, Both can be implemented as a kind of method, it is also possible to be embodied as a kind of device.Below in conjunction with the Figure of description preferred reality to the application Execute example to illustrate, it will be appreciated that preferred embodiment described herein is merely to illustrate and explains the application, be not used to limit Determine the application.And in the case of not conflicting, the embodiment in the application and the feature in embodiment can be mutually combined.
The embodiment of the present application provides a kind of descriptor screening technique, as it is shown in figure 1, include:
For each business object in multiple business objects, perform following steps 101 and step 102:
Step 101: each descriptor included based on descriptor dictionary, from the title content of this business object, extracts Descriptor present in descriptor dictionary.
Step 102: determine whether the descriptor of this existence exists, if deposited in the appointment description information of this business object , according to the first statistical value that the descriptor setting this existence of incremental update is corresponding, if it does not, according to setting incremental update The second statistical value that the descriptor of this existence is corresponding.
Step 103, for each business object in the plurality of business object, perform step 101 and step 102 it After, distinguish the first corresponding statistical value and the second statistical value, to descriptor dictionary according to each descriptor that descriptor dictionary includes Including each descriptor screen, the descriptor dictionary after being updated.
Wherein, each descriptor that descriptor dictionary includes, can be to be described information by the appointment in the plurality of business object The middle descriptor composition occurred.
Further, in the said method that the embodiment of the present application provides, after the descriptor dictionary after being updated, can For the descriptor dictionary after updating, to use the descriptor screening mode shown in above-mentioned Fig. 1, to the descriptor dictionary after updating Including each descriptor again screen, in order to further improve the accuracy of descriptor included by descriptor dictionary.
Further, in the said method that the embodiment of the present application provides, after the descriptor dictionary after being updated, i.e. The each descriptor that can include based on the descriptor dictionary after this renewal, is described word identifying processing to a business object, Describe information with the appointment of this business object supplementary, or correct inaccurate description in the appointment description information of this business object Word, for a pending business object, specifically may include that
The each descriptor included based on the descriptor dictionary after updating, from the title content of pending business object, takes out Take descriptor present in descriptor dictionary in the updated;
In the presence of this descriptor extracted is in the appointment description information of pending business object not, this extracted is retouched Predicate joins in the appointment description information of pending business object, or, pending business is replaced in this descriptor of extraction Descriptor in the appointment description information of object.
Below in conjunction with the accompanying drawings, the method and device provided the application with specific embodiment is described in detail.
Embodiment 1:
The flow chart of the Fig. 2 descriptor screening technique for providing in the embodiment of the present application 1, specifically includes and processes step as follows Rapid:
Step 201, obtain in multiple business object in the title content of each business object, and the plurality of business object The appointment of each business object describes information.
Wherein, the plurality of business object can be to belong to the multiple business object of same class purpose, such as, with e-business network As a example by standing, multiple business objects may belong to identical product classification, such as men's clothing classification, women's dress classification, or mobile phone classification etc..
This appointment describes the attribute information that information can be business object, and when such as business object is commodity, this appointment is retouched The information of stating can be the brand message of commodity.This appointment describes information can be relevant with the type of follow-up screened descriptor, Such as, when the descriptor of follow-up screening is brand word, accordingly, this appointment describes information can be brand message.
Step 202, determine the plurality of business object specify the descriptor in description information.
Step 203, the descriptor composition descriptor word that will occur in the appointment description information of the plurality of business object Allusion quotation.
In this step, it is also possible to each descriptor that statistics descriptor dictionary includes respectively is in the appointment of the plurality of business object The number of times occurred in description information, this number of times can be used for the follow-up screening to the descriptor in descriptor dictionary.
After obtaining initial descriptor dictionary, i.e. can by each business object in the plurality of business object respectively As pending business object, perform following steps 204 to step 207.
Step 204, each descriptor included based on this descriptor dictionary, from the title content of currently pending business object In, extract descriptor present in this descriptor dictionary.
This step can carry out word segmentation processing by title content based on currently pending business object, and determines obtain every Whether individual participle exists in this descriptor dictionary.
Based on each descriptor in descriptor dictionary, this step can also determine that whether this descriptor is currently pending The title content of business object exists, now need not the title content of currently pending business object is carried out at participle Reason.
Preferably, this step specifically can use Aho-Corasick algorithm to realize, and Aho-Corasick algorithm is a kind of String assemble matching algorithm based on dictionary, is one all to cross finity state machine structure and build the character of similar Trie tree construction String matching algorithm.
Step 205, exist in descriptor dictionary, and in currently pending industry for what above-mentioned steps 204 was determined The descriptor that there is also in the title content of business object, determines that this descriptor is retouched in the appointment of currently pending business object State in information and whether exist, if it does, enter step 206, if it does not, enter step 207.
Step 206, first statistical value corresponding according to setting this descriptor of incremental update.
That is, by that set increment former first statistical value corresponding with this descriptor and be worth, corresponding more as this descriptor The first statistical value after Xin.
Follow-up add up according to step 204 to step 207 for other pending business object time, by this step This descriptor obtained update after the first statistical value, iteration enters in the statistical computation next time carried out for this descriptor. I.e. once the first statistical value of this descriptor is updated to the after updating this descriptor obtained in this step upper Renewal on the basis of one statistical value.
It is to say, the value before the first statistical value of this descriptor is updated be last this descriptor is updated after the One statistical value.Wherein, when adding up for this descriptor first, the first statistical value of its correspondence is initial value, and this is initial Value could be arranged to 0.This setting increment could be arranged to 1.
Step 207, second statistical value corresponding according to setting this descriptor of incremental update.
That is, by that set increment former second statistical value corresponding with this descriptor and be worth, corresponding more as this descriptor The second statistical value after Xin.
Follow-up add up according to step 204 to step 207 for other pending business object time, by this step This descriptor obtained update after the second statistical value, iteration enters in the statistical computation next time carried out for this descriptor, I.e. once the second statistical value of this descriptor is updated to the after updating this descriptor obtained in this step upper Renewal on the basis of two statistical values.
It is to say, the value before the second statistical value of this descriptor is updated be last this descriptor is updated after the Two statistical values.Wherein, when descriptor for this existence first is added up, the second statistical value of its correspondence is initial value, This initial value could be arranged to 0.This setting increment could be arranged to 1.
Step 208, for each business object in the plurality of business object, perform step 204 respectively to step 207 Afterwards, each descriptor that this descriptor dictionary includes is respectively to having the first statistical value and the second statistical value, in this step, First statistical value the most corresponding with each descriptor included according to this descriptor dictionary and the second statistical value, to this descriptor word Each descriptor that allusion quotation includes screens, the descriptor dictionary after being updated, specifically can be in the following way:
First kind of way: first, the first statistical value that each descriptor included according to this descriptor dictionary is the most corresponding and Second statistical value, determines the comprehensive scores of each descriptor that this descriptor dictionary includes;
Wherein, the first statistical value that descriptor is corresponding is the highest, represents that this descriptor is the most accurate, so, comprehensive scores Can increase along with the increase of the first statistical value, otherwise, the second statistical value that descriptor is corresponding is the highest, represents this description Word is the most inaccurate, so, comprehensive scores can increase along with the increase of the second statistical value;
The concrete calculation of comprehensive scores can be adopted in various manners as required, for example, it is possible to be the first statistical value Deduct the difference of the second statistical value, it is also possible to be that the first statistical value accounts for the first statistical value and the second statistical value and value ratio Deng;
Then, according to the height of comprehensive scores, each descriptor including this descriptor dictionary screens, such as, and will Comprehensive scores is got rid of from this descriptor dictionary less than the descriptor presetting point threshold, retains comprehensive scores not less than default point The descriptor of value threshold value, thus the descriptor dictionary after being updated.
The second way: from each descriptor that this descriptor dictionary includes, selects the first corresponding statistical value to meet the One presets statistical value condition, and the second statistical value of correspondence meets the second descriptor presetting statistical value condition, after composition updates Descriptor dictionary;
Wherein, first presets statistical value condition and second presets statistical value condition, can carry out according to actual needs flexibly Arranging, such as, first statistical value corresponding due to descriptor is the highest, represents that this descriptor is the most accurate, otherwise, a description The second statistical value that word is corresponding is the highest, represents that this descriptor is the most inaccurate, so, first to preset statistical value condition can be corresponding The first statistical value preset statistical value threshold value not less than first, second presets the second statistical value that statistical value condition can be correspondence Statistical value threshold value is preset less than second.
When above-mentioned steps 203 has also been added up the appointment in the plurality of business object of each descriptor that descriptor dictionary includes During the number of times occurred in description information, in this step, it is also possible to each descriptor correspondence respectively included according to this descriptor dictionary The first statistical value and the second statistical value, and specify the number of times occurred in description information, to this in the plurality of business object Each descriptor that descriptor dictionary includes screens, the descriptor dictionary after being updated, specifically can be in the following way:
The third mode: first, the first statistical value that each descriptor included according to this descriptor dictionary is the most corresponding and Second statistical value, and the number of times that each descriptor occurs respectively in the appointment description information of the plurality of business object, determine this The comprehensive scores of each descriptor that descriptor dictionary includes;
Wherein, the first statistical value that descriptor is corresponding is the highest, represents that this descriptor is the most accurate, and comprehensive scores can be with The increase of the first statistical value and increase, otherwise, the second statistical value that descriptor is corresponding is the highest, represents this descriptor the most not Accurately, comprehensive scores can increase along with the increase of the second statistical value;And descriptor is at the finger of the plurality of business object The number of times determining to occur in description information is the biggest, represents that this descriptor is the most accurate, and comprehensive scores can be along with the increase of this number of times Increase;
The concrete calculation of comprehensive scores can be adopted in various manners as required, for example, it is possible to add for this number of times First statistical value deducts the numerical value that the second statistical value obtains again, it is also possible to deduct the difference of the second statistical value for the first statistical value Value, and this number of times, the numerical value that being weighted sues for peace obtains;
Preferably, in the embodiment of the present application propose can use equation below calculate descriptor comprehensive scores:
Score=log (C+n1)+((P+n2)/(N+n2)+1)/Th;
Wherein, Score is the comprehensive scores of a descriptor, and C is that this descriptor is retouched in the appointment of the plurality of business object Stating the number of times occurred in information, P is the first statistical value that this descriptor is corresponding, and N is the second statistical value that this descriptor is corresponding, Th For adjusting threshold value.n1And n2For smooth regulation coefficient, its purpose is to obtain smoothed data, such as, n1Can be set as 2, n2 Can be set as 1.Adjust threshold value Th can arrange flexibly with actual count situation according to actual needs, be used for filtering and make an uproar Sound;
Then, according to the height of comprehensive scores, each descriptor including this descriptor dictionary screens, such as, and will Comprehensive scores is got rid of from this descriptor dictionary less than the descriptor presetting point threshold, retains comprehensive scores not less than default point The descriptor of value threshold value, thus the descriptor dictionary after being updated.
4th kind of mode: from each descriptor that this descriptor dictionary includes, selects the first corresponding statistical value to meet the One presets statistical value condition, and the second statistical value of correspondence meets second and presets statistical value condition, and in the plurality of business object The number of times occurred in description information of specifying meet the descriptor of preset times condition, composition update after descriptor dictionary;
Wherein, first preset statistical value condition, second preset statistical value condition and preset times condition, can be according to reality Needs are arranged flexibly, and such as, first statistical value corresponding due to descriptor is the highest, represents that this descriptor is the most accurate, Otherwise, the second statistical value that descriptor is corresponding is the highest, represents that this descriptor is the most inaccurate, and a descriptor is in the plurality of industry The number of times occurred in the appointment description information of business object is the biggest, represents that this descriptor is the most accurate, so, first presets statistical value bar Part can be that the first corresponding statistical value presets statistical value threshold value not less than first, and second to preset statistical value condition can be correspondence The second statistical value preset statistical value threshold value less than second;Preset times threshold value can be to retouch in the appointment of the plurality of business object State the number of times occurred in information and meet preset times not less than preset times threshold value.
Use the foregoing description word screening technique that the embodiment of the present application 1 provides, due to when a descriptor is in business object Title content and specify in description information in the presence of all, represent that this descriptor is accurately to a certain extent, otherwise, when one Descriptor only exists in the title content of business object, and in the appointment description information of this business object not in the presence of, table Show that this descriptor is inaccurate to a certain extent, so, after multiple business objects are all completed statistics, descriptor dictionary Including each descriptor to having the first statistical value and the second statistical value, and, the first corresponding the biggest expression of statistical value This descriptor is the most accurate, and corresponding the second statistical value this descriptor of the biggest expression is the most inaccurate, thus according to descriptor dictionary bag First statistical value of each descriptor included correspondence respectively and the second statistical value, each descriptor including descriptor dictionary sieves Choosing, removes inaccurate descriptor, it is possible to obtain the descriptor dictionary after wherein descriptor updates more accurately, i.e. improve institute The accuracy of the descriptor determined.
In the embodiment of the present application, by foregoing description word screening technique, after the descriptor dictionary after being updated, go back Can to update after descriptor dictionary in descriptor be ranked up display, sequence time can according to use above-mentioned the third The comprehensive scores that mode determines order from high to low is ranked up.
In the embodiment of the present application, by foregoing description word screening technique, after the descriptor dictionary after being updated, go back The each descriptor that can include based on the descriptor dictionary after this renewal, is described word identifying processing to a business object, Describe information with the appointment of this business object supplementary, or correct inaccurate description in the appointment description information of this business object Word, for a pending business object, as it is shown on figure 3, specifically can include processing as follows step:
Step 301, each descriptor included based on the descriptor dictionary after updating, in the title of pending business object Rong Zhong, extracts descriptor present in descriptor dictionary in the updated.
This step can carry out word segmentation processing by title content based on this business object, and determines that each participle obtained exists Whether the descriptor dictionary after this renewal exists.
Based on each descriptor in the descriptor dictionary after updating, this step can also determine that whether this descriptor is at this The title content of business object exists, now need not the title content to this business object and carry out word segmentation processing.
Preferably, this step specifically can use Aho-Corasick algorithm to realize.
Step 302, determine whether this descriptor of extraction exists in the appointment description information of pending business object, as Fruit does not exists, and enters step 303, if it does, enter step 304.
Step 303, this descriptor according to extraction update the appointment of this pending business object and describe information.
Concrete, this descriptor of extraction can be joined in the appointment description information of pending business object, or, What this descriptor of extraction was replaced pending business object specifies the descriptor in description information.
Concrete, if it is sky that the appointment of pending business object describes information, then this descriptor of extraction can be added Enter in the appointment description information of this pending business object;If the appointment of pending business object describes information not for empty, And this descriptor of extraction similar to the descriptor in this appointment description information time, such as, there is identical word, then can be by The descriptor in the appointment description information of pending business object replaced in this descriptor of extraction.
Step 304, the appointment of this pending business object is kept to describe Information invariability.
The descriptor included due to the descriptor dictionary after updating is more accurate, so, based on the descriptor word after this renewal When allusion quotation is described word identifying processing to business object, improves the accuracy of identifying processing, and avoid follow-up to identifying knot The corrigendum of fruit processes, thus decreases the waste processing resource, and improves the treatment effeciency of descriptor identification.
Embodiment 2:
Based on same inventive concept, according to the descriptor screening technique of the above embodiments of the present application offer, correspondingly, this Shen Embodiment 2 please additionally provide a kind of descriptor screening plant, its structural representation as shown in Figure 4, specifically includes:
First extracting unit 401, for for each business object in multiple business objects, based on descriptor dictionary bag The each descriptor included, from the title content of this business object, extracts descriptor present in the described descriptor dictionary;
Statistic unit 402, for determine the descriptor of described existence in the appointment description information of this business object whether Exist, if it does, first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, press According to the second statistical value that the descriptor existed described in described setting incremental update is corresponding;
Screening unit 403, for for each business object in the plurality of business object, takes out by described first Take unit and after described statistic unit processes, each descriptor of including according to described descriptor dictionary the most corresponding the One statistical value and the second statistical value, each descriptor including described descriptor dictionary screens, the description after being updated Word dictionary.
Further, said apparatus, also include:
Dictionary determines unit 404, the descriptor in the appointment description information determining the plurality of business object;And will The descriptor occurred in description information of specifying in the plurality of business object forms described descriptor dictionary.
Further, said apparatus, also include:
Number of times determines unit 405, for adding up each descriptor that described descriptor dictionary includes respectively in the plurality of industry The number of times occurred in the appointment description information of business object;
Screening unit 403, the first system of each descriptor correspondence respectively specifically for including according to described descriptor dictionary Evaluation and the second statistical value, and the number of times occurred in the appointment description information of the plurality of business object, to described description Each descriptor that word dictionary includes screens.
Further, screening unit 403, specifically for each descriptor correspondence respectively included according to described descriptor dictionary The first statistical value and the second statistical value, determine the comprehensive scores of each descriptor that described descriptor dictionary includes;And according to combining Closing the height of score value, each descriptor including described descriptor dictionary screens.
Further, screening unit 403, specifically for from each descriptor that described descriptor dictionary includes, it is right to select The first statistical value answered meets first and presets statistical value condition, and the second statistical value of correspondence meets second and presets statistical value condition Descriptor, composition update after descriptor dictionary.
Further, said apparatus, also include:
Second extracting unit 406, for each descriptor included based on the descriptor dictionary after described renewal, from pending In the title content of business object, descriptor present in extraction descriptor dictionary after described renewal;
Descriptor supplementary units 407, for describing in the appointment of described pending business object when this descriptor of extraction In information not in the presence of, this descriptor of extraction is joined in the appointment description information of described pending business object, or, What this descriptor of extraction was replaced described pending business object specifies the descriptor in description information.
The function of above-mentioned each unit may correspond to the respective handling step in flow process shown in Fig. 1 to Fig. 3, the most superfluous at this State.
In sum, the scheme that the embodiment of the present application provides, including: for each business pair in multiple business objects As, each descriptor included based on descriptor dictionary, from the title content of this business object, extraction is deposited in descriptor dictionary Descriptor, and determine whether the descriptor of this existence exists in the appointment description information of this business object, if it does, According to the first statistical value that the descriptor setting this existence of incremental update is corresponding, if it does not, should according to setting incremental update The second statistical value corresponding to descriptor existed;Above-mentioned system is all being carried out for each business object in the plurality of business object After meter, distinguish the first corresponding statistical value and the second statistical value, to descriptor according to each descriptor that descriptor dictionary includes Each descriptor that dictionary includes screens, the descriptor dictionary after being updated.Use the scheme that the embodiment of the present application provides, Improve the accuracy that the descriptor to business object determines.
The screening plant that embodiments herein is provided can be realized by computer program.Those skilled in the art should It is understood that above-mentioned Module Division mode is only the one in numerous Module Division mode, if be divided into other modules or Do not divide module, as long as screening plant has above-mentioned functions, all should be within the protection domain of the application.
The application is with reference to method, equipment (system) and the flow process of computer program according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that can the most first-class by computer program instructions flowchart and/or block diagram Flow process in journey and/or square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided Instruction arrives the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device to produce A raw machine so that the instruction performed by the processor of computer or other programmable data processing device is produced for real The device of the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame now.
These computer program instructions may be alternatively stored in and computer or other programmable data processing device can be guided with spy Determine in the computer-readable memory that mode works so that the instruction being stored in this computer-readable memory produces and includes referring to Make the manufacture of device, this command device realize at one flow process of flow chart or multiple flow process and/or one square frame of block diagram or The function specified in multiple square frames.
These computer program instructions also can be loaded in computer or other programmable data processing device so that at meter Perform sequence of operations step on calculation machine or other programmable devices to produce computer implemented process, thus at computer or The instruction performed on other programmable devices provides for realizing at one flow process of flow chart or multiple flow process and/or block diagram one The step of the function specified in individual square frame or multiple square frame.
Obviously, those skilled in the art can carry out various change and the modification essence without deviating from the application to the application God and scope.So, if these amendments of the application and modification belong to the scope of the application claim and equivalent technologies thereof Within, then the application is also intended to comprise these change and modification.

Claims (15)

1. a descriptor screening technique, it is characterised in that including:
For each business object in multiple business objects, perform following steps A and step B:
Step A: each descriptor included based on descriptor dictionary, from the title content of this business object, extraction is retouched described Descriptor present in predicate dictionary;
Step B: determine whether the descriptor of described existence exists in the appointment description information of this business object, if it does, First statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, according to described setting increment Update the second statistical value that the descriptor of described existence is corresponding;
For each business object in the plurality of business object, after performing step A and step B, according to described description First statistical value of each descriptor that word dictionary includes correspondence respectively and the second statistical value, include described descriptor dictionary is each Descriptor screens, the descriptor dictionary after being updated, and wherein, corresponding the first statistical value this descriptor of the biggest expression is more Accurately, corresponding the second statistical value this descriptor of the biggest expression is the most inaccurate.
2. the method for claim 1, it is characterised in that the determination method of descriptor dictionary includes:
Determine the plurality of business object specifies the descriptor in description information;
The descriptor occurred in description information of specifying in the plurality of business object is formed descriptor dictionary.
3. method as claimed in claim 2, it is characterised in that also include:
Add up each descriptor that described descriptor dictionary includes respectively to go out in the appointment description information of the plurality of business object Existing number of times;
Distinguish the first corresponding statistical value and the second statistical value according to each descriptor that described descriptor dictionary includes, retouch described Each descriptor that predicate dictionary includes screens, particularly as follows:
The first corresponding statistical value and the second statistical value is distinguished according to each descriptor that described descriptor dictionary includes, and in institute That states multiple business object specifies the number of times occurred in description information, and each descriptor including described descriptor dictionary sieves Choosing.
4. the method for claim 1, it is characterised in that the most right according to each descriptor that described descriptor dictionary includes The first statistical value answered and the second statistical value, each descriptor including described descriptor dictionary screens, and specifically includes:
Distinguish the first corresponding statistical value and the second statistical value according to each descriptor that described descriptor dictionary includes, determine described The comprehensive scores of each descriptor that descriptor dictionary includes;And according to the height of comprehensive scores, described descriptor dictionary is included Each descriptor screen.
5. method as claimed in claim 4, it is characterised in that described comprehensive scores uses equation below to calculate: Score= log(C+n1)+((P+n2)/(N+n2)+1)/Th;
Wherein, Score is the comprehensive scores of a descriptor, and C is that this descriptor describes letter in the appointment of the plurality of business object The number of times occurred in breath, P is the first statistical value that this descriptor is corresponding, and N is the second statistical value that this descriptor is corresponding, and Th is for adjusting Whole threshold value, n1And n2For smooth regulation coefficient.
6. the method for claim 1, it is characterised in that the most right according to each descriptor that described descriptor dictionary includes The first statistical value answered and the second statistical value, each descriptor including described descriptor dictionary screens, and specifically includes:
From each descriptor that described descriptor dictionary includes, select the first corresponding statistical value to meet first and preset statistical value bar Part, and the second statistical value of correspondence meets the second descriptor presetting statistical value condition, composition update after descriptor dictionary.
7. the method for claim 1, it is characterised in that also include:
The each descriptor included based on the descriptor dictionary after described renewal, from the title content of pending business object, takes out It is taken at descriptor present in the descriptor dictionary after described renewal;
In the presence of this descriptor extracted is in the appointment description information of described pending business object not, this extracted is retouched Predicate joins in the appointment description information of described pending business object, or, treat described in this descriptor replacement of extraction Process business object specifies the descriptor in description information.
8. the method as described in claim 1-7 is arbitrary, it is characterised in that the plurality of business object is for belonging to same class purpose Multiple business objects.
9. the method as described in claim 1-7 is arbitrary, it is characterised in that each descriptor that described descriptor dictionary includes is product Board word, it is brand message that described appointment describes information.
10. a descriptor screening plant, it is characterised in that including:
First extracting unit, for for each business object in multiple business objects, based on descriptor dictionary include each Descriptor, from the title content of this business object, extracts descriptor present in the described descriptor dictionary;
Statistic unit, for determining whether the descriptor of described existence exists in the appointment description information of this business object, as Fruit exists, and first statistical value corresponding according to setting the descriptor existed described in incremental update, if it does not, set according to described Determine the second statistical value that the descriptor of existence described in incremental update is corresponding;
Screening unit, for for each business object in the plurality of business object, by described first extracting unit After processing with described statistic unit, according to the first statistics of each descriptor correspondence respectively that described descriptor dictionary includes Value and the second statistical value, each descriptor including described descriptor dictionary screens, the descriptor dictionary after being updated, Wherein, corresponding the first statistical value this descriptor of the biggest expression is the most accurate, corresponding the second statistical value this descriptor of the biggest expression The most inaccurate.
11. devices as claimed in claim 10, it is characterised in that also include:
Dictionary determines unit, the descriptor in the appointment description information determining the plurality of business object;And will be described The descriptor occurred in description information of specifying of multiple business objects forms described descriptor dictionary.
12. devices as claimed in claim 11, it is characterised in that also include:
Number of times determines unit, for adding up each descriptor that described descriptor dictionary includes respectively in the plurality of business object Specify the number of times occurred in description information;
Described screening unit, the first statistical value of each descriptor correspondence respectively specifically for including according to described descriptor dictionary With the second statistical value, and specify the number of times occurred in description information, to described descriptor word in the plurality of business object Each descriptor that allusion quotation includes screens.
13. devices as claimed in claim 10, it is characterised in that described screening unit, specifically for according to described descriptor First statistical value of each descriptor that dictionary includes correspondence respectively and the second statistical value, determine that described descriptor dictionary includes each The comprehensive scores of descriptor;And according to the height of comprehensive scores, each descriptor including described descriptor dictionary screens.
14. devices as claimed in claim 10, it is characterised in that described screening unit, specifically for from described descriptor word In each descriptor that allusion quotation includes, select the first corresponding statistical value to meet first and preset statistical value condition, and the second system of correspondence Evaluation meets the second descriptor presetting statistical value condition, the descriptor dictionary after composition renewal.
15. devices as claimed in claim 10, it is characterised in that also include:
Second extracting unit, for each descriptor included based on the descriptor dictionary after described renewal, from pending business pair In the title content of elephant, descriptor present in extraction descriptor dictionary after described renewal;
Descriptor supplementary units, for when extraction this descriptor in the appointment description information of described pending business object not In the presence of, this descriptor of extraction is joined in the appointment description information of described pending business object, or, by extraction The descriptor in the appointment description information of described pending business object replaced in this descriptor.
CN201210551720.8A 2012-12-18 2012-12-18 A kind of descriptor screening technique and device Active CN103870446B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210551720.8A CN103870446B (en) 2012-12-18 2012-12-18 A kind of descriptor screening technique and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210551720.8A CN103870446B (en) 2012-12-18 2012-12-18 A kind of descriptor screening technique and device

Publications (2)

Publication Number Publication Date
CN103870446A CN103870446A (en) 2014-06-18
CN103870446B true CN103870446B (en) 2016-12-28

Family

ID=50908990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210551720.8A Active CN103870446B (en) 2012-12-18 2012-12-18 A kind of descriptor screening technique and device

Country Status (1)

Country Link
CN (1) CN103870446B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469184B (en) * 2015-08-20 2019-12-27 阿里巴巴集团控股有限公司 Data object label processing and displaying method, server and client

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101599A (en) * 2007-06-20 2008-01-09 精实万维软件(北京)有限公司 Method for extracting advertisement main information from web page
CN102473190A (en) * 2009-07-30 2012-05-23 阿尔卡特朗讯 Keyword assignment to a web page
CN102682001A (en) * 2011-03-09 2012-09-19 阿里巴巴集团控股有限公司 Method and device for determining suggest word

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7152059B2 (en) * 2002-08-30 2006-12-19 Emergency24, Inc. System and method for predicting additional search results of a computerized database search user based on an initial search query
JP2007104312A (en) * 2005-10-04 2007-04-19 Toshiba Corp Information processing method using electronic guide information and apparatus thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101599A (en) * 2007-06-20 2008-01-09 精实万维软件(北京)有限公司 Method for extracting advertisement main information from web page
CN102473190A (en) * 2009-07-30 2012-05-23 阿尔卡特朗讯 Keyword assignment to a web page
CN102682001A (en) * 2011-03-09 2012-09-19 阿里巴巴集团控股有限公司 Method and device for determining suggest word

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种基于词典的搜索引擎系统动态更新模型;雷鸣 等;《计算机研究与发展》;20001031;第37卷(第10期);第1265-1270页 *
基于机器学习的文本聚类描述算法研究;章成志;《第三届全国信息检索与内容安全学术会议论文集》;20071101;第216-225页 *
词表的自动丰富-从元数据中提取关键词及其定位;王军;《中文信息学报》;20051125;第19卷(第6期);第36-43页 *

Also Published As

Publication number Publication date
CN103870446A (en) 2014-06-18

Similar Documents

Publication Publication Date Title
CN106651542B (en) Article recommendation method and device
CN103778205B (en) A kind of commodity classification method and system based on mutual information
CN106056407A (en) Online banking user portrait drawing method and equipment based on user behavior analysis
CN110110075A (en) Web page classification method, device and computer readable storage medium
CN106909901B (en) Method and device for detecting object from image
CN104035968A (en) Method and device for constructing training corpus set based on social network
CN104331417A (en) Matching method for personnel garments of user
CN107273391A (en) Document recommends method and apparatus
CN108319888B (en) Video type identification method and device and computer terminal
CN110019163A (en) Method, system, equipment and the storage medium of prediction, the recommendation of characteristics of objects
CN110489449A (en) A kind of chart recommended method, device and electronic equipment
CN108509458B (en) Business object identification method and device
CN103123624A (en) Method of confirming head word, device of confirming head word, searching method and device
CN109325639A (en) A kind of credit scoring card automation branch mailbox method for credit forecast assessment
CN104572775B (en) Advertisement classification method, device and server
CN109902157A (en) A kind of training sample validation checking method and device
CN108876452A (en) Electricity customers demand information acquisition methods, device and electronic equipment
CN107657030A (en) Collect method, apparatus, terminal device and storage medium that user reads data
CN107766316A (en) The analysis method of evaluating data, apparatus and system
CN103678548B (en) Failure service based on integrated mode substitutes recommendation method
CN107633421A (en) A kind of processing method and processing device of market prediction data
CN105335446A (en) Short text classification model generation method and classification method based on word vector
CN107357782A (en) One kind identification user's property method for distinguishing and terminal
CN110458600A (en) Portrait model training method, device, computer equipment and storage medium
CN107885754A (en) The method and apparatus for extracting credit variable from transaction data based on LDA models

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant