CN106980667A - A kind of method and apparatus that label is marked to article - Google Patents

A kind of method and apparatus that label is marked to article Download PDF

Info

Publication number
CN106980667A
CN106980667A CN201710172954.4A CN201710172954A CN106980667A CN 106980667 A CN106980667 A CN 106980667A CN 201710172954 A CN201710172954 A CN 201710172954A CN 106980667 A CN106980667 A CN 106980667A
Authority
CN
China
Prior art keywords
article
label
resources bank
weight
represent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710172954.4A
Other languages
Chinese (zh)
Other versions
CN106980667B (en
Inventor
潘岸腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Youshi Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Youshi Network Technology Co Ltd filed Critical Guangzhou Youshi Network Technology Co Ltd
Priority to CN201710172954.4A priority Critical patent/CN106980667B/en
Publication of CN106980667A publication Critical patent/CN106980667A/en
Priority to PCT/CN2018/071607 priority patent/WO2018171295A1/en
Application granted granted Critical
Publication of CN106980667B publication Critical patent/CN106980667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of method and apparatus that label is marked to article.Methods described includes:Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up;Determine the first weight of each keyword in the keywords database to each label in the tag library set up in advance;The word frequency of first weight and each keyword based on acquisition determines second weight of each label in the tag library to every article in existing article resources bank;A number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained.

Description

A kind of method and apparatus that label is marked to article
Technical field
The present invention relates to technical field of information processing, in particular to a kind of method and dress that label is marked to article Put.
Background technology
With the popularization and the popularization of intelligent terminal of communication network, people increasingly get used to electronic product and read Read.For example, logging in news website or the various news of novel website reading or novel on electronic computer, net can also be logged in Various books are read in upper library.For another example using installed on the intelligent mobile terminals such as smart mobile phone or tablet personal computer Tripartite applies to realize reading, " the book flag novel " of " today's tops ", novel class such as news category, also other periodical classes APP etc..
News website is either logged on electronic computer or novel or periodical website etc. read various news or small Say or paper etc., or read using the third-party application of read function is provided, is required for substantial amounts of news content Classification integration is carried out, classification integration is carried out to substantial amounts of novel or paper, data based on the data integrated, this sample prescription Just it is shown, or is recommended based on user interest according to content type.
During integration is sorted out to articles such as news, novel or papers, many articles both are from external data source, These articles do not have any classification information or label information, and it is an intractable thing how these articles sort out.Pass The method of system is rule of thumb to judge that article belongs to that classification by operation personnel.The defect of this method has two:
1st, need to expend huge human cost.Each new article, especially instantaneity for addition is very strong new Class article is heard, operation personnel needs Fast Reading this article, is then referred to existing classification.
2nd, efficiency is low, cost is high, needs professional to judge for professional very strong article.By manually come one One classification, efficiency is low;And for professional very strong article, the news such as economic, financing, investment, content is much like, Need professional to judge just to can guarantee that correct classification, this can bring high cost.
The content of the invention
It is an object of the invention to provide a kind of method and apparatus that label is marked to article, to improve above mentioned problem.
The embodiments of the invention provide a kind of method that label is marked to article, it includes:
Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up, the keyword Storehouse includes but is not limited to:Multiple keywords, the word frequency that each keyword occurs in every article in existing article resources bank;
Determine each keyword in the keywords database to first of each label in the tag library set up in advance Weight;
The word frequency of first weight and each keyword based on acquisition determines each mark in the tag library Sign the second weight to every article in existing article resources bank;
A number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained.
The embodiment of the present invention also provides a kind of device that label is marked to article, and it includes:
Keywords database sets up unit, for extracting multiple keywords in all articles in existing article resources bank, Keywords database is set up, the keywords database includes but is not limited to:Multiple keywords, each keyword is in existing article resources bank The word frequency occurred in every article;
First weight determining unit, for determining each keyword in the keywords database to the label set up in advance First weight of each label in storehouse;
Second weight determining unit, is determined for first weight based on acquisition and the word frequency of each keyword The second weight of each label in the tag library to every article in existing article resources bank;
Label for labelling unit, for choosing a number of label to phase by predetermined way based on the second weight obtained On the article mark answered.
Wherein, setting up the process of keywords database includes:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, built Vertical participle word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if Participle word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking in θ Multiple element numerical value of position to last 1 are divided to carry out cumulative summation;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrange and ranking is existed 1st multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
Wherein it is determined that the method for first weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as TW when not having keyword w in fruit label t word contentt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank;
L represents the set of all articles in existing article resources bank.
Wherein it is determined that the label is as follows to the method for the second weight of article:
Wherein:
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank Weight;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as There is no keyword w then TW in fruit label t word contentt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
It is preferred that, the label is standardized to the second weight of article to obtain the label to article With respect to the second weight, method is as follows:
Wherein:
LPCl,tRepresent label t in the tag library set up in advance to relative the of the article l in existing article resources bank Two weights;
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank Weight;
Represent the label t in the tag library set up in advance to the article l's in existing article resources bank Average weight;
| L | represent the total quantity of all articles in existing article resources bank.
It is preferred that, a number of label is chosen to corresponding article mark by predetermined way based on the second weight obtained Process on note includes:A number of label is chosen according to the size order of the second weight to mark to corresponding article, or Person's selection is more than one or more labels corresponding to one or more second weights of predetermined threshold value on corresponding article mark.
According to a kind of method and apparatus that label is marked to article of the present invention, by setting up the label in tag library and text Relevance between chapter, can be realized to being closed on the new article from external data source or the article automatic marking without label Suitable label, each label represents a classification, or multiple labels point to a classification, it is possible thereby to save huge manpower Cost, manually-operated poor efficiency is set to be improved significantly, substantially reduce operation cost.
Brief description of the drawings
Fig. 1 is the flow chart of the method provided in an embodiment of the present invention that label is marked to article;
Fig. 2 is the schematic block diagram of the device provided in an embodiment of the present invention that label is marked to article.
Embodiment
Below in conjunction with the embodiment of the present invention and accompanying drawing, the technical scheme in the embodiment of the present invention is carried out clear, complete Ground is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Generally herein The component of the embodiment of the present invention described and illustrated in place's accompanying drawing can be arranged and designed with a variety of configurations.Therefore, The detailed description of the embodiments of the invention to providing in the accompanying drawings is not intended to limit the model of claimed invention below Enclose, but be merely representative of the selected embodiment of the present invention.Based on embodiments of the invention, those skilled in the art are not making The every other embodiment obtained on the premise of creative work, belongs to the scope of protection of the invention.
Fig. 1 is the flow chart of the method provided in an embodiment of the present invention that label is marked to article.As shown in figure 1, of the invention To article mark label method comprise the following steps:
S1:Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up, the key Dictionary includes but is not limited to:Multiple keywords, the word that each keyword occurs in every article in existing article resources bank Frequently.
Article money can all be set up on the server for the Internet service provider read in the enterprising style of writing chapter of electronic product by providing Source storehouse, with the electronic product such as computer for users to use or intelligent terminal online reading or download in terminal read.Here Described article refer to it is various can carry out the books of word read, include but is not limited to:Various types of novels, paper, phase Periodical, the textbook of all kinds of subjects, all kinds of tutoring books for examination, workbook etc..In addition, in order to manage conveniently, service provider It can also be to set up multiple article resources banks, for example, novel books resources bank be set up for novel class, for paper and periodical etc. Periodical Resources storehouse is set up, special resources bank is set up for books such as textbook, tutoring book and workbooks, for Domestic News etc. Article sets up resources bank, and these can voluntarily be formulated by service provider according to policy in resource management, without the discussion model in the present invention Within enclosing, article resources bank is collectively referred to as here.
In order to put on label automatically to article, it is necessary first to extracted in all articles in existing article resources bank Go out multiple keywords, set up keywords database.The specific method for the step for realizing is as follows:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, built Vertical participle word storehouse.
As described above, the service provider in the multiple third-party applications for providing electronic reading can all pre-establish on the server Article resources bank, preserves all articles in resources bank, and this is those skilled in the art's common method, is seldom repeated here.Adopt Multiple participle words are extracted to each article in existing article resources bank with known any participle technique, to these Participle word sets up participle word storehouse.The participle word storehouse can include but is not limited to:1st, each participle word and every article Relevance, i.e., which article each participle word both is from;2nd, word frequency of each participle word in every article.
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal, is according to the quantity value of label in practice
| T | represent the total quantity of all labels in existing tag library;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if Participle word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank.
If P is a real number array, α and the real number that β is [0,1], wherein α<β, defined function pct (P, α, β) represent logarithm Group P element does descending sort by numerical values recited, and ranking is tired out α points of positions to the numerical value of multiple elements between β points of positions Plus summation.Note:Because 0≤α<β≤1, so α and β value digit is less than the quantity of array element, such as array element Quantity for 1000, α and β value digit be less than 4, i.e. decimal point behind can only take the digit of 1-3.At execution point position Need that α and β first is expanded into 10NAgain with integer, then N arranges according to α and β digit value from by element numerical values recited descending α * 10 are chosen in the array element of sequenceNPosition to β * 10NPosition between multiple element numerical value, then add up summation.Example Such as, function pct (P, α, β), numerical value P member have 10,000, α=0.324, β=0.8792, then α * 103=324, β * 104= 8792, the element-the on the 324th position is chosen so in 1 to 1 ten thousand array element arranged by element numerical values recited descending 8469 elements on 8792 positions, cumulative summation is carried out to the numerical value of this 8469 elements.
From the above:
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking in θ Multiple element numerical value of position to last 1 are divided to carry out cumulative summation;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrange and ranking is existed 1st multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank, so l ∈ L represent that calculated article l is to belong to Article in existing article resources bank.
Function pct (P, α, β) definition is explained with an instantiation below.
Assuming that pct ([0,1,3,2,5], 0.2,1), the element of array [0,1,3,2,5] is done by numerical values recited dropped first Sequence sorts, and is [5,3,2,1,0] after descending arrangement, then element position of the ranking 0.2 point of position is 0.2*10=2, i.e., the 2nd Element 3, the element 0 of the element position of 1 point of position of integer then for last 1, i.e., the 5th, then pct ([0,1,3,2,5], 0.2, 1)=3+2+1+0=6.
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.So, pin Keywords database can just be set up to multiple keywords that these are selected.Certainly, multiple keywords of selection come from institute Participle word storehouse is stated, so the content included in the keywords database set up is as the content that the participle word storehouse is included, bag Include but be not limited to:1st, which article the relevance of each keyword and every article, i.e., each keyword both is from;2nd, it is each Word frequency of the keyword in every article.In addition, choosing a number of word conduct by predetermined manner according to the resolution ratio The preferred embodiment of multiple keywords includes:A number of participle word is chosen as many according to the size order of the resolution ratio Individual keyword, or randomly select in the participle word corresponding to the resolution ratio more than or equal to predetermined threshold value or in order A number of participle word is used as multiple keywords.
Participle word i described here resolution ratio SiIt is the energy for stating participle word i for division article theme Power, the value of resolution ratio is higher, and it is stronger that it divides ability.For example:" preparing for the postgraduate qualifying examination " this word direct correlation " postgraduate qualifying examination " theme, and " study " can not point clearly to a theme, then the high resolution of " preparing for the postgraduate qualifying examination " is in " study ".
The word frequency (TF) is the generic term of this area, i.e., in a given article, word frequency (TF) refers to certain The number of times that one given word occurs in this article.
The purpose of the first step chooses popular word as keyword, and label substance also belongs to the key of hot topic Word, so lays a good groundwork to perform next step.The keyword quantity of selection can be based on the needs of practice.Here use Mode be, according to depending on the article quantity in article resources bank, keyword to be chosen by the certain percentage of article quantity.Example Such as, when article quantity reaches ten million order of magnitude, 100,000 or so keywords can be chosen.Can be by multiple keywords of selection Set up keywords database or lists of keywords etc..Here illustrated by taking keywords database as an example.
Existing tag library has been also discussed above, has wanted to realize and gives article mark label, either automatic marking or artificial Mark, is required for pre-establishing tag library, could so realize the tag standards of mark.Each label in the tag library To point to the key wordses of a certain theme, such as " preparing for the postgraduate qualifying examination ", " speculation in stocks ".And the tag library of article is set up, it can use any Known method, the mode of the multiple labels rule of thumb extracted for example with operation personnel sets up tag library;Or adopt Tag library is set up with the mode for the story label having had on the market;Foundation can also be discussed jointly using colleague dealer unified The mode in story label storehouse set up tag library, naturally it is also possible to be the appropriate combination of this several ways to set up label Storehouse, preferably by the dealer that goes together, unified story label storehouse is set up in discussion jointly.
S2:Determine each keyword in the keywords database to of each label in the tag library set up in advance One weight.
After keywords database is established, then it needs to be determined that each keyword in the keywords database to having set up in advance Tag library in each label weight, the first weight is named as here, the method for determining first weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as TW when not having keyword w in fruit label t word contentt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank.
L represents the set of all articles in existing article resources bank, so l ∈ L represent that the article l is to belong to existing Article in article resources bank, ∑l∈L(PLTl,t·PLWl,w) represent that calculating has all articles in article resources bank (PLTl,t·PLWl,w) value and to the cumulative summation of these values, it can also be write as| L | represent The total quantity of all articles in existing article resources bank.
So, with the keyword w in keywords database to the first weight of the label t in the tag library set up in advance with knob Band, is that next step sets up label and the relevance of article carries out place mat.
S3:The word frequency of first weight and each keyword based on acquisition, is determined every in the tag library Second weight of the individual label to every article in existing article resources bank.
The word frequency occurred in every article of each keyword in the existing article resources bank is counted, this is setting up pass Just counted during keyword storehouse and completed and be stored in keywords database.For example, introduced at one in the article A of Stock Trading, Word " speculation in stocks " occurs 20 times in this article, then the word frequency that word " speculation in stocks " occurs in article A is 20.
So, using each keyword in the keywords database of acquisition to the first weight of each label and described The word frequency that each keyword of record occurs in every article in article resources bank, determines each mark in the tag library The weight to every article in existing article resources bank is signed, the second weight is named as, determines the label to the second of article The method of weight is as follows:
Wherein:
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank Weight;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as There is no keyword w then TW in fruit label t word contentt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
S4:A number of label is chosen based on the second weight obtained by predetermined way to mark to corresponding article On.
Each label in the tag library set up in advance is obtained to every article in existing article resources bank the After two weights, a number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained. It is preferred that, a number of label is chosen on corresponding article mark according to the size order of the second weight.For example, obtaining After each label in tag library is to an article A the second weight, a fixed number is chosen according to the second weight descending order The label of amount, for example, choose ranking in preceding 1-3 or 1-5 label on this article A marks.Or, it can also set in advance A fixed threshold value, selection is more than one or more labels corresponding to one or more second weights of the predetermined threshold value to corresponding On article mark.
In a preferred embodiment, in order that each label in the tag library obtained is to the second weight of every article It is placed in same dimension and is compared size, so that comparative result is more accurate, can be to second weight of the label to article It is standardized to obtain relative second weight of the label to article, method is as follows:
Wherein:
LPCl,tRepresent label t in the tag library set up in advance to relative the of the article l in existing article resources bank Two weights;
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank Weight;
Represent the label t in the tag library set up in advance to the article l's in existing article resources bank Average weight;
| L | represent the total quantity of all articles in existing article resources bank.
So,To calculate the label t in the tag library set up in advance in existing article resources bank The total quantity of the cumulative sum of second weight of all articles divided by all articles in existing article resources bank.
After relative second weight is obtained, a fixed number is chosen by predetermined way based on relative second weight obtained The label of amount is on corresponding article mark.
In a preferred embodiment, the article of the article resources bank is preferably thematic strong article, for example:News Class article, paper class article, description class article (for example describing article using shop applications).
According to a kind of method that label is marked to article of the present invention, by setting up between the label in tag library and article Relevance, can realize to suitably being marked on the new article from external data source or the article automatic marking without label Label, each label represents a classification, or multiple labels point to one and classified, it is possible thereby to save huge human cost, Manually-operated poor efficiency is set to be improved significantly, substantially reduce operation cost.
Fig. 2 is the schematic block diagram of the device provided in an embodiment of the present invention that label is marked to article.As shown in Fig. 2 this The device for marking label to article of invention includes:
Keywords database sets up unit, for extracting multiple keywords in all articles in existing article resources bank, Keywords database is set up, the keywords database includes but is not limited to:Multiple keywords, each keyword is in existing article resources bank The word frequency occurred in every article;
First weight determining unit, for determining each keyword in the keywords database to the label set up in advance First weight of each label in storehouse;
Second weight determining unit, is determined for first weight based on acquisition and the word frequency of each keyword The second weight of each label in the tag library to every article in existing article resources bank;
Label for labelling unit, for choosing a number of label to phase by predetermined way based on the second weight obtained On the article mark answered.
Wherein, the keywords database set up unit set up keywords database specific method it is as follows:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, built Vertical participle word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal, is according to the quantity value of label in practice
| T | represent the total quantity of all labels in existing tag library;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if Participle word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking in θ Multiple element numerical value of position to last 1 are divided to carry out cumulative summation;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrange and ranking is existed 1st multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
Wherein, first weight determining unit is used to determine that the method for first weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as TW when not having keyword w in fruit label t word contentt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank.
L represents the set of all articles in existing article resources bank, so l ∈ L represent that the article l is to belong to existing Article in article resources bank, ∑l∈L(PLTl,t·PLWl,w) represent that calculating has all articles in article resources bank (PLTl,t·PLWl,w) value and to the cumulative summation of these values, it can also be write as| L | represent The total quantity of all articles in existing article resources bank.
Wherein, second weight determining unit is used to determine that label is as follows to the method for the second weight of article:
Wherein:
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank Weight;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as There is no keyword w then TW in fruit label t word contentt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
Wherein, label for labelling unit chooses a number of label to phase based on the second weight obtained by predetermined way The preferred embodiment on article mark answered includes:A number of label is chosen to corresponding according to the size order of the second weight On article mark;Or, a threshold value, one or more second weights of selection more than the predetermined threshold value can also be preset One or more corresponding labels are on corresponding article mark.For example, obtaining each label in tag library to a text After chapter A the second weight, a number of label is chosen according to the second weight descending order, for example, chooses ranking in preceding 1- 3 or 1-5 label are on this article A marks.Or, a threshold value can also be preset, selection is more than the default threshold One or more labels corresponding to one or more second weights of value are on corresponding article mark.
In a preferred embodiment, in order that each label in the tag library obtained is to the second weight of every article It is placed in same dimension and is compared size, so that comparative result is more accurate, it is of the invention to mark the device of label also to article It can include:With respect to the second weight determining unit (not shown), for being carried out to the label to the second weight of article Standardization is to obtain relative second weight of the label to article, and process is as follows:
Wherein:
LPCl,tRepresent label t in the tag library set up in advance to relative the of the article l in existing article resources bank Two weights;
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank Weight;
Represent the label t in the tag library set up in advance to the article l's in existing article resources bank Average weight;
| L | represent the total quantity of all articles in existing article resources bank.
So,To calculate the label t in the tag library set up in advance in existing article resources bank The total quantity of the cumulative sum of second weight of all articles divided by all articles in existing article resources bank.
After relative second weight is obtained, the label for labelling unit is based on relative second weight obtained by pre- Determine mode and choose a number of label on corresponding article mark.
In a preferred embodiment, the article of the article resources bank is preferably thematic strong article, for example:News Class article, paper class article, description class article (for example describing article using shop applications).
Certainly, as it is known by the man skilled in the art that the label can also be by the second power to relative second weight of article Weight determining unit is performed, it is not necessary to performed by single relative second weight determining unit.
It is apparent to those skilled in the art that, for convenience and simplicity of description, the device of foregoing description Specific work process, may be referred to row illustrated example in the corresponding process in preceding method embodiment, preceding method embodiment And associated description, the course of work of interpreting means is equally applicable to, description is not repeated herein.
According to a kind of device that label is marked to article of the present invention, by setting up between the label in tag library and article Relevance, can realize to suitably being marked on the new article from external data source or the article automatic marking without label Label, each label represents a classification, or multiple labels point to one and classified, it is possible thereby to save huge human cost, Manually-operated poor efficiency is set to be improved significantly, substantially reduce operation cost.
The computer program product for the method that label is marked to article that the embodiment of the present invention is provided, including store journey The computer-readable recording medium of sequence code, the instruction that described program code includes can be used for performing institute in previous methods embodiment The method stated, implements and can be found in embodiment of the method, will not be repeated here.
If the function is realized using in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Understood based on such, technical scheme is substantially in other words The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are to cause a computer equipment (can be individual People's computer, Intelligent flat computer, smart mobile phone, server, or network equipment etc.) perform described in each embodiment of the invention The all or part of step of method.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM), deposit at random Access to memory (RAM), magnetic disc or CD etc. are various can be with the medium of store program codes.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (12)

1. a kind of method that label is marked to article, it includes:
Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up, the keywords database bag Include but be not limited to:Multiple keywords, the word frequency that each keyword occurs in every article in existing article resources bank;
Determine the first weight of each keyword in the keywords database to each label in the tag library set up in advance;
The word frequency of first weight and each keyword based on acquisition determines each label pair in the tag library Second weight of every article in existing article resources bank;
A number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained.
2. according to the method described in claim 1, it is characterised in that extracted in all articles in existing article resources bank Go out in multiple keywords, the step of setting up keywords database:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, sets up and divide Word word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
S i = p c t ( &lsqb; P l , i &rsqb; l &Element; L , &theta; , 1 ) p c t ( &lsqb; P l , i &rsqb; l &Element; L , 0 , &theta; ) &CenterDot; p c t ( &lsqb; P l , i &rsqb; l &Element; L , 0 , 1 ) | L |
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if participle Word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking θ points of position Cumulative summation is carried out to multiple element numerical value of last 1;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrangement and to ranking at the 1st Multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
3. according to the method described in claim 1, it is characterised in that it is determined that each keyword in the keywords database is to pre- In the step of first weight of each label in the tag library first set up, the method for determining first weight is as follows:
TW t , w = &Sigma; l &Element; L ( PLT l , t &CenterDot; PLW l , w ) | L |
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark TW when not having a keyword w in the word content for signing tt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank;
L represents the set of all articles in existing article resources bank.
4. according to the method described in claim 1, it is characterised in that in first weight based on acquisition and each pass The word frequency of keyword determines each label in the tag library to the second weight of every article in existing article resources bank In step, determine that the label is as follows to the method for the second weight of article:
LP l , t = &Sigma; w = 1 n ( TW t , w &CenterDot; PLW l , w )
Wherein:
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark There is no keyword w then TW in the word content for signing tt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
5. method according to claim 4, it is characterised in that also include:The label is carried out to the second weight of article Standardization is to obtain relative second weight of the label to article, and method is as follows:
LPC l , t = LP l , t 1 | L | &Sigma; l &Element; L LP l , t
Wherein:
LPCl,tRepresent that the label t in the tag library set up in advance is weighed to relative second of the article l in existing article resources bank Weight;
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
Represent label t being averaged to the article l in existing article resources bank in the advance tag library set up Weight;
| L | represent the total quantity of all articles in existing article resources bank.
6. according to the method described in claim 1, it is characterised in that chosen based on the second weight obtained by predetermined way In the step that a number of label is marked to corresponding article, including:Choose certain according to the size order of the second weight The label of quantity is marked to corresponding article, or selection is more than 1 corresponding to one or more second weights of predetermined threshold value Individual or multiple labels are on corresponding article mark.
7. a kind of device that label is marked to article, it includes:
Keywords database sets up unit, for extracting multiple keywords in all articles in existing article resources bank, sets up Keywords database, the keywords database includes but is not limited to:Multiple keywords, every in existing article resources bank of each keyword The word frequency occurred in article;
First weight determining unit, for determining each keyword in the keywords database in the tag library set up in advance Each label the first weight;
Second weight determining unit, is determined described for first weight based on acquisition and the word frequency of each keyword The second weight of each label in tag library to every article in existing article resources bank;
Label for labelling unit, for choosing a number of label to corresponding by predetermined way based on the second weight obtained On article mark.
8. device according to claim 7, it is characterised in that keywords database sets up the mistake that unit is used to set up keywords database Journey includes:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, sets up and divide Word word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
S i = p c t ( &lsqb; P l , i &rsqb; l &Element; L , &theta; , 1 ) p c t ( &lsqb; P l , i &rsqb; l &Element; L , 0 , &theta; ) &CenterDot; p c t ( &lsqb; P l , i &rsqb; l &Element; L , 0 , 1 ) | L |
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if participle Word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking θ points of position Cumulative summation is carried out to multiple element numerical value of last 1;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrangement and to ranking at the 1st Multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
9. device according to claim 7, it is characterised in that first weight determining unit is used to determine described first The method of weight is as follows:
TW t , w = &Sigma; l &Element; L ( PLT l , t &CenterDot; PLW l , w ) | L |
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark TW when not having a keyword w in the word content for signing tt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank;
L represents the set of all articles in existing article resources bank.
10. device according to claim 7, it is characterised in that second weight determining unit is used to determine the label Method to the second weight of article is as follows:
LP l , t = &Sigma; w = 1 n ( TW t , w &CenterDot; PLW l , w )
Wherein:
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark There is no keyword w then TW in the word content for signing tt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
11. device according to claim 10, it is characterised in that described device also includes:Determined with respect to the second weight single Member, for being standardized to the label to the second weight of article to obtain the label to relative the second of article Weight, method is as follows:
LPC l , t = LP l , t 1 | L | &Sigma; l &Element; L LP l , t
Wherein:
LPCl,tRepresent that the label t in the tag library set up in advance is weighed to relative second of the article l in existing article resources bank Weight;
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
Represent label t being averaged to the article l in existing article resources bank in the advance tag library set up Weight;
| L | represent the total quantity of all articles in existing article resources bank.
12. device according to claim 7, it is characterised in that the label for labelling unit is additionally operable to according to the second weight Size order is chosen a number of label and marked to corresponding article, or selection is more than one or more of predetermined threshold value One or more labels corresponding to second weight are on corresponding article mark.
CN201710172954.4A 2017-03-22 2017-03-22 A kind of method and apparatus to article mark label Active CN106980667B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710172954.4A CN106980667B (en) 2017-03-22 2017-03-22 A kind of method and apparatus to article mark label
PCT/CN2018/071607 WO2018171295A1 (en) 2017-03-22 2018-01-05 Method and apparatus for tagging article, terminal, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710172954.4A CN106980667B (en) 2017-03-22 2017-03-22 A kind of method and apparatus to article mark label

Publications (2)

Publication Number Publication Date
CN106980667A true CN106980667A (en) 2017-07-25
CN106980667B CN106980667B (en) 2019-04-12

Family

ID=59339570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710172954.4A Active CN106980667B (en) 2017-03-22 2017-03-22 A kind of method and apparatus to article mark label

Country Status (2)

Country Link
CN (1) CN106980667B (en)
WO (1) WO2018171295A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107748745A (en) * 2017-11-08 2018-03-02 厦门美亚商鼎信息科技有限公司 A kind of enterprise name keyword extraction method
WO2018171295A1 (en) * 2017-03-22 2018-09-27 广州优视网络科技有限公司 Method and apparatus for tagging article, terminal, and computer readable storage medium
WO2018188378A1 (en) * 2017-04-10 2018-10-18 广州优视网络科技有限公司 Method and device for tagging label for application, terminal and computer readable storage medium
CN110519654A (en) * 2019-09-11 2019-11-29 广州荔支网络技术有限公司 A kind of label determines method and device
CN111611461A (en) * 2019-05-14 2020-09-01 北京精准沟通传媒科技股份有限公司 Data processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289523A (en) * 2011-09-20 2011-12-21 北京金和软件股份有限公司 Method for intelligently extracting text labels
CN103164471A (en) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 Recommendation method and system of video text labels
US20160070803A1 (en) * 2014-09-09 2016-03-10 Funky Flick, Inc. Conceptual product recommendation
CN105893478A (en) * 2016-03-29 2016-08-24 广州华多网络科技有限公司 Tag extraction method and equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980667B (en) * 2017-03-22 2019-04-12 广州优视网络科技有限公司 A kind of method and apparatus to article mark label

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289523A (en) * 2011-09-20 2011-12-21 北京金和软件股份有限公司 Method for intelligently extracting text labels
CN103164471A (en) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 Recommendation method and system of video text labels
US20160070803A1 (en) * 2014-09-09 2016-03-10 Funky Flick, Inc. Conceptual product recommendation
CN105893478A (en) * 2016-03-29 2016-08-24 广州华多网络科技有限公司 Tag extraction method and equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018171295A1 (en) * 2017-03-22 2018-09-27 广州优视网络科技有限公司 Method and apparatus for tagging article, terminal, and computer readable storage medium
WO2018188378A1 (en) * 2017-04-10 2018-10-18 广州优视网络科技有限公司 Method and device for tagging label for application, terminal and computer readable storage medium
CN107748745A (en) * 2017-11-08 2018-03-02 厦门美亚商鼎信息科技有限公司 A kind of enterprise name keyword extraction method
CN107748745B (en) * 2017-11-08 2021-08-03 厦门美亚商鼎信息科技有限公司 Enterprise name keyword extraction method
CN111611461A (en) * 2019-05-14 2020-09-01 北京精准沟通传媒科技股份有限公司 Data processing method and device
CN110519654A (en) * 2019-09-11 2019-11-29 广州荔支网络技术有限公司 A kind of label determines method and device
CN110519654B (en) * 2019-09-11 2021-07-27 广州荔支网络技术有限公司 Label determining method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2018171295A1 (en) 2018-09-27
CN106980667B (en) 2019-04-12

Similar Documents

Publication Publication Date Title
CN106980667B (en) A kind of method and apparatus to article mark label
CN110532451A (en) Search method and device for policy text, storage medium, electronic device
Cattuto Semiotic dynamics in online social communities
CN102215300B (en) Telecommunication service recommendation method and system
CN107944986A (en) A kind of O2O Method of Commodity Recommendation, system and equipment
CN105893533A (en) Text matching method and device
CN105404698A (en) Education video recommendation method and device
CN106951571A (en) A kind of method and apparatus for giving application mark label
CN108491388A (en) Data set acquisition methods, sorting technique, device, equipment and storage medium
CN106919575A (en) application program searching method and device
CN108256537A (en) A kind of user gender prediction method and system
CN106095939B (en) The acquisition methods and device of account authority
CN106294882A (en) Data digging method and device
CN110276382A (en) Listener clustering method, apparatus and medium based on spectral clustering
CN107491536A (en) A kind of examination question method of calibration, examination question calibration equipment and electronic equipment
CN106909688A (en) A kind of method and apparatus that search word is recommended based on input search word
CN108629047A (en) A kind of song list generation method and terminal device
CN112052396A (en) Course matching method, system, computer equipment and storage medium
CN103324758A (en) News classifying method and system
CN109902157A (en) A kind of training sample validation checking method and device
CN113918806A (en) Method for automatically recommending training courses and related equipment
CN108170691A (en) It is associated with the determining method and apparatus of document
Razzaq et al. An automatic determining food security status: machine learning based analysis of household survey data
CN108153781A (en) The method and apparatus for extracting the keyword of business scope
CN106250402A (en) A kind of Website classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200415

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 15 layer self unit 02

Patentee before: GUANGZHOU UC NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right