CN106980667A - A kind of method and apparatus that label is marked to article - Google Patents
A kind of method and apparatus that label is marked to article Download PDFInfo
- Publication number
- CN106980667A CN106980667A CN201710172954.4A CN201710172954A CN106980667A CN 106980667 A CN106980667 A CN 106980667A CN 201710172954 A CN201710172954 A CN 201710172954A CN 106980667 A CN106980667 A CN 106980667A
- Authority
- CN
- China
- Prior art keywords
- article
- label
- resources bank
- weight
- represent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of method and apparatus that label is marked to article.Methods described includes:Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up;Determine the first weight of each keyword in the keywords database to each label in the tag library set up in advance;The word frequency of first weight and each keyword based on acquisition determines second weight of each label in the tag library to every article in existing article resources bank;A number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained.
Description
Technical field
The present invention relates to technical field of information processing, in particular to a kind of method and dress that label is marked to article
Put.
Background technology
With the popularization and the popularization of intelligent terminal of communication network, people increasingly get used to electronic product and read
Read.For example, logging in news website or the various news of novel website reading or novel on electronic computer, net can also be logged in
Various books are read in upper library.For another example using installed on the intelligent mobile terminals such as smart mobile phone or tablet personal computer
Tripartite applies to realize reading, " the book flag novel " of " today's tops ", novel class such as news category, also other periodical classes
APP etc..
News website is either logged on electronic computer or novel or periodical website etc. read various news or small
Say or paper etc., or read using the third-party application of read function is provided, is required for substantial amounts of news content
Classification integration is carried out, classification integration is carried out to substantial amounts of novel or paper, data based on the data integrated, this sample prescription
Just it is shown, or is recommended based on user interest according to content type.
During integration is sorted out to articles such as news, novel or papers, many articles both are from external data source,
These articles do not have any classification information or label information, and it is an intractable thing how these articles sort out.Pass
The method of system is rule of thumb to judge that article belongs to that classification by operation personnel.The defect of this method has two:
1st, need to expend huge human cost.Each new article, especially instantaneity for addition is very strong new
Class article is heard, operation personnel needs Fast Reading this article, is then referred to existing classification.
2nd, efficiency is low, cost is high, needs professional to judge for professional very strong article.By manually come one
One classification, efficiency is low;And for professional very strong article, the news such as economic, financing, investment, content is much like,
Need professional to judge just to can guarantee that correct classification, this can bring high cost.
The content of the invention
It is an object of the invention to provide a kind of method and apparatus that label is marked to article, to improve above mentioned problem.
The embodiments of the invention provide a kind of method that label is marked to article, it includes:
Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up, the keyword
Storehouse includes but is not limited to:Multiple keywords, the word frequency that each keyword occurs in every article in existing article resources bank;
Determine each keyword in the keywords database to first of each label in the tag library set up in advance
Weight;
The word frequency of first weight and each keyword based on acquisition determines each mark in the tag library
Sign the second weight to every article in existing article resources bank;
A number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained.
The embodiment of the present invention also provides a kind of device that label is marked to article, and it includes:
Keywords database sets up unit, for extracting multiple keywords in all articles in existing article resources bank,
Keywords database is set up, the keywords database includes but is not limited to:Multiple keywords, each keyword is in existing article resources bank
The word frequency occurred in every article;
First weight determining unit, for determining each keyword in the keywords database to the label set up in advance
First weight of each label in storehouse;
Second weight determining unit, is determined for first weight based on acquisition and the word frequency of each keyword
The second weight of each label in the tag library to every article in existing article resources bank;
Label for labelling unit, for choosing a number of label to phase by predetermined way based on the second weight obtained
On the article mark answered.
Wherein, setting up the process of keywords database includes:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, built
Vertical participle word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if
Participle word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking in θ
Multiple element numerical value of position to last 1 are divided to carry out cumulative summation;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrange and ranking is existed
1st multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
Wherein it is determined that the method for first weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as
TW when not having keyword w in fruit label t word contentt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank;
L represents the set of all articles in existing article resources bank.
Wherein it is determined that the label is as follows to the method for the second weight of article:
Wherein:
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank
Weight;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as
There is no keyword w then TW in fruit label t word contentt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
It is preferred that, the label is standardized to the second weight of article to obtain the label to article
With respect to the second weight, method is as follows:
Wherein:
LPCl,tRepresent label t in the tag library set up in advance to relative the of the article l in existing article resources bank
Two weights;
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank
Weight;
Represent the label t in the tag library set up in advance to the article l's in existing article resources bank
Average weight;
| L | represent the total quantity of all articles in existing article resources bank.
It is preferred that, a number of label is chosen to corresponding article mark by predetermined way based on the second weight obtained
Process on note includes:A number of label is chosen according to the size order of the second weight to mark to corresponding article, or
Person's selection is more than one or more labels corresponding to one or more second weights of predetermined threshold value on corresponding article mark.
According to a kind of method and apparatus that label is marked to article of the present invention, by setting up the label in tag library and text
Relevance between chapter, can be realized to being closed on the new article from external data source or the article automatic marking without label
Suitable label, each label represents a classification, or multiple labels point to a classification, it is possible thereby to save huge manpower
Cost, manually-operated poor efficiency is set to be improved significantly, substantially reduce operation cost.
Brief description of the drawings
Fig. 1 is the flow chart of the method provided in an embodiment of the present invention that label is marked to article;
Fig. 2 is the schematic block diagram of the device provided in an embodiment of the present invention that label is marked to article.
Embodiment
Below in conjunction with the embodiment of the present invention and accompanying drawing, the technical scheme in the embodiment of the present invention is carried out clear, complete
Ground is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Generally herein
The component of the embodiment of the present invention described and illustrated in place's accompanying drawing can be arranged and designed with a variety of configurations.Therefore,
The detailed description of the embodiments of the invention to providing in the accompanying drawings is not intended to limit the model of claimed invention below
Enclose, but be merely representative of the selected embodiment of the present invention.Based on embodiments of the invention, those skilled in the art are not making
The every other embodiment obtained on the premise of creative work, belongs to the scope of protection of the invention.
Fig. 1 is the flow chart of the method provided in an embodiment of the present invention that label is marked to article.As shown in figure 1, of the invention
To article mark label method comprise the following steps:
S1:Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up, the key
Dictionary includes but is not limited to:Multiple keywords, the word that each keyword occurs in every article in existing article resources bank
Frequently.
Article money can all be set up on the server for the Internet service provider read in the enterprising style of writing chapter of electronic product by providing
Source storehouse, with the electronic product such as computer for users to use or intelligent terminal online reading or download in terminal read.Here
Described article refer to it is various can carry out the books of word read, include but is not limited to:Various types of novels, paper, phase
Periodical, the textbook of all kinds of subjects, all kinds of tutoring books for examination, workbook etc..In addition, in order to manage conveniently, service provider
It can also be to set up multiple article resources banks, for example, novel books resources bank be set up for novel class, for paper and periodical etc.
Periodical Resources storehouse is set up, special resources bank is set up for books such as textbook, tutoring book and workbooks, for Domestic News etc.
Article sets up resources bank, and these can voluntarily be formulated by service provider according to policy in resource management, without the discussion model in the present invention
Within enclosing, article resources bank is collectively referred to as here.
In order to put on label automatically to article, it is necessary first to extracted in all articles in existing article resources bank
Go out multiple keywords, set up keywords database.The specific method for the step for realizing is as follows:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, built
Vertical participle word storehouse.
As described above, the service provider in the multiple third-party applications for providing electronic reading can all pre-establish on the server
Article resources bank, preserves all articles in resources bank, and this is those skilled in the art's common method, is seldom repeated here.Adopt
Multiple participle words are extracted to each article in existing article resources bank with known any participle technique, to these
Participle word sets up participle word storehouse.The participle word storehouse can include but is not limited to:1st, each participle word and every article
Relevance, i.e., which article each participle word both is from;2nd, word frequency of each participle word in every article.
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal, is according to the quantity value of label in practice
| T | represent the total quantity of all labels in existing tag library;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if
Participle word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank.
If P is a real number array, α and the real number that β is [0,1], wherein α<β, defined function pct (P, α, β) represent logarithm
Group P element does descending sort by numerical values recited, and ranking is tired out α points of positions to the numerical value of multiple elements between β points of positions
Plus summation.Note:Because 0≤α<β≤1, so α and β value digit is less than the quantity of array element, such as array element
Quantity for 1000, α and β value digit be less than 4, i.e. decimal point behind can only take the digit of 1-3.At execution point position
Need that α and β first is expanded into 10NAgain with integer, then N arranges according to α and β digit value from by element numerical values recited descending
α * 10 are chosen in the array element of sequenceNPosition to β * 10NPosition between multiple element numerical value, then add up summation.Example
Such as, function pct (P, α, β), numerical value P member have 10,000, α=0.324, β=0.8792, then α * 103=324, β * 104=
8792, the element-the on the 324th position is chosen so in 1 to 1 ten thousand array element arranged by element numerical values recited descending
8469 elements on 8792 positions, cumulative summation is carried out to the numerical value of this 8469 elements.
From the above:
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking in θ
Multiple element numerical value of position to last 1 are divided to carry out cumulative summation;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrange and ranking is existed
1st multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank, so l ∈ L represent that calculated article l is to belong to
Article in existing article resources bank.
Function pct (P, α, β) definition is explained with an instantiation below.
Assuming that pct ([0,1,3,2,5], 0.2,1), the element of array [0,1,3,2,5] is done by numerical values recited dropped first
Sequence sorts, and is [5,3,2,1,0] after descending arrangement, then element position of the ranking 0.2 point of position is 0.2*10=2, i.e., the 2nd
Element 3, the element 0 of the element position of 1 point of position of integer then for last 1, i.e., the 5th, then pct ([0,1,3,2,5], 0.2,
1)=3+2+1+0=6.
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.So, pin
Keywords database can just be set up to multiple keywords that these are selected.Certainly, multiple keywords of selection come from institute
Participle word storehouse is stated, so the content included in the keywords database set up is as the content that the participle word storehouse is included, bag
Include but be not limited to:1st, which article the relevance of each keyword and every article, i.e., each keyword both is from;2nd, it is each
Word frequency of the keyword in every article.In addition, choosing a number of word conduct by predetermined manner according to the resolution ratio
The preferred embodiment of multiple keywords includes:A number of participle word is chosen as many according to the size order of the resolution ratio
Individual keyword, or randomly select in the participle word corresponding to the resolution ratio more than or equal to predetermined threshold value or in order
A number of participle word is used as multiple keywords.
Participle word i described here resolution ratio SiIt is the energy for stating participle word i for division article theme
Power, the value of resolution ratio is higher, and it is stronger that it divides ability.For example:" preparing for the postgraduate qualifying examination " this word direct correlation " postgraduate qualifying examination " theme, and
" study " can not point clearly to a theme, then the high resolution of " preparing for the postgraduate qualifying examination " is in " study ".
The word frequency (TF) is the generic term of this area, i.e., in a given article, word frequency (TF) refers to certain
The number of times that one given word occurs in this article.
The purpose of the first step chooses popular word as keyword, and label substance also belongs to the key of hot topic
Word, so lays a good groundwork to perform next step.The keyword quantity of selection can be based on the needs of practice.Here use
Mode be, according to depending on the article quantity in article resources bank, keyword to be chosen by the certain percentage of article quantity.Example
Such as, when article quantity reaches ten million order of magnitude, 100,000 or so keywords can be chosen.Can be by multiple keywords of selection
Set up keywords database or lists of keywords etc..Here illustrated by taking keywords database as an example.
Existing tag library has been also discussed above, has wanted to realize and gives article mark label, either automatic marking or artificial
Mark, is required for pre-establishing tag library, could so realize the tag standards of mark.Each label in the tag library
To point to the key wordses of a certain theme, such as " preparing for the postgraduate qualifying examination ", " speculation in stocks ".And the tag library of article is set up, it can use any
Known method, the mode of the multiple labels rule of thumb extracted for example with operation personnel sets up tag library;Or adopt
Tag library is set up with the mode for the story label having had on the market;Foundation can also be discussed jointly using colleague dealer unified
The mode in story label storehouse set up tag library, naturally it is also possible to be the appropriate combination of this several ways to set up label
Storehouse, preferably by the dealer that goes together, unified story label storehouse is set up in discussion jointly.
S2:Determine each keyword in the keywords database to of each label in the tag library set up in advance
One weight.
After keywords database is established, then it needs to be determined that each keyword in the keywords database to having set up in advance
Tag library in each label weight, the first weight is named as here, the method for determining first weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as
TW when not having keyword w in fruit label t word contentt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank.
L represents the set of all articles in existing article resources bank, so l ∈ L represent that the article l is to belong to existing
Article in article resources bank, ∑l∈L(PLTl,t·PLWl,w) represent that calculating has all articles in article resources bank
(PLTl,t·PLWl,w) value and to the cumulative summation of these values, it can also be write as| L | represent
The total quantity of all articles in existing article resources bank.
So, with the keyword w in keywords database to the first weight of the label t in the tag library set up in advance with knob
Band, is that next step sets up label and the relevance of article carries out place mat.
S3:The word frequency of first weight and each keyword based on acquisition, is determined every in the tag library
Second weight of the individual label to every article in existing article resources bank.
The word frequency occurred in every article of each keyword in the existing article resources bank is counted, this is setting up pass
Just counted during keyword storehouse and completed and be stored in keywords database.For example, introduced at one in the article A of Stock Trading,
Word " speculation in stocks " occurs 20 times in this article, then the word frequency that word " speculation in stocks " occurs in article A is 20.
So, using each keyword in the keywords database of acquisition to the first weight of each label and described
The word frequency that each keyword of record occurs in every article in article resources bank, determines each mark in the tag library
The weight to every article in existing article resources bank is signed, the second weight is named as, determines the label to the second of article
The method of weight is as follows:
Wherein:
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank
Weight;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as
There is no keyword w then TW in fruit label t word contentt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
S4:A number of label is chosen based on the second weight obtained by predetermined way to mark to corresponding article
On.
Each label in the tag library set up in advance is obtained to every article in existing article resources bank the
After two weights, a number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained.
It is preferred that, a number of label is chosen on corresponding article mark according to the size order of the second weight.For example, obtaining
After each label in tag library is to an article A the second weight, a fixed number is chosen according to the second weight descending order
The label of amount, for example, choose ranking in preceding 1-3 or 1-5 label on this article A marks.Or, it can also set in advance
A fixed threshold value, selection is more than one or more labels corresponding to one or more second weights of the predetermined threshold value to corresponding
On article mark.
In a preferred embodiment, in order that each label in the tag library obtained is to the second weight of every article
It is placed in same dimension and is compared size, so that comparative result is more accurate, can be to second weight of the label to article
It is standardized to obtain relative second weight of the label to article, method is as follows:
Wherein:
LPCl,tRepresent label t in the tag library set up in advance to relative the of the article l in existing article resources bank
Two weights;
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank
Weight;
Represent the label t in the tag library set up in advance to the article l's in existing article resources bank
Average weight;
| L | represent the total quantity of all articles in existing article resources bank.
So,To calculate the label t in the tag library set up in advance in existing article resources bank
The total quantity of the cumulative sum of second weight of all articles divided by all articles in existing article resources bank.
After relative second weight is obtained, a fixed number is chosen by predetermined way based on relative second weight obtained
The label of amount is on corresponding article mark.
In a preferred embodiment, the article of the article resources bank is preferably thematic strong article, for example:News
Class article, paper class article, description class article (for example describing article using shop applications).
According to a kind of method that label is marked to article of the present invention, by setting up between the label in tag library and article
Relevance, can realize to suitably being marked on the new article from external data source or the article automatic marking without label
Label, each label represents a classification, or multiple labels point to one and classified, it is possible thereby to save huge human cost,
Manually-operated poor efficiency is set to be improved significantly, substantially reduce operation cost.
Fig. 2 is the schematic block diagram of the device provided in an embodiment of the present invention that label is marked to article.As shown in Fig. 2 this
The device for marking label to article of invention includes:
Keywords database sets up unit, for extracting multiple keywords in all articles in existing article resources bank,
Keywords database is set up, the keywords database includes but is not limited to:Multiple keywords, each keyword is in existing article resources bank
The word frequency occurred in every article;
First weight determining unit, for determining each keyword in the keywords database to the label set up in advance
First weight of each label in storehouse;
Second weight determining unit, is determined for first weight based on acquisition and the word frequency of each keyword
The second weight of each label in the tag library to every article in existing article resources bank;
Label for labelling unit, for choosing a number of label to phase by predetermined way based on the second weight obtained
On the article mark answered.
Wherein, the keywords database set up unit set up keywords database specific method it is as follows:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, built
Vertical participle word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal, is according to the quantity value of label in practice
| T | represent the total quantity of all labels in existing tag library;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if
Participle word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking in θ
Multiple element numerical value of position to last 1 are divided to carry out cumulative summation;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrange and ranking is existed
1st multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
Wherein, first weight determining unit is used to determine that the method for first weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as
TW when not having keyword w in fruit label t word contentt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank.
L represents the set of all articles in existing article resources bank, so l ∈ L represent that the article l is to belong to existing
Article in article resources bank, ∑l∈L(PLTl,t·PLWl,w) represent that calculating has all articles in article resources bank
(PLTl,t·PLWl,w) value and to the cumulative summation of these values, it can also be write as| L | represent
The total quantity of all articles in existing article resources bank.
Wherein, second weight determining unit is used to determine that label is as follows to the method for the second weight of article:
Wherein:
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank
Weight;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, such as
There is no keyword w then TW in fruit label t word contentt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
Wherein, label for labelling unit chooses a number of label to phase based on the second weight obtained by predetermined way
The preferred embodiment on article mark answered includes:A number of label is chosen to corresponding according to the size order of the second weight
On article mark;Or, a threshold value, one or more second weights of selection more than the predetermined threshold value can also be preset
One or more corresponding labels are on corresponding article mark.For example, obtaining each label in tag library to a text
After chapter A the second weight, a number of label is chosen according to the second weight descending order, for example, chooses ranking in preceding 1-
3 or 1-5 label are on this article A marks.Or, a threshold value can also be preset, selection is more than the default threshold
One or more labels corresponding to one or more second weights of value are on corresponding article mark.
In a preferred embodiment, in order that each label in the tag library obtained is to the second weight of every article
It is placed in same dimension and is compared size, so that comparative result is more accurate, it is of the invention to mark the device of label also to article
It can include:With respect to the second weight determining unit (not shown), for being carried out to the label to the second weight of article
Standardization is to obtain relative second weight of the label to article, and process is as follows:
Wherein:
LPCl,tRepresent label t in the tag library set up in advance to relative the of the article l in existing article resources bank
Two weights;
LPl,tRepresent that the label t in the tag library set up in advance is weighed to second of the article l in existing article resources bank
Weight;
Represent the label t in the tag library set up in advance to the article l's in existing article resources bank
Average weight;
| L | represent the total quantity of all articles in existing article resources bank.
So,To calculate the label t in the tag library set up in advance in existing article resources bank
The total quantity of the cumulative sum of second weight of all articles divided by all articles in existing article resources bank.
After relative second weight is obtained, the label for labelling unit is based on relative second weight obtained by pre-
Determine mode and choose a number of label on corresponding article mark.
In a preferred embodiment, the article of the article resources bank is preferably thematic strong article, for example:News
Class article, paper class article, description class article (for example describing article using shop applications).
Certainly, as it is known by the man skilled in the art that the label can also be by the second power to relative second weight of article
Weight determining unit is performed, it is not necessary to performed by single relative second weight determining unit.
It is apparent to those skilled in the art that, for convenience and simplicity of description, the device of foregoing description
Specific work process, may be referred to row illustrated example in the corresponding process in preceding method embodiment, preceding method embodiment
And associated description, the course of work of interpreting means is equally applicable to, description is not repeated herein.
According to a kind of device that label is marked to article of the present invention, by setting up between the label in tag library and article
Relevance, can realize to suitably being marked on the new article from external data source or the article automatic marking without label
Label, each label represents a classification, or multiple labels point to one and classified, it is possible thereby to save huge human cost,
Manually-operated poor efficiency is set to be improved significantly, substantially reduce operation cost.
The computer program product for the method that label is marked to article that the embodiment of the present invention is provided, including store journey
The computer-readable recording medium of sequence code, the instruction that described program code includes can be used for performing institute in previous methods embodiment
The method stated, implements and can be found in embodiment of the method, will not be repeated here.
If the function is realized using in the form of SFU software functional unit and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Understood based on such, technical scheme is substantially in other words
The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are to cause a computer equipment (can be individual
People's computer, Intelligent flat computer, smart mobile phone, server, or network equipment etc.) perform described in each embodiment of the invention
The all or part of step of method.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM), deposit at random
Access to memory (RAM), magnetic disc or CD etc. are various can be with the medium of store program codes.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.
Claims (12)
1. a kind of method that label is marked to article, it includes:
Multiple keywords are extracted in all articles in existing article resources bank, keywords database is set up, the keywords database bag
Include but be not limited to:Multiple keywords, the word frequency that each keyword occurs in every article in existing article resources bank;
Determine the first weight of each keyword in the keywords database to each label in the tag library set up in advance;
The word frequency of first weight and each keyword based on acquisition determines each label pair in the tag library
Second weight of every article in existing article resources bank;
A number of label is chosen on corresponding article mark by predetermined way based on the second weight obtained.
2. according to the method described in claim 1, it is characterised in that extracted in all articles in existing article resources bank
Go out in multiple keywords, the step of setting up keywords database:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, sets up and divide
Word word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if participle
Word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking θ points of position
Cumulative summation is carried out to multiple element numerical value of last 1;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrangement and to ranking at the 1st
Multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
3. according to the method described in claim 1, it is characterised in that it is determined that each keyword in the keywords database is to pre-
In the step of first weight of each label in the tag library first set up, the method for determining first weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark
TW when not having a keyword w in the word content for signing tt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank;
L represents the set of all articles in existing article resources bank.
4. according to the method described in claim 1, it is characterised in that in first weight based on acquisition and each pass
The word frequency of keyword determines each label in the tag library to the second weight of every article in existing article resources bank
In step, determine that the label is as follows to the method for the second weight of article:
Wherein:
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark
There is no keyword w then TW in the word content for signing tt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
5. method according to claim 4, it is characterised in that also include:The label is carried out to the second weight of article
Standardization is to obtain relative second weight of the label to article, and method is as follows:
Wherein:
LPCl,tRepresent that the label t in the tag library set up in advance is weighed to relative second of the article l in existing article resources bank
Weight;
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
Represent label t being averaged to the article l in existing article resources bank in the advance tag library set up
Weight;
| L | represent the total quantity of all articles in existing article resources bank.
6. according to the method described in claim 1, it is characterised in that chosen based on the second weight obtained by predetermined way
In the step that a number of label is marked to corresponding article, including:Choose certain according to the size order of the second weight
The label of quantity is marked to corresponding article, or selection is more than 1 corresponding to one or more second weights of predetermined threshold value
Individual or multiple labels are on corresponding article mark.
7. a kind of device that label is marked to article, it includes:
Keywords database sets up unit, for extracting multiple keywords in all articles in existing article resources bank, sets up
Keywords database, the keywords database includes but is not limited to:Multiple keywords, every in existing article resources bank of each keyword
The word frequency occurred in article;
First weight determining unit, for determining each keyword in the keywords database in the tag library set up in advance
Each label the first weight;
Second weight determining unit, is determined described for first weight based on acquisition and the word frequency of each keyword
The second weight of each label in tag library to every article in existing article resources bank;
Label for labelling unit, for choosing a number of label to corresponding by predetermined way based on the second weight obtained
On article mark.
8. device according to claim 7, it is characterised in that keywords database sets up the mistake that unit is used to set up keywords database
Journey includes:
First, multiple participle words are extracted in all articles in existing article resources bank using participle technique, sets up and divide
Word word storehouse;
It is then determined the resolution ratio of each participle word in participle word storehouse:
Wherein:
SiRepresent participle word i resolution ratio;
θ is a customized decimal;
Pl,iThe word frequency in article l of the participle word i in participle word storehouse in existing article resources bank is represented, if participle
Word i does not appear in article l then Pl,i=0;
| L | represent the total quantity of all articles in existing article resources bank;
pct([Pl,i]l∈L, θ, 1) represent array PlIn element by numerical values recited do descending arrangement and to ranking θ points of position
Cumulative summation is carried out to multiple element numerical value of last 1;
pct([Pl,i]l∈L, 0, θ) represent array PlIn element by numerical values recited do descending arrangement and to ranking at the 1st
Multiple element numerical value to θ points of positions carry out cumulative summation;
L represents the set of all articles in existing article resources bank;
Finally, a number of word is chosen by predetermined manner according to the resolution ratio and is used as multiple keywords.
9. device according to claim 7, it is characterised in that first weight determining unit is used to determine described first
The method of weight is as follows:
Wherein:
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark
TW when not having a keyword w in the word content for signing tt,wFor 0;
PLTl,tRepresent the word frequency occurred in article l of the label t in tag library in existing article resources bank;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
| L | represent the total quantity of all articles in existing article resources bank;
L represents the set of all articles in existing article resources bank.
10. device according to claim 7, it is characterised in that second weight determining unit is used to determine the label
Method to the second weight of article is as follows:
Wherein:
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
TWt,wFirst weights of the keyword w in keywords database to the label t in the tag library set up in advance is represented, if mark
There is no keyword w then TW in the word content for signing tt,wFor 0;
PLWl,wRepresent the word frequency occurred in article l of the keyword w in keywords database in existing article resources bank;
N is the total quantity of the keyword in keywords database.
11. device according to claim 10, it is characterised in that described device also includes:Determined with respect to the second weight single
Member, for being standardized to the label to the second weight of article to obtain the label to relative the second of article
Weight, method is as follows:
Wherein:
LPCl,tRepresent that the label t in the tag library set up in advance is weighed to relative second of the article l in existing article resources bank
Weight;
LPl,tRepresent second weights of the label t in the advance tag library set up to the article l in existing article resources bank;
Represent label t being averaged to the article l in existing article resources bank in the advance tag library set up
Weight;
| L | represent the total quantity of all articles in existing article resources bank.
12. device according to claim 7, it is characterised in that the label for labelling unit is additionally operable to according to the second weight
Size order is chosen a number of label and marked to corresponding article, or selection is more than one or more of predetermined threshold value
One or more labels corresponding to second weight are on corresponding article mark.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710172954.4A CN106980667B (en) | 2017-03-22 | 2017-03-22 | A kind of method and apparatus to article mark label |
PCT/CN2018/071607 WO2018171295A1 (en) | 2017-03-22 | 2018-01-05 | Method and apparatus for tagging article, terminal, and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710172954.4A CN106980667B (en) | 2017-03-22 | 2017-03-22 | A kind of method and apparatus to article mark label |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106980667A true CN106980667A (en) | 2017-07-25 |
CN106980667B CN106980667B (en) | 2019-04-12 |
Family
ID=59339570
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710172954.4A Active CN106980667B (en) | 2017-03-22 | 2017-03-22 | A kind of method and apparatus to article mark label |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106980667B (en) |
WO (1) | WO2018171295A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107748745A (en) * | 2017-11-08 | 2018-03-02 | 厦门美亚商鼎信息科技有限公司 | A kind of enterprise name keyword extraction method |
WO2018171295A1 (en) * | 2017-03-22 | 2018-09-27 | 广州优视网络科技有限公司 | Method and apparatus for tagging article, terminal, and computer readable storage medium |
WO2018188378A1 (en) * | 2017-04-10 | 2018-10-18 | 广州优视网络科技有限公司 | Method and device for tagging label for application, terminal and computer readable storage medium |
CN110519654A (en) * | 2019-09-11 | 2019-11-29 | 广州荔支网络技术有限公司 | A kind of label determines method and device |
CN111611461A (en) * | 2019-05-14 | 2020-09-01 | 北京精准沟通传媒科技股份有限公司 | Data processing method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102289523A (en) * | 2011-09-20 | 2011-12-21 | 北京金和软件股份有限公司 | Method for intelligently extracting text labels |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
US20160070803A1 (en) * | 2014-09-09 | 2016-03-10 | Funky Flick, Inc. | Conceptual product recommendation |
CN105893478A (en) * | 2016-03-29 | 2016-08-24 | 广州华多网络科技有限公司 | Tag extraction method and equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980667B (en) * | 2017-03-22 | 2019-04-12 | 广州优视网络科技有限公司 | A kind of method and apparatus to article mark label |
-
2017
- 2017-03-22 CN CN201710172954.4A patent/CN106980667B/en active Active
-
2018
- 2018-01-05 WO PCT/CN2018/071607 patent/WO2018171295A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102289523A (en) * | 2011-09-20 | 2011-12-21 | 北京金和软件股份有限公司 | Method for intelligently extracting text labels |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
US20160070803A1 (en) * | 2014-09-09 | 2016-03-10 | Funky Flick, Inc. | Conceptual product recommendation |
CN105893478A (en) * | 2016-03-29 | 2016-08-24 | 广州华多网络科技有限公司 | Tag extraction method and equipment |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018171295A1 (en) * | 2017-03-22 | 2018-09-27 | 广州优视网络科技有限公司 | Method and apparatus for tagging article, terminal, and computer readable storage medium |
WO2018188378A1 (en) * | 2017-04-10 | 2018-10-18 | 广州优视网络科技有限公司 | Method and device for tagging label for application, terminal and computer readable storage medium |
CN107748745A (en) * | 2017-11-08 | 2018-03-02 | 厦门美亚商鼎信息科技有限公司 | A kind of enterprise name keyword extraction method |
CN107748745B (en) * | 2017-11-08 | 2021-08-03 | 厦门美亚商鼎信息科技有限公司 | Enterprise name keyword extraction method |
CN111611461A (en) * | 2019-05-14 | 2020-09-01 | 北京精准沟通传媒科技股份有限公司 | Data processing method and device |
CN110519654A (en) * | 2019-09-11 | 2019-11-29 | 广州荔支网络技术有限公司 | A kind of label determines method and device |
CN110519654B (en) * | 2019-09-11 | 2021-07-27 | 广州荔支网络技术有限公司 | Label determining method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2018171295A1 (en) | 2018-09-27 |
CN106980667B (en) | 2019-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106980667B (en) | A kind of method and apparatus to article mark label | |
CN110532451A (en) | Search method and device for policy text, storage medium, electronic device | |
Cattuto | Semiotic dynamics in online social communities | |
CN102215300B (en) | Telecommunication service recommendation method and system | |
CN107944986A (en) | A kind of O2O Method of Commodity Recommendation, system and equipment | |
CN105893533A (en) | Text matching method and device | |
CN105404698A (en) | Education video recommendation method and device | |
CN106951571A (en) | A kind of method and apparatus for giving application mark label | |
CN108491388A (en) | Data set acquisition methods, sorting technique, device, equipment and storage medium | |
CN106919575A (en) | application program searching method and device | |
CN108256537A (en) | A kind of user gender prediction method and system | |
CN106095939B (en) | The acquisition methods and device of account authority | |
CN106294882A (en) | Data digging method and device | |
CN110276382A (en) | Listener clustering method, apparatus and medium based on spectral clustering | |
CN107491536A (en) | A kind of examination question method of calibration, examination question calibration equipment and electronic equipment | |
CN106909688A (en) | A kind of method and apparatus that search word is recommended based on input search word | |
CN108629047A (en) | A kind of song list generation method and terminal device | |
CN112052396A (en) | Course matching method, system, computer equipment and storage medium | |
CN103324758A (en) | News classifying method and system | |
CN109902157A (en) | A kind of training sample validation checking method and device | |
CN113918806A (en) | Method for automatically recommending training courses and related equipment | |
CN108170691A (en) | It is associated with the determining method and apparatus of document | |
Razzaq et al. | An automatic determining food security status: machine learning based analysis of household survey data | |
CN108153781A (en) | The method and apparatus for extracting the keyword of business scope | |
CN106250402A (en) | A kind of Website classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200415 Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Patentee after: Alibaba (China) Co.,Ltd. Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 15 layer self unit 02 Patentee before: GUANGZHOU UC NETWORK TECHNOLOGY Co.,Ltd. |
|
TR01 | Transfer of patent right |