CN106528894B - The method and device of label information is set - Google Patents
The method and device of label information is set Download PDFInfo
- Publication number
- CN106528894B CN106528894B CN201611235463.1A CN201611235463A CN106528894B CN 106528894 B CN106528894 B CN 106528894B CN 201611235463 A CN201611235463 A CN 201611235463A CN 106528894 B CN106528894 B CN 106528894B
- Authority
- CN
- China
- Prior art keywords
- keyword
- probability
- information
- subject information
- multimedia file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Present disclose provides a kind of method and devices that label information is arranged, and belong to Internet technical field.The described method includes: obtaining the caption information of destination multimedia file;The caption information is segmented, the first keyword set is obtained;Each keyword in first keyword set is analyzed, the label information of the destination multimedia file is obtained;For the destination multimedia file, the label information is set.The disclosure carries out semantic analysis by the caption information to destination multimedia file, extracts the label information of destination multimedia file, the label information is arranged for the destination multimedia file.To not only increase the efficiency of setting label information, the accuracy of setting label information is also improved.
Description
Technical field
This disclosure relates to Internet technical field more particularly to a kind of method and device that label information is arranged.
Background technique
With the arrival of information age, the video file stored in server is more and more, and user obtains from server
The difficulty of the interested video file of user is increasing.In order to reduce difficulty, label can be arranged for video file in server
Information, so that user can select the interested video file of user according to the label information of video file from server.
Currently, usually the label information of the video file is defined, to be embodied as the video by manually watching video file
Label information is arranged in file;The label information includes subject information belonging to the video file, for example, the label information can be
Emotion or comedy etc..
Summary of the invention
To overcome the problems in correlation technique, the disclosure provides a kind of method and device that label information is arranged, institute
It is as follows to state technical solution:
According to the first aspect of the embodiments of the present disclosure, a kind of method that label information is arranged is provided, which comprises
Obtain the caption information of destination multimedia file;
The caption information is segmented, the first keyword set is obtained;
Each keyword in first keyword set is analyzed, the mark of the destination multimedia file is obtained
Sign information;
For the destination multimedia file, the label information is set.
In the embodiments of the present disclosure, semantic analysis is carried out by the caption information to destination multimedia file, extracts target
The label information is arranged for the destination multimedia file in the label information of multimedia file.To not only increase setting label
The efficiency of information also improves the accuracy of setting label information.
In a kind of possible implementation, each keyword in first keyword set is analyzed,
Obtain the label information of the destination multimedia file, comprising:
Probability of each keyword in the caption information is obtained, and, it obtains each keyword and belongs to
The probability of each subject information in subject information library, the subject information library is for storing multiple preset subject informations;
Belong to each theme according to probability of each keyword in the caption information and each keyword
The probability of information determines that the destination multimedia file belongs to the probability of each subject information;
The probability for belonging to each subject information according to the destination multimedia file, from each subject information
The maximum preset number subject information of select probability;
The preset number subject information of selection is formed to the label information of the destination multimedia file.
In the embodiments of the present disclosure, the probability according to each keyword in the caption information and each keyword belong to often
The probability of a subject information determines that destination multimedia file belongs to the probability of each subject information;According to destination multimedia file
The probability for belonging to each subject information, the maximum preset number subject information of select probability from each subject information, thus
Improve the accuracy of setting label information.
In a kind of possible implementation, the probability and institute according to each keyword in the caption information
The probability that each keyword belongs to each subject information is stated, determines that the destination multimedia file belongs to each subject information
Probability, comprising:
Probability of each keyword in the caption information is formed into the first probability matrix, and, it will be described every
The probability that a keyword belongs to each subject information forms the second probability matrix;
The inverse matrix of second probability matrix is multiplied with first probability matrix, obtains third probability matrix;
The probability that the destination multimedia file belongs to each subject information is obtained from the third probability matrix.
In the embodiments of the present disclosure, the probability by each keyword in the caption information forms the first probability matrix, will
The probability that each keyword belongs to each subject information forms the second probability matrix, according to the first probability matrix and the second probability square
Battle array, determines that the destination multimedia file belongs to the probability of each subject information, improves and determine the destination multimedia file category
In the accuracy of the probability of each subject information, and then improve the accuracy of setting label information.
In a kind of possible implementation, acquisition each keyword belongs to each theme in subject information library
The probability of information, comprising:
For each subject information, the corresponding predetermined keyword set of the subject information is obtained;
According to probability of each keyword in the caption information, the predetermined keyword set and described default
The number for the keyword that keyword set includes determines that each keyword belongs to the probability of the subject information.
In a kind of possible implementation, the probability according to each keyword in the caption information, institute
The number for stating the keyword that predetermined keyword set and the predetermined keyword set include determines that each keyword belongs to
The probability of the subject information, comprising:
If including each keyword in the predetermined keyword set, by each keyword in the subtitle
The ratio of the number for the keyword that probability and the predetermined keyword set in information include is as each keyword category
In the probability of the subject information;
If not including each keyword in the predetermined keyword set, determine that each keyword belongs to institute
The probability for stating subject information is zero.
In the embodiments of the present disclosure, the probability by each keyword in the caption information and predetermined keyword set include
The ratio of number of keyword belong to the probability of the subject information as each keyword, exist due to combining each keyword
Probability in the caption information determines that each keyword belongs to the probability of the subject information, improves and determine each keyword
Belong to the accuracy of the probability of the subject information, and then improves the accuracy of setting label information.
It is described that the caption information is segmented in a kind of possible implementation, the first keyword set is obtained, is wrapped
It includes:
The caption information is segmented, each participle for including by the caption information forms the second keyword set
It closes;
By the keyword removal of preset kind in second keyword set, first set of keywords is obtained.
In the embodiments of the present disclosure, the keyword of preset kind in the second keyword set is removed, not only reduces fortune
Calculation amount also improves the accuracy of setting label information.
According to the second aspect of an embodiment of the present disclosure, a kind of device that label information is set is provided, described device includes:
Module is obtained, for obtaining the caption information of destination multimedia file;
Word segmentation module obtains the first keyword set for segmenting to the caption information;
Analysis module obtains the target for analyzing each keyword in first keyword set
The label information of multimedia file;
Setup module, for the label information to be arranged for the destination multimedia file.
In a kind of possible implementation, the analysis module, comprising:
First acquisition unit, for obtaining probability of each keyword in the caption information;
Second acquisition unit, for obtaining the general of each subject information that each keyword belongs in subject information library
Rate, the subject information library is for storing multiple preset subject informations;
Determination unit, for the probability and each keyword according to each keyword in the caption information
The probability for belonging to each subject information determines that the destination multimedia file belongs to the probability of each subject information;
Selecting unit, for belonging to the probability of each subject information according to the destination multimedia file, from described
The maximum preset number subject information of select probability in each subject information;
First component units, for the preset number subject information of selection to be formed to the mark of the destination multimedia file
Sign information.
In a kind of possible implementation, the determination unit is also used to believe each keyword in the subtitle
Probability in breath forms the first probability matrix, and, each keyword is belonged to the probability composition the of each subject information
The inverse matrix of second probability matrix is multiplied with first probability matrix, obtains third probability matrix by two probability matrixs,
The probability that the destination multimedia file belongs to each subject information is obtained from the third probability matrix.
In a kind of possible implementation, the second acquisition unit is also used to obtain each subject information
The corresponding predetermined keyword set of the subject information, according to probability of each keyword in the caption information, institute
The number for stating the keyword that predetermined keyword set and the predetermined keyword set include determines that each keyword belongs to
The probability of the subject information.
In a kind of possible implementation, the second acquisition unit is also used to, if in the predetermined keyword set
Comprising each keyword, by probability of each keyword in the caption information and the predetermined keyword set
The ratio of the number for the keyword for including belongs to the probability of the subject information as each keyword, if described default
Each keyword is not included in keyword set, the probability for determining that each keyword belongs to the subject information is
Zero.
In a kind of possible implementation, the word segmentation module, comprising:
Participle unit, for being segmented to the caption information;
Second component units, each participle for including by the caption information form the second keyword set;
Removal unit obtains described first for removing the keyword of preset kind in second keyword set
Set of keywords.
According to the third aspect of an embodiment of the present disclosure, a kind of device that label information is set is provided, described device includes:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
Obtain the caption information of destination multimedia file;
The caption information is segmented, the first keyword set is obtained;
Each keyword in first keyword set is analyzed, the mark of the destination multimedia file is obtained
Sign information;
For the destination multimedia file, the label information is set.
The technical scheme provided by this disclosed embodiment can include the following benefits:
In the embodiments of the present disclosure, semantic analysis is carried out by the caption information to destination multimedia file, extracts target
The label information is arranged for the destination multimedia file in the label information of multimedia file.To not only increase setting label
The efficiency of information also improves the accuracy of setting label information.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure
Example, and together with specification for explaining the principles of this disclosure.
Fig. 1 is a kind of flow chart of method that label information is arranged shown according to an exemplary embodiment;
Fig. 2 is a kind of flow chart of method that label information is arranged shown according to an exemplary embodiment;
Fig. 3 is a kind of block diagram of device that label information is arranged shown according to an exemplary embodiment;
Fig. 4 is a kind of block diagram of analysis module shown according to an exemplary embodiment;
Fig. 5 is a kind of block diagram of word segmentation module shown according to an exemplary embodiment;
Fig. 6 is a kind of block diagram of device that label information is arranged shown according to an exemplary embodiment.
Specific embodiment
To keep the purposes, technical schemes and advantages of the disclosure clearer, below in conjunction with attached drawing to disclosure embodiment party
Formula is described in further detail.
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
The difficulty of the interested video file of user is obtained to reduce user from server, server needs for service
Label information is arranged in video file in device, which may include subject information belonging to video file etc..To use
Family can select the interested video file of user according to the label information of video file from server.
In the related art, usually the label information of video file is defined, to be embodied as by manually watching video file
Label information is arranged in the video file;However the quantity of the video file in server is very big, and each video file
Time length comparison is long, usually percentage clock or so;Therefore user is the low efficiency of video file setting label information manually.And
And the influence of family subjective factor is benefited from, lead to the label information inaccuracy of user setting.
In the embodiments of the present disclosure, semantic analysis is carried out by the caption information to multimedia file by server, extracted
The label information is arranged for the multimedia file in the label information of multimedia file.To not only increase setting label information
Efficiency, also improve setting label information accuracy.
Fig. 1 is a kind of method flow diagram that label information is arranged shown according to an exemplary embodiment, and this method is held
Row main body can be server, as shown in Figure 1, including the following steps.
In step s101, the caption information of destination multimedia file is obtained.
In step s 102, which is segmented, obtains the first keyword set.
In step s 103, each keyword in the first keyword set is analyzed, obtains the destination multimedia
The label information of file.
In step S104, the label information is set for the destination multimedia file.
In a kind of possible implementation, each keyword in the first keyword set is analyzed, the mesh is obtained
Mark the label information of multimedia file, comprising:
Probability of each keyword in the caption information is obtained, and, it obtains each keyword and belongs to subject information library
In each subject information probability, the subject information library is for storing multiple preset subject informations;
Belong to the probability of each subject information according to probability of each keyword in the caption information and each keyword,
Determine that the destination multimedia file belongs to the probability of each subject information;
The probability for belonging to each subject information according to the destination multimedia file, select probability is most from each subject information
Big preset number subject information;
The preset number subject information of selection is formed to the label information of the destination multimedia file.
In a kind of possible implementation, according to probability and each keyword category of each keyword in the caption information
In the probability of each subject information, determine that the destination multimedia file belongs to the probability of each subject information, comprising:
Probability of each keyword in the caption information is formed into the first probability matrix, and, by each keyword category
The second probability matrix is formed in the probability of each subject information;
The inverse matrix of second probability matrix is multiplied with the first probability matrix, obtains third probability matrix;
The probability that the destination multimedia file belongs to each subject information is obtained from third probability matrix.
In a kind of possible implementation, the general of each subject information that each keyword belongs in subject information library is obtained
Rate, comprising:
For each subject information, the corresponding predetermined keyword set of the subject information is obtained;
Include according to probability, predetermined keyword set and predetermined keyword set of each keyword in the caption information
Keyword number, determine that each keyword belongs to the probability of the subject information.
In a kind of possible implementation, according to probability, predetermined keyword collection of each keyword in the caption information
The number for closing the keyword for including with predetermined keyword set, determines that each keyword belongs to the probability of the subject information, comprising:
If in predetermined keyword set including each keyword, by probability of each keyword in the caption information and
The ratio of the number for the keyword that predetermined keyword set includes belongs to the probability of the subject information as each keyword;
If not including each keyword in predetermined keyword set, determine that each keyword belongs to the general of the subject information
Rate is zero.
In a kind of possible implementation, which is segmented, obtains the first keyword set, comprising:
The caption information is segmented, each participle for including by the caption information forms the second keyword set;
By the keyword removal of preset kind in the second keyword set, the first set of keywords is obtained.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination
It repeats one by one.
Fig. 2 is a kind of method flow diagram that label information is arranged shown according to an exemplary embodiment, and this method is held
Row main body can be server, as shown in Fig. 2, including the following steps.
In step s 201, server obtains the caption information of destination multimedia file.
A large amount of multimedia files are stored in server, server selects a not set label from a large amount of multimedia files
The multimedia file of information is as destination multimedia file.Also, the mark and subtitle of multimedia file are stored in server
The corresponding relationship of file;Correspondingly, this step can be with are as follows:
Server is closed according to the mark of the destination multimedia file from the mark and the corresponding of subtitle file of multimedia file
It is more to obtain the target from the subtitle file of the destination multimedia file for the subtitle file that the destination multimedia file is obtained in system
The caption information of media file.
Destination multimedia file can be video file or audio file.The mark of the destination multimedia file can be
Title or number of the destination multimedia file etc..In the embodiments of the present disclosure, the mark of destination multimedia file is not made
It is specific to limit.
In step S202, server segments the caption information, obtains the first keyword set.
In this step, server can segment the caption information, each participle for including by the caption information
Form the first keyword set;Server can also obtain the first keyword set by following steps (1)-(2), comprising:
(1): server segments the caption information, and each participle composition second for including by the caption information is crucial
Set of words.
Server segments the caption information, obtaining the caption information each of includes by presetting participle tool
Participle, each participle for including by the caption information form the second keyword set.
For example, the caption information is " most to understand your people not instead of your friend, your enemy." then pass through default point
Word tool segments the caption information, obtain each participle that the caption information includes be " most ", " understanding ", " you ",
" people ", "no", " you ", " friend ", " but ", " you ", " enemy ", then the second keyword set be combined into " most ", "
Solution ", " you ", " people ", "no", " you ", " friend ", " but ", " you ", " enemy ".
Default participle tool can be StandardAnalyzer (standardization participle tool), ChineseAnalyzer (in
State segments tool), CJKAnalyzer (CJK segments tool) or IKAnalyzer (IK segments tool).In the embodiment of the present disclosure
In, default participle tool is not specifically limited.
Due to " ", " ", " ", " ", " ", " most " etc keyword crucial work is not had to label information
With;Therefore, in order to reduce operand and improve the accuracy of setting label information, in this step, server can also lead to
Cross following steps (2) by " ", " ", " ", " ", " ", " most " etc keyword from the second keyword set
It removes.
(2): server removes the keyword of preset kind in the second keyword set, obtains the first set of keywords.
The keyword of preset kind can be modal particle or auxiliary word etc..Then this step can be with are as follows: server mark second
The part of speech of each keyword in keyword set, according to each keyword in the second keyword set, from the second keyword
The keyword that preset kind is searched in set, the keyword of preset kind is removed from the second keyword set, obtains first
Keyword set.
For example, server is by the second keyword set { " most ", " understanding ", " you ", " people ", "no", " you ", " friend
Friend ", " but ", " you ", " enemy " in " most ", " you ", " people ", "no" and " but " removal, obtain the first key
Word set is combined into { " understanding ", " friend ", " enemy " }.
In a possible implementation, due in the first keyword set may include synonym or near synonym,
For example, " capital " and " Beijing " is synonym;Therefore, in order to reduce operand, after server obtains the first keyword set,
Can also by the first keyword set multiple synonyms or near synonym merge into a keyword.Due to reducing first
Therefore the quantity of keyword in keyword set reduces the operand of server, and then improve setting label information
Efficiency.
In step S203, server analyzes each keyword in the first keyword set, and it is more to obtain target
The label information of media file.
This step can be realized by following first way or the second way;For the first implementation, originally
Step can be realized by following steps (1)-(3), comprising:
(1): server obtains probability of each keyword in the caption information.
Server obtains the frequency of occurrence that each keyword occurs in the caption information, calculates the appearance of each keyword
The ratio of the sum of the frequency of occurrence of each keyword and the frequency of occurrence is determined as each keyword in the subtitle by the sum of number
Probability in information.
It should be noted that if server by the first keyword set multiple synonyms or near synonym merge into
One keyword, then when server obtains probability of the keyword in the caption information, server obtains the same of the keyword
The sum of the frequency of occurrence that adopted word or near synonym occur in the caption information calculates the sum of the frequency of occurrence of each keyword,
The sum of frequency of occurrence that the synonym of the keyword or near synonym are occurred in the caption information goes out with each keyword
The ratio of the sum of occurrence number is determined as probability of the keyword in the caption information.
(2): server obtains the probability for each subject information that each keyword belongs in subject information library, theme letter
Breath library is for storing multiple preset subject informations.
Preset subject information can be " friendship ", " emotion " and " love " etc..This step can pass through following steps (2-
1) it-(2-2) realizes, comprising:
(2-1): for each subject information, server obtains the corresponding predetermined keyword set of the subject information.
For each subject information in subject information library, each subject information and predetermined keyword collection are stored in server
The corresponding relationship of conjunction;Correspondingly, this step can be with are as follows:
Server obtains the theme from subject information and the corresponding relationship of predetermined keyword set according to the subject information
The corresponding predetermined keyword set of information.It wherein, include belonging to the multiple of the subject information to preset in the predetermined keyword set
Keyword.
For example, server, which obtains subject information " friendship " corresponding predetermined keyword collection, is combined into { friend, friendship, the code of brotherhood }.
(2-2): server is according to probability of each keyword in caption information, the predetermined keyword set and this is default
The number for the keyword that keyword set includes determines that each keyword belongs to the probability of the subject information.
For each keyword, whether server is detected in the predetermined keyword set comprising the keyword;If this is pre-
If including the keyword in keyword set, by probability of the keyword in the caption information and the predetermined keyword set packet
The ratio of the number of the keyword contained belongs to the probability of the theme as the keyword.
If not including the keyword in the predetermined keyword set, determine that the keyword belongs to the probability of the subject information
It is zero.
(3): server belongs to each theme according to probability of each keyword in the caption information and each keyword
The probability of information determines that destination multimedia file belongs to the probability of each subject information.
This step can be realized by following steps (3-1)-(3-3), comprising:
(3-1): probability of each keyword in the caption information is formed the first probability matrix by server, and, it will
The probability that each keyword belongs to each subject information forms the second probability matrix.
Server forms the first probability matrix using probability of each keyword in the caption information as data line;
For each keyword, it is general to form second as data line for the probability which is belonged to each subject information by server
Rate matrix.
First probability matrix is the matrix of n × 1, and the second probability matrix is the matrix of n × m;Wherein, n is the first keyword
The number for the keyword for including in set, m are the theme the number of the preset subject information for including in information bank.
For example, each keyword is respectively A, B and C;A, probability of the B and C in the caption information is respectively PA、PBAnd PC,
The each subject information for including in subject information library is the theme 1, theme 2, theme 3 and theme 4 respectively;Keyword A belongs to each
The probability of subject information is respectively A1, A2, A3 and A4, and the probability that keyword B belongs to each subject information is respectively B1, B2, B3
And it is respectively C1, C2, C3 and C4 that B4, keyword C, which belong to the probability of each subject information,.
Then the first probability matrix isSecond probability matrix is
(3-2): the inverse matrix of the second probability matrix is multiplied by server with the first probability matrix, obtains third probability square
Battle array.
Server determines the inverse matrix of the second probability matrix according to the second probability matrix;By the inverse square of the second probability matrix
Battle array is multiplied with the first probability matrix, obtains third probability matrix.Wherein, third probability matrix is the matrix of m × 1, third probability
Each row of data in matrix is the probability that the destination multimedia file belongs to each subject information.
For example, server obtains third probability matrix is
(3-3): server obtains the probability that destination multimedia file belongs to each subject information from third probability matrix.
Each row of data in third probability matrix is the probability that the destination multimedia file belongs to each subject information.Clothes
Business device can obtain the probability that destination multimedia file belongs to each subject information from third probability matrix.
For example, third matrix isThen P1Belong to the probability of subject information 1, P for the destination multimedia file2For the mesh
Mark multimedia file belongs to the probability of subject information 2, P3Belong to the probability of subject information 3, P for the destination multimedia file4For this
Destination multimedia file belongs to the probability of subject information 4.
(4): server belongs to the probability of each subject information according to the destination multimedia file, from each subject information
The maximum preset number subject information of select probability.
For the ease of distinguishing, the preset number at this is known as the first preset number, the first preset number can be according to need
It is configured and changes, in the embodiments of the present disclosure, the first preset number is not especially limited;For example, the first present count
Mesh can be 1 or 2 etc..
(5): server believes the label that the first preset number subject information of selection forms the destination multimedia file
Breath.
For example, the subject information selected is comedy and love, then the label information of the multimedia file is comedy and love.
For second of implementation, this step can be with are as follows:
Server obtains probability of each keyword in the caption information, according to each keyword in the caption information
Probability, the maximum second preset number keyword of select probability, obtains belonging to the keyword of selection from each keyword
Subject information, subject information belonging to the keyword by selection forms the label information of the destination multimedia file.
In step S204, server is that the label information is arranged in the destination multimedia file.
In the embodiments of the present disclosure, semantic analysis is carried out by the caption information to destination multimedia file by server,
The label information for extracting multimedia file, is arranged the label information for the multimedia file.To not only increase setting label
The efficiency of information also improves the accuracy of setting label information.
Fig. 3 is a kind of device block diagram that label information is arranged shown according to an exemplary embodiment.Referring to Fig. 3, the dress
Set includes: to obtain module 301, word segmentation module 302, analysis module 303 and setup module 304.
Module 301 is obtained, is configured as obtaining the caption information of destination multimedia file;
Word segmentation module 302 is configured as segmenting the caption information, obtains the first keyword set;
Analysis module 303 is configured as analyzing each keyword in first keyword set, obtains institute
State the label information of destination multimedia file;
Setup module 304 is configured as that the label information is arranged for the destination multimedia file.
In a kind of possible implementation, referring to fig. 4, the analysis module 303, comprising:
First acquisition unit 3031 is configured as obtaining probability of each keyword in the caption information;
Second acquisition unit 3032 is configured as obtaining each theme that each keyword belongs in subject information library
The probability of information, the subject information library are configured as storing multiple preset subject informations;
Determination unit 3033 is configured as probability according to each keyword in the caption information and described every
A keyword belongs to the probability of each subject information, determines that the destination multimedia file belongs to the general of each subject information
Rate;
Selecting unit 3034 is configured as belonging to the general of each subject information according to the destination multimedia file
Rate, the maximum preset number subject information of select probability from each subject information;
First component units 3035 are configured as the preset number subject information that will be selected and form the destination multimedia
The label information of file.
In a kind of possible implementation, the determination unit 3033 is additionally configured to each keyword in institute
It states the probability in caption information and forms the first probability matrix, and, each keyword is belonged into the general of each subject information
Rate forms the second probability matrix, and the inverse matrix of second probability matrix is multiplied with first probability matrix, obtains third
Probability matrix obtains the destination multimedia file from the third probability matrix and belongs to the general of each subject information
Rate.
In a kind of possible implementation, the second acquisition unit 3032 is additionally configured to for each theme
Information obtains the corresponding predetermined keyword set of the subject information, according to each keyword in the caption information
Probability, the predetermined keyword set and the predetermined keyword set keyword that includes number, determine described each
Keyword belongs to the probability of the subject information.
In a kind of possible implementation, the second acquisition unit 3032, if being additionally configured to the default key
It include each keyword in set of words, by probability of each keyword in the caption information and the default pass
The ratio of the number for the keyword that keyword set includes belongs to the probability of the subject information as each keyword, if
Do not include each keyword in the predetermined keyword set, determines that each keyword belongs to the subject information
Probability is zero.
In a kind of possible implementation, referring to Fig. 5, the word segmentation module 302, comprising:
Participle unit 3021 is configured as segmenting the caption information;
Second component units 3022 are configured as each participle for including by the caption information and form the second keyword set
It closes;
Removal unit 3023 is configured as obtaining the keyword removal of preset kind in second keyword set
First set of keywords.
In the embodiments of the present disclosure, semantic analysis is carried out by the caption information to destination multimedia file by server,
The label information for extracting multimedia file, is arranged the label information for the multimedia file.To not only increase setting label
The efficiency of information also improves the accuracy of setting label information.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination
It repeats one by one.
It should be understood that it is provided by the above embodiment setting label information device when label information is arranged, only with
The division progress of above-mentioned each functional module can according to need and for example, in practical application by above-mentioned function distribution by not
Same functional module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above
Or partial function.In addition, the device and setting label information of setting label information provided by the above embodiment are that method is implemented
Example belongs to same design, and specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Fig. 6 is shown according to an exemplary embodiment a kind of for the block diagram of the device 600 of label information to be arranged.Example
Such as, device 600 may be provided as a server.Referring to Fig. 6, it further comprises one that device 600, which includes processing component 622,
A or multiple processors, and the memory resource as representated by memory 632, can holding by processing component 622 for storing
Capable instruction, such as application program.The application program stored in memory 632 may include it is one or more each
Module corresponding to one group of instruction.In addition, processing component 622 is configured as executing instruction, to execute above-mentioned setting label information
Method.
Device 600 can also include the power management that a power supply module 626 is configured as executive device 600, and one has
Line or radio network interface 650 are configured as device 600 being connected to network and input and output (I/O) interface 658.Dress
Setting 600 can operate based on the operating system for being stored in memory 632, such as Windows ServerTM, Mac OS XTM,
UnixTM,LinuxTM, FreeBSDTMOr it is similar.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or
Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following
Claim is pointed out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.
Claims (9)
1. a kind of method that label information is arranged, which is characterized in that the described method includes:
Obtain the caption information of destination multimedia file;
The caption information is segmented, and in the multiple keywords obtained after participle synonym or near synonym close
And obtain the first keyword set;
Probability of each keyword in first keyword set in the caption information is obtained, and, described in acquisition
Each keyword belongs to the probability of each subject information in subject information library, and the subject information library is multiple default for storing
Subject information;Probability of each keyword in the caption information is formed into the first probability matrix, and, it will be described
The probability that each keyword belongs to each subject information forms the second probability matrix;By the inverse matrix of second probability matrix with
First probability matrix is multiplied, and obtains third probability matrix;The destination multimedia is obtained from the third probability matrix
File belongs to the probability of each subject information;Belong to the general of each subject information according to the destination multimedia file
Rate, the maximum preset number subject information of select probability from each subject information;The preset number of selection is main
Topic information forms the label information of the destination multimedia file;
For the destination multimedia file, the label information is set.
2. the method according to claim 1, wherein acquisition each keyword belongs to subject information library
In each subject information probability, comprising:
For each subject information, the corresponding predetermined keyword set of the subject information is obtained;
According to probability, the predetermined keyword set and the default key of each keyword in the caption information
The number for the keyword that set of words includes determines that each keyword belongs to the probability of the subject information.
3. according to the method described in claim 2, it is characterized in that, it is described according to each keyword in the caption information
In probability, the predetermined keyword set and the predetermined keyword set keyword that includes number, determine described every
A keyword belongs to the probability of the subject information, comprising:
If including each keyword in the predetermined keyword set, by each keyword in the caption information
In the ratio of number of probability and the predetermined keyword set keyword that includes belong to institute as each keyword
State the probability of subject information;
If not including each keyword in the predetermined keyword set, determine that each keyword belongs to the master
The probability for inscribing information is zero.
4. obtaining first the method according to claim 1, wherein described segment the caption information
Keyword set, comprising:
The caption information is segmented, each participle for including by the caption information forms the second keyword set;
By the keyword removal of preset kind in second keyword set, first set of keywords is obtained.
5. a kind of device that label information is arranged, which is characterized in that described device includes:
Module is obtained, for obtaining the caption information of destination multimedia file;
Word segmentation module, for being segmented to the caption information, and to the synonym in the multiple keywords obtained after participle
Or near synonym merge, and obtain the first keyword set;
Analysis module obtains the more matchmakers of the target for analyzing each keyword in first keyword set
The label information of body file;
Setup module, for the label information to be arranged for the destination multimedia file;
Wherein, the analysis module, comprising:
First acquisition unit, for obtaining probability of each keyword in the caption information;
Second acquisition unit, for obtaining the probability for each subject information that each keyword belongs in subject information library,
The subject information library is for storing multiple preset subject informations;
Determination unit, for being belonged to according to probability of each keyword in the caption information and each keyword
The probability of each subject information determines that the destination multimedia file belongs to the probability of each subject information;
Selecting unit, for belonging to the probability of each subject information according to the destination multimedia file, from described each
The maximum preset number subject information of select probability in subject information;
First component units, the label for the preset number subject information of selection to be formed to the destination multimedia file are believed
Breath;
Wherein, the determination unit, the probability composition first being also used to by each keyword in the caption information are general
Rate matrix, and, the probability that each keyword belongs to each subject information is formed into the second probability matrix, by described second
The inverse matrix of probability matrix is multiplied with first probability matrix, obtains third probability matrix, from the third probability matrix
Obtain the probability that the destination multimedia file belongs to each subject information.
6. device according to claim 5, which is characterized in that the second acquisition unit is also used to for described each
Subject information obtains the corresponding predetermined keyword set of the subject information, is believed according to each keyword in the subtitle
The number for the keyword that probability, the predetermined keyword set and the predetermined keyword set in breath include, determine described in
Each keyword belongs to the probability of the subject information.
7. device according to claim 6, which is characterized in that
The second acquisition unit will be described if be also used in the predetermined keyword set comprising each keyword
The ratio of the number for the keyword that probability and the predetermined keyword set of each keyword in the caption information include
Belong to the probability of the subject information as each keyword, if do not included in the predetermined keyword set described every
A keyword, the probability for determining that each keyword belongs to the subject information is zero.
8. device according to claim 5, which is characterized in that the word segmentation module, comprising:
Participle unit, for being segmented to the caption information;
Second component units, each participle for including by the caption information form the second keyword set;
It is crucial to obtain described first for removing the keyword of preset kind in second keyword set for removal unit
Word set.
9. a kind of device that label information is arranged characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to:
Obtain the caption information of destination multimedia file;
The caption information is segmented, and in the multiple keywords obtained after participle synonym or near synonym close
And obtain the first keyword set;
Probability of each keyword in first keyword set in the caption information is obtained, and, described in acquisition
Each keyword belongs to the probability of each subject information in subject information library, and the subject information library is multiple default for storing
Subject information;Probability of each keyword in the caption information is formed into the first probability matrix, and, it will be described
The probability that each keyword belongs to each subject information forms the second probability matrix;By the inverse matrix of second probability matrix with
First probability matrix is multiplied, and obtains third probability matrix;The destination multimedia is obtained from the third probability matrix
File belongs to the probability of each subject information;Belong to the general of each subject information according to the destination multimedia file
Rate, the maximum preset number subject information of select probability from each subject information;The preset number of selection is main
Topic information forms the label information of the destination multimedia file;
For the destination multimedia file, the label information is set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611235463.1A CN106528894B (en) | 2016-12-28 | 2016-12-28 | The method and device of label information is set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611235463.1A CN106528894B (en) | 2016-12-28 | 2016-12-28 | The method and device of label information is set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106528894A CN106528894A (en) | 2017-03-22 |
CN106528894B true CN106528894B (en) | 2019-11-15 |
Family
ID=58339089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611235463.1A Active CN106528894B (en) | 2016-12-28 | 2016-12-28 | The method and device of label information is set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106528894B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107656958B (en) * | 2017-06-09 | 2019-07-19 | 平安科技(深圳)有限公司 | A kind of classifying method and server of multi-data source data |
CN107295375A (en) * | 2017-06-13 | 2017-10-24 | 中国传媒大学 | Variety show content characteristic obtains system and application system |
CN109213841B (en) * | 2017-06-29 | 2021-01-01 | 武汉斗鱼网络科技有限公司 | Live broadcast theme sample extraction method, storage medium, electronic device and system |
CN107832287A (en) * | 2017-09-26 | 2018-03-23 | 晶赞广告(上海)有限公司 | A kind of label identification method and device, storage medium, terminal |
CN108595660A (en) * | 2018-04-28 | 2018-09-28 | 腾讯科技(深圳)有限公司 | Label information generation method, device, storage medium and the equipment of multimedia resource |
CN109753563B (en) * | 2019-03-28 | 2019-09-10 | 深圳市酷开网络科技有限公司 | Tag extraction method, apparatus and computer readable storage medium based on big data |
CN110650364B (en) * | 2019-09-27 | 2022-04-01 | 北京达佳互联信息技术有限公司 | Video attitude tag extraction method and video-based interaction method |
CN116092063B (en) * | 2022-12-09 | 2024-05-17 | 湖南润科通信科技有限公司 | Short video keyword extraction method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101853250A (en) * | 2009-04-03 | 2010-10-06 | 华为技术有限公司 | Method and device for classifying documents |
CN102855312B (en) * | 2012-08-24 | 2013-08-14 | 武汉大学 | Domain-and-theme-oriented Web service clustering method |
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
CN104239373A (en) * | 2013-06-24 | 2014-12-24 | 腾讯科技(深圳)有限公司 | Document tag adding method and document tag adding device |
-
2016
- 2016-12-28 CN CN201611235463.1A patent/CN106528894B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101853250A (en) * | 2009-04-03 | 2010-10-06 | 华为技术有限公司 | Method and device for classifying documents |
CN102855312B (en) * | 2012-08-24 | 2013-08-14 | 武汉大学 | Domain-and-theme-oriented Web service clustering method |
CN104239373A (en) * | 2013-06-24 | 2014-12-24 | 腾讯科技(深圳)有限公司 | Document tag adding method and document tag adding device |
CN103744835A (en) * | 2014-01-02 | 2014-04-23 | 上海大学 | Text keyword extracting method based on subject model |
Also Published As
Publication number | Publication date |
---|---|
CN106528894A (en) | 2017-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106528894B (en) | The method and device of label information is set | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
TWI653542B (en) | Method, system and device for discovering and tracking hot topics based on network media data flow | |
JP5449628B2 (en) | Determining category information using multistage | |
US10552422B2 (en) | Extended search method and apparatus | |
US20150278359A1 (en) | Method and apparatus for generating a recommendation page | |
CN103294778B (en) | A kind of method and system pushing information | |
CN108595679B (en) | Label determining method, device, terminal and storage medium | |
CN104751354B (en) | A kind of advertisement crowd screening technique | |
CN103136228A (en) | Image search method and image search device | |
US10346496B2 (en) | Information category obtaining method and apparatus | |
CN110909120B (en) | Resume searching/delivering method, device and system and electronic equipment | |
US20190266406A1 (en) | Automatically detecting contents expressing emotions from a video and enriching an image index | |
CN111241389A (en) | Sensitive word filtering method and device based on matrix, electronic equipment and storage medium | |
CN108376164B (en) | Display method and device of potential anchor | |
CN104915426B (en) | Information sorting method, the method and device for generating information sorting model | |
CN104915359B (en) | Theme label recommended method and device | |
CN110968789B (en) | Electronic book pushing method, electronic equipment and computer storage medium | |
CN105574030B (en) | A kind of information search method and device | |
Wicaksono et al. | Automatic extraction of advice-revealing sentences foradvice mining from online forums | |
de Oliveira et al. | FS-NER: A lightweight filter-stream approach to named entity recognition on twitter data | |
Jeon et al. | Hashtag recommendation based on user tweet and hashtag classification on twitter | |
CN103559313B (en) | Searching method and device | |
CN103902596B (en) | High frequency content of pages clustering method and system | |
CN105159927B (en) | Method and device for selecting subject term of target text and terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |