Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the embodiment of the present invention, multimedia file may include video file and audio file, specifically, in video file
Including video data and text information, video data includes video finished product, key frame of video and video caption or speech text.?
It include audio data and text information in audio file, audio data includes audio finished product and audio text.
Fig. 1 is the flow chart of the acquisition methods embodiment one of multi-medium data of the present invention, as shown in Figure 1, the present embodiment
Executing subject is the user terminal with memory space, can specifically be realized in the user terminal by software mode.The then party
Method includes:
Step 101, the inquiry request of user's input is received, which includes multimedia messages keyword.
In the present embodiment, receive user input inquiry request before, can according to user input configuration rule, generate with
The relevant multimedia messages keyword of configuration rule receives the inquiry request of user's input, includes multimedia in inquiry request
Information key.Such as before the inquiry request for receiving user's input, the configuration rule of user's input are as follows: " Beijing room rate ", according to
The multimedia messages keyword relevant to " Beijing room rate " of the configuration rule of " Beijing room rate ", generation has " Beijing room rate ", " state
Five ", " house property tax " etc..It will include that the multimedias such as " Beijing room rate ", " five, state ", " house property tax " are believed so in inquiry request
Cease keyword.
Step 102, according to multimedia messages keyword, the metadata in database is retrieved, determining and multimedia
The corresponding purpose mark of the text information that information key matches, the metadata include multimedia file text information and its
Corresponding mark.
In the present embodiment, database is (Hadoop Database, abbreviation Hbase) database, and Hbase database is suitable
Together in the database of unstructured data storage, it is per-column rather than based on capable mode, the more easily big number of read-write
According to content.A large amount of metadata is stored in Hbase database.Metadata includes the text information and its correspondence of multimedia file
Mark.Wherein the text information of multimedia file include: the title of multi-medium data, author, issuing time, affiliated web site,
Chained address, regional information, multimedia abstract etc..Regional information therein can be the region letter of multi-medium data affiliated web site
Breath, is also possible to the regional information of some specific channel of website belonging to multi-medium data.
In the present embodiment, according to multimedia messages keyword, to the metadata in database carry out retrieval be periodically into
Capable, because the multimedia file of internet is to constantly update, metadata in the database is also in continuous renewal
In the process, in carrying out periodically retrieval, the text of continuous renewal to match with multimedia messages keyword can be found
Information makes retrieval with more real-time and accuracy.The specific search cycle can preset, and such as be set as retrieving for every 10 minutes
Once, this embodiment is not limited for the size of search cycle.
It is in the text information of each metadata in the present embodiment before being retrieved to the metadata in database
The creations such as title, author, issuing time, affiliated web site, chained address, regional information, the multimedia abstract of multi-medium data
Index keeps retrieval quicker.
Step 103, the text information to match to user's output with multimedia messages keyword.
In the present embodiment, the text information of output to match with multimedia messages keyword number be and multimedia believe
Keyword and whether be that hot topic is related is ceased, when multimedia messages keyword is less and when being hot topic, in data
The more text information to match with multimedia messages keyword will be obtained in library.By the shape that these text informations are with list
Formula is collected and is presented to user.
Step 104, user is received to respond the confirmation of text information.
According to the text information of output to match with multimedia messages keyword, user can be in the mark of every text message
The content that the corresponding multi-medium data of text information is told about is seen in topic, author and abstract, one or more sense of confirmation is emerging
The text information of interest.
Step 105, multimedia number is obtained from server according to the purpose mark that confirmation responds corresponding text information
According to.
In the present embodiment, after receiving user to the confirmation response of text information, so that it may respond corresponding text according to confirmation
The purpose mark of this information obtains multi-medium data from server, so that user checks multi-medium data.
Step 106, multi-medium data is exported to user.
In the present embodiment, multi-medium data is exported to user, when multi-medium data is video data, user be may be viewed by
Video checks key frame in video and obtains video caption or speech text.When multi-medium data is audio data, use
Family can listen to audio, obtain audio text etc..
In this implementation, by receiving the inquiry request of user's input, which includes multimedia messages keyword;Root
According to multimedia messages keyword, the metadata in database is retrieved, what determining and multimedia messages keyword matched
The corresponding purpose mark of text information, the metadata include multimedia file text information and its corresponding mark;To user
The text information that output matches with multimedia messages keyword;User is received to respond the confirmation of text information;According to it is true
Recognize and responds the purpose mark of corresponding text information and obtain multi-medium data from server;Multi-medium data is defeated to user
Out.The present invention can obtain fast and accurately multi-medium data according to the text information of multimedia file, more be able to satisfy user
Demand to multi-medium data personalization makes user obtain preferably experience.
Fig. 2 is the first pass figure of the acquisition methods embodiment two of multi-medium data of the present invention, as shown in Fig. 2, this implementation
The executing subject of example is the user terminal with memory space, can specifically be realized in the user terminal by software mode, then
This method comprises:
Step 201, multimedia file is acquired.
Multimedia file in the present embodiment by acquisition is illustrated for being all video file.
In the present embodiment, multimedia file is to swash to take off from internet, in local disk or large server
It is stored.Such as when multimedia file is video file, by the text information of video file, video finished product, key frame of video
And the subtitle or speech text of video are stored.Wherein the text information of video file includes title, author, the hair of video
Cloth time, affiliated web site, chained address, regional information, video frequency abstract etc., if the issuing time in text information is empty,
To acquire the time of text information for its default publications time, to guarantee the integrality of text information.The acquisition of key frame of video
It is to extract Video Key frame technique using automatic, the extraction of key frame can be effectively carried out to video finished product, by the pass of extraction
Key frame is stored, and the subtitle or speech text of video are using subtitle, speech recognition technology to the voice or word of video finished product
Curtain is identified, is converted to the text of video caption or voice, the video caption of conversion or voice are carried out in a text form
Storage.
Step 202, the text information and multi-medium data of multimedia file are extracted.
In the present embodiment, due to being influenced by network bandwidth, when acquiring multimedia file, often multimedia file
Text information first acquires completion, followed by multi-medium data video finished product, according to video finished product extract key frame of video,
The identification of video caption and speech text is carried out according to video finished product.By the text information of the multimedia file after acquisition and more
Media data extracts, and carries out classification storage.
Step 203, processing is filtered to text information.
In the present embodiment, some rubbish word informations are pre-configured with, to garbage information filtering, by way of text matches
Judge whether containing rubbish word in the title and abstract of text information, if single text information is matched to more than two different rubbish
When rubbish word, it is filtered, remaining text information is effective text information.
Step 204, Hash calculation is carried out to the webpage link address of text information, using obtained cryptographic Hash as the text
The corresponding mark of this information, and Hash calculation is carried out to the webpage link address in the related information of multi-medium data, it obtains
Cryptographic Hash is as the corresponding mark of multi-medium data.
In the present embodiment, after being filtered processing to text information, the web page interlinkage of every text message all existence anduniquess
Address calculates the cryptographic Hash of webpage link address, the cryptographic Hash conduct using MurmurHash algorithm according to webpage link address
The unique identification of text information.
The present embodiment calculates the cryptographic Hash of webpage link address using MurmurHash algorithm, is because MurmurHash is calculated
Method is a kind of non-encrypted hash algorithm, and in aspect of performance and traditional CRC32 algorithm, MD5 algorithm, SHA-1 algorithm etc. is compared to tool
Standby some superiority, and collision rate is relatively low.
In the present embodiment, when being stored multimedia data classification, the association letter of each multi-medium data will record
Breath, related information include the store path after each multi-medium data extracts, the chained address of filename and webpage.Wherein
Multi-medium data includes: the subtitle or speech text of video finished product, key frame of video and video.Then according to multi-medium data
Related information in webpage link address using MurmurHash algorithm calculate webpage link address cryptographic Hash, the cryptographic Hash
As the corresponding mark of multi-medium data.
If text information and multi-medium data are text information and more matchmakers corresponding to same multimedia file
Webpage link address in volume data is identical, be also by the calculated cryptographic Hash of webpage link address it is identical, just same more
Text information and multi-medium data in media file are associated with.
Step 205, disappeared to text information and handled again.
Specifically, disappeared to handle again to text information and can be divided into following five steps execution, as shown in Figure 3.
Step 205a, judge in memory whether the corresponding mark of existing text information.If it exists, it thens follow the steps
205b thens follow the steps 205c if it does not exist
In the present embodiment, in user terminal starting, the corresponding mark of the text information recorded in physical file need to be first read
Know, i.e., upload the text information corresponding mark of completion when load last user terminal starts, and these are uploaded to the mark of completion
Knowledge is loaded onto memory.
The corresponding mark of text information is stored according to the issuing time of the file information in memory, to store in memory
Issuing time be three days in the corresponding mark of text information for be illustrated.In memory, by issuing time in three days
The corresponding mark of text information, the corresponding mark of text information is divided into 72 pieces and deposited by issuing time as unit of hour
It puts, and periodically eliminates the corresponding mark of expired text information.As unit of hour, piecemeal stores the corresponding expression of text information
Inquiry velocity can be improved.
This time judge in memory after starting user terminal with the presence or absence of the corresponding mark of text information updated.
In the present embodiment, Broome has been used when mark corresponding with the presence or absence of the text information updated in audit memory
The algorithm of filter (Bloom filter), the basic thought of Bloom filter algorithm is: using the method for hash function,
One element is mapped to a point on the array of a m length, when this point is 1, then this element is in set,
It is on the contrary then not in set.The shortcomings that this method is exactly may have conflict, solution when there are many element of detection
It is exactly to correspond to k point using k hash function, if all the points are all 1, that identical element element is in set, if there is 0
Words, element is not then in set.
Step 205b filters text information.
Step 205c judges whether the issuing time of text information is in preset time, if so, thening follow the steps
205d, if it is not, thening follow the steps 205e.
Step 205d by the corresponding identification record of text information into memory, and executes step 205f.
In the present embodiment, by the corresponding identification record of text information into memory, can for it is subsequent judge in memory whether
Disappear containing text information and uses again.
Step 205e, judge in database whether the corresponding mark of existing text information, and if it exists, then follow the steps
205b thens follow the steps 204f if it does not exist.
Step 205f, by text information and its corresponding expression associated storage into database.
In the present embodiment, in the processing again that disappeared to text information, enter database in the text information of update first
Before, first determine whether in memory whether the corresponding mark of existing text information because the multimedia file in internet carries out
When update, the multimedia file typically issued in the recent period, and what is stored in memory be also issuing time is recent text
The corresponding mark of information, since the metadata stored in Hbase database is largely, if directly in Hbase database
The lookup for carrying out the file information, the pressure of inquiry is brought by elapsed time and to Hbase database.So when the text for having update
This information will enter before database, and progress memory first disappears again, can effectively reduce the weight that directly carries out disappearing in Hbase database
The pressure of bring inquiry.
In the present embodiment step 202- step 205, extracts the text information of multimedia file and text information is carried out
Filtration treatment, calculates cryptographic Hash and the processing again that disappeared is with extraction multi-medium data and to multi-medium data progress cryptographic Hash
Calculating process also can be carried out successively simultaneously, and the present embodiment is with no restriction.
Step 206, by text information and its corresponding mark associated storage into the database.
In the present embodiment, after the processing again that disappeared to text information, text information and its association of corresponding mark are deposited
Store up in the database, the database be (Hadoop Database, abbreviation Hbase) database, every text message and its
Corresponding mark forms a metadata.
Step 207, index is created for the text information in database.
It is that the text information in database creates index, as in the text information of each metadata in the present embodiment
The title of multi-medium data, author, issuing time, affiliated web site, chained address, regional information, the creation such as multimedia abstract
Index can make the subsequent retrieval to multimedia messages keyword more quickly and efficiently after text information creation index.
Step 208, judge in database with the presence or absence of the identical mark of mark corresponding with multi-medium data, and if it exists,
209 are thened follow the steps, if it does not exist, thens follow the steps 210.
Step 209, by multi-medium data and its corresponding mark associated storage into server, and by multi-medium data pair
The mark answered and storage address associated storage in the server execute step 211 into database.
In the present embodiment, server refers to the servers such as Tomcat, when by multi-medium data storage into server, note
Record the storage address of each multi-medium data in the server.Storage by the corresponding mark of multi-medium data and in the server
Address information is stored into database, can be in the database according to multimedia number so that user is when checking multi-medium data
Multi-medium data is obtained in the server according to corresponding mark and storage address.
Step 210, the corresponding mark of the multi-medium data is backed up, and executes step 208.
In the present embodiment, after being backed up the corresponding mark of the multi-medium data, whether deposited in judging database
After the identical mark of mark corresponding with a upper multi-medium data, rejudging in database whether there is and the multimedia
The identical mark of the corresponding mark of data.
Step 211, judge whether multi-medium data is video caption or speech text, if so, 212 are thened follow the steps, if
It is no, then follow the steps 213.
Step 212, the field of video caption or speech text is added in the text information of corresponding mark, and is video
The field of subtitle or speech text creation index.
In the present embodiment, the field of video caption or speech text is added in the text information of corresponding mark, and be
The field of video caption or speech text creation index, user can carry out text information by voice, caption information crucial
Word and search.
The step 101 of step 213 and the acquisition methods embodiment one of multi-medium data of the present invention is identical, does not go to live in the household of one's in-laws on getting married one by one herein
It states.
Step 214, according to multimedia messages keyword, the metadata in database is retrieved, determining and multimedia
The corresponding purpose mark of the text information that information key matches, the metadata include multimedia file text information and its
Corresponding mark.
It further, due to being added to the field of video caption or speech text in text information, and is video caption
Or the field of speech text creates index, so user can be according to the pass of video caption or speech text in multimedia messages
Keyword retrieves the metadata in database, can retrieve more accurate and generalization text information.
Step 215, the text information to match to user's output with multimedia messages keyword.
Further, after the text information to match to user's output with multimedia messages keyword, time, net are provided
It stands, the statistics of the various dimensions such as region, receives certain dimension of user's selection, count it to text information, output statistics
As a result, user can be monitored and analyzed multi-medium data according to statistical result.
It illustrates are as follows: output 10000 to user and match with " Beijing room rate ", " five, state ", " house property tax "
Text information, user's selection count this 10000 text message with time dimension, then to user can be exported each time
The number curve of the video data of Duan Fabu enables users to preferably analyze " Beijing room rate ", " five, state ", " house property tax " mutual
The distribution situation of the multi-medium data of networking.
The step 104- step 106 of step 216- step 218 and the acquisition methods embodiment one of multi-medium data of the present invention
It is identical, it will not repeat them here.
In the present embodiment, the cryptographic Hash of the webpage link address by calculating separately text information and multi-medium data will
The text information and multi-medium data of each multimedia file are associated, and can obtain more matchmakers according to the text information retrieved
Volume data, and the field of video caption and speech text is added in text information, it is capable of providing more accurate and generalization
Search result carry out memory to text information and disappear to handle again before by text information and corresponding mark storage to database,
The pressure for the weight bring inquiry that directly carries out disappearing in Hbase database can be effectively reduced.By multi-medium data in server
In storage address storage in the database, can be according to the text information of output purpose mark and storage address in the server
The interested multi-medium data of user is quickly obtained, user is made to obtain preferably experience.
Fig. 4 is the structural schematic diagram of the acquisition device embodiment one of multi-medium data of the present invention, as shown in figure 4, more matchmakers
The acquisition device of volume data includes: receiving module 401, retrieval module 402, determining module 403, output module 404 and obtains mould
Block 405.Wherein, receiving module 401, for receiving the inquiry request of user's input, inquiry request includes multimedia messages key
Word.Retrieval module 402, for being retrieved to the metadata in database according to multimedia messages keyword.Determining module
403, for determining purpose mark corresponding with the text information that multimedia messages keyword matches, metadata includes multimedia
The text information of file and its corresponding mark.Output module 404 is used for user's output and multimedia messages keyword phase
The text information matched.Receiving module 401 is also used to receive user and responds to the confirmation of text information.Module 405 is obtained, is used for
Multi-medium data is obtained from server according to the purpose mark of text information corresponding with confirmation response.Output module 404,
For multi-medium data to be exported to user.
The device of the present embodiment can execute the technical solution of embodiment of the method shown in Fig. 1, realization principle and technology effect
Seemingly, details are not described herein again for fruit.
Fig. 5 is the structural schematic diagram of the acquisition device embodiment two of multi-medium data of the present invention, as shown in figure 5, more matchmakers
The acquisition device of volume data includes: receiving module 501, retrieval module 502, determining module 503, output module 504, obtains module
505, acquisition module 506, extraction module 507, computing module 508, memory module 509, adding module 510, judgment module 511.
Wherein, receiving module 501, for receiving the inquiry request of user's input, inquiry request includes that multimedia messages close
Keyword.Retrieval module 502, for being retrieved to the metadata in database according to multimedia messages keyword.Determining module
503, for determining purpose mark corresponding with the text information that multimedia messages keyword matches, metadata includes multimedia
The text information of file and its corresponding mark.Output module 504 is used for user's output and multimedia messages keyword phase
The text information matched.Receiving module 501 is also used to receive user and responds to the confirmation of text information.Module 505 is obtained, is used for
Multi-medium data is obtained from server according to the purpose mark of text information corresponding with confirmation response.Output module 504,
For multi-medium data to be exported to user.
Further, acquisition module 506, the inquiry request of user's input is received for receiving module, and inquiry request includes
Before multimedia messages keyword, multimedia file is acquired.
Extraction module 507, for extracting the text information of multimedia file.
Computing module 508 carries out Hash calculation for the webpage link address to text information, obtained cryptographic Hash is made
For the corresponding mark of text information.
Memory module 509 is used for text information and its corresponding mark associated storage into database.
Further, extraction module 507 are also used to after acquisition module acquisition multimedia file, extract multimedia file
Multi-medium data.
Computing module 508, the webpage link address being also used in the related information to multi-medium data carry out Hash calculation,
Obtained cryptographic Hash is as the corresponding mark of multi-medium data.
Memory module 509 will if being also used in database mark identical in the presence of mark corresponding with multi-medium data
Multi-medium data and and its corresponding mark associated storage into server.
Memory module 509 is also used to for the corresponding mark of multi-medium data being associated with storage address in the server and deposit
It stores up in database.
Preferably, adding module 510, if for identifying identical mark in the presence of corresponding with multi-medium data in database
Know, then by multi-medium data and its corresponding mark associated storage into server after, if multi-medium data be video caption
Or speech text, then the field of video caption or speech text is added in the text information of corresponding mark.
Further, judgment module 511, for memory module by text information and its corresponding mark associated storage to counting
Before in library, judge in memory whether the corresponding mark of existing text information.
Judgment module 511, if being also used in memory there is no the corresponding mark of text information, judge in database whether
The corresponding mark of existing text information.
Memory module 509, if being also used in database there is no the corresponding mark of text information, by text information and its
Corresponding mark associated storage is into database.
The device of the present embodiment can execute the technical solution of embodiment of the method shown in Fig. 2 and Fig. 3, realization principle and skill
Art effect is similar, and details are not described herein again.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to
The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey
When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned include: ROM, RAM, magnetic disk or
The various media that can store program code such as person's CD.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme.