The content of the invention
The present invention solves the technical problem of how to provide a kind of media search method based on speech recognition and dress
Put, realize that user carries out more accurate search to media content by voice.
For this purpose, the invention provides a kind of media search method based on speech recognition, this method includes following step
Suddenly:
Obtain the content index and metadata information of media;
Associate the content index and metadata information sets up media knowledge base;
Parse the user's voice inquirement collected and obtain corresponding speech recognition text;
Media research is carried out to the media knowledge base according to the speech recognition text.
Wherein, the content index for obtaining media, is specifically included:
It is unified coded format by the media transcoding received;
The index of program layer is obtained to the mark that the media after transcoding carry out program terminal;
The index of slice layer is obtained to the cutting that each program in program layer carries out fragment;
Speech recognition is carried out to each fragment in the slice layer and subtitle recognition obtains the index of character layer.
Wherein, each fragment in the slice layer carries out speech recognition and subtitle recognition obtains the mark of character layer
Draw, specifically include:
Obtain the corresponding speech recognition text in identification path and the identification path of the speech recognition;
Obtain the corresponding subtitle recognition text in identification path and the identification path of the subtitle recognition;
Merge the speech recognition text and subtitle recognition text, obtain the index of character layer.
Wherein, the metadata information includes but is not limited to director, personage, subject, type, region and the language of media
Speech.
Wherein, the user's voice inquirement collected that parses obtains corresponding speech recognition text, specifically includes:
Receive the audio signal of user's voice inquirement;
The decoded audio signal is segmented;
Carry out speech recognition respectively to each section audio signal and obtain section identification text;
Described section of identification text for merging each section audio signal obtains the speech recognition text.
Wherein, it is described that media research is carried out to the media knowledge base according to the speech recognition text, specifically include:
Metadata information present in the speech recognition text is extracted according to default metadata dictionary;
Metasearch is carried out in the media knowledge base according to the metadata information of extraction;
Key word information present in the speech recognition text is extracted according to default keywords database;
Keyword search is carried out in the media knowledge base according to the key word information;
Merge the result of the metasearch and the result of the keyword search obtains complete search result.
In addition, the present invention also proposes a kind of media research device based on speech recognition, including:
Acquisition module, relating module, parsing module and search module;
Acquisition module, content index and metadata information for obtaining media;
Relating module, media knowledge is set up for associating content index and metadata information that the acquisition module gets
Storehouse;
Parsing module, corresponding speech recognition text is obtained for parsing the user's voice inquirement collected;
Search module, for carrying out media research to the media knowledge base according to the speech recognition text.
Wherein, the acquisition module includes:Transcoding units, indexing unit, cutting unit and recognition unit;
Transcoding units, the media transcoding for that will receive is unified coded format;
Indexing unit, the index of program layer is obtained for the media after transcoding to be carried out with the mark of program terminal;
Cutting unit, the cutting for carrying out fragment to the program in the media obtains the index of slice layer;
Recognition unit, for carrying out speech recognition respectively to the fragment in the program and subtitle recognition obtains character layer
Index.
Wherein, the parsing module includes:Receiving unit, decoding unit, segmenting unit, recognition unit and combining unit;
Receiving unit, the audio signal for receiving user's voice inquirement;
Decoding unit, for being decoded to the audio signal;
Segmenting unit, for the decoded audio signal to be segmented;
Recognition unit, section identification text is obtained for carrying out speech recognition respectively to each section audio signal;
Combining unit, the described section of identification text for merging each section audio signal obtains the speech recognition text.
Wherein, the search module includes:First extraction unit, the first search unit, the second extraction unit, the second search
Unit and combining unit;
First extraction unit, for extracting first number present in the speech recognition text according to default metadata dictionary
It is believed that breath;
First search unit, carries out metadata in the media knowledge base for the metadata information according to extraction and searches
Rope;
Second extraction unit, for extracting keyword present in the speech recognition text according to default keywords database
Information;
Second search unit, for carrying out keyword search in the media knowledge base according to the key word information;
Combining unit, metasearch result and second search unit for merging first search unit
Keyword search results obtain complete search result.
By using a kind of media search method and device based on speech recognition disclosed in this invention, used in front end
Media content is identified in rear end so as to provide the user with more convenient interactive mode for interactive voice, and builds corresponding
Knowledge base, be finally reached the purpose that user is scanned for by voice to media content;, should compared to traditional way of search
Method provides the user with interactive voice mode in client so that interaction more facilitates nature;Base is carried out to media in service end
In content recognition and based on Natural Language Search so that search of the user to media content is more accurate.
Embodiment
Below in conjunction with the accompanying drawing of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described,
Obviously described embodiment is a part of embodiment of the invention, rather than whole embodiments.Based on the implementation in the present invention
Example, the every other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not made is belonged to
The scope of protection of the invention.
The embodiment of the present invention one proposes a kind of media search method based on speech recognition, as shown in figure 1, including following
Step:
Step 101, the content index and metadata information of media are obtained;
Step 102, associate the content index and metadata information sets up media knowledge base;
Step 103, user's voice inquirement that parsing is collected obtains corresponding speech recognition text;
Step 104, media research is carried out to the media knowledge base according to the speech recognition text.
Wherein, the content index for obtaining media, is specifically included:
It is unified coded format by the media transcoding received;
The index of program layer is obtained to the mark that the media after transcoding carry out program terminal;
The index of slice layer is obtained to the cutting that each program in program layer carries out fragment;
Speech recognition is carried out to each fragment in the slice layer and subtitle recognition obtains the index of character layer.
In the present embodiment, as shown in Fig. 2 carrying out content processing to the media obtained from different signal source, obtain on matchmaker
The index held in vivo, specific steps include:
The media obtained from different signal source are transcoded onto to unified form.Media data is gathered, can both pass through broadcast
Television acquisition card, gathers broadcast television signal, and the video on network can also be captured by web crawlers, can also be by other
Mode, such as directly obtain from storage medium.For the video file for the digitized all kinds of forms being collected into, use
Ffmpeg or other video code conversion softwares, are defined unified form by its transcoding.For example, the video after transcoding
File is avi forms, and the audio file after transcoding is wav forms, and facing the media file storage after transcoding to computer
When memory block.
The mark of the terminal of program is carried out for the media comprising multiple programs, the index of program layer is obtained.Program rises
The mark of stop can be by the way of handmarking, it would however also be possible to employ the mode that computer is marked automatically.For using calculating
The mode that machine is marked automatically, its step includes:
Collect the media file for all programs for needing to make marks, one program of each file correspondence;
The fingerprint characteristic of contents of media files is extracted, and saves as corresponding template;
Media file to be marked is matched with template.When on certain part of media file and some template matches
When, the fragment of the media file matched is beginning and ending time of the program in media file corresponding to the template.
For a program, the cutting of camera lens fragment is carried out, the index of slice layer is obtained.Camera lens is video camera from being opened to
The successive image frame that this process record gets off is closed, it is the minimal physical unit in video.Inside camera lens, adjacent and phase
Near frame of video feature is close, varies less, but at camera lens conversion, obvious change often occurs for the feature of frame of video.
The step of shot segmentation, is as follows:
Selected characteristic describes two field picture, it is preferred that extracts the colored rgb space histogram per two field picture and is used as the two field picture
Feature.
Frame difference is calculated, that is, calculates the histogrammic difference of the colored rgb space of interframe.It is preferred that, entered using Euclidean distance
Row measurement;
Selection Strategy analyzes these differences and determines shot boundary, it is preferred that determine camera lens using sliding window detection method
Border.The index of slice layer is beginning and the result time point of camera lens.
Wherein, each fragment in the slice layer carries out speech recognition and subtitle recognition obtains the mark of character layer
Draw, specifically include:
Step 301, the corresponding speech recognition text in identification path and the identification path of the speech recognition is obtained;
Step 302, the corresponding subtitle recognition text in identification path and the identification path of the subtitle recognition is obtained;
Step 303, merge the speech recognition text and subtitle recognition text, obtain the index of character layer.
For the video segment with voice or captions in the present embodiment, speech recognition and subtitle recognition are carried out respectively, and
The voice identification result and caption identification of the video segment with voice and captions are merged, character layer is obtained
Index.Captions and voice are description video media content important clues, and specific steps include:
Using automatic continuous audio recognition method, the preceding M bars for obtaining speech recognition preferably recognize path, and per paths
Corresponding recognition result;
Using subtitle recognition method, the preceding M bars for obtaining subtitle recognition preferably recognize path, and the corresponding knowledge per paths
Other result;
The preceding M bars that the preceding M bars of described speech recognition preferably recognize path and described subtitle recognition are preferably recognized into road
Candidate result figure is merged into footpath;
To each candidate word collection in described candidate result figure, the word of highest scoring is selected to make according to ballot score rule
It should be the corresponding word of node, and finally give the recognition result of fusion.The time point that the recognition result occurs together with each word, as
The index of character layer is preserved.
Wherein, the metadata information includes but is not limited to director, personage, subject, type, region and the language of media
Speech.
Wherein, the user's voice inquirement collected that parses obtains corresponding speech recognition text, specifically includes:
Step 401, the audio signal of user's voice inquirement is received;
Step 402, the decoded audio signal is segmented;
Step 403, carry out speech recognition respectively to each section audio signal and obtain section identification text;
Step 404, described section of identification text for merging each section audio signal obtains the speech recognition text.
The speech polling on media that user is gathered in the present embodiment is inputted.The speech polling input of user passes through client
End recording module is recorded, and after compressed encoding, is handled by network transmission to server end.
Speech polling input to user carries out speech recognition, obtains the text results of speech recognition, its specific steps bag
Include:The audio signal from client is received, and is decoded.It is preferred that, can be PCM format by audio decoder;After decoding
Audio signal according to Jing Yin carry out end-point detection so that by continuous audio signal cutting be several sections;It will distinguish per section audio
It is sent in distributed continuous speech recognition engine, the parallel processing for carrying out speech recognition;Reclaim the voice of all parallel processings
The result fragment of identification, and splicing obtains complete voice identification result.
Wherein, it is described that media research is carried out to the media knowledge base according to the speech recognition text, specifically include:
Metadata information present in the speech recognition text is extracted according to default metadata dictionary;
Metasearch is carried out in the media knowledge base according to the metadata information of extraction;
Key word information present in the speech recognition text is extracted according to default keywords database;
Keyword search is carried out in the media knowledge base according to the key word information;
Merge the result of the metasearch and the result of the keyword search obtains complete search result.
Semantic understanding is carried out to the text results of speech recognition in the embodiment of the present invention, searching to the knowledge base of media is triggered
Rope order, and search result is returned into user, the text results of speech recognition carry out semantic reason to text as query text
Solution refers to, to extracting crucial, significant word in text, be used as the query word of query and search.This step provides two kinds of extractions and looked into
The method for asking word, a kind of is that the query word based on metadata is extracted, and another is the extraction of the query word based on entity, concept.
The search command to the knowledge base of media is triggered, and search result is returned into user, its specific steps includes:
Member in text results based on predefined metadata dictionary and user's query grammar Rule Extraction speech recognition
Data message.
The mark of metadata is carried out to the new inquiry question sentence of user by the metadata information of the film and TV media of collection.
The user of mark is inquired about into question sentence and the user's query grammar collected in advance rule is matched, obtains most suitable
The mark of metadata.
It is extended for metadata information, the metadata information after being expanded.Described extension is mainly basis and known
Know the extension that collection of illustrative plates carries out synonym, related term etc..
The key word informations such as entity, concept are extracted from the text results of speech recognition.Using machine learning method from
The language material learning of magnanimity is to key word informations such as entity, concepts.These information are recycled from the text results of speech recognition
Extract the keywords such as entity, concept.
Key word information is extended, the key word information after being expanded.Described extension is mainly according to knowledge
Collection of illustrative plates carries out the extension of synonym, related term etc..
Metasearch is carried out from the knowledge base of media using metadata information, the search knot based on metadata is obtained
Really.
Keyword search is carried out using key word information and from the knowledge base of media, the search knot based on keyword is obtained
Really;
Search result based on metadata and the search result based on keyword are merged, final search result is obtained,
And return result to user.
In addition, a kind of media research device based on speech recognition is also proposed in the embodiment of the present invention two, as shown in figure 3,
Including:
Acquisition module 1, relating module 2, parsing module 3 and search module 4;
Acquisition module 1, content index and metadata information for obtaining media;
Relating module 2, sets up media and knows for associating content index and metadata information that the acquisition module gets
Know storehouse;
Parsing module 3, corresponding speech recognition text is obtained for parsing the user's voice inquirement collected;
Search module 4, for carrying out media research to the media knowledge base according to the speech recognition text.
Wherein, the acquisition module includes:Transcoding units, indexing unit, cutting unit and recognition unit;
Transcoding units, the media transcoding for that will receive is unified coded format;
Indexing unit, the index of program layer is obtained for the media after transcoding to be carried out with the mark of program terminal;
Cutting unit, the cutting for carrying out fragment to the program in the media obtains the index of slice layer;
Recognition unit, for carrying out speech recognition respectively to the fragment in the program and subtitle recognition obtains character layer
Index.
Wherein, the parsing module includes:Receiving unit, decoding unit, segmenting unit, recognition unit and combining unit;
Receiving unit, the audio signal for receiving user's voice inquirement;
Decoding unit, for being decoded to the audio signal;
Segmenting unit, for the decoded audio signal to be segmented;
Recognition unit, section identification text is obtained for carrying out speech recognition respectively to each section audio signal;
Combining unit, the described section of identification text for merging each section audio signal obtains the speech recognition text.
Wherein, the search module includes:First extraction unit, the first search unit, the second extraction unit, the second search
Unit and combining unit;
First extraction unit, for extracting first number present in the speech recognition text according to default metadata dictionary
It is believed that breath;
First search unit, carries out metadata in the media knowledge base for the metadata information according to extraction and searches
Rope;
Second extraction unit, for extracting keyword present in the speech recognition text according to default keywords database
Information;
Second search unit, for carrying out keyword search in the media knowledge base according to the key word information;
Combining unit, metasearch result and second search unit for merging first search unit
Keyword search results obtain complete search result.
By using a kind of media search method and device based on speech recognition disclosed in this invention, used in front end
Media content is identified in rear end so as to provide the user with more convenient interactive mode for interactive voice, and builds corresponding
Knowledge base, be finally reached the purpose that user is scanned for by voice to media content;And compared to traditional searcher
Formula, this method provides the user with interactive voice mode in client so that interaction more facilitates nature;Media are entered in service end
Row is based on content recognition and based on Natural Language Search so that search of the user to media content is more accurate.
The above embodiments are merely illustrative of the technical solutions of the present invention and it is non-limiting, reference only to preferred embodiment to this hair
It is bright to be described in detail.It will be understood by those within the art that, technical scheme can be modified
Or equivalent substitution, without departing from the spirit and scope of technical solution of the present invention, it all should cover in scope of the presently claimed invention
It is central.