CN103761261B

CN103761261B - A kind of media search method and device based on speech recognition

Info

Publication number: CN103761261B
Application number: CN201310752909.8A
Authority: CN
Inventors: 高鹏
Original assignee: Purple Winter Of Beijing Is Voice Technology Co Ltd With Keen Determination
Current assignee: Purple Winter Of Beijing Is Voice Technology Co Ltd With Keen Determination
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2017-07-28
Anticipated expiration: 2033-12-31
Also published as: CN103761261A

Abstract

The present invention provides a kind of media search method and device based on speech recognition, and the method comprising the steps of：Obtain the content index and metadata information of media；Associate the content index and metadata information sets up media knowledge base；Parse the user's voice inquirement collected and obtain corresponding speech recognition text；Media research is carried out to the media knowledge base according to the speech recognition text.A kind of media search method and device based on speech recognition disclosed in this invention, more convenient interactive mode is provided the user with using speech recognition in front end, media content is identified in rear end, and corresponding knowledge base is built, it is finally reached the purpose that user is scanned for by voice to media content；Compared to traditional way of search, interactive voice mode is provided the user with client so that interaction more facilitates nature；Media are carried out based on content recognition and based on Natural Language Search in service end so that search of the user to media content is more accurate.

Description

A kind of media search method and device based on speech recognition

Technical field

The present invention relates to processing data information technical field, more particularly to a kind of media search method based on speech recognition And device.

Background technology

With the development of internet and digital multimedia content, Digital Media especially digital video is into explosive increase Situation, fast and effectively retrieval how is carried out to Digital Media has important application value.Because Digital Media is non-structural The data of change, want to reach the demand for retrieving digital media content, it is necessary to the content to Digital Media is identified, than It is text such as by the speech recognition in audio, the subtitle recognition in video is text, is then retrieved using text.

On the other hand, mobile Internet flourishes, interacting as important research direction between people and smart machine.Language Sound interaction receives the attention of enterprise and liking for user as a kind of means of man-machine interaction most naturally easily.

Speech recognition technology（Automatic Speech Recognition,ASR）, also referred to as automatic speech recognition, Its target is that vocabulary Content Transformation in the voice by the mankind is computer-readable input, such as button, binary coding or Person's character string.It is different from Speaker Identification and speaker verification, the latter attempt identification or confirm send voice speaker and Non- vocabulary content included in it.

The application of speech recognition technology includes phonetic dialing, Voice Navigation, indoor equipment control, voice document searching, letter Single dictation data inputting etc..Speech recognition technology and other natural language processing techniques such as machine translation and speech synthesis technique It is combined, more complicated application can be constructed, such as based on media content and speech-sound intelligent interactive media searching method.

The content of the invention

The present invention solves the technical problem of how to provide a kind of media search method based on speech recognition and dress Put, realize that user carries out more accurate search to media content by voice.

For this purpose, the invention provides a kind of media search method based on speech recognition, this method includes following step Suddenly：

Obtain the content index and metadata information of media；

Associate the content index and metadata information sets up media knowledge base；

Parse the user's voice inquirement collected and obtain corresponding speech recognition text；

Media research is carried out to the media knowledge base according to the speech recognition text.

Wherein, the content index for obtaining media, is specifically included：

It is unified coded format by the media transcoding received；

The index of program layer is obtained to the mark that the media after transcoding carry out program terminal；

The index of slice layer is obtained to the cutting that each program in program layer carries out fragment；

Speech recognition is carried out to each fragment in the slice layer and subtitle recognition obtains the index of character layer.

Wherein, each fragment in the slice layer carries out speech recognition and subtitle recognition obtains the mark of character layer Draw, specifically include：

Obtain the corresponding speech recognition text in identification path and the identification path of the speech recognition；

Obtain the corresponding subtitle recognition text in identification path and the identification path of the subtitle recognition；

Merge the speech recognition text and subtitle recognition text, obtain the index of character layer.

Wherein, the metadata information includes but is not limited to director, personage, subject, type, region and the language of media Speech.

Wherein, the user's voice inquirement collected that parses obtains corresponding speech recognition text, specifically includes：

Receive the audio signal of user's voice inquirement；

The decoded audio signal is segmented；

Carry out speech recognition respectively to each section audio signal and obtain section identification text；

Described section of identification text for merging each section audio signal obtains the speech recognition text.

Wherein, it is described that media research is carried out to the media knowledge base according to the speech recognition text, specifically include：

Metadata information present in the speech recognition text is extracted according to default metadata dictionary；

Metasearch is carried out in the media knowledge base according to the metadata information of extraction；

Key word information present in the speech recognition text is extracted according to default keywords database；

Keyword search is carried out in the media knowledge base according to the key word information；

Merge the result of the metasearch and the result of the keyword search obtains complete search result.

In addition, the present invention also proposes a kind of media research device based on speech recognition, including：

Acquisition module, relating module, parsing module and search module；

Acquisition module, content index and metadata information for obtaining media；

Relating module, media knowledge is set up for associating content index and metadata information that the acquisition module gets Storehouse；

Parsing module, corresponding speech recognition text is obtained for parsing the user's voice inquirement collected；

Search module, for carrying out media research to the media knowledge base according to the speech recognition text.

Wherein, the acquisition module includes：Transcoding units, indexing unit, cutting unit and recognition unit；

Transcoding units, the media transcoding for that will receive is unified coded format；

Indexing unit, the index of program layer is obtained for the media after transcoding to be carried out with the mark of program terminal；

Cutting unit, the cutting for carrying out fragment to the program in the media obtains the index of slice layer；

Recognition unit, for carrying out speech recognition respectively to the fragment in the program and subtitle recognition obtains character layer Index.

Wherein, the parsing module includes：Receiving unit, decoding unit, segmenting unit, recognition unit and combining unit；

Receiving unit, the audio signal for receiving user's voice inquirement；

Decoding unit, for being decoded to the audio signal；

Segmenting unit, for the decoded audio signal to be segmented；

Recognition unit, section identification text is obtained for carrying out speech recognition respectively to each section audio signal；

Combining unit, the described section of identification text for merging each section audio signal obtains the speech recognition text.

Wherein, the search module includes：First extraction unit, the first search unit, the second extraction unit, the second search Unit and combining unit；

First extraction unit, for extracting first number present in the speech recognition text according to default metadata dictionary It is believed that breath；

First search unit, carries out metadata in the media knowledge base for the metadata information according to extraction and searches Rope；

Second extraction unit, for extracting keyword present in the speech recognition text according to default keywords database Information；

Second search unit, for carrying out keyword search in the media knowledge base according to the key word information；

Combining unit, metasearch result and second search unit for merging first search unit Keyword search results obtain complete search result.

By using a kind of media search method and device based on speech recognition disclosed in this invention, used in front end Media content is identified in rear end so as to provide the user with more convenient interactive mode for interactive voice, and builds corresponding Knowledge base, be finally reached the purpose that user is scanned for by voice to media content；, should compared to traditional way of search Method provides the user with interactive voice mode in client so that interaction more facilitates nature；Base is carried out to media in service end In content recognition and based on Natural Language Search so that search of the user to media content is more accurate.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1：It is a kind of flow chart of the media search method based on speech recognition of the present invention；

Fig. 2：It is a kind of FB(flow block) for media search method based on speech recognition that the embodiment of the present invention one is recorded；

Fig. 3：It is a kind of module map of the media research device based on speech recognition of the present invention.

Embodiment

Below in conjunction with the accompanying drawing of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, Obviously described embodiment is a part of embodiment of the invention, rather than whole embodiments.Based on the implementation in the present invention Example, the every other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not made is belonged to The scope of protection of the invention.

The embodiment of the present invention one proposes a kind of media search method based on speech recognition, as shown in figure 1, including following Step:

Step 101, the content index and metadata information of media are obtained；

Step 102, associate the content index and metadata information sets up media knowledge base；

Step 103, user's voice inquirement that parsing is collected obtains corresponding speech recognition text；

Step 104, media research is carried out to the media knowledge base according to the speech recognition text.

Wherein, the content index for obtaining media, is specifically included：

It is unified coded format by the media transcoding received；

In the present embodiment, as shown in Fig. 2 carrying out content processing to the media obtained from different signal source, obtain on matchmaker The index held in vivo, specific steps include：

The media obtained from different signal source are transcoded onto to unified form.Media data is gathered, can both pass through broadcast Television acquisition card, gathers broadcast television signal, and the video on network can also be captured by web crawlers, can also be by other Mode, such as directly obtain from storage medium.For the video file for the digitized all kinds of forms being collected into, use Ffmpeg or other video code conversion softwares, are defined unified form by its transcoding.For example, the video after transcoding File is avi forms, and the audio file after transcoding is wav forms, and facing the media file storage after transcoding to computer When memory block.

The mark of the terminal of program is carried out for the media comprising multiple programs, the index of program layer is obtained.Program rises The mark of stop can be by the way of handmarking, it would however also be possible to employ the mode that computer is marked automatically.For using calculating The mode that machine is marked automatically, its step includes：

Collect the media file for all programs for needing to make marks, one program of each file correspondence；

The fingerprint characteristic of contents of media files is extracted, and saves as corresponding template；

Media file to be marked is matched with template.When on certain part of media file and some template matches When, the fragment of the media file matched is beginning and ending time of the program in media file corresponding to the template.

For a program, the cutting of camera lens fragment is carried out, the index of slice layer is obtained.Camera lens is video camera from being opened to The successive image frame that this process record gets off is closed, it is the minimal physical unit in video.Inside camera lens, adjacent and phase Near frame of video feature is close, varies less, but at camera lens conversion, obvious change often occurs for the feature of frame of video. The step of shot segmentation, is as follows：

Selected characteristic describes two field picture, it is preferred that extracts the colored rgb space histogram per two field picture and is used as the two field picture Feature.

Frame difference is calculated, that is, calculates the histogrammic difference of the colored rgb space of interframe.It is preferred that, entered using Euclidean distance Row measurement；

Selection Strategy analyzes these differences and determines shot boundary, it is preferred that determine camera lens using sliding window detection method Border.The index of slice layer is beginning and the result time point of camera lens.

Step 301, the corresponding speech recognition text in identification path and the identification path of the speech recognition is obtained；

Step 302, the corresponding subtitle recognition text in identification path and the identification path of the subtitle recognition is obtained；

Step 303, merge the speech recognition text and subtitle recognition text, obtain the index of character layer.

For the video segment with voice or captions in the present embodiment, speech recognition and subtitle recognition are carried out respectively, and The voice identification result and caption identification of the video segment with voice and captions are merged, character layer is obtained Index.Captions and voice are description video media content important clues, and specific steps include：

Using automatic continuous audio recognition method, the preceding M bars for obtaining speech recognition preferably recognize path, and per paths Corresponding recognition result；

Using subtitle recognition method, the preceding M bars for obtaining subtitle recognition preferably recognize path, and the corresponding knowledge per paths Other result；

The preceding M bars that the preceding M bars of described speech recognition preferably recognize path and described subtitle recognition are preferably recognized into road Candidate result figure is merged into footpath；

To each candidate word collection in described candidate result figure, the word of highest scoring is selected to make according to ballot score rule It should be the corresponding word of node, and finally give the recognition result of fusion.The time point that the recognition result occurs together with each word, as The index of character layer is preserved.

Step 401, the audio signal of user's voice inquirement is received；

Step 402, the decoded audio signal is segmented；

Step 403, carry out speech recognition respectively to each section audio signal and obtain section identification text；

Step 404, described section of identification text for merging each section audio signal obtains the speech recognition text.

The speech polling on media that user is gathered in the present embodiment is inputted.The speech polling input of user passes through client End recording module is recorded, and after compressed encoding, is handled by network transmission to server end.

Speech polling input to user carries out speech recognition, obtains the text results of speech recognition, its specific steps bag Include：The audio signal from client is received, and is decoded.It is preferred that, can be PCM format by audio decoder；After decoding Audio signal according to Jing Yin carry out end-point detection so that by continuous audio signal cutting be several sections；It will distinguish per section audio It is sent in distributed continuous speech recognition engine, the parallel processing for carrying out speech recognition；Reclaim the voice of all parallel processings The result fragment of identification, and splicing obtains complete voice identification result.

Semantic understanding is carried out to the text results of speech recognition in the embodiment of the present invention, searching to the knowledge base of media is triggered Rope order, and search result is returned into user, the text results of speech recognition carry out semantic reason to text as query text Solution refers to, to extracting crucial, significant word in text, be used as the query word of query and search.This step provides two kinds of extractions and looked into The method for asking word, a kind of is that the query word based on metadata is extracted, and another is the extraction of the query word based on entity, concept. The search command to the knowledge base of media is triggered, and search result is returned into user, its specific steps includes：

Member in text results based on predefined metadata dictionary and user's query grammar Rule Extraction speech recognition Data message.

The mark of metadata is carried out to the new inquiry question sentence of user by the metadata information of the film and TV media of collection.

The user of mark is inquired about into question sentence and the user's query grammar collected in advance rule is matched, obtains most suitable The mark of metadata.

It is extended for metadata information, the metadata information after being expanded.Described extension is mainly basis and known Know the extension that collection of illustrative plates carries out synonym, related term etc..

The key word informations such as entity, concept are extracted from the text results of speech recognition.Using machine learning method from The language material learning of magnanimity is to key word informations such as entity, concepts.These information are recycled from the text results of speech recognition Extract the keywords such as entity, concept.

Key word information is extended, the key word information after being expanded.Described extension is mainly according to knowledge Collection of illustrative plates carries out the extension of synonym, related term etc..

Metasearch is carried out from the knowledge base of media using metadata information, the search knot based on metadata is obtained Really.

Keyword search is carried out using key word information and from the knowledge base of media, the search knot based on keyword is obtained Really；

Search result based on metadata and the search result based on keyword are merged, final search result is obtained, And return result to user.

In addition, a kind of media research device based on speech recognition is also proposed in the embodiment of the present invention two, as shown in figure 3, Including：

Acquisition module 1, relating module 2, parsing module 3 and search module 4；

Acquisition module 1, content index and metadata information for obtaining media；

Relating module 2, sets up media and knows for associating content index and metadata information that the acquisition module gets Know storehouse；

Parsing module 3, corresponding speech recognition text is obtained for parsing the user's voice inquirement collected；

Search module 4, for carrying out media research to the media knowledge base according to the speech recognition text.

Receiving unit, the audio signal for receiving user's voice inquirement；

Decoding unit, for being decoded to the audio signal；

Segmenting unit, for the decoded audio signal to be segmented；

By using a kind of media search method and device based on speech recognition disclosed in this invention, used in front end Media content is identified in rear end so as to provide the user with more convenient interactive mode for interactive voice, and builds corresponding Knowledge base, be finally reached the purpose that user is scanned for by voice to media content；And compared to traditional searcher Formula, this method provides the user with interactive voice mode in client so that interaction more facilitates nature；Media are entered in service end Row is based on content recognition and based on Natural Language Search so that search of the user to media content is more accurate.

The above embodiments are merely illustrative of the technical solutions of the present invention and it is non-limiting, reference only to preferred embodiment to this hair It is bright to be described in detail.It will be understood by those within the art that, technical scheme can be modified Or equivalent substitution, without departing from the spirit and scope of technical solution of the present invention, it all should cover in scope of the presently claimed invention It is central.

Claims

1. a kind of media search method based on speech recognition, it is characterised in that including step：

Obtain the content index and metadata information of media；

Media research is carried out to the media knowledge base according to the speech recognition text；

The content index for obtaining media, is specifically included：

It is unified coded format by the media transcoding received；

Speech recognition is carried out to each fragment in the slice layer and subtitle recognition obtains the index of character layer；

The mark of the media progress program terminal to after transcoding obtains the index of program layer, including：

Media file to be marked is matched with template, when on certain part of media file and some template matches, The fragment for the media file mixed is beginning and ending time of the program in media file corresponding to the template；

For each program, the cutting of camera lens fragment is carried out, the index of slice layer is obtained, step is as follows：

Selected characteristic describes two field picture, extracts the colored rgb space histogram per two field picture as the feature of the two field picture；

Frame difference is calculated, that is, calculates the histogrammic difference of the colored rgb space of interframe；

Selection Strategy analyzes these differences and determines shot boundary, and the index of slice layer is beginning and the result time of camera lens Point.

2. according to the method described in claim 1, it is characterised in that each fragment in the slice layer carries out voice knowledge Other and subtitle recognition obtains the index of character layer, specifically includes：

3. according to the method described in claim 1, it is characterised in that the metadata information includes but is not limited to leading for media Drill, personage, subject, type, region and language.

4. according to the method described in claim 1, it is characterised in that the user's voice inquirement collected that parses obtains correspondence Speech recognition text, specifically include：

Receive the audio signal of user's voice inquirement；

The decoded audio signal is segmented；

5. according to the method described in claim 1, it is characterised in that described that the media are known according to the speech recognition text Know storehouse and carry out media research, specifically include：

6. a kind of media research device based on speech recognition, it is characterised in that including acquisition module, relating module, parsing mould Block and search module；

Relating module, media knowledge base is set up for associating content index and metadata information that the acquisition module gets；

Search module, for carrying out media research to the media knowledge base according to the speech recognition text

The acquisition module includes：Transcoding units, indexing unit, cutting unit and recognition unit；

Recognition unit, for carrying out speech recognition respectively to the fragment in the program and subtitle recognition obtains the mark of character layer Draw；

The indexing unit, specifically for：

The cutting unit, specifically for for each program, carrying out the cutting of camera lens fragment, obtaining the index of slice layer, Step is as follows：

7. device according to claim 6, it is characterised in that the parsing module includes：Receiving unit, decoding unit, Segmenting unit, recognition unit and combining unit；

Receiving unit, the audio signal for receiving user's voice inquirement；

Decoding unit, for being decoded to the audio signal；

Segmenting unit, for the decoded audio signal to be segmented；

8. device according to claim 6, it is characterised in that the search module includes：First extraction unit, first are searched Cable elements, the second extraction unit, the second search unit and combining unit；

First extraction unit, believes for extracting metadata present in the speech recognition text according to default metadata dictionary Breath；

First search unit, metasearch is carried out for the metadata information according to extraction in the media knowledge base；

Second extraction unit, believes for extracting keyword present in the speech recognition text according to default keywords database Breath；

Combining unit, for merging the metasearch result of first search unit and the key of second search unit Word search result obtains complete search result.