CN101202888A

CN101202888A - Conversing method and device for recognizing speech into video

Info

Publication number: CN101202888A
Application number: CNA2006101610005A
Authority: CN
Inventors: 王东; 郑罡; 张嵩
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2006-12-11
Filing date: 2006-12-11
Publication date: 2008-06-18

Abstract

The invention discloses a conversion method from sound recognition to video, and the method comprises the following steps: the first step: when started, a media server builds a corresponding recognition code according to the category of the video source; the second step: after the media server receives the request of an application server, a connection channel of an audio flow is built to receive the audio flow; the third step: a sound recognition module of the media server recognizes audio data and outputs the recognized data to a conversion treatment procedure; the fourth step: after the conversion treatment procedure receives and converses the recognized data, the conversed data is contrasted with the recognition code built at the first step, therefore, the video conversion is realized; and the fifth step: the media server outputs the conversed video flow to terminal equipment through the network. Furthermore, the invention provides a conversion device which can change sound recognition to video. The conversion from speech recognition to video can be realized through the invention.

Description

The conversion method of recognizing speech into video and device

Technical field

The present invention relates to media server and use and field of speech recognition, and especially, relate to the conversion method and the device that utilize media server that recognizing speech into video is provided.

Background technology

Next generation network is the network of business-driven, media server provides the autonomous device of specialized media resource function, it also is the visual plant in the packet network, its position in system as shown in Figure 1, wherein, Fig. 1 is the schematic diagram that the business-driven network is formed, media server provides the required media resource function of miscellaneous service in the soft switch under the control of application server, comprising: playback, recording, (dual-tone multi-frequency's dual-tone multifrequency DTMF) collects the digits, fax, meeting, phonetic synthesis (text to speech, TTS) and automatic speech recognition (automatic speech recognition, ASR) etc. function provides carry voice simultaneously, functions such as deletion, wherein, Fig. 2 shows the composition of media server.

Along with development of science and technology, the user can be more and more widely to the demand of multimedia life-stylize, not only need literal sound, the sense organ demand that vision also will be arranged, this just forces carrying out of new business particularly important, and wherein, the switch technology from the speech recognition to the time-frequency is exactly the problem of a worth research and development, yet, the present rarely seen relevant therewith technological achievement that has.

Summary of the invention

Consider the problems referred to above and make the present invention that for this reason, the invention provides a kind of mechanism of utilizing media server to realize the conversion of recognizing speech into video, it can realize the conversion of sound to video, thereby satisfies user's demand.

Main invention thought of the present invention is, function based on the media server speech recognition, as the conversion processor that is input to media server, by conversion processor conversion output video, the video flowing of output sends to terminal by media server with its output.That is to say, on the basis of existing hardware resource, make full use of resource, suitably increase the input of software resource, can expand business according to the demand of network development.

At first, according to one embodiment of present invention, provide a kind of conversion method of recognizing speech into video.

This method may further comprise the steps: first step: media server type according to video resource when starting is set up corresponding identification code; Second step: media server is set up the interface channel of audio stream and is received audio stream after the request that receives application server; Third step: the sound identification module identification voice data of media server, and the data after will discerning output to conversion processor; The 4th step: change after the data of conversion processor after receiving identification, and data after will changing and the identification code contrast of in first step, setting up, thereby realize the video conversion; And the 5th step: the video flowing after media server will be changed outputs to terminal equipment by network.

Wherein, in first step, if the type of video resource is new type of adding, then media server provides interface, to add the identification code of video resource correspondence in real time.And in second step, media server notice sound identification module after receiving audio stream begins to handle.In addition, in third step, after the data of sound identification module after output identification, the notice conversion processor begins to handle.

In addition, in the 4th step, conversion processor will be by the data read of sound identification module output in the buffer memory of oneself after receiving the notice that begins to handle.And the data of conversion processor after will changing are added the video index after the conversion in the buffer memory to after contrasting with the identification code of setting up in first step.

Wherein, conversion processor sorts to video index, and the notice media server begins to send after finishing ordering.In addition, media server finds video resource according to video index after receiving the notice of conversion processor, and begins to send.

In addition, according to another embodiment of the present invention, the invention provides a kind of conversion equipment of recognizing speech into video.

This device comprises: identification code is set up module, is used for setting up corresponding identification code in media server type according to video resource when starting; The audio stream receiver module is connected to identification code and sets up module, is used for after media server is receiving the request of application server, sets up the interface channel of audio stream and receives audio stream; Sound identification module is connected to the audio stream receiver module, is used to discern voice data, and the data after will discerning output to the conversion process module; The conversion process module is connected to sound identification module and identification code is set up module, be used for changing after receiving the data of sound identification module, and the data after will changing and identification code sets up the identification code contrast that module is set up, thereby realizes the video conversion; And the video flowing output module, be connected to the conversion process module, be used for the video flowing after the conversion is outputed to terminal equipment by network.

Wherein, if the type of video resource is new type of adding, then identification code is set up module interface is provided, to add the identification code of video resource correspondence in real time.And audio stream receiver module notice sound identification module after receiving audio stream begins to handle.

In addition, after the data of sound identification module after output identification, notice conversion process module begins to handle.Then, the conversion process module will be by the data read of sound identification module output in the buffer memory of oneself after receiving the notice that begins to handle.Wherein, the data of conversion process module after will changing are added the video index after the conversion in the buffer memory to after contrasting with the identification code of being set up module foundation by identification code.

And the conversion process module sorts to video index, and notice video flowing output module begins to send after finishing ordering.Then, the video flowing output module finds video resource according to video index after the notice that receives the conversion process module, and begins to send.

By technique scheme, the present invention can realize the conversion of recognizing speech into video by media server.

Description of drawings

Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:

Fig. 1 is the schematic diagram that the business-driven network is formed;

Fig. 2 is the schematic diagram according to the composition of the media server of the embodiment of the invention;

Fig. 3 is according to the voice of the first embodiment of the invention flow chart to video conversion method;

Fig. 4 is according to the voice of the embodiment of the invention schematic diagram to video conversion method; And

Fig. 5 is according to the voice of the embodiment of the invention block diagram to video change-over device.

Embodiment

Describe the present invention below with reference to the accompanying drawings in detail.

First embodiment

At first, with reference to Fig. 3 and Fig. 4 the first embodiment of the present invention is described.Fig. 3 is according to the voice of the first embodiment of the invention flow chart to video conversion method, and Fig. 4 is according to the voice of the embodiment of the invention schematic diagram to video conversion method.

As shown in Figure 3, the conversion method according to the recognizing speech into video of first embodiment of the invention may further comprise the steps: step S302: media server type according to video resource when starting is set up corresponding identification code; Step S304: media server is set up the interface channel of audio stream and is received audio stream after the request that receives application server; Step S306: the sound identification module identification voice data of media server, and the data after will discerning output to conversion processor; Step S308: change after the data of conversion processor after receiving identification, and data after will changing and the identification code contrast of in step S302, setting up, thereby realize the video conversion; And step S310: the video flowing after media server will be changed outputs to terminal equipment by network.

Wherein, in step S302, if the type of video resource is new type of adding, then media server provides interface, to add the identification code of video resource correspondence in real time.And in step S304, media server notice sound identification module after receiving audio stream begins to handle.In addition, in step S306, after the data of sound identification module after output identification, the notice conversion processor begins to handle.

In addition, in step S308, conversion processor will be by the data read of sound identification module output in the buffer memory of oneself after receiving the notice that begins to handle.And in step S308, the data of conversion processor after will changing are added the video index after the conversion in the buffer memory to after contrasting with the identification code of setting up in step S302.

For example,, at first, when media server starts, set up corresponding identification code according to the video resource type with reference to Fig. 3 and Fig. 4, specific as follows:

When (1) media server started, search the type resource was for the type resource is set up corresponding identification code;

(2) if newly add such resource, media server provides interface, can carry out the identification code of this resource correspondence in real time and add;

Secondly, media server is set up the interface channel of audio stream after the request that receives application server, receives voice flow and carries out voice recognition processing:

(3) application server transmit a request to media server, and media server is set up the interface channel of audio stream according to request;

(4) media server receives audio stream after passage is set up, and the notice sound identification module begins to handle;

Then, the sound identification module identification voice data by media server outputs to recognition data in the conversion processor:

(5) after the sound identification module of media server is received instruction, the beginning processing audio data;

(6) data after sound identification module will be handled output in the conversion processor, and notify it to begin to handle;

Afterwards, the media server conversion program is changed after receiving recognition data, with data after the conversion and identification code contrast, realizes the conversion of video:

(7) conversion program of media server has notice, and data is read in the buffer memory of oneself;

(8) corresponding corresponding identification code is added the video index after the conversion in the buffer memory to, and simultaneously with the optimized Algorithm ordering, after putting in order, notice begins to send;

(9) media server has notice, and the video index of changing according to conversion program finds video resource, begins to send;

At last, video flowing outputs in the terminal equipment by network.

Illustrate below: Xiao Wang has newly bought individual 3G mobile, and he has handled voice simultaneously changes video traffic, can be configured type of service by mobile phone.Xiao Wang is by the exercise of speech recognition at ordinary times, and phonetic recognization rate has obtained large increase.Certain day, mother of Xiao Wang celebrates a birthday, Xiao Wang can not go home because of the reason of working outside, in order to express wish to mother, just mother a little wish mothers words that happy birthday have been said by mobile phone, audio frequency is transferred in the media server by application server, media server carries out speech recognition, the corresponding video identification sign indicating number of data after will discerning is simultaneously changed (video identification sign indicating number can by the video traffic type decided of Xiao Wang's configuration), and the video after media server will be changed is sent in mother's the mobile phone.Like this, mother not only can see the child of oneself, hears child's sound, can also see the professional video that sends that happy birthday, and cake, fresh flower, literal add music in the video, and mother realizes a slice filial devotion of child.

This method can become the data transaction of speech recognition subsidiary video pictures, and is full of moving sense, and recreational strong, can improve people's powerful interest, can be used for aspects such as teaching, amusement.Along with widening of the network bandwidth, the use of 3G, the user can be more and more widely to the demand of multimedia life-stylize, all can make more important that this function shows.

This is that scene of the present invention is illustrated, do not limit to purposes of the present invention, the invention provides a kind of media server that utilizes provides speech recognition to change the method for video, filled up and utilized media server to carry out the blank that the method for video is changeed in speech recognition, thereby realized the transformational relation between a kind of voice and video.

Second embodiment

With reference to Fig. 5 the second embodiment of the present invention is described below.Fig. 5 is the block diagram according to the conversion equipment 500 of the recognizing speech into video of the embodiment of the invention.

As shown in Figure 5, the conversion equipment 500 according to the recognizing speech into video of the embodiment of the invention comprises: identification code is set up module 502, is used for setting up corresponding identification code in media server type according to video resource when starting; Audio stream receiver module 504 is connected to identification code and sets up module 502, is used for after media server is receiving the request of application server, sets up the interface channel of audio stream and receives audio stream; Sound identification module 506 is connected to audio stream receiver module 504, is used to discern voice data, and the data after will discerning output to conversion process module 508; Conversion process module 508, be connected to sound identification module 506 and identification code and set up module 502, be used for after the data that receive sound identification module 506, changing, and the data after will changing and identification code set up the identification code contrast that module 502 is set up, thereby realize the video conversion; And video flowing output module 510, be connected to the conversion process module, be used for the video flowing after the conversion is outputed to terminal equipment by network.

Wherein, if the type of video resource is new type of adding, then identification code is set up module 502 interface is provided, to add the identification code of video resource correspondence in real time.And audio stream receiver module 504 notice sound identification module 506 after receiving audio stream begins to handle.

In addition, after the data of sound identification module 506 after output identification, notice conversion process module 508 begins to handle.Then, conversion process module 508 will be by the data read of sound identification module 506 output in the buffer memory of oneself after receiving the notice that begins to handle.Wherein, the data of conversion process module 508 after will changing are added the video index after the conversion in the buffer memory to after contrasting with the identification code of being set up module 502 foundation by identification code.

And, 508 pairs of video index orderings of conversion process module, and notice video flowing output module begins to send after finishing ordering.Then, the video flowing output module finds video resource according to video index after the notice that receives the conversion process module, and begins to send.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the conversion method of a recognizing speech into video is characterized in that, may further comprise the steps:

First step: media server type according to video resource when starting is set up corresponding identification code;

Second step: described media server is set up the interface channel of audio stream and is received audio stream after the request that receives application server;

Third step: the sound identification module identification voice data of described media server, and the data after will discerning output to conversion processor;

The 4th step: change after the data of described conversion processor after receiving described identification, and data after will changing and the described identification code contrast of in described first step, setting up, thereby realize the video conversion; And

The 5th step: the video flowing after described media server will be changed outputs to terminal equipment by network.

2. the conversion method of recognizing speech into video according to claim 1, it is characterized in that, in described first step, if the type of video resource is new type of adding, then described media server provides interface, to add the identification code of described video resource correspondence in real time.

3. the conversion method of recognizing speech into video according to claim 1, it is characterized in that, in described second step, described media server notifies described sound identification module to begin to handle after receiving audio stream, and, in described third step, after the data of described sound identification module after output identification, notify described conversion processor to begin to handle.

4. the conversion method of recognizing speech into video according to claim 1, it is characterized in that, in described the 4th step, described conversion processor is after receiving the notice that begins to handle, will be in the buffer memory of oneself by the data read of described sound identification module output, and the data after will changing are added the video index after the conversion in the described buffer memory to after contrasting with the described identification code of setting up in described first step.

5. the conversion method of recognizing speech into video according to claim 4, it is characterized in that, described conversion processor sorts to described video index, and after finishing ordering, notify described media server to begin to send, described media server is after receiving the notice of described conversion processor, find video resource according to described video index, and begin to send.

6. the conversion equipment of a recognizing speech into video is characterized in that, comprising:

Identification code is set up module, is used for setting up corresponding identification code in media server type according to video resource when starting;

The audio stream receiver module is connected to described identification code and sets up module, is used for after described media server is receiving the request of application server, sets up the interface channel of audio stream and receives audio stream;

Sound identification module is connected to described audio stream receiver module, be used to discern voice data, and the data after will discerning outputs to the conversion process module;

The conversion process module, be connected to described sound identification module and described identification code is set up module, be used for after receiving the data of described sound identification module, changing, and the data after will changing and described identification code set up the described identification code contrast that module is set up, thereby realize the video conversion; And

The video flowing output module is connected to described conversion process module, is used for the video flowing after the conversion is outputed to terminal equipment by network.

7. the conversion equipment of recognizing speech into video according to claim 6 is characterized in that, if the type of video resource is new type of adding, then described identification code is set up module provides interface, to add the identification code of described video resource correspondence in real time.

8. the conversion equipment of recognizing speech into video according to claim 6, it is characterized in that, described audio stream receiver module notifies described sound identification module to begin to handle after receiving audio stream, and, after the data of described sound identification module after output identification, notify described conversion process module to begin to handle, described conversion process module will be by the data read of described sound identification module output in the buffer memory of oneself after receiving the notice that begins to handle.。

9. the conversion equipment of recognizing speech into video according to claim 8, it is characterized in that, the data of described conversion process module after will changing are added the video index after the conversion in the described buffer memory to after contrasting with the described identification code of being set up module foundation by described identification code.

10. the conversion equipment of recognizing speech into video according to claim 9, it is characterized in that, described conversion process module sorts to described video index, and after finishing ordering, notify described video flowing output module to begin to send, and, described video flowing output module finds video resource according to described video index after the notice that receives described conversion process module, and begins to send.