WO2018108013A1

WO2018108013A1 - Medium displaying method and terminal

Info

Publication number: WO2018108013A1
Application number: PCT/CN2017/114843
Authority: WO
Inventors: 张长帅
Original assignee: 中兴通讯股份有限公司
Priority date: 2016-12-14
Filing date: 2017-12-06
Publication date: 2018-06-21
Also published as: CN108234735A

Abstract

A medium displaying method and terminal. The method comprises: acquiring a target call voice (101); acquiring, according to the target call voice, at least one target medium display image frame matching the target call voice (102); and displaying, via a display interface, the at least one target medium display image frame (103).

Description

Media display method and terminal

Technical field

This document relates to, but is not limited to, the field of communication technologies, and in particular, to a media display method and terminal.

Background technique

In the process of communication, traditional mobile phones can only communicate by voice. Traditional calls do not have specific video scenes, and voice communication is far less detailed and more profound than video communication.

The mobile phone can enter the network environment through a SIM card or a wireless WIFI to realize a video call.

Summary of the invention

The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.

A media display method and terminal are provided in an embodiment of the present invention.

The embodiment of the invention adopts the following technical solutions:

In one aspect, an embodiment of the present invention provides a media display method, including:

Collect target call voice;

Acquiring at least one frame of the target media display image that matches the target call voice according to the target call voice;

The at least one frame of the target media display image is played and displayed through the display interface.

Optionally, the acquiring, according to the target call voice, the at least one frame target media display image that matches the target call voice, includes:

Determining a target emoticon packet corresponding to the target call voice;

Determining at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice;

And matching at least one frame of the target media display image including the target vocalization pattern from the target expression package.

Optionally, the determining, according to the voiceprint feature of the target call voice, determining at least one target pronunciation type required to send the target call voice, including:

Determining text content corresponding to the target call voice according to the voiceprint feature of the target call voice;

According to the text content, matching a corresponding pinyin combination from a correspondence table between a letter and a pinyin;

According to the pinyin combination, matching at least one pronunciation mouth shape from a correspondence table of a pinyin and a pronunciation mouth type;

Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.

And matching, according to the text content, a phonetic combination corresponding to the text content from a correspondence table between text and phonetic;

According to the phonetic combination, at least one corresponding pronunciation type is matched from the correspondence table between the phonetic and the pronunciation mouth type;

Optionally, the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin; and according to the text content, matching the phonetic corresponding to the text content from the correspondence table between the characters and the phonetic Combinations, including:

And matching, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape Type, including:

According to the pinyin combination, at least one corresponding pronunciation type is matched from the correspondence table between the pinyin and the pronunciation type.

Optionally, the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence relationship between the initials and the finals in the pinyin and the pronunciation mouth shape; and according to the pinyin combination, from the correspondence table between the pinyin and the pronunciation mouth shape, Match the corresponding at least one pronunciation type, including:

Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the pinyin combination;

According to the initial and the final, or according to the final, the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.

Optionally, the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; and according to the text content, matching the correspondence between the text and the phonetic to the text content Phonetic combination, including:

And matching, according to the text content, a phonetic symbol combination corresponding to the text content from a correspondence table between English and English phonetic symbols;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic transcription combination, matching at least one pronunciation corresponding from the correspondence table between the phonetic transcription and the pronunciation mouth shape Mouth type, including:

According to the phonetic symbol combination, at least one corresponding pronunciation type is matched from the correspondence table between the English phonetic symbols and the pronunciation mouth shape.

Optionally, the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; and the correspondence between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination In the relationship table, at least one pronunciation type corresponding to the matching, including:

Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination;

According to the vowel and consonant, or according to the vowel, from the vowel and consonant and pronunciation In the correspondence of the mouth shape, the matching obtains at least one corresponding mouth shape.

Optionally, the determining a target emoticon packet corresponding to the target call voice includes:

Determining a target contact corresponding to the target call voice;

Retrieving a target emoticon package pre-associated with the target contact.

Optionally, the collecting the target call voice includes:

Monitor the voice call process;

Determining the received call voice of the other party as the target call voice.

Optionally, before the step of collecting the target call voice, the method further includes:

Obtaining a personal image of the asset package and the call contact, wherein the asset package includes at least one media material image;

Integrating the personal image of the call contact with each of the media material images to generate at least one media display image to obtain an emoticon package including the at least one media display image.

Optionally, the personal image includes a facial image of a person, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. And integrating the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining an expression package including the at least one media display image, including:

Identifying a mouth region in the facial image;

Filling and replacing the voiced mouth image in the media material image in the mouth region;

Generating a media display image corresponding to each of the vocalization-type images in the media material image to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.

Optionally, the displaying, by the display interface, the at least one frame of the target media display image, including:

And displaying, by using the display interface, the obtained display image as a background image, and displaying and displaying the at least one frame of the target media display image.

In another aspect, the embodiment of the present invention further provides a media display terminal, including:

The acquisition module is configured to collect the target call voice;

a first acquiring module, configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice;

The display module is configured to perform play display on the at least one frame of the target media display image through the display interface.

Optionally, the first acquiring module includes:

a first determining submodule, configured to determine a target emoticon packet corresponding to the target call voice;

a second determining submodule, configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice;

Obtaining a sub-module, configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.

Optionally, the second determining submodule includes:

a first determining unit, configured to determine, according to the voiceprint feature of the target call voice, text content corresponding to the target call voice;

The first matching unit is configured to match the corresponding pinyin combination from a correspondence table of a text and a pinyin according to the text content;

The second matching unit is configured to match the corresponding at least one pronunciation port type from a correspondence table between a pinyin and a pronunciation type according to the pinyin combination;

The second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.

Optionally, the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin; the first matching unit includes:

The first matching subunit is configured to match, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:

The second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.

Optionally, the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:

Optionally, the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; the first matching unit includes:

The third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:

The fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.

Optionally, the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; the fourth matching subunit is set as:

According to the vowel and the consonant, or according to the vowel, from the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.

Optionally, the first determining submodule includes:

a third determining unit, configured to determine a target contact corresponding to the target call voice;

The retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.

Optionally, the collecting module (301) includes:

The monitoring submodule is set to monitor the voice call process;

The third determining submodule is configured to determine that the received counterpart voice is the target call voice.

Optionally, the media display terminal further includes:

a second acquiring module, configured to acquire a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image;

And a generating module, configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.

Optionally, the personal image includes a facial image of the individual, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. The generating module includes:

Identifying a sub-module, configured to identify a mouth region in the facial image;

a replacement module, configured to fill and replace the voiced mouth image in the media material image in the mouth region;

And generating a sub-module, configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.

Optionally, the display module includes:

The display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.

Embodiments of the present invention have the following beneficial effects:

In the embodiment of the present invention, according to the collected call voice, at least one frame of the target media display image is matched, and the target media display images are continuously played and displayed to form a video style playing effect, and the voice recognition is converted into a video display. Through the local software application mode of the terminal device, the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic or even gets rid of the traffic restriction, and makes the video picture accompanying virtualization in the call process, and the call process is more fresh. Live and smart, strengthen the communication effect, add communication fun, and prompt the user experience.

Other aspects will be apparent upon reading and understanding the drawings and detailed description.

BRIEF abstract

1 is a schematic flow chart of a media display method in an embodiment of the present invention;

2 is a schematic flow chart of another media display method in an embodiment of the present invention;

3 is a block diagram showing the structure of a media display terminal in an embodiment of the present invention;

FIG. 4 is a schematic diagram of an operator template in an embodiment of the present invention.

Embodiments of the invention

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Although the mobile phone can enter the network environment through SIM card or wireless WIFI to realize video call, the video call often needs to rely on wireless networking, and the actual situation is not wireless network coverage anytime and anywhere, and the video call is connected through the telephone card, which is expensive. The method relies on the network environment and is limited to the constraints of the network environment. Video conversation cannot be the normal state of communication, and when people talk and communicate, only voice communication is not flexible and vivid, and the user experience is insufficient.

A media display method is disclosed in the embodiment of the present invention.

Step 101: Collect a target call voice.

This step may occur during a normal telephone call or during the playback of voice in an instant messaging software such as WeChat, and the target call voice may be the voice of the party in the call or the voice of the contact who is talking to himself. The target call voice forms the next processing object.

When applied to a normal telephone call, as an embodiment, the step of collecting the target call voice may include: monitoring a voice call process; and determining that the received party voice is the target call voice. That is, the call voice of the other party in the call is collected, and the next process is performed to realize the fun call.

Step 102: Acquire at least one frame target media display image that matches the target call voice according to the target call voice.

The at least one frame of the target media display image may be selected from a set of image packets. The at least one frame of the target media display image may be within the conversation with the target call voice Matching the matching, or matching the sound tone, volume, or vocalization required by the target conversation voice, the target media display image may be an expression, or a physical motion, or a different symbol, or a voice The location and context of the content.

Step 103: Play and display the at least one frame of the target media display image through the display interface.

Correspondingly, in the continuous play display process of the at least one frame of the target media display image, in addition to the change of the expression with the voice, the limb motion may be switched with the voice.

The process, according to the collected call voice, matches at least one frame of the target media display image, and plays and displays the target media display image to form a video style play effect, thereby realizing the voice recognition to be converted into a video display, which can be performed by the terminal device. The local software application mode breaks through the limitations of the network environment and communicates with virtual video phones. The process does not depend on the network, saves traffic and even gets rid of traffic restrictions, so that the video picture accompanying virtualization during the call, the call process is more lively and smart, and strengthens. The communication effect adds communication fun and prompts the user experience.

A media display method is also disclosed in the embodiment of the present invention.

Step 201: Collect a target call voice.

Step 202: Determine a target emoticon packet corresponding to the target call voice.

The target emoticon package may be a fixed emoticon package that matches the different target call speech, and is directly determined and read from the storage device; or is targeted to be in accordance with the target call voice. The expression pack that changes with a specific element needs to be matched and determined according to the target call voice.

In an embodiment, the step of determining a target emoticon packet corresponding to the target call voice includes:

Determining a target contact corresponding to the target call voice; and retrieving a target emoticon package pre-associated with the target contact. The emoticon package may be a resource content that is associated with the target contact, for example, when the contact is known to call the user's mobile phone, the user's mobile phone can obtain Knowing that the target call voice is sent by the contact, according to the address book information, adapting the emoticon packet corresponding to the caller's contact, which may be the photo of the target contact or a specific picture, and make different settings for the specific contact. To make the display results more targeted and more interesting.

Step 203: Determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.

Different call voices will correspond to different voice mouth patterns, and different voices have different voiceprint feature information. Alternatively, the desired voice pattern can be determined according to the voiceprint features in the collected target call voice.

In an embodiment, the step of determining at least one target pronunciation type required to send the target call voice according to the voiceprint feature of the target call voice includes:

Determining the text content corresponding to the target call voice according to the voiceprint feature of the target call voice; and according to the text content, matching the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic; according to the phonetic combination, In the correspondence table between the phonetic transcription and the pronunciation mouth shape, matching at least one pronunciation mouth shape is matched; determining the at least one pronunciation mouth shape is at least one target pronunciation mouth shape required for issuing the target call voice.

The phonetic transcription is the content of the pronunciation of the text, the different languages correspond to different characters, and the different characters correspond to different phonetic systems. For example, when the text is Chinese, the corresponding phonetic transcription is pinyin, when the text is English The corresponding phonetic is the English phonetic symbol. According to the voiceprint feature of the target call voice, the process of determining the corresponding voice mouth shape needs to first convert the voice into text, and match the corresponding phonetic combination through the text, thereby matching the target pronunciation type corresponding to the phonetic combination.

Optionally, the process is implemented as follows: receiving voice, quantizing the voice data, and calling the open source interface to implement voice conversion text. The principle is that different voices have different voiceprint feature information, and the voiceprint feature information and text comparison can be recorded in advance. Generate a database correspondence. After capturing the new voice, compare it with the pre-recorded database to find the corresponding text, and the text conversion phonetic process is also the same. Pre-record the contrast between the different phonetic and corresponding texts, write Array, generate the database correspondence, after the voice conversion text, and then find the phonetic into the database, and then from the corresponding table of the phonetic and pronunciation mouth type preset in the database, get the target pronunciation type corresponding to the phonetic, the phonetic will be Split and fit the emoticon image, quickly switch the display, and generate a virtual video effect.

On the one hand, when it is determined that the text content corresponding to the target call voice is a Chinese character, as an embodiment, the correspondence table between the character and the phonetic includes a correspondence table between the Chinese character and the pinyin; In the correspondence table of the phonetic transcription, the phonetic combination corresponding to the text content is matched, including:

According to the text content, a pinyin combination corresponding to the text content is matched from the correspondence table between the Chinese character and the pinyin.

Correspondingly, the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape The method includes: matching, according to the pinyin combination, the corresponding at least one pronunciation mouth shape from the correspondence table between the pinyin and the pronunciation mouth shape.

The process of converting Chinese characters into pinyin is also the same. Pre-recording the relationship between Pinyin and Chinese characters, reading the Chinese Pinyin table in the standard GBK character set database, writing the array, generating the database correspondence, after the speech conversion Chinese characters, the Chinese characters are compared with the database. To the pinyin, and then from the correspondence table of the pinyin and the pronunciation mouth type which are preset in the database, the target pronunciation type corresponding to the pinyin is obtained, the pinyin is split and adapted to the expression image, and the display is quickly switched to generate a virtual video effect.

In an embodiment, the correspondence between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; and the correspondence between the pinyin and the pronunciation type is according to the pinyin combination. Matching at least one pronunciation type corresponding to:

Determining, according to the pinyin combination, the initials and finals included in the phonetic combination, or, or determining the finals included in the pinyin combination; according to the initials and finals, or according to the finals, from the initials In the correspondence between the final and the pronunciation type, the matching obtains at least one corresponding pronunciation type.

In the process, from the correspondence table between Chinese characters and pinyin, after matching to the pinyin combination corresponding to the text content, the pinyin is positioned and split in the array structure to separate the initials and finals, since some Chinese pronunciations only correspond to separate finals. Therefore, it is determined that the obtained pinyin combination may include a combination of initials and finals, or only a finals, and the emoticons are adapted according to the initials of the initials and the finals, and the display is switched in the display window, and the expressions are quickly switched to generate a virtual video presentation form. Optionally, the correspondence table between the pinyin and the pronunciation port type may also be generated by implementing import or download.

This solution is based on the case of making a call on a mobile phone to implement a virtual video. The storage device is provided with a lip-shaped resource library of Chinese pinyin letters. In the pre-collected standard lip-type library, each lip-shaped image in the library corresponds to one. The pronunciation example of the pinyin letters, the mouth pattern and the initials and the finals in the pinyin table correspond one-to-one, so that each pronunciation can find the corresponding initial and final image combination in the mouth type library, so that the pronunciation mouth type is truly and actual. The mouth of the person's mouth is consistent in appearance.

On the other hand, when it is determined that the text content corresponding to the target call voice is English, as an embodiment, the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; And matching the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic, including:

According to the text content, a phonetic combination corresponding to the text content is matched from the correspondence table between the English and the English phonetic symbols.

Correspondingly, the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic combination, the correspondence between the phonetic transcription and the pronunciation mouth shape is matched, at least A pronunciation mouth shape includes: matching at least one pronunciation mouth shape from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.

The same is true for the process of converting English phonetic symbols into English. Pre-recording the contrast relationship between English phonetic symbols and English, writing arrays, generating database correspondence, after converting English into English, English pronunciations are found in the database, and then from the database. In the correspondence table between the English phonetic symbols and the pronunciation mouth type, the target pronunciation type corresponding to the English phonetic symbols is obtained, and the English phonetic symbols are split and adapted to the expression image, and the display is quickly switched to generate a virtual video effect.

In an embodiment, the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the sound phone combination is based on the phonetic symbols and the pronunciation type In the correspondence table, matching at least one pronunciation port type, including:

Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination; according to the vowel and the consonant, or according to the vowel, From the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.

The process is matched to the text content in the correspondence table between English and English phonetic symbols. After the phonetic symbols are combined, the English phonetic symbols are located in the array structure to locate the split vowels and consonants. Since some English pronunciations only correspond to individual vowels, the determined pinyin combinations may contain a combination of vowels and consonants, or It is only a vowel, which adapts the emoticon according to the vowel and consonant corresponding mouth shape, and performs switching display in the display window, and the expression quickly switches to generate a virtual video presentation form. Optionally, the correspondence table between the English phonetic symbols and the pronunciation mouth shape may also be generated by implementing import or download.

The solution is based on a mobile phone, and the virtual video is taken as an example. The storage device is provided with a mouth-shaped resource library of English phonetic symbols. In the pre-collected standard port library, each port image in the library corresponds to one. The pronunciation example of the English phonetic symbol, the mouth pattern and the vowel consonants in the English phonetic table are in one-to-one correspondence, so that each pronunciation can find the corresponding vowel and consonant pronunciation image in the mouth type library, and the combination is associated to make the pronunciation port The type is really consistent with the actual mouth of the caller.

Step 204: Match at least one frame of the target media display image including the target vocalization pattern from the target expression package.

Optionally, the target emoticon package is an emoticon package containing different mouth shapes, and the media display image included in the mouth emoticon package is also a display image with different shapes for the mouth type, and optionally, the target media display In addition to the target pronunciation type, the image can also contain other changing elements, such as eyebrow changes, expressions and face types, etc., to match the mouth shape.

Step 205: Perform play display on at least one frame of the target media display image through the display interface.

In the process, when the two parties talk, the user terminal, for example, a mobile phone, adapts the associated expression according to the voice recognition, and optimizes the corresponding image resource synthesis, so that the picture is continuously played with the voice, and the expression is continuously switched and updated. The virtual video scene is generated in a network-free environment, and the scene of virtual display video dialogue is realized, so that the dialogue communication is more effective and interesting.

The process is described below. The technical implementation steps of the speech conversion text pinching and adapting the expression are as follows:

Step 1. Read the Chinese pinyin content in the standard Chinese character encoding character set (GBK) database in advance, and write the Chinese pinyin data into the array. Each element of the array is a structure, and the structure is divided into four parts, pinyin and pinyin. The initials after splitting, the finals, and the Chinese characters corresponding to Pinyin.

Step 2. The voice is converted into text through the open source interface, and the text is searched in the array as a key, and the corresponding pinyin can be found, and then the pinchin is split into the corresponding initial and initials.

Step 3. The initials and the finals look up the associated expressions, create a visual dialog on the desktop as a user interface (UI) window, and present the expressions in the UI window.

When the at least one frame of the target media display image is continuously played and displayed, for the presentation time, the time length of the target call voice may be calculated first, and according to the time length of the target call voice, the target media display image is continuously played for each frame. The playing time when displaying, the playing time of the playing time is changed continuously, and the length of the target call voice is divided by the number of frames of the target media display image, and the playing time of one frame of the image is obtained, to "Hello". For example, suppose the voice "hello" time is 1 second, the voice is decomposed into four images corresponding to the expression, and each image is displayed for 1/4 second, that is, each image is displayed for 250 milliseconds, which is obtained according to the target call voice. In sequence, at least one frame of the target media display image that matches the target call voice is continuously switched and presented. At this time, the expression presented by the window corresponds to the received voice. During the call, as the voice changes continuously, the user's expression package is quickly switched in the UI window, that is, the video effect is presented, thereby realizing the virtual video call function.

In an embodiment, the step of performing play display on the at least one frame of the target media display image by using the display interface includes: obtaining, by using the display interface, a obtained display image as a background image, and the at least one frame The target media displays an image for playback display.

When playing a display of the target media display image, a play background may be added, the background may be a fixed background corresponding to the set contact, or may be based on a keyword or word recognized in the target call voice. The matching acquires the corresponding image as a background image, so that the background image can be changed according to the user's voice content.

This process mainly performs voice matching, performs voice matching, and performs synthesis processing on known scene images (background images) in the library without a wireless network environment, thereby realizing virtual scene reproduction in a network-free environment. In the no-network environment, the scenes and expressions that are constantly switched can realize the scene of virtual display video dialogue, making dialogue communication more effective and interesting.

Optionally, the core implementation process is as follows. For example, when a call voice conversation is performed on a mobile phone, the mobile phones of both parties have pre-stored expressions and scene images. When receiving the voice of the other party, the local priority activates the voice recognition module, and the corresponding media display is switched by the recognition. Resources, displayed in the terminal dialogue interface, voice The voice recognition continuously adapts to the corresponding expression resources, and the terminal interface can display a fixed background image or adapt the corresponding background image according to the voice content, and the expression matches the background image in the terminal interface, and the visual effect of the fast switching is reality. The video conversation scene, thus realizing the virtual reality video dialogue.

As an embodiment, before the step of collecting the target call voice, the method further includes:

Obtaining a personal image of the asset package and the call contact, wherein the asset package includes at least one media material image; integrating the personal image of the call contact with each media material image to generate at least one media display image, An emoticon packet containing the at least one media display image is obtained.

The personal image may be a personal expression image or an image associated with the call contact. The process corresponds to an initialization process of the emoticon packet containing the media display image to enable acquisition of at least one frame of the target media display image in accordance with the target call speech. The asset package can be downloaded in advance when there is a network. The process of integrating the personal image with each media material image may be implemented by, for example, image filling, partial replacement, partial coverage, and the like.

Optionally, for example, the user installs the virtual simulation software in the mobile phone, and the software can arbitrarily set the image scene, upload the user image, initialize the preset, and generate the user expression package. First, the user's mobile phone stores the required image resources, which may be pre-camera or network download to the mobile phone local, usually including the user's personal image, the material resource package, and the scene picture of the regular video conversation. The material resource package is, for example, a lip-type resource package, where the mouth is The type resource package is integrated by the software of the embodiment of the present invention and provides user use. Taking the mouth resource package as an example, the software itself integrates the lip image resource. The user provides a personal image before use, initializes the software, initializes the lip image and the personal image, and performs image optimization to synthesize the lip package. To the user's own image, an emoticon packet corresponding to the pinyin alphabet of the user is generated. The image synthesis technology first clears the lip and edge regions of the user's face image, and superimposes the same size of the lip-shaped resources, and then optimizes the image to obtain a user-defined emoticon packet corresponding to the pinyin alphabet.

In one embodiment, the personal image includes a facial image of the individual, and the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. And integrating the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining at least one media display The steps of the image emoticon package include:

Identifying a mouth region in the facial image; filling and replacing a vocal vocal image in the media material image in the mouth region; generating each of the vocalization patterns in the image of the media material The media corresponding to the image displays an image, and an emoticon packet including a media display image corresponding to each of the vocal vocal images is obtained.

Taking the pronunciation image of the media material image including the initials and the finals in the pinyin as an example, the technical implementation steps of the face and mouth synthesis expression package are as follows:

Step 1. Taking a call between two parties, for example, in the voice communication device held by A, the image resource of B is stored in advance, and the lip image resource corresponding to all the alphabetic pronunciation of the Chinese pinyin alphabet is stored in advance.

Step 2. Convert the face and mouth color image into a grayscale image. For color to grayscale, according to the classical formula, Gray=R*0.299+G*0.587+B*0.114, in order to avoid low-speed floating-point operations, The integer algorithm is introduced and rounded off to obtain an equivalent variant algorithm. Gray=(R*30+G*59+B*11+50)/100, which improves the computational conversion efficiency.

Step 3. The grayscale image takes the grayscale threshold, uses the threshold to segment the face, implements the mouth region detection, performs edge detection on the face image, and uses the mean operator template (ie, each pixel value is calculated and updated to the adjacent pixel). Mean), processing the grayscale image, can detect the feature area of the face, such as the eye area and the like, or identify the mouth area according to the symmetry and structure distribution of the face.

Step 4. Replace the original lip type resource with the lip area detected in the face area, generate an expression, and quantize the pixel value of the lip image generated in step 1 to make the number of pixels and step 3 The number of pixels in the lip-shaped area is the same, and the sampled and quantized lip-shaped resource is replaced and filled into the lip-shaped area of step 3, and the generated facial expression image is reconstructed.

Step 5. Perform Gaussian filtering on the newly generated expression image, enhance the smoothness of the synthesized image, and generate an expression resource library. Each expression image in the library is a pronunciation type of the face type corresponding to the alphabet.

The face correspondence and the different mouth shapes are combined to generate different expression images, and each image corresponds to the pronunciation of the pinyin letters. The newly generated facial expression image needs to be denoised and smoothed by Gaussian filtering to obtain a clear expression image. The template calculated by the Gaussian function is a floating-point number. For balanced filtering and computational efficiency, an integer 5×5 template operator is used, and the coefficient is 1/273. As shown in Figure 4.

In the process, according to the collected call voice, at least one frame of the target media display image is matched, and the target media display images are continuously played and displayed to form a video style playing effect, and the voice recognition is converted into a video display, which can be adopted. The local software application mode of the terminal device breaks through the limitation of the network environment and performs virtual video telephone communication. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call is more vivid and smart. It enhances the communication effect and adds communication fun.

The embodiment of the present invention further discloses a media display terminal, which is shown in FIG. 3, and includes an acquisition module 301, a first acquisition module 302, and a display module 303. The media display terminal may be a terminal such as a smart watch or a mobile phone that supports voice.

The collecting module 301 is configured to collect a target call voice.

The first obtaining module 302 is configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice.

The display module 303 is configured to perform play display on the at least one frame of the target media display image through the display interface.

The first acquiring module includes:

The first determining submodule is configured to determine a target emoticon packet corresponding to the target call voice.

The second determining submodule is configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.

The second determining submodule includes:

The first determining unit is configured to determine the text content corresponding to the target call voice according to the voiceprint feature of the target call voice.

The first matching unit is configured to match the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic according to the text content.

a second matching unit, configured to correspond to the phonetic and pronunciation type according to the phonetic combination In the table, at least one corresponding pronunciation type is matched.

The correspondence table between the characters and the phonetic includes a correspondence table between Chinese characters and pinyin; the first matching unit includes:

The first matching subunit is configured to match the pinyin combination corresponding to the text content from the correspondence table between the Chinese character and the pinyin according to the text content.

The correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:

Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the phonetic combination; according to the initial and final, or according to the final, from the initial In the correspondence between the final and the pronunciation type, the matching obtains at least one corresponding pronunciation type.

The correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; the first matching unit includes:

The third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content.

The correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the fourth matching subunit is set as:

The first determining submodule includes:

The third determining unit is configured to determine a target contact corresponding to the target call voice.

The collection module includes:

The monitor submodule is set to listen to the voice call process.

The terminal further includes:

The second obtaining module is configured to obtain a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image.

Wherein, the personal image includes a facial image of a person, and the image of the media material includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. The generation module includes:

An identification sub-module is provided to identify a mouth region in the facial image.

And a replacement module configured to fill and replace the voiced mouth image in the media material image in the mouth region.

The display module includes:

The media display terminal matches at least one frame of the target media display image according to the collected call voice, and performs continuous play display on the target media display images to form a video style play effect, and realizes voice recognition to be converted into a video display. Through the local software application mode of the terminal device, the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the virtualized video picture is accompanied during the call, and the call process is more vivid. Smart, enhanced communication, and increased communication fun.

The embodiment of the invention further provides a computer readable storage medium storing computer executable instructions, which are implemented by the processor to implement the method described in the foregoing embodiments.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and functional blocks/units of the methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical units; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components work together. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on a computer readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules, or other data. , removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer. Moreover, it is well known to those skilled in the art that communication media typically comprise computer readable instructions, data structures, program modules or such as a carrier wave or He transmits other data in the data signal, such as a transmission mechanism, and may include any information delivery medium.

Although alternative embodiments of the embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be interpreted as including all alternatives and modifications of the embodiments of the invention.

Finally, it should also be noted that in the embodiments of the present invention, relational terms such as first and second, etc. are merely used to distinguish one entity or operation from another entity or operation, without necessarily requiring or Imply that there is any such actual relationship or order between these entities or operations. Furthermore, the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article, or terminal device that comprises the element, without further limitation.

The above is an alternative embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. These improvements and retouchings are also Within the scope of protection of the present invention.

Industrial applicability

The embodiment of the present invention breaks through the limitation of the network environment and performs virtual video telephone communication. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call process is more vivid and smart, and the communication is strengthened. The effect adds fun to the communication and prompts the user experience.

Claims

A media display method includes:

Collecting target call voice (101);

Obtaining at least one frame target media display image that matches the target call voice according to the target call voice (102);

The at least one frame of the target media display image is played and displayed through the display interface (103).
The method of claim 1, wherein the acquiring at least one frame of the target media display image (102) that matches the target call voice according to the target call voice comprises:

Determining a target emoticon packet corresponding to the target call voice (202);

Determining at least one target pronunciation type (203) required to issue the target call voice according to the voiceprint feature of the target call voice;

At least one frame of the target media display image including the target vocalization pattern is obtained from the target expression package (204).
The method according to claim 2, wherein the determining, according to the voiceprint feature of the target call voice, the at least one target voice vocal type required to issue the target call voice comprises:

Determining text content corresponding to the target call voice according to the voiceprint feature of the target call voice;

And matching, according to the text content, a phonetic combination corresponding to the text content from a correspondence table between text and phonetic;

According to the phonetic combination, at least one corresponding pronunciation type is matched from the correspondence table between the phonetic and the pronunciation mouth type;

Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
The method according to claim 3, wherein the correspondence table between the characters and the phonetic includes a correspondence table of Chinese characters and pinyin; and according to the text content, matching and matching from the correspondence table between the characters and the phonetic The phonetic combination corresponding to the text content includes:

According to the text content, from the correspondence table between Chinese characters and pinyin, matching with the text The corresponding pinyin combination;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape Type, including:

According to the pinyin combination, at least one corresponding pronunciation type is matched from the correspondence table between the pinyin and the pronunciation type.
The method according to claim 4, wherein the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the pinyin and the pronunciation port according to the pinyin combination The corresponding correspondence table of the type matches at least one pronunciation type corresponding to the type, including:

Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the pinyin combination;

According to the initial and the final, or according to the final, the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
The method according to claim 3, wherein the correspondence table between the characters and the phonetic includes a correspondence table of English and English phonetic symbols; and according to the text content, matching from the correspondence table between the characters and the phonetic A phonetic combination corresponding to the text content, including:

And matching, according to the text content, a phonetic symbol combination corresponding to the text content from a correspondence table between English and English phonetic symbols;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic transcription combination, matching at least one pronunciation corresponding from the correspondence table between the phonetic transcription and the pronunciation mouth shape Mouth type, including:

According to the phonetic symbol combination, at least one corresponding pronunciation type is matched from the correspondence table between the English phonetic symbols and the pronunciation mouth shape.
The method according to claim 6, wherein the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth type; the according to the phonetic symbols combination, from the English In the correspondence table between the phonetic symbols and the pronunciation mouth shape, at least one pronunciation port type corresponding to the matching, including:

Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination;

According to the vowel and the consonant, or according to the vowel, from the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.
The method of claim 2, wherein the determining a target emoticon packet corresponding to the target call voice comprises:

Determining a target contact corresponding to the target call voice;

Retrieving a target emoticon package pre-associated with the target contact.
The method according to any one of claims 1 to 8, wherein the collecting a target call voice comprises:

Monitor the voice call process;

Determining the received call voice of the other party as the target call voice.
The method of any of claims 1-8, the method further comprising:

Before collecting the target call voice, acquiring a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image;

Integrating the personal image of the call contact with each of the media material images to generate at least one media display image to obtain an emoticon package including the at least one media display image.
The method according to claim 10, wherein the personal image includes a facial image of a person, and the media material image includes a vocal vocal image corresponding to the initial and the final in the pinyin or a vowel and consonant in the English phonetic symbol. Corresponding vocal image, the personal image of the call contact is integrated with each of the media material images to generate at least one media display image, and an expression package including the at least one media display image is obtained. include:

Identifying a mouth region in the facial image;

Filling and replacing the voiced mouth image in the media material image in the mouth region;

Generating a media display image corresponding to each of the vocalization-type images in the media material image to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
The method of claim 1 wherein said at least said display interface A frame of the target media display image for playback display, including:

And displaying, by using the display interface, the obtained display image as a background image, and displaying and displaying the at least one frame of the target media display image (205).
A media display terminal comprising:

An acquisition module (301) configured to collect a target call voice;

The first obtaining module (302) is configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice;

The display module (303) is configured to perform play display on the at least one frame of the target media display image through the display interface.
The media display terminal according to claim 13, wherein the first obtaining module (302) comprises:

a first determining submodule, configured to determine a target emoticon packet corresponding to the target call voice;

a second determining submodule, configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice;

Obtaining a sub-module, configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
The media display terminal according to claim 14, wherein the second determining submodule comprises:

a first determining unit, configured to determine, according to the voiceprint feature of the target call voice, text content corresponding to the target call voice;

The first matching unit is configured to match the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic according to the text content;

The second matching unit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the phonetic and the pronunciation type according to the phonetic combination;

The second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
A media display terminal according to claim 15, wherein said text and phonetic The correspondence table includes a correspondence table of Chinese characters and pinyin; the first matching unit includes:

The first matching subunit is configured to match, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:

The second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
The media display terminal according to claim 16, wherein the correspondence table between the pinyin and the pronunciation type includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:

Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the pinyin combination;

According to the initial and the final, or according to the final, the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
The media display terminal according to claim 15, wherein the correspondence table between the characters and the phonetic includes a correspondence table of English and English phonetic symbols; the first matching unit includes:

The third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content;

The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:

The fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
The media display terminal according to claim 18, wherein the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the fourth matching subunit is Set as:

Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination;

According to the vowel and the consonant, or according to the vowel, from the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.
The media display terminal according to claim 14, wherein the first determining submodule comprises:

a third determining unit, configured to determine a target contact corresponding to the target call voice;

The retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
The media display terminal according to any one of claims 13 to 20, wherein the acquisition module (301) comprises:

The monitoring submodule is set to monitor the voice call process;

The third determining submodule is configured to determine that the received counterpart voice is the target call voice.
The media display terminal according to any one of claims 13 to 20, further comprising:

a second acquiring module, configured to acquire a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image;

And a generating module, configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
The media display terminal according to claim 22, wherein the personal image includes a facial image of a person, and the media material image includes a vocal vocal image corresponding to the initial and the final in the pinyin or a vowel in the English phonetic symbol. And the pronunciation port image corresponding to the consonant, the generating module includes:

Identifying a sub-module, configured to identify a mouth region in the facial image;

a replacement module, configured to fill and replace the voiced mouth image in the media material image in the mouth region;

And generating a sub-module, configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
The media display terminal according to claim 13, wherein the display module comprises:

The display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.